%\VignetteIndexEntry{Using the qrqc package to gather information about sequence qualities}
%\VignettePackage{qrqc}

\documentclass{article}

\usepackage{Sweave}
\usepackage[a4paper]{geometry}
\usepackage{hyperref,graphicx}

\SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5} 
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rfunction}[1]{{\small\texttt{#1}}}

\author{Vince Buffalo\\[1em]Bioinformatics Core\\ UC Davis Genome Center\\[1em]
\texttt{vsbuffalo@ucdavis.edu}}

\title{\textsf{\textbf{Using the \Rpackage{qrqc} package to gather information about sequence qualities}}}

\date{2010-01-19}

\begin{document}

\maketitle


\begin{abstract}
  Many projects in bioinformatics begin with raw sequences from a
  high-throughput sequencer that must be quality checked before
  additional analysis steps can proceed. The \Rpackage{qrqc} (Quick Read
  Quality Control) package is a fast and extensible package that
  reports basic quality and summary statistics on FASTQ and FASTA
  files, including base and quality distribution by position, sequence
  length distribution, and common sequences.
\end{abstract}

\section{Reading in a Sequence File}

The \Rpackage{qrqc} package reads and processes FASTA and FASTQ files
in C for speed, through the function \texttt{readSeqFile}. Optionally,
sequences can be hashed (also done at the C level), to see the most
frequent sequences in the file.

<<echo=TRUE>>=
library(qrqc)
s.fastq <- readSeqFile(system.file('extdata', 'test.fastq', package='qrqc'))
@ 
<<echo=FALSE>>=
s.fastq@filename <- 'test.fastq' # otherwise a long temp dir will be here
@

Note that there is a maximum sequence length argument in
\texttt{readSeqFile}, \texttt{max.length}. By default, this is
1,000. It is used to pre-allocate the matrices in C, and could be much
larger than the largest sequence encountered without many downsides
(its memory usage is relatively low). If a sequence larger than
\texttt{max.length} is encountered, the function will stop, and the
user can call the function again with a larger \texttt{max.length}.

\texttt{Readseqfile} produces a \texttt{FASTQSummary} object, which inherits from the
\texttt{SequenceSummary} class. Printing the object lists a very short summary:

<<echo=TRUE>>=
s.fastq
@

Optionally, \texttt{readSeqFile} can be run without hashing, with
\texttt{hash=FALSE}. \texttt{readSeqFile} also works on FASTA files,
but \texttt{type=FASTA} must be specified.

\section{Plotting Quality of FASTQSummary Objects}

If the file read and summarized with \texttt{readSeqFile} is a FASTQ
file (and thus the resulting object is from the \texttt{FASTQSummary}
class), quality information by position can be plotted with
\texttt{plotQuals}, which produces a graphic as in Figure \ref{fig:PlotQuals}:

\begin{figure}
\centering
\includegraphics{qrqc-figPlotQuals}
\caption{A plot of quality by base position, with sequence length
  histogram.}
\label{fig:PlotQuals}
\end{figure}

<<figPlotQuals,fig=TRUE,width=7,height=6>>=
plotQuals(s.fastq)
@ 

By default, quality plots include a histogram of sequence lengths. This
is recommended, as low quality at a particular base position is less
worrisome if there are few reads with this sequence length. The grey
lines indicate the 10\% and 90\% quantiles, orange lines indicate the
lower and upper quartiles, the blue dot is the median, and the green
dash the mean. A purple lowess curve is fit through the
distributions. This line is fit by first randomly drawing values from
the empirical (binned) distribution of qualities at a particular base, then
fitting through these points.

\texttt{plotQuals} can be very useful in inspecting base qualities
before and after read quality control pipelines. For example, the
package contains \texttt{test-trimmed.fastq}, which has the same
sequences as \texttt{test.fastq} after being trimmed with Nik Joshi's
\texttt{Sickle}, a windowed adaptive quality trimmer.

\begin{figure}
\centering
\includegraphics{qrqc-figPlotQuals-trimmed}
\caption{A plot of quality by base position after being trimmed with \texttt{Sickle}.}
\label{fig:PlotQuals-trimmed}
\end{figure}
<<figPlotQuals-trimmed,fig=TRUE,width=7,height=6>>=
s.trimmed.fastq <- readSeqFile(system.file('extdata', 'test-trimmed.fastq', package='qrqc'))
plotQuals(s.trimmed.fastq)
@ 

\section{Other Plots of FASTQSummary and FASTASummary Objects}
The following S4 methods work for FASTQSummary and FASTASummary
objects. Base frequencies (counts) and base proportions by position
can be plotted with \texttt{plotBases}. When used with
\texttt{type='freq'}, \texttt{plotBases} produces a graphic as in Figure
\ref{fig:PlotBase-freqs}:

\begin{figure}
\centering
\includegraphics{qrqc-figPlotBase-freqs}
\caption{Base frequencies by position in sequence.}
\label{fig:PlotBase-freqs}
\end{figure}

<<figPlotBase-freqs,fig=TRUE,width=7,height=6>>=
plotBases(s.fastq, type="freq")
@ 

\texttt{plotBases} uses the Sanger base color scheme: blue is
Cytosine, green is Adenine, black is Guanine, red is Thymine, and
purple in N (any base). When further IUPAC nucleotides are found in
the sequences, a legend is added.

\texttt{plotBases} also accepts a \texttt{bases} parameter, which
can be used to specify specific bases. This is useful for plotting
just the frequency of `N'.

Base proportions by position can be plotted with \texttt{plotBases},
with \texttt{type='prop'}. This plot is basically identical to
the plot produced with \texttt{type='freq'} with a different y scale:

\begin{figure}
\centering
\includegraphics{qrqc-figPlotBase-prop}
\caption{Base proportions by position in sequence.}
\label{fig:PlotBase-prop}
\end{figure}

<<figPlotBase-prop,fig=TRUE,width=7,height=6>>=
plotBases(s.fastq, type="prop")
@

Sequence length distribution can be plotted with
\texttt{plotSeqLengths} (graphic shown in Figure \ref{fig:PlotSeqLengths}):

\begin{figure}
\centering
\includegraphics{qrqc-figPlotSeqLengths}
\caption{Histogram of sequence lengths after quality trimming.}
\label{fig:PlotSeqLengths}
\end{figure}

<<figPlotSeqLengths,fig=TRUE,width=7,height=6>>=
plotSeqLengths(s.trimmed.fastq)
@

The GC content can be plotted with \texttt{plotGC} (graphic shown
in Figure \ref{fig:PlotGC}):

\begin{figure}
\centering
\includegraphics{qrqc-figPlotGC}
\caption{GC content by position}
\label{fig:PlotGC}
\end{figure}

<<figPlotGC,fig=TRUE,width=7,height=6>>=
plotGC(s.fastq)
@

\section{HTML Report Generation}

With the help of \Rpackage{brew} and \Rpackage{xtable},
\Rpackage{qrqc} can generate an HTML summary report. This is created
with \texttt{makeReport}. Reports are stored in their own directories,
with images kept under `images/' and the report is ``report.html''. A
specific output directory can be passed through the \texttt{outputDir}
argument, but the present directory is used as a default. Multiple
reports generated in the same directory will have an incremental
naming scheme.

\section{Working with the FASTQSummary and FASTASummary classes}
\Rpackage{qrqc} provides the \texttt{FASTQSummary} and
\texttt{FASTASummary} classes for users to build functions and
applications around. Both inherit from \texttt{SequenceSummary}, which
should not be used directly.

\texttt{FASTASummary} has the same slots as \texttt{FASTQSummary},
except the latter provides additional slots for quality
information. Both contain:

\begin{itemize}
\item \texttt{filename}: the filename of the file read and summarized
  with \texttt{readSeqFile}.
\item \texttt{base.freqs}: a dataframe containing the frequency counts
  of each base.
\item \texttt{seq.lengths}: a numeric vector containing the sequences
  lengths (counts by position).
\item \texttt{hash}: a numeric vector containing the counts of unique
  sequences (the actual sequences are the names attribute).
\item \texttt{hashed}: a logical indicating whether sequence hashing
  to count unique sequences was done.
\end{itemize}

Additionally, \texttt{FASTQSummary} provides:

\begin{itemize}
\item \texttt{qual.freqs}: a dataframe containing the counts of bases
  of a particular quality by position.
\item \texttt{mean.qual}: a numeric giving the mean quality, weighted
  by sequence lengths.
\end{itemize}

\section{Acknowledgements}
Thanks to Simon Andrews for his work on
\href{http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/}{FastQC}
(a similar program written in Java) from which this project was
inspired. Also thanks to Joseph Fass and Dawei Lin for their advice
and helpful comments.

Thanks to Heng Li for his work on \texttt{kseq.h} \texttt{khash.h}
(both MIT License) which this package uses through
\Rpackage{RSamtools}. More on these header files can be found at
\url{http://lh3lh3.users.sourceforge.net/kseq.shtml} and
\url{http://attractivechaos.awardspace.com/khash.h.html}.

\texttt{Sickle} can be downloaded or cloned from the UC Davis
Bioinformatics Github repository:
\url{https://github.com/ucdavis-bioinformatics/sickle}.

\section{Session Info}
<<>>=
sessionInfo()
@

\end{document}