%\VignetteIndexEntry{Using the qrqc package to gather information about sequence qualities} %\VignettePackage{qrqc} \documentclass{article} \usepackage{Sweave} \usepackage[a4paper]{geometry} \usepackage{hyperref,graphicx} \SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{{\small\texttt{#1}}} \author{Vince Buffalo\\[1em]Bioinformatics Core\\ UC Davis Genome Center\\[1em] \texttt{vsbuffalo@ucdavis.edu}} \title{\textsf{\textbf{Using the \Rpackage{qrqc} package to gather information about sequence qualities}}} \date{2012-02-19} \begin{document} \maketitle \begin{abstract} Many projects in bioinformatics begin with raw sequences from a high-throughput sequencer that must be quality checked before additional analysis steps can proceed. The \Rpackage{qrqc} (Quick Read Quality Control) package is a fast and extensible package that reports basic quality and summary statistics on FASTQ and FASTA files, including base and quality distribution by position, sequence length distribution, k-mers by position, and common sequences. \end{abstract} \section{Reading in a Sequence File} The \Rpackage{qrqc} package reads and processes FASTA and FASTQ files in C for speed, through the function \texttt{readSeqFile}. Optionally, sequences can be hashed (also done at the C level), to see the most frequent sequences in the file. Hashing is memory intensive (all unique sequence reads are kept in memory once), so by default, \texttt{readSeqFile} will randomly sample 10\% of the reads and hash these. This proportion can be adjusted with the \texttt{hash.prop}. Also, k-mers are hashed (when \texttt{kmer=TRUE}) with the same proportion as set by \texttt{hash.prop}. <>= library(qrqc) s.fastq <- readSeqFile(system.file('extdata', 'test.fastq', package='qrqc')) @ <>= s.fastq@filename <- 'test.fastq' # otherwise a long temp dir will be here @ Note that there is a maximum sequence length argument in \texttt{readSeqFile}, \texttt{max.length}. By default, this is 1,000. It is used to pre-allocate the matrices in C, and could be much larger than the largest sequence encountered without many downsides (its memory usage is relatively low). If a sequence larger than \texttt{max.length} is encountered, the function will stop, and the user can call the function again with a larger \texttt{max.length}. \texttt{Readseqfile} produces a \texttt{FASTQSummary} object, which inherits from the \texttt{SequenceSummary} class. Printing the object lists a very short summary: <>= s.fastq @ Optionally, \texttt{readSeqFile} can be run without hashing, with \texttt{hash=FALSE}. \texttt{readSeqFile} also works on FASTA files, but \texttt{type=FASTA} must be specified. \section{Plotting Quality of FASTQSummary Objects} If the file read and summarized with \texttt{readSeqFile} is a FASTQ file (and thus the resulting object is from the \texttt{FASTQSummary} class), quality information by position can be plotted with \texttt{qualPlot}, which produces a graphic as in Figure \ref{fig:qualPlot}: \begin{figure} \centering \includegraphics{qrqc-figQualPlot} \caption{A plot of quality by base position, with sequence length histogram.} \label{fig:QualPlot} \end{figure} <>= qualPlot(s.fastq) @ If there's variability in sequence length, one should interpret this quality plot with the sequence length histogram (as produced by \texttt{seqlenPlot}), as low quality at a particular base position is less worrisome if there are few reads with this sequence length. The grey lines indicate the 10\% and 90\% quantiles, orange lines indicate the lower and upper quartiles, the blue dot is the median, and the green dash the mean. A purple smooth curve is fit through the distributions. This line is fit by first randomly drawing values from the empirical (binned) distribution of qualities at a particular base, then using \Rpackage{ggplot2}'s \texttt{geom\_smooth} with these points. \texttt{qualPlot} can be very useful in inspecting base qualities before and after read quality control pipelines. For example, the package contains \texttt{test-trimmed.fastq}, which has the same sequences as \texttt{test.fastq} after being trimmed with Nik Joshi's \texttt{Sickle}, a windowed adaptive quality trimmer. Each plot can be made separately by passing each \texttt{FASTQSummary} object directly to \texttt{qualPlot}. However, \texttt{qualPlot} has methods to handle a named list of \texttt{FASTQSummary} objects. Each will be plotted in a panel of its own. \begin{figure} \centering \includegraphics{qrqc-figQualPlot-list} \caption{A plot of quality by base position after being trimmed with \texttt{Sickle}.} \label{fig:QualPlot-list} \end{figure} <>= s.trimmed.fastq <- readSeqFile(system.file('extdata', 'test-trimmed.fastq', package='qrqc')) qualPlot(list("trimmed"=s.trimmed.fastq, "untrimmed"=s.fastq)) @ \section{Base Plots of FASTQSummary and FASTASummary Objects} \texttt{qualPlot} is the only plotting method that works only on \texttt{FASTQSummary} objects. Other plotting methods work for FASTQSummary and FASTASummary objects. Base frequencies (counts) and base proportions by position can be plotted with \texttt{basePlot}. When used with \texttt{type='frequency'}, \texttt{basePlot} produces a graphic as in Figure \ref{fig:BasePlot-freqs}: \begin{figure} \centering \includegraphics{qrqc-figBasePlot-freqs} \caption{Base frequencies by position in sequence.} \label{fig:BasePlot-freqs} \end{figure} <>= basePlot(s.fastq) @ \texttt{basePlot} uses a color scheme from \Rpackage{biovizBase}, which (according to their goals) make biological sense, are aesthetically pleasing, and accommodate those that are colorblind. By default, the color scheme is determined by \Rpackage{biovizBase}'s \texttt{getBioColor} with ``DNA\_BASES\_N'' (which includes A, T, C, G, and N). A scheme for IUPAC codes can be used too; use \texttt{colorvalues=getBioColor('IUPAC\_CODE\_MAP')}. If one is making custom graphics, these color schemes can be more easily accessed through \texttt{scale\_color\_dna} and \texttt{scale\_color\_iupac}. \texttt{basePlot} also accepts a \texttt{bases} parameter, which can be used to specify specific bases. This is useful for plotting just the frequency of `N'. Base proportions by position can be plotted with \texttt{basePlot}, with \texttt{type='proportion'}. This plot is basically identical to the plot produced with \texttt{type='frequency'} with a different y scale. Different geoms (see \Rpackage{ggplot2} for more details) can be specified too with the \texttt{geom} parameter. By default, a line graph is used, but bar graphs (through \texttt{geom='bar'}) and bar graphs with dodged bars (through \texttt{geom='dodge'}) can be plotted also, as in Figure \ref{fig:BasePlot-prop}. \begin{figure} \centering \includegraphics{qrqc-figBasePlot-prop} \caption{Base proportions by position in sequence.} \label{fig:BasePlot-prop} \end{figure} <>= basePlot(s.fastq, bases=c("G", "C"), geom="bar", type="proportion") @ Sequence length distribution can be plotted with \texttt{seqlenPlot} (graphic shown in Figure \ref{fig:seqlenPlot}): \begin{figure} \centering \includegraphics{qrqc-figSeqlenPlot} \caption{Histogram of sequence lengths after quality trimming.} \label{fig:SeqlenPlot} \end{figure} <>= seqlenPlot(s.trimmed.fastq) @ The GC content can be plotted with \texttt{gcPlot} (graphic shown in Figure \ref{fig:gcPlot}). Here, we can start to see how using \Rpackage{ggplot2} can easily add to \Rpackage{qrqc}'s plotting functions. \begin{figure} \centering \includegraphics{qrqc-figGcPlot} \caption{GC content by position} \label{fig:GcPlot} \end{figure} <>= gcPlot(s.fastq) + geom_hline(yintercept=0.5, color="purple") @ \section{k-mer Hashing and Investigating Contamination with Shannon Entropy and K-L Divergence} \texttt{readSeqFile} also has an option to hash k-mers. By default, \texttt{readSeqFile} randomly selects 10\% of the reads and hashes their k-mers with \texttt{k=6}. The accessor function \texttt{getKmer} will return a data frame of k-mer frequency by position. The function \texttt{kmerEntropyPlot} will plot the Shannon entropy by position, as in Figure \ref{fig:Entropy}. \texttt{kmerEntropyPlot} will handle any object that inherits from \texttt{SequenceSummary} or a named list of such objects. \begin{figure} \centering \includegraphics{qrqc-figEntropy} \caption{Shannon Entropy for an Illumina file and a FASTA file made from uniformly and randomly sampling bases.} \label{fig:Entropy} \end{figure} <>= s.rand <- readSeqFile(system.file('extdata', 'random.fasta', package='qrqc'), type="fasta") kmerEntropyPlot(list("contaminated"=s.fastq, "random"=s.rand)) @ To look for possible contamination in this k-mer frequency data, the function \texttt{calcKL} calculates the Kullback-Leibler divergence between the distribution of k-mers at a particular positions (for all positions) and the distribution of k-mers across all positions. The result is a data frame of the K-L terms, of which a subset are stacked in the plot produced by \texttt{kmerKLPlot}, as in Figure \ref{fig:KmerKL}. \begin{figure} \centering \includegraphics{qrqc-figKmerKL} \caption{K-L terms for a subset of top k-mers.} \label{fig:KmerKL} \end{figure} <>= kmerKLPlot(s.fastq) @ Note that the total height is \emph{proportional} to the actual K-L divergence; because only a subset of the k-mers are used to avoid over plotting, this is not the full K-L divergence (it's a subset of the sample space of k-mers). The number of top k-mers (determined by finding the k-mers with the highest K-L terms) is controlled by setting the number of k-mers to include with \texttt{n.kmers}. Passing a list of objects that inherit from \texttt{SequenceSummary} will facet the K-L divergence term plot, as in Figure \ref{fig:KmerKLRand} \begin{figure} \centering \includegraphics{qrqc-figKmerKLRand} \caption{K-L terms for a subset of top k-mers for many \texttt{SequenceSummary} objects.} \label{fig:KmerKLRand} \end{figure} <>= contam.file <- system.file('extdata', 'test-contam.fastq', package='qrqc') s.contam <- readSeqFile(contam.file, kmer=TRUE, k=5) kmerKLPlot(list("contaminated"=s.fastq, "random"=s.rand, "highly contaminated"=s.contam)) @ \section{Customizing Graphics with \Rpackage{ggplot2}} A major motivating factor of using \Rpackage{ggplot2} with \Rpackage{qrqc} is that it allows easy customization of plots without requiring plotting methods to have excessive arguments. For example, if we wish to flip the coordinates of a plot, we easily with \texttt{coord\_flip()}, as in Figure \ref{fig:Coord}. \begin{figure} \centering \includegraphics{qrqc-figCoord} \caption{Flipped coordinate base plot.} \label{fig:Coord} \end{figure} <>= basePlot(s.fastq, geom="bar") + coord_flip() @ \texttt{scale\_x\_continuous} and \texttt{scale\_y\_continuous} can be used to change axis labels and limits of the window. For example, one can focus on only the 3'-end bases, as in Figure \ref{fig:Zoom}. In this example, we also use \texttt{theme\_bw()} to remove \Rpackage{ggplot2}'s default grey background. \begin{figure} \centering \includegraphics{qrqc-figZoom} \caption{Controlling the x-axis.} \label{fig:Zoom} \end{figure} <>= qualPlot(s.fastq) + scale_x_continuous(limits=c(60, 85)) + theme_bw() @ \section{Accessor Functions} One can access the summary data gathered by \texttt{readSeqFile} through the slots in an object that inherits from \texttt{SequenceSummary}. However, these are straight from the C function used by \texttt{readSeqFile} and not easy to work with. The accessor functions \texttt{getBase}, \texttt{getBaseProp}, \texttt{getQual}, \texttt{getSeqlen}, \texttt{getKmer}, and \texttt{getMCQual} return data frames that can then be used directly with \Rpackage{ggplot2}. \begin{figure} \centering \includegraphics{qrqc-figAltqual} \caption{Accessor functions with custom \Rpackage{ggplot2} plots.} \label{fig:Altqual} \end{figure} <>= ggplot(getQual(s.fastq)) + geom_linerange(aes(x=position, ymin=lower, ymax=upper, color=mean)) + scale_color_gradient("mean quality", low="red", high="green") + scale_y_continuous("quality") @ \section{HTML Report Generation} With the help of \Rpackage{brew} and \Rpackage{xtable}, \Rpackage{qrqc} can generate an HTML summary report. This is created with \texttt{makeReport}. Reports are stored in their own directories, with images kept under `images/' and the report is ``report.html''. A specific output directory can be passed through the \texttt{outputDir} argument, but the present directory is used as a default. Multiple reports generated in the same directory will have an incremental naming scheme. \section{Working with the FASTQSummary and FASTASummary classes} \Rpackage{qrqc} provides the \texttt{FASTQSummary} and \texttt{FASTASummary} classes for users to build functions and applications around. Both inherit from \texttt{SequenceSummary}, which should not be used directly. \texttt{FASTASummary} has the same slots as \texttt{FASTQSummary}, except the latter provides additional slots for quality information. Both contain: \begin{itemize} \item \texttt{filename}: the filename of the file read and summarized with \texttt{readSeqFile}. \item \texttt{base.freqs}: a data frame containing the frequency counts of each base. \item \texttt{seq.lengths}: a numeric vector containing the sequences lengths (counts by position). \item \texttt{hash}: a numeric vector containing the counts of unique sequences (the actual sequences are the names attribute). \item{\texttt{hash.prop}}{a numeric value indicating the proportion of sequences that were sampled for hashing.} \item{\texttt{kmer}}{a data frame of k-mer frequency by position.} \item{\texttt{k}{an integer value indicating the k-mer size 'k'.}} \item{\texttt{hashed}: a logical indicating whether sequence hashing to count unique sequences was done.} \item{\texttt{kmers.hashed}: a logical indicating whether k-mer hashing was done.} \end{itemize} Additionally, \texttt{FASTQSummary} provides: \begin{itemize} \item \texttt{qual.freqs}: a data frame containing the counts of bases of a particular quality by position. \item \texttt{mean.qual}: a numeric giving the mean quality, weighted by sequence lengths. \end{itemize} \section{Acknowledgements} Thanks to Simon Andrews for his work on \href{http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/}{FastQC} (a similar program written in Java) from which this project was inspired. Also thanks to Joseph Fass and Dawei Lin for their advice and helpful comments. Thanks to Heng Li for his work on \texttt{kseq.h} \texttt{khash.h} (both MIT License) which this package uses through \Rpackage{RSamtools}. More on these header files can be found at \url{http://lh3lh3.users.sourceforge.net/kseq.shtml} and \url{http://attractivechaos.awardspace.com/khash.h.html}. \texttt{Sickle} can be downloaded or cloned from the UC Davis Bioinformatics Github repository: \url{https://github.com/ucdavis-bioinformatics/sickle}. \section{Session Info} <<>>= sessionInfo() @ \end{document}