%\VignetteIndexEntry{Using the qrqc package to gather information about sequence qualities} %\VignettePackage{qrqc} \documentclass{article} \usepackage{Sweave} \usepackage[a4paper]{geometry} \usepackage{hyperref,graphicx} \SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{{\small\texttt{#1}}} \author{Vince Buffalo\\[1em]Bioinformatics Core\\ UC Davis Genome Center\\[1em] \texttt{vsbuffalo@ucdavis.edu}} \title{\textsf{\textbf{Using the \Rpackage{qrqc} package to gather information about sequence qualities}}} \date{2010-01-19} \begin{document} \maketitle \begin{abstract} Many projects in bioinformatics begin with raw sequences from a high-throughput sequencer that must be quality checked before additional analysis steps can proceed. The \Rpackage{qrqc} (Quick Read Quality Control) package is a fast and extensible package that reports basic quality and summary statistics on FASTQ and FASTA files, including base and quality distribution by position, sequence length distribution, and common sequences. \end{abstract} \section{Reading in a Sequence File} The \Rpackage{qrqc} package reads and processes FASTA and FASTQ files in C for speed, through the function \texttt{readSeqFile}. Optionally, sequences can be hashed (also done at the C level), to see the most frequent sequences in the file. <>= library(qrqc) s.fastq <- readSeqFile(system.file('extdata', 'test.fastq', package='qrqc')) @ <>= s.fastq@filename <- 'test.fastq' # otherwise a long temp dir will be here @ Note that there is a maximum sequence length argument in \texttt{readSeqFile}, \texttt{max.length}. By default, this is 1,000. It is used to pre-allocate the matrices in C, and could be much larger than the largest sequence encountered without many downsides (its memory usage is relatively low). If a sequence larger than \texttt{max.length} is encountered, the function will stop, and the user can call the function again with a larger \texttt{max.length}. \texttt{Readseqfile} produces a \texttt{FASTQSummary} object, which inherits from the \texttt{SequenceSummary} class. Printing the object lists a very short summary: <>= s.fastq @ Optionally, \texttt{readSeqFile} can be run without hashing, with \texttt{hash=FALSE}. \texttt{readSeqFile} also works on FASTA files, but \texttt{type=FASTA} must be specified. \section{Plotting Quality of FASTQSummary Objects} If the file read and summarized with \texttt{readSeqFile} is a FASTQ file (and thus the resulting object is from the \texttt{FASTQSummary} class), quality information by position can be plotted with \texttt{plotQuals}, which produces a graphic as in Figure \ref{fig:PlotQuals}: \begin{figure} \centering \includegraphics{qrqc-figPlotQuals} \caption{A plot of quality by base position, with sequence length histogram.} \label{fig:PlotQuals} \end{figure} <>= plotQuals(s.fastq) @ By default, quality plots include a histogram of sequence lengths. This is recommended, as low quality at a particular base position is less worrisome if there are few reads with this sequence length. The grey lines indicate the 10\% and 90\% quantiles, orange lines indicate the lower and upper quartiles, the blue dot is the median, and the green dash the mean. A purple lowess curve is fit through the distributions. This line is fit by first randomly drawing values from the empirical (binned) distribution of qualities at a particular base, then fitting through these points. \texttt{plotQuals} can be very useful in inspecting base qualities before and after read quality control pipelines. For example, the package contains \texttt{test-trimmed.fastq}, which has the same sequences as \texttt{test.fastq} after being trimmed with Nik Joshi's \texttt{Sickle}, a windowed adaptive quality trimmer. \begin{figure} \centering \includegraphics{qrqc-figPlotQuals-trimmed} \caption{A plot of quality by base position after being trimmed with \texttt{Sickle}.} \label{fig:PlotQuals-trimmed} \end{figure} <>= s.trimmed.fastq <- readSeqFile(system.file('extdata', 'test-trimmed.fastq', package='qrqc')) plotQuals(s.trimmed.fastq) @ \section{Other Plots of FASTQSummary and FASTASummary Objects} The following S4 methods work for FASTQSummary and FASTASummary objects. Base frequencies (counts) and base proportions by position can be plotted with \texttt{plotBases}. When used with \texttt{type='freq'}, \texttt{plotBases} produces a graphic as in Figure \ref{fig:PlotBase-freqs}: \begin{figure} \centering \includegraphics{qrqc-figPlotBase-freqs} \caption{Base frequencies by position in sequence.} \label{fig:PlotBase-freqs} \end{figure} <>= plotBases(s.fastq, type="freq") @ \texttt{plotBases} uses the Sanger base color scheme: blue is Cytosine, green is Adenine, black is Guanine, red is Thymine, and purple in N (any base). When further IUPAC nucleotides are found in the sequences, a legend is added. \texttt{plotBases} also accepts a \texttt{bases} parameter, which can be used to specify specific bases. This is useful for plotting just the frequency of `N'. Base proportions by position can be plotted with \texttt{plotBases}, with \texttt{type='prop'}. This plot is basically identical to the plot produced with \texttt{type='freq'} with a different y scale: \begin{figure} \centering \includegraphics{qrqc-figPlotBase-prop} \caption{Base proportions by position in sequence.} \label{fig:PlotBase-prop} \end{figure} <>= plotBases(s.fastq, type="prop") @ Sequence length distribution can be plotted with \texttt{plotSeqLengths} (graphic shown in Figure \ref{fig:PlotSeqLengths}): \begin{figure} \centering \includegraphics{qrqc-figPlotSeqLengths} \caption{Histogram of sequence lengths after quality trimming.} \label{fig:PlotSeqLengths} \end{figure} <>= plotSeqLengths(s.trimmed.fastq) @ The GC content can be plotted with \texttt{plotGC} (graphic shown in Figure \ref{fig:PlotGC}): \begin{figure} \centering \includegraphics{qrqc-figPlotGC} \caption{GC content by position} \label{fig:PlotGC} \end{figure} <>= plotGC(s.fastq) @ \section{HTML Report Generation} With the help of \Rpackage{brew} and \Rpackage{xtable}, \Rpackage{qrqc} can generate an HTML summary report. This is created with \texttt{makeReport}. Reports are stored in their own directories, with images kept under `images/' and the report is ``report.html''. A specific output directory can be passed through the \texttt{outputDir} argument, but the present directory is used as a default. Multiple reports generated in the same directory will have an incremental naming scheme. \section{Working with the FASTQSummary and FASTASummary classes} \Rpackage{qrqc} provides the \texttt{FASTQSummary} and \texttt{FASTASummary} classes for users to build functions and applications around. Both inherit from \texttt{SequenceSummary}, which should not be used directly. \texttt{FASTASummary} has the same slots as \texttt{FASTQSummary}, except the latter provides additional slots for quality information. Both contain: \begin{itemize} \item \texttt{filename}: the filename of the file read and summarized with \texttt{readSeqFile}. \item \texttt{base.freqs}: a dataframe containing the frequency counts of each base. \item \texttt{seq.lengths}: a numeric vector containing the sequences lengths (counts by position). \item \texttt{hash}: a numeric vector containing the counts of unique sequences (the actual sequences are the names attribute). \item \texttt{hashed}: a logical indicating whether sequence hashing to count unique sequences was done. \end{itemize} Additionally, \texttt{FASTQSummary} provides: \begin{itemize} \item \texttt{qual.freqs}: a dataframe containing the counts of bases of a particular quality by position. \item \texttt{mean.qual}: a numeric giving the mean quality, weighted by sequence lengths. \end{itemize} \section{Acknowledgements} Thanks to Simon Andrews for his work on \href{http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/}{FastQC} (a similar program written in Java) from which this project was inspired. Also thanks to Joseph Fass and Dawei Lin for their advice and helpful comments. Thanks to Heng Li for his work on \texttt{kseq.h} \texttt{khash.h} (both MIT License) which this package uses through \Rpackage{RSamtools}. More on these header files can be found at \url{http://lh3lh3.users.sourceforge.net/kseq.shtml} and \url{http://attractivechaos.awardspace.com/khash.h.html}. \texttt{Sickle} can be downloaded or cloned from the UC Davis Bioinformatics Github repository: \url{https://github.com/ucdavis-bioinformatics/sickle}. \section{Session Info} <<>>= sessionInfo() @ \end{document}