%\VignetteIndexEntry{Assessing and Adjusting for Technical Bias in High Throughput Sequencing Data} %\VignettePackage{seqbias} %\VignettePackage{seqbias, Rsamtools} \documentclass{article} \usepackage{Sweave} \usepackage{graphicx} \usepackage{utopia} \SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE} \title{\texttt{seqbias} \\ Assessing and Adjusting for Technical Bias in High Throughput Sequencing Data} \author{Daniel Jones \\ \small{\texttt{}} \\ \small{Computer Science \& Engineering} \\ \small{University of Washington}} \date{\today} \begin{document} \maketitle \section{Introduction} This package is designed as a means to assess and adjust for technical bias in high-throughput sequencing datasets, RNA-Seq being a specific target. As noted in previous studies, RNA-Seq is often subject to protocol specific bias. That is, the number of reads mapping to a particular position of the genome is dependent on the the surrounding nucleotide sequence (as well as the abundance of the RNA transcript) \cite{Hansen2010} \cite{Li2010}. Accounting for this bias increases uniformity of coverage and may result in more accurate quantification. The approach implemented here trains a simple Bayesian network classifier and uses it to evaluate the per position bias. This builds off work done by Hansen, et. al. \cite{Hansen2010}, available in the \texttt{Genominator} Bioconductor package. Another approach is taken by Li, et. al. \cite{Li2010} in the \texttt{mseq} package, available from CRAN. For this vignette, we will use some example data taken from Mortazavi, et. al. \cite{Mortazavi2008} (NCBI accession number SRR001358). Because of space constraints, we have mapped the reads (using Bowtie \cite{Langmead2009}) to an artificial genome consisting of approximately 100kb of exonic DNA. This ``artificial genome'' is given as, <<>>= library(seqbias) library(Rsamtools) ref_fn <- system.file( "extra/example.fa", package = "seqbias" ) ref_f <- FaFile( ref_fn ) open.FaFile( ref_f ) @ And the mapped reads, <<>>= reads_fn <- system.file( "extra/example.bam", package = "seqbias" ) @ \section{Assessment} As a natural first step, we would like to assess whether our sample is significantly biased. If this proves to be the case, we may wish to correct for this. A simple procedure to do so will be covered in the next section. To assess the nucleotide frequency we will use a very simple procedure: \begin{enumerate} \item Generate a random sample of intervals across our reference genome. \item Extract sequences for these intervals from a FASTA file. \item Extract read counts across these intervals from a BAM file. \item Using these sequences and counts, compute and plot nucleotide or $k$-mer frequencies. \end{enumerate} \subsection*{Sampling} For this step, we could use collection of known exons, but trustworthy annotations are not always available, and biasing the analysis by known exons may be a concern in some instances. Fortunately, \texttt{seqbias} provides a function to generate random intervals. First, we extract a vector of sequence lengths, in the reference genome. Given an FASTA file than has been indexed with the \texttt{samtools faidx} command, we can use the \texttt{Rsamtools} package to read off the sequence lengths and to extract the sequence. First, the lengths, <<>>= ref_seqs <- scanFaIndex( ref_f ) @ Once we have this, we generate 5 intervals of 100kb. It most cases we would want to generate a larger sample, but since we are working here with small reference sequence with dense coverage, we can get an accurate measurement with a few intervals. <<>>= I <- random.intervals( ref_seqs, n = 5, m = 100000 ) @ \subsection*{Sequences} Next we extract the nucleotide sequences, <<>>= seqs <- scanFa( ref_f, I ) @ The \texttt{scanFa} function does not respect strand, so we must be sure to perform the reverse complement ourselves. <<>>= neg_idx <- as.logical( I@strand == '-' ) seqs[neg_idx] <- reverseComplement( seqs[neg_idx] ) @ \subsection*{Counts} Finally, we count the number of reads mapping to each position in our sampled intervals. <<>>= counts <- count.reads( reads_fn, I, binary = T ) @ Unless the \texttt{binary} argument is FALSE, this function returns a 0-1 vector, where a position is 0 if no reads map to it, and 1 if \emph{at least} one read maps to it. This is a more robust way to measure sequencing bias, as the frequencies can not get dominated by a few very high peaks. \subsection*{Frequencies} At last, we compute the $k$-mer frequency (where $k=1$, by default). <<>>= freqs <- kmer.freq( seqs, counts ) @ A nice way to plot this is with the \texttt{ggplot2} package, if available. <>= if( require(ggplot2) ) { P <- qplot( x = pos, y = freq, ylim = c(0.15,0.4), color = seq, data = freqs, geom = "line" ) P <- P + facet_grid( seq ~ . ) print(P) } else { par( mar = c(5,1,1,1), mfrow = c(4,1) ) with( subset( freqs, seq == "a" ), plot( freq ~ pos, ylim = c(0.15,0.4), sub = "a", type = 'l' ) ) with( subset( freqs, seq == "c" ), plot( freq ~ pos, ylim = c(0.15,0.4), sub = "c", type = 'l' ) ) with( subset( freqs, seq == "g" ), plot( freq ~ pos, ylim = c(0.15,0.4), sub = "g", type = 'l' ) ) with( subset( freqs, seq == "t" ), plot( freq ~ pos, ylim = c(0.15,0.4), sub = "t", type = 'l' ) ) } @ Doing so will produce the following plot, \begin{center} \includegraphics[width=\textwidth]{overview-fig1} \end{center} The x-axis shows the nucleotide position relative to the read start. Negative number occur in the genome to the left of mapped reads. In this set, the reads consist of positions 0-24. We can see a clear bias here in positions 0-15, approximately. The rest of the plot looks relatively flat, as we would expect if the experiment was measuring abundance only and not a biased by the nucleotide sequence. In the next section we will adjust read counts to account for this. \section{Compensation} \subsection*{Training} To begin, we must fit a seqbias model to our dataset. This done very easily with the \texttt{seqbias.fit} function. This will take only a few seconds, but when more reads are available a more accurate model can be trained at the expense of the training procedure taking up to several minutes. <>= sb <- seqbias.fit( ref_fn, reads_fn, L = 5, R = 15 ) @ The \texttt{L} and \texttt{R} arguments control the maximum number of positions to the left and right of the read start that may be considered. The model tries to consider only informative positions, so increasing these numbers will increase training time, but should never have a negative effect the accuracy of the model. \subsection*{Prediction} Once we have trained the seqbias model, we can use it to predict the sequencing bias across a set of intervals. <<>>= bias <- seqbias.predict( sb, I ) @ \subsection*{Adjustment} To adjust, we will can simply divide the \texttt{counts} vectors by the \texttt{bias} vectors. <<>>= counts.adj <- mapply( FUN = `/`, counts, bias, SIMPLIFY = F ) @ The post-adjustment nucleotide frequencies can then be measured as before, <<>>= freqs.adj <- kmer.freq( seqs, counts.adj ) @ And plotted, <>= if( require(ggplot2) ) { P <- qplot( x = pos, y = freq, ylim = c(0.15,0.4), color = seq, data = freqs.adj, geom = "line" ) P <- P + facet_grid( seq ~ . ) print(P) } else { par( mar = c(5,1,1,1), mfrow = c(4,1) ) with( subset( freqs.adj, seq == "a" ), plot( freq ~ pos, ylim = c(0.15,0.4), sub = "a", type = 'l' ) ) with( subset( freqs.adj, seq == "c" ), plot( freq ~ pos, ylim = c(0.15,0.4), sub = "c", type = 'l' ) ) with( subset( freqs.adj, seq == "g" ), plot( freq ~ pos, ylim = c(0.15,0.4), sub = "g", type = 'l' ) ) with( subset( freqs.adj, seq == "t" ), plot( freq ~ pos, ylim = c(0.15,0.4), sub = "t", type = 'l' ) ) } @ The plot below results, \begin{center} \includegraphics[width=\textwidth]{overview-fig2} \end{center} Compared to the first figure, the improvement is clear. \section{Save / Load} If the model is fit using a large number of reads, it can take several minutes to train. To avoid repeatedly refitting the model, \texttt{seqbias} provides a mechanism to save and load the model to a YAML file with the \texttt{seqbias.save} and \texttt{seqbias.load} functions. <<>>= seqbias.save( sb, "my_seqbias_model.yml" ) # load the model sometime later sb <- seqbias.load( ref_fn, "my_seqbias_model.yml" ) @ Note when loading the model, we need to provide a reference sequence. The \texttt{seqbias} object keeps track of the reference sequence to make \texttt{seqbias.predict} more convenient. \section{Session Info} <<>>= sessionInfo() @ \bibliographystyle{plain} \bibliography{overview} \end{document}