%\VignetteIndexEntry{Rolexa}
%\VignetteDepends{mclust}
%\VignetteKeywords{High-throughput sequencing}
%\VignettePackage{Rolexa}

\documentclass[11pt,a4paper]{article}
\SweaveOpts{engine=R}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\usepackage{hyperref}
\usepackage{Sweave}
\usepackage{a4wide}

\setkeys{Gin}{width=\textwidth}

\begin{document}

\title{\bf Rolexa: Probabilistic Base Calling of Solexa Sequencing Data}
\author{Jacques Rougemont\\
Bioinformatics \& Biostatistics Core Facility,\\
EPFL School of Life Sciences,\\
Lausanne, Switzerland}

\maketitle


\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{intro.sect}
This package provides an alternative base calling algorithm using
model-based clustering (\Rpackage{mclust})
and probability theory to identify ambiguous bases and code them with
IUPAC symbols. We also select optimal sub-tags using a score based on
information content to remove uncertain bases towards the ends of the reads.
There are also a few diagnostic plots functionalities. Details of the
algorithms were published in \cite{Rougemont2008}.

\section{Environment variables}\label{Renv.sect}
The \Rpackage{Rolexa} package uses a \Robject{RolexaRun} object to
store the various parameters of the run, and uses the
\Rpackage{ShortRead} for manipulating data, in particular
many \Rpackage{Rolexa} functions take a \Robject{SolexaPath} object as argument.

We load the library and create a configuration with default
parameters except for the \Robject{idsep} variable:
<<>>=
library(Rolexa)
rolenv = SetModel(idsep="_")
GetModel(rolenv)
@

The meaning of these parameters is as follows:
\begin{description}
    \item[MinimumTagLength]{tags shorter than this will not be saved}
    \item[SequencingLength]{number of sequencing cycles, used to calculate the number of columns in files}
    \item[Barcode]{number of bases used as barcode at the beginning of
    the tag}
    \item[HThresholds]{entropy thresholds between 1 and 2-base
    ambiguities, 2 and 3-base ambiguities and 3-base ambiguity or
    undecided (the default is $\log_2(c(1.5,2.5,3.5))$)}
    \item[IThresholds]{total entropy thresholds, as a function of tag
    length (the default is $\log_2(4+1:36/6)$)}
    \item[PET]{paired-end sequencing run}
    \item[fit]{use full EM convergence instead of only one-step
      optimization if \Robject{TRUE}}
    \item[normal]{use tile-level normalization before
      base-calling if \Robject{TRUE}}
    \item[decorrelate]{use 'cycle'-level decorrelation procedure,
      'channel'-level, 'both' or 'none'}
    \item[idsep]{character separating coordinate fields in
      sequence headers (default is ":")}
    \item[verbose]{print debug information if $>0$}
\end{description}

\section{Loading data}\label{load.sect}

Loading data is done using the \Rpackage{ShortRead} utilities (in
particular the \Robject{SolexaPath} class) with two
additional wrappers \Rfunction{CombineReads} and
\Rfunction{CombineFastQ}:
<<build>>=
path = SolexaPath(system.file("extdata", package="ShortRead"))
@
Then use the loading functions to read a selection of those files:
<<load>>=
(int = readIntensities(path,pattern="s_1_0001",withVariability=FALSE))
(seq = CombineReads(run=rolenv,path=path,pattern="s_1_0001_seq*"))
(seq_fastq = readFastq(path))
@

\section{Data transforms}\label{transform.sect}

Before going into the base calling itself, we can perform several data
transformations to remove some of the systematic biases:
\begin{enumerate}
  \item Reduce cross-talk between color channels
<<cross-talk>>=
(theta=OptimizeAngle(int=int))[1:10,]
int=DeCorrelateChannels(int=int,theta=theta)
@ 
  \item Reduce dephasing along cycles
<<dephase>>=
(rate=OptimizeRate(int=int))
int=DeCorrelateCycles(int=int,rate=rate)
@ 
  \item Reduce position-dependent bias within each tile
<<normalize>>=
int2=TileNormalize(run=rolenv,int=int)
@ 
\end{enumerate}

\section{Base calling}\label{basecall.sect}

The base calling algorithm fits a gaussian mixture model to the
four-dimensional intenity values from each cycle. Sequences from a
previous base calling, if available, are used to seed the algorithm:
<<evalscore>>=
(res=SeqScore(run=rolenv,int=int,seqInit=seq,cycles=1:36))$sread
@ 

\section{Filtering and saving} \label{filter.sect}

The base calling results consist of a full-length tag with base
quality entropy scores, which can then be filtered to extract the best
sequence tag for each colony. This is where the parameters
\Robject{IThresholds} comes into play:
\begin{center}
<<filter,fig=TRUE>>=
rolenv@MinimumTagLength = as.integer(1)
(res2 = FilterResults(run=rolenv,results=res))$sread
str = as.matrix(res$sread[241:253])
nt = DNA_ALPHABET
post.entropy = matrix(0,nrow=nrow(str),ncol=36)
post.entropy[which(str %in% nt[5:10])] = 1
post.entropy[which(str %in% nt[11:14])] = log2(3)
post.entropy[which(str == 'N')] = 2
matplot(1:36,y=apply(post.entropy,1,cumsum),t='l',lty=1,col=rainbow(6),ylim=c(0,30),xlim=c(1,36),xlab="cycles",ylab="cumulative entropy",main="Tag length cutoff")
lines(1:36,rolenv@IThresholds,t='l',lty=2,lwd=2,col="tomato")
abline(v=nchar(res2$sread[241:253]),col=rainbow(6),lty=2)
legend(x=0,y=30,res2$sread[241:253],col=rainbow(6),lty=1,bg="white",cex=.8)
@ 
\end{center}
The final step is to save results:
<<save,eval=FALSE>>=
SaveResults(run=rolenv,results=res2,outpath="./")
@ 

\section{Batch execution}\label{batch.sect}

The whole sequence of operations needed to load, analyse, filter and
save a sequencing run can be performed in parallel (using calls to the
\Rpackage{fork} package) via the function \Rfunction{ForkBatch}:
<<fork,eval=FALSE>>=
library(fork)
ForkBatch(run=rolenv,path=path,outpath="./",prefix="rs_",nthreads=2,nfiles=5,lane=1,tiles=1,idsep="_")
@
Each of the \Robject{nthreads} threads will execute a call to
<<onebatch,eval=FALSE>>=
OneBatch(run=rolenv,path=path,lane=1,tiles=tiles[n:m],outpath="./",prefix="rs_")
@
This
function can be used in a loop on single-processor systems or in
independent jobs distributed on a computing cluster.

\section{Diagnostic plots}\label{plot.sect}

There are multiple possibilities for evaluating the quality of the
base calling, at the level of each sequence, tile or lane.

Given a sequence tag, the corresponding raw intensities and a base
quality score, we can use \Rfunction{CombinedPlot}:
\begin{center}
<<combined,fig=TRUE>>=
CombinedPlot(run=rolenv,int=int,seq=seq,scores=as(quality(seq_fastq),"matrix"),colonies=sample(1:nrow(int),4),par=list(mfrow=c(2,2),cex=.6,mar=c(4, 4, 2, 1)+.1))
@
\end{center}
we can also evaluate the distribution of intensity values at selected
cycles via 1- and 2-dimensional projections:
\begin{center}
<<histo,fig=TRUE>>=
ChannelHistogram(int=int,cycles=1,par=list(mfrow=c(2,2),mar=c(4, 4, 2, 1)+.1))
@
<<plotcycles,fig=TRUE>>=
par(mfrow=c(2,2),mar=c(4, 4, 2, 1)+.1)
PlotCycles(run=rolenv,int=int,seq=seq,cycles=c(1,20))
@
\end{center}
and look at global statistics of a base-calling:
\begin{center}
<<batchplot,fig=TRUE,split=TRUE>>=
par(mfrow=c(2,2),cex=.8,mar=c(4, 4, 2, 1)+.1)
BatchAnalysis(run=rolenv,seq=res2$sread,scores=res2$entropy,what="length",main="Length distribution")
BatchAnalysis(run=rolenv,seq=res$sread,scores=res$entropy,what="ratio",main="Complementary bases ratio")
BatchAnalysis(run=rolenv,seq=res$sread,scores=res$entropy,what="iupac",main="IUPAC codes")
BatchAnalysis(run=rolenv,seq=res2$sread,scores=res2$entropy,what="information",main="Base quality")
@
\end{center}
and visualize the positional bias over a tile by 
\begin{center}
<<tileimage,fig=TRUE,split=TRUE>>=
par(mfrow=c(1,2))
TileImage(int=int,cycle=1,tile=readInfo(int)$tile[1],ncell=5,channel='A')
TileImage(int=int,cycle=1,tile=readInfo(int)$tile[1],ncell=5,channel='C' )
@
\end{center}

\section{Session Information}

The version number of R and packages loaded for generating the vignette were:
<<sessionInfo, results=tex, print=TRUE>>=
toLatex(sessionInfo())
@

\bibliographystyle{plain}
\bibliography{Rolexa-vignette}

\end{document}