%\VignetteIndexEntry{Data preprocessing and creation of the data objects pasillaGenes and pasillaExons}
\documentclass{article}

\usepackage{Sweave}
\usepackage[a4paper]{geometry}
\usepackage{hyperref,graphicx}
\usepackage{cite}
\usepackage{color}
\usepackage{url}
\usepackage{listings}

\SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,height=4.5,width=4} 
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rfunction}[1]{{\small\texttt{#1}}}
\newcommand{\fixme}[1]{{\textbf{Fixme:} \textit{\textcolor{blue}{#1}}}}

\title{\textsf{\textbf{Data preprocessing and creation of the data objects pasillaGenes and pasillaExons}}}
\author{Alejandro Reyes}
\begin{document}
\maketitle
\begin{abstract}
This vignette describes the steps that were followed for the generation 
of the data objects contained in the package \Rpackage{pasilla}.
\end{abstract}

\tableofcontents

%--------------------------------------------------
\section{Downloading the files}
%--------------------------------------------------
We used the RNA-Seq data from the publication by Brooks et al.~\cite{Brooks2010}.
The experiment investigated the effect of siRNA knock-down of pasilla, a gene 
that is known to bind to mRNA in the spliceosome, and which is thought to be involved 
in the regulation of splicing.  The data set contains 3 biological replicates of the 
knockdown as well as 4 biological replicates for the untreated control. 
Data files are publicly available in the NCBI Gene Expression Omnibus under the 
accession GSE18508\footnote{\url{http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508}}. 
The read sequences in FASTQ format were extracted from the NCBI short read archive file 
(.sra files), using the sra 
toolkit\footnote{\url{http://www.ncbi.nlm.nih.gov/books/NBK47540/\#SRA\_Download\_Guid\_B.5\_Converting\_SRA\_for}}.
\\
<<samples,echo=FALSE,results=hide>>=
tab = data.frame(
  file = c("treated1fb", "treated2fb", "treated3fb", "untreated1fb", "untreated2fb", "untreated3fb", "untreated4fb"),
  type = c("single-read", "paired-end", "paired-end", "single-read", "single-read", "paired-end", "paired-end"),
  "number of lanes" = as.integer(c(5,2,2,2,6,2,2)),
  "total number of reads" = c("35158667", "12242535 (x2)", "12443664 (x2)", "17812866", "34284521", "10542625 (x2)", "12214974 (x 2)"),
  "exon counts" = as.integer(c(15679615, 15620018, 12733865, 14924838, 20764558, 10283129, 11653031)),
  stringsAsFactors = TRUE,
  check.names = FALSE)
library("xtable")
print(xtable(tab, caption = "Read numbers and alignment statistics. The column \\emph{exon counts} refers to the number of reads that could be uniquely aligned to an exon.", label="tab:samples"), file="create_objects_tabsamples.tex")
@ 

%--------------------------------------------------
\section{Read alignment and filtering}
%--------------------------------------------------

The reads in the FASTQ files were aligned using tophat version 1.2.0
with default parameters against the reference Drosophila melanogaster
genome.  Table~\ref{tab:samples} summarizes the read number and
alignment statistics.

\input{create_objects_tabsamples.tex}

The reference genome fasta files were obtained from the Ensembl ftp 
server\footnote{\url{http://www.ensembl.org/info/data/ftp/index.html}}. 
We ran \texttt{bowtie-build} to index the
fasta file. For more information on this procedure see
the bowtie webpage\footnote{\url{http://bowtie-bio.sourceforge.net/tutorial.shtml}}.
The indexed form is required by bowtie, and thus tophat. 

\begin{lstlisting}
wget ftp://ftp.ensembl.org/pub/release-62/fasta/drosophila_melanogaster/ \
dna/Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa.gz

gunzip Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa.gz
bowtie-build Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa \
    d_melanogaster_BDGP5.25.62
\end{lstlisting}

We generated the alignment BAM file using tophat.  For the single-reads data:

\begin{lstlisting}
tophat bowtie_index reads1.fastq,reads2.fastq,...,readsN.fastq
\end{lstlisting}
  
For the paired-end data:

\begin{lstlisting}
tophat -r inner-fragment-size bowtie_index \
    reads1_1.fastq,reads2_1.fastq,...,readsN_1.fastq \
    reads1_2.fastq,reads2_2.fastq,...,readsN_2.fastq
\end{lstlisting}

More information on tophat is provided on its webpage\footnote{\url{http://tophat.cbcb.umd.edu/tutorial.html}}.
The SAM alignment files from which \Rpackage{pasilla} was generated are
available at \url{http://www-huber.embl.de/pub/DEXSeq/analysis/brooksetal/bam/}. 

%--------------------------------------------------
\section{Exon count files}
%--------------------------------------------------
To generate the per-exon read counts, we first needed to define the exonic regions.  
To this end, we downloaded the file Drosophila\_melanogaster.BDGP5.25.62.gtf.gz from 
Ensembl\footnote{\url{ftp://ftp.ensembl.org/pub/release-62/gtf/drosophila\_melanogaster}}.
The script \texttt{dexseq\_prepare\_annotation.py} contained in the \Rpackage{DEXSeq}
package was used to extract the exons of the transcripts from the file, define new 
non-overlapping exonic regions and reformat it to create the file Dmel.BDGP5.25.62.DEXSeq.chr.gff
contained in pasilla/extdata. For example, for this file we ran:

\begin{lstlisting}
wget ftp://ftp.ensembl.org/pub/release-62/gtf/ \
drosophila_melanogaster/Drosophila_melanogaster.BDGP5.25.62.gtf.gz

gunzip Drosophila_melanogaster.BDGP5.25.62.gtf.gz
python dexseq_prepare_annotation.py Drosophila_melanogaster.BDGP5.25.62.gtf \
    Dmel.BDGP5.25.62.DEXSeq.chr.gff
\end{lstlisting}

To count the reads that fell into each non-overlapping exonic part, the script 
\texttt{dexseq\_count.py}, which is also contained in the \Rpackage{DEXSeq} package, was used. 
It took the alignment results in the form of a SAM file
(sorted by position in the case of a paired end data) and the \texttt{gtf} file 
Dmel.BDGP5.25.62.DEXSeq.chr.gff and returned one file for each biological replicate 
with the exon counts. For example, for the file treated1.bam, which contained single-end
alignments, we ran:

\begin{lstlisting}
samtools index treated1.bam
samtools view treated1.bam > treated1.sam
python dexseq_count.py Dmel.BDGP5.25.62.DEXSeq.chr.gff \
    treated1.sam treated1fb.txt
\end{lstlisting}

For the file treated2.bam, which contained paired-end alignments:

\begin{lstlisting}
samtools index treated2.bam
samtools view treated2.bam > treated2.sam
sort -k1,1 -k2,2n treated2.sam > treated2_sorted.sam
python dexseq_count.py -p yes Dmel.BDGP5.25.62.DEXSeq.chr.gff \
    treated2_sorted.sam treated2fb.txt
\end{lstlisting}

The output of the two HTSeq python scripts is provided in the 
\Rpackage{pasilla} package:
%
<<pasilla,results=hide>>=
library("pasilla")
<<extdata>>=
inDir = system.file("extdata", package="pasilla", mustWork=TRUE)
dir(inDir)
@ 
%
The Python scripts are built upon the HTSeq 
library\footnote{\url{http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html}}.
  
 
%--------------------------------------------------
\section{Creation of the \Rclass{ExonCountSet} \Robject{pasillaExons}}
%--------------------------------------------------
<<samples1, echo=FALSE>>=
z = sub("fb$", "", tab$file)
nc = nchar(z)
samples = data.frame(
  condition = substr(z, 1, nc-1),
#  replicate = as.integer(substr(z, nc, nc)),
  type = tab$type,
  row.names = tab$file,
  stringsAsFactors = TRUE,
  check.names = FALSE)

#stopifnot(!any(is.na(samples$replicate)))
@ 
%$
To create an \Rclass{ExonCountSet} object, we started with a data frame
\Robject{samples} that contained the sample annotations, as in
Table~\ref{tab:samples}.

<<samples>>=
head(samples)
@ 

We also needed the annotation file with the per exon annotation.
%
<<ecs>>=
annotationfile = file.path(inDir, "Dmel.BDGP5.25.62.DEXSeq.chr.gff")
@

With these, we could call the function \Rfunction{read.HTSeqCounts} to construct the object \Robject{ecs}.

<<read, results=hide>>=
library("DEXSeq")
ecs = read.HTSeqCounts(countfiles = file.path(inDir, paste(rownames(samples), "txt", sep=".")), 
           design = samples, 
           flattenedfile = annotationfile)
sampleNames(ecs) = rownames(samples)
@ 

We only wanted to work with data from a subset of genes, which was defined in the following file.
%
<<gfs>>=
genesforsubset = readLines(file.path(inDir, "geneIDsinsubset.txt"))
pasillaExons = subsetByGenes(ecs, genes=genesforsubset)
@

We added the experiment data:
<<expdata>>=
expdata = new("MIAME", 
   name="pasilla knockdown", 
   lab="Genetics and Developmental Biology, University of Connecticut Health Center", 
   contact="Dr. Brenton Graveley", 
   title="modENCODE Drosophila pasilla RNA Binding Protein RNAi knockdown RNA-Seq Studies", 
   url="http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508", 
   abstract="RNA-seq of 3 biological replicates of from the Drosophila melanogaster 
      S2-DRSC cells that have been RNAi depleted of mRNAs encoding pasilla, a mRNA binding 
      protein and 4 biological replicates of the the untreated cell line.")
   pubMedIds(expdata) <- "20921232"
experimentData(pasillaExons) <- expdata
@

%--------------------------------------------------
\section{Creation of the \Rclass{CountDataSet} \Robject{pasillaGenes}}
%--------------------------------------------------
The \Rclass{CountDataSet} class is analogous to the
\Rclass{ExonCountSet} class; 
the latter is specifically designed to store exon level counts,
while the \Rclass{CountDataSet} class is useful more generally for whatever one wishes to count
(e.\,g.\ ChIP peaks, gene levels counts).  We made use of the
function \Rfunction{geneCountTable} from the package \Rpackage{DEXSeq} to get a data frame
containing the number of reads falling on each of the genes. We used the function \Rfunction{newCountDataSet} to
create the object \Robject{pasillaGenes}.

<<DESeq, results=hide>>=
library("DESeq")
genetable = geneCountTable(ecs)

pasillaGenes = newCountDataSet(genetable, 
  conditions = samples)

experimentData(pasillaGenes) = expdata
@

We saved the objects in the data directory of the package:
<<save,eval=FALSE>>=
save(pasillaExons, file=file.path("..", "..", "data", "pasillaExons.RData"))
save(pasillaGenes, file=file.path("..", "..", "data", "pasillaGenes.RData"))
@ 

%---------------------------------------------------------
% SessionInfo
%---------------------------------------------------------
\begin{table*}[tbp]
\begin{minipage}{\textwidth}
<<sessionInfo, results=tex, print=TRUE>>=
toLatex(sessionInfo())
@ 
\end{minipage}
\caption{\label{tab:sessioninfo}%
The output of \Rfunction{sessionInfo} on the build system 
after running this vignette.}
\end{table*}


%--------------------------------------------------
\bibliography{library}
\bibliographystyle{plain}
\end{document}