%\VignetteIndexEntry{Analyzing RNA-seq data for differential exon usage with the "DEXSeq" package}
%\VignettePackage{DEXSeq}
%\VignetteEngine{knitr::knitr}

% To compile this document
% library('knitr'); rm(list=ls()); knit('DEXSeq.Rnw')

\documentclass[12pt]{article}
\usepackage[utf8]{inputenc}
\usepackage{titling}

<<knitr, echo=FALSE, results='hide'>>=
library("knitr")
opts_chunk$set(tidy=FALSE,dev="pdf",fig.show="hide",
               fig.width=4,fig.height=4.5,
               message=FALSE)
@ 

<<style, eval=TRUE, echo=FALSE, results='asis'>>=
BiocStyle::latex()
@

% My changes to the Bioc style:
%\IfFileExists{inconsolata.sty}{\usepackage{inconsolata}}{\usepackage{zi4}}  % I like Inconsolata as terminal font
%\renewcommand{\familydefault}{\rmdefault}  % I want serifs in my body text
%\renewcommand{\maketitlehooka}{\sffamily\bfseries} % The title should be sans-serif, though
%\renewcommand{\maketitlehookb}{\rmfamily\mdseries}  % But not the authors
%\fancyhead[R]{\sffamily\small\thepage}  % The header should not be the same font and size as the body,
%\fancyhead[L]{\sffamily\small\thetitle} % to set the two visually apart.

\newcommand{\tttildemiddle}{\raise.17ex\hbox{$\scriptstyle\mathtt{\sim}$}}

\usepackage[sort]{cite}
\usepackage{xstring}
\usepackage[fleqn]{amsmath}


\author{Alejandro Reyes, Simon Anders, Wolfgang Huber\\[1em]
  European Molecular Biology Laboratory (EMBL),\\
  Heidelberg, Germany}

\title{Inferring differential exon usage in RNA-Seq data with the DEXSeq package}

\date{}
%\date{\Rpackage{DEXSeq} version \Sexpr{packageDescription("DEXSeq")$Version} %
%(Last revision \StrMid{$Date: 2014-10-23 06:00:24 -0700 (Thu, 23 Oct 2014) $}{8}{18})}

\begin{document}

\maketitle

\noindent This vignette describes version \Sexpr{packageDescription("DEXSeq")$Version} of the \Rpackage{DEXSeq} package.

\noindent Last revision of this document: \StrMid{$Date: 2014-10-23 06:00:24 -0700 (Thu, 23 Oct 2014) $}{8}{18}
% Note: The preceding line uses SVN's keyword substitution mechanism: The text between "$Date:" and the second "$" 
% is automatically replaced by a current time stamp when "svn ci" is used. The \StrMid function (from the 
% xstring package) takes out the relevant part. (To activate keyword substitution for another file, use
% "svn propset svn:keywords Date filename.txt".)

<<options,results='hide',echo=FALSE>>=
options(digits=2, width=80, prompt=" ", continue=" ")
@

\tableofcontents

%-----------------------------------------------------------
\section{Overview}\label{sec:praeludium}
%-----------------------------------------------------------

The Bioconductor package \Rpackage{DEXseq} implements a method to test for differential exon usage
in comparative RNA-Seq experiments. By \emph{differential exon usage} (DEU), we mean changes in the relative 
usage of exons caused by the experimental condition. The relative usage of an exon is defined as
\begin{equation}\label{eq:reu}
  \frac{\text{number of transcripts from the gene that contain this exon}}%
  {\text{number of all transcripts from the gene}}.
\end{equation}
The statistical method used by \Rpackage{DEXSeq} was introduced in our paper \cite{DEXSeqPaper}. The basic concept can 
be summarized as follows. For each exon (or part of an exon) and each sample, we count how many 
reads map to this exon and how many reads map to any of the other exons of the same gene. 
We consider the ratio of these two counts, and how it changes across conditions, to infer changes in 
the relative exon usage~(\ref{eq:reu}).
In the case of an inner exon, a change in relative exon usage is typically due to a change
in the rate with which this exon is spliced into transcripts (alternative splicing). 
Note, however, that DEU is a more
general concept than alternative splicing, since it also includes 
changes in the usage of alternative transcript start sites
and polyadenylation sites, which can cause
differential usage of  exons at the 5' and 3' boundary of transcripts.

Similar as with differential gene expression, we need to make sure that observed 
differences of values of the ratio~(\ref{eq:reu}) between conditions
are statistically significant, i.\,e., are sufficiently unlikely to be just due to random fluctuations such as those 
seen even between samples from the same condition, i.\,e., between replicates. 
To this end, \Rpackage{DEXSeq} assesses the strength of these fluctuations 
(quantified by the so-called \emph{dispersion}) by comparing replicates 
before comparing the averages between the sample groups.

The preceding description is somewhat simplified (and perhaps over-simplified), and we recommend that 
users consult the paper \cite{DEXSeqPaper} for a more complete description, as well as Appendix~\ref{changes} of 
this vignette, which describes how the current implementation of \Rpackage{DEXSeq} differs from 
the original approach described in the paper. Nevertheless, two important aspects should be 
mentioned already here: First, \Rpackage{DEXSeq} does not actually work 
on the ratios~(\ref{eq:reu}), but on the counts in the numerator
and denominator, to be able to make use of the information that is
contained in the magnitude of count values. (3000 reads versus 1000 reads is the same ratio as 3 reads
versus 1 read, but the latter is a far less reliable estimate of the
underlying true value, because of statistical sampling.) Second, \Rpackage{DEXSeq} is not limited to
simple two-group comparisons; rather, it uses so-called generalized linear models
(GLMs) to permit ANOVA-like analysis of potentially complex experimental designs.

\section{Preparations} \label{preps}

\subsection{Example data}

To demonstrate the use of \emph{DEXSeq}, we use the \emph{pasilla} dataset, an
RNA-Seq dataset generated by Brooks et al.~\cite{Brooks2010}. They investigated the effect of 
siRNA knock-down of the gene \emph{pasilla} on the transcriptome of fly S2-DRSC cells. The 
RNA-binding protein \emph{pasilla} protein is thought to be involved
in the regulation of splicing. (Its mammalian orthologs, NOVA1 and NOVA2, are well-studied examples
of splicing factors.) Brooks et al.\ prepared seven cell cultures, treated three with siRNA to knock
down \emph{pasilla} and left four as untreated controls, and performed RNA-Seq on all samples. They 
deposited the raw sequencing reads with the NCBI Gene Expression Omnibus (GEO) under the accession number
GSE18508.\footnote{\url{http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508}}

\paragraph{Executability of the code.} 
Usually, Bioconductor vignettes contain automatically executable code, i.\,e., you can follow the vignette 
by directly running the code shown, using functionality and data provided with the package. However, it would not be
practical to include the voluminous raw data of the pasilla experiment here. Therefore, the
code in this section is not automatically executable. You may download the raw data yourself 
from GEO, as well as the required extra tools, 
and follow the work flow shown here and in the \Rpackage{pasilla} vignette~\cite{pasillaVignette}. 
From Section~\ref{stdAnalysis} on, code is directly executable, as usual. 
Therefore, we recommend that you just read this section, and try following our analysis in R only from
the next section onwards. Once you work with your own data, you will want to come back and adapt the 
work flow shown here to your data.

\subsection{Alignment}
The first step of the analysis is to align the reads to a reference genome. It is important
to align them to the genome, not to the transcriptome, 
and to use a splice-aware aligner (i.\,e., a short-read
alignment tool that can deal with reads that span across introns) such as TopHat2 \cite{TopHat2}, 
GSNAP \cite{GSNAP}, or STAR \cite{STAR}. The explanation
of the analysis work-flow presented here starts with the aligned reads in the SAM format. If you
are unfamiliar with the process of aligning reads to obtain SAM files, you can find a summary how
we proceeded in preparing the pasilla data in the vignette for the \Rpackage{pasilla} data 
package~\cite{pasillaVignette} and a more extensive explanation, using the same  data set, in our 
protocol article on differential expression calling~\cite{DEProt}.

\subsection{HTSeq} \label{HTSeq}

The initial steps of a \emph{DEXSeq} analysis, described in the following two sections,
is typically done outside R, by using two provided Python scripts. You do not need
to know how to use Python; however you have to install the Python package \software{HTSeq},
following the explanations given on the HTSeq web page:

\url{http://www-huber.embl.de/users/anders/HTSeq/doc/install.html}

Once you have installed \emph{HTSeq}, you can use the two Python scripts, \verb|dexseq_prepare_annotation.py|
(described in Section~\ref{prepAnno}) and \verb|dexseq_count.py| (Section~\ref{sec:counting}), that 
come with the \Rpackage{DEXSeq} package. If you have trouble finding them, start R and ask for 
the installation directory with
%
<<systemFile>>=
pythonScriptsDir = system.file( "python_scripts", package="DEXSeq" )
list.files(pythonScriptsDir)
@
<<systemFileCheck,echo=FALSE,results='hide'>>=
system.file( "python_scripts", package="DEXSeq", mustWork=TRUE )
@
%
The displayed path should contain the two files. If it does not, try to re-install \Rpackage{DEXSeq} 
(as usual, with \Rfunction{biocLite}).

An alternative work flow, which replaces the two Python-based steps with R=based code, is also available
and is demonstrated in the vignette of the \Rpackage{parathyroidSE} package~\cite{parathyroidSEVignette}. 


\subsection{Preparing the annotation} \label{prepAnno}

The Python scripts expect a GTF file with gene models for your species. We have tested our tools chiefly
with GTF files from Ensembl and hence recommend to prefer these, as files from other providers sometimes do 
not adhere fully to the GTF standard and cause the preprocessing to fail.
Ensembl GTF files can be found in the ``FTP Download'' sections of
the Ensembl web sites (i.\,e., Ensembl, EnsemblPlants, EnsemblFungi, etc.). Make sure that your GTF 
file uses a coordinate system that matches the reference genome that you have used for aligning your 
reads. (The safest way to ensure this is to download the reference genome from Ensembl, too.)
If you cannot use an Ensembl GTF file, see Appendix~\ref{GTF} for advice on converting GFF files from other
sources to make them suitable as input for the  \verb|dexseq_prepare_annotation.py| script.

In a GTF file, many exons appear multiple times, once for each transcript that contains them. We
need to ``collapse'' this information to define \emph{exon counting bins}, i.\,e., a list of intervals, 
each corresponding to one exon or part of an exon. Counting bins for parts of exons arise when 
an exonic region appears with different boundaries in different transcripts. See Figure~1 of the 
DEXSeq paper~\cite{DEXSeqPaper} for an illustration. The Python script \verb|dexseq_prepare_annotation.py| 
takes an Ensembl GTF file and translates it into a GFF file with collapsed exon counting bins.

Make sure that your current working directory contains the GTF file and call the script 
(from the command line shell, not from within R) with

\begin{verbatim}
python /path/to/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py 
    Drosophila_melanogaster.BDGP5.72.gtf Dmel.BDGP5.25.62.DEXSeq.chr.gff
\end{verbatim}

In this command, which should be entered as a single line, replace
\verb|/path/to|...\verb|/python_scripts| with the correct path to the Python scripts,
which you have found with the call to \Rfunction{system.file} shown above.
\verb|Drosophila_melanogaster.BDGP5.72.gtf| is the Ensembl GTF file (here the one for 
fruit fly, already de-compressed) and \verb|Dmel.BDGP5.25.62.DEXSeq.chr.gff| is the name of the output
file.

In the process of forming the counting bins, the script might come across overlapping genes. If two genes on the
same strand are found with an exon of the first gene overlapping with an exon of the second gene, the script's
default behaviour is to combine the genes into a single ``aggregate gene'' which is subsequently referred to
with the IDs of the individual genes, joined by a plus ('+') sign. If you do not like this behaviour, you can
disable aggregation with the option ``\verb|-r no|''. Without aggregation, exons that overlap with other exons from
different genes are simply skipped.


\subsection{Counting reads}\label{sec:counting}

For each SAM file, we next count the number of reads that overlap with each of the exon counting bins defined
in the flattened GFF file. This is done with the script \verb|python_count.py|:

\begin{verbatim}
python /path/to/library/DEXSeq/python_scripts/dexseq_count.py 
    Dmel.BDGP5.25.62.DEXSeq.chr.gff untreated1.sam untreated1fb.txt
\end{verbatim}

This command (again, to be entered as a single line) expects two files in the current working directory, namely the GFF file produced in the previous
step (here \verb|Dmel_flattened.py|) and a SAM file with the aligned reads from a sample (here the
file \verb|untreated1.sam| with the aligned reads from the first control sample). The command generates an
output file, here called \verb|untreated1fb.txt|, with one line for each exon counting bin defined in
the flattened GFF file. The lines contain the exon counting bin IDs (which are composed of gene IDs and
exon bin numbers), followed by a integer number which indicates the number of reads that were aligned such that
they overlap with the counting bin.

Use the script multiple times to produce a count file from each of your SAM files.

There are a number of crucial points to pay attention to when using the \verb|python_count.py| script:

\emph{Paired-end data:} If your data is from a paired-end sequencing run, you need to 
add the option ``\verb|-p yes|'' to 
the command to call the script. (As usual, options have to be placed before the file names, surrounded by spaces.)
In addition, the SAM file needs to be sorted, either by read name or by position. Most aligners produce
sorted SAM files; if your SAM file is not sorted, use \verb|samtools sort -n| to sort by read name
(or \verb|samtools sort|) to sort by position. (See e.g. reference \cite{DEProt}, if you need further explanations
on how to sort SAM files.) Use the option ``\verb|-r pos|'' or ``\verb|-r name|'' to indicate whether
your paired-end data is sorted by alignment position or by read name.\footnote{The possibility to process
paired-end data from a file sorted by position is based on recent contributions of Paul-Theodor Pyl
to \software{HTSeq}.}

\emph{Strandedness:} By default, the counting script assumes your library to be \emph{strand-specific}, 
i.e., reads are aligned to
the same strand as the gene they originate from. If you have used a library preparation protocol that does
not preserve strand information (i.e., reads from a given gene can appear equally likely on either strand),
you need to inform the script by specifying the option ``\verb|-s no|''. If your library preparation
protocol reverses the strand (i.e., reads appear on the strand opposite to their gene of origin), use 
``\verb|-s reverse|''. In case of paired-end data, the default (\verb|-s yes|) means that the read from the
first sequence pass is on the same strand as the gene and the read from the second pass on the opposite strand
(``forward-reverse'' or ``fr'' order in the parlance of the Bowtie/TopHat manual) and the options \verb|-s reverse|
specifies the opposite case.

\emph{SAM and BAM files:} By default, the script expects its input to be in plain-text SAM format.
However, it can also read BAM files, i.e., files in the the compressed binary variant of the SAM format.
If you wish to do so, use the option ``\verb|-f bam|''. This works only if you have installed the Python
package \software{pysam}, which can be found at \url{https://code.google.com/p/pysam/}.

\emph{Alignment quality:} The scripts takes a further option, \verb|-a| to specify the 
minimum alignment quality (as given in the fifth column of the SAM file). All reads with a lower quality 
than specified (with default \verb|-a 10|) are skipped.

\emph{Help pages:} Calling either script without arguments displays a help page with an overview of all
options and arguments.


\subsection{Reading the data in to R} \label{sec:readData}

The remainder of the analysis is now done in \R. We will use 
the output of the python scripts for the \emph{pasilla} experiment, 
that can be found in the package \Rpackage{pasilla}. 
Open an \R session and type:

%
<<loadDEXSeq>>=
inDir = system.file("extdata", package="pasilla")
countFiles = list.files(inDir, pattern="fb.txt$", full.names=TRUE)
countFiles
flattenedFile = list.files(inDir, pattern="gff$", full.names=TRUE)
flattenedFile
@
%
Now, we need to prepare a sample table. This table should contain one row for each library, and columns for
all relevant information such as name of the file with the read counts, experimental conditions, technical
information and further covariates. To keep this vignette simple, we construct the table on the fly.
%
<<sampleTable>>=
sampleTable = data.frame(
   row.names = c( "treated1", "treated2", "treated3", 
      "untreated1", "untreated2", "untreated3", "untreated4" ),
   condition = c("knockdown", "knockdown", "knockdown",  
      "control", "control", "control", "control" ),
   libType = c( "single-end", "paired-end", "paired-end", 
      "single-end", "single-end", "paired-end", "paired-end" ) )
@
%
While this is a simple way to prepare the table, it may be less error-prone and
more prudent to used an existing table that had already been prepared 
when the experiments were done, save it in CSV format and use the R function 
\Rfunction{read.csv} to load it.

In any case, it is vital to check the table carefully for correctness.
%
<<displaySampleTable>>=
sampleTable
@
%
Our table contains the sample names as row names and the two covariates that 
vary between samples: first the experimental condition (factor \texttt{condition} 
with levels \texttt{control} and \texttt{treatment}) and the library type 
(factor \texttt{libType}), which we included because the samples in this 
particular experiment were sequenced partly in single-end runs and partly 
in paired-end runs. 

For now, we will ignore this latter piece of information, and postpone 
the discussion of how to include such additional covariates to Section~\ref{sec:glm}. 
If you have only a single covariate and want to perform a simple analysis, 
the column with this covariate should be named \texttt{condition}.

Now, we construct an \Rclass{DEXSeqDataSet} object from this data. This object holds all
the input data and will be passed along the stages of the subsequent analysis.
We construct the object with the \Rpackage{DEXSeq} function \Rfunction{DEXSeqDataSetFromHTSeq}, 
as follows:
%
<<makeecs, eval=TRUE>>=
suppressPackageStartupMessages( library( "DEXSeq" ) )

dxd = DEXSeqDataSetFromHTSeq(
   countFiles,
   sampleData=sampleTable,
   design= ~ sample + exon + condition:exon,
   flattenedfile=flattenedFile )
@   
%
The function takes four arguments. First, a vector with names of count files, i.e., of files
that have been generated with the \verb|dexseq_count.py| script. The function will read these
files and arrange the count data in a matrix, which is stored in the \Rclass{DEXSeqDataSet}
object \Robject{dxd}. The second argument is our sample table, with one row for each of the 
files listed in the first argument. This information is simply stored as is in the object's
meta-data slot (see below). The third argument is a formula of the form 
``~ sample + exon + condition:exon'' that specifies the contrast with of a variable from the 
sample table columns and the `exon' variable. Using this formula, we are 
interested in differences in exon usage due to the `condition' variable changes. Later in this vignette,
we will how to add additional variables for complex designs.  The fourth argument is a file 
name, now of the flattened GFF file that was generated with \verb|dexseq_prepare_annotation.py| 
and used as input to \verb|dexseq_count.py| when creating the count file.

There are other ways to get a \Rpackage{DEXSeq} analysis started. See Appendix \ref{sec:creatingInR} and 
Ref.~\cite{parathyroidSEVignette} for details.

\section{Standard analysis work-flow} \label{stdAnalysis}

\subsection{Loading and inspecting the example data}

To demonstrate the \Rpackage{DEXSeq} work flow, we will use the \Robject{DEXSeqDataSet} constructed 
in the previous section. However, in order to keep the run-time of this vignette small, we will 
subset the object to only a few genes.

%
<<start>>=
genesForSubset = read.table( 
  file.path(inDir, "geneIDsinsubset.txt"), 
  stringsAsFactors=FALSE)[[1]]

dxd = dxd[geneIDs( dxd ) %in% genesForSubset,]
@
%

The \Rclass{DEXSeqDataSet} class is derived from the \Rclass{DESeqDataSet}. As
such, it contains the usual accessor functions for the column data, row data, 
and some specific ones. The core data in an \Rclass{DEXSeqDataSet} 
object are the counts per exon. Each row of the 
\Rclass{DEXSeqDataSet} contains in each column the count data from a given 
exon ('this') as well as the count data from the sum of the other exons 
belonging to the same gene ('others'). This annotation, as well
as all the information regarding each column of the \Rclass{DEXSeqDataSet}, is 
specified in the colData.

<<seeColData>>=
colData(dxd)
@

We can access the first 5 rows from the count data by doing, 

<<seeCounts>>=
head( counts(dxd), 5 )
@

Notice that the number of columns is 14, the first seven (we have seven samples) 
corresponding to the number of reads mapping to out exonic regions and the
last seven correspond to the sum of the counts mapping to the rest of the exons
from the same gene on each sample.

<<seeSplitted>>=
split( seq_len(ncol(dxd)), colData(dxd)$exon )
@

We can also access only the first five rows from the count belonging to the exonic regions
('this') (without showing the sum of counts from the rest of the exons from the same gene)
by doing,

<<seeCounts2>>=
head( featureCounts(dxd), 5 )
@


%
In both cases, the rows are labelled with gene IDs (here Flybase IDs), 
followed by a colon and the counting bin number. (As a counting bin 
corresponds to an exon or part of an exon, this ID is called the \emph{feature ID} 
or \emph{exon ID} within \Rpackage{DEXSeq}.) The table content indicates the
number of reads that have been mapped to each counting bin in the respective sample.

To see details on the counting bins, we also print the first
3 lines of the feature data annotation:
%
{\small
<<fData>>=
head( rowData(dxd), 3 )
@
}
%
So far, this table contains information on the annotation data, such as gene and exon IDs,
genomic coordinates of the exon, and the list of transcripts that contain an exon. 

The accessor function \Rfunction{annotationData} shows the design table with the sample 
annotation (which was passed as the second argument to \Rfunction{DEXSeqDataSetFromHTSeq}):
%
<<pData>>=
sampleAnnotation( dxd )
@
%

In the following sections, we will update the object by calling a number of analysis functions,
always using the idiom ``\verb| dxd = |\textit{\texttt{someFunction}}\verb|( dxd )|'', which
takes the \verb|dxd| object, fills in the results of the performed computation and writes the returned
and updated object back into the variable \verb|dxd|.

 %--------------------------------------------------
\subsection{Normalisation}\label{sec:norm}
%--------------------------------------------------
Different samples might be sequenced with different depths. In order to adjust for such coverage biases,
we estimate \emph{size factors}, which measure relative sequencing depth.  \Rpackage{DEXSeq} uses the same method
as \Rpackage{DESeq} and \Rpackage{DESeq2}, which is provided in the function \Rfunction{estimateSizeFactors}.
%
<<sizeFactors1>>=
dxd = estimateSizeFactors( dxd )
@
%


%--------------------------------------------------
\subsection{Dispersion estimation}\label{sec:dispest}
%--------------------------------------------------
To test for differential exon usage, we need to estimate the variability of the data. This is necessary
to be able to distinguish technical and biological variation (noise) from real effects
on exon usage due to the different conditions.
The information on the strength of the noise is inferred from the biological replicates in the data set
and characterized by the so-called \emph{dispersion}. In RNA-Seq experiments the number of replicates is 
typically too small to reliably estimate variance or dispersion parameters individually exon by exon, 
and therefore, variance information is shared across exons and genes, in an intensity-dependent manner.

In this section, we discuss simple one-way designs: In this setting, samples with the same experimental 
condition, as indicated in the \texttt{condition} factor of the design table (see above), are 
considered as replicates -- and therefore, the design table needs to contain a column with the
name \verb|condition|. In Section~\ref{sec:glm}, we discuss how to treat more complicated experimental
designs which are not accommodated by a single \Robject{condition} factor.

To estimate the dispersion estimates, \Rpackage{DEXSeq} uses the approach of the package \Rpackage{DESeq2}.
Internally, the functions from DESeq2 are called, adapting the parameters of the functions for the specific 
case of the DEXSeq model. Briefly, per-exon dispersions are calculated using a Cox-Reid adjusted
profile likelihood estimation, then a dispersion-mean relation is fitted to this individual dispersion values and
finally, the fitted values are taken as a prior in order to shrink the per-exon estimates towards the fitted
values. See the \Rpackage{DESeq2} paper for the rational behind the shrinkage approach \cite{deseq2}.

<<estDisp1>>=
dxd = estimateDispersions( dxd )
@
%

As a shrinkage diagnostic, the \Rclass{DEXSeqDataSet} inherits the method \Rfunction{plotDispEsts} that allows us
to plot the per-exon dispersion estimates versus the mean normalised count, the resulting fitted values
and the \emph{a posteriori} (shrinked) dispersion estimates (Figure~\ref{figure/fitdiagnostics}).

%
<<fitdiagnostics, dev='png', resolution=220>>=
plotDispEsts( dxd )
@
%
\incfig{figure/fitdiagnostics-1}{.5\textwidth}{Fit Diagnostics.}{The initial per-exon dispersion estimates 
  (shown by black points), the fitted mean-dispersion values function (red line), and the shrinked values
  in blue.}
%


%--------------------------------------------------
\subsection{Testing for differential exon usage}\label{sec:deu}
%--------------------------------------------------
Having the dispersion estimates and the size factors, we can now test for differential exon usage.
For each gene, \Rpackage{DEXSeq} fits a generalized linear model with the formula
\begin{equation}\label{eq:altmodel}
\mbox{\texttt{\tttildemiddle\ sample + exon + condition:exon}}
\end{equation}
and compare it to the smaller model (the null model)
\begin{equation}\label{eq:nullmodel}
\mbox{\texttt{\tttildemiddle\ sample + exon}.}
\end{equation}
In these formulae (which use the standard notation for linear model formulae in \R; consult a text book
on \R\ if you are unfamiliar with it), \verb|sample| is a factor with different levels for each sample,
\verb|condition| is the factor of experimental conditions that we defined when constructing the
\Rclass{DEXSeqDataSet} object at the beginning of the analysis, and \verb|exon| is a factor with
two levels, \verb|this| and \verb|others|, that were specified when we generated our \Rclass{DEXSeqDataSet}
object. The two models described by these formulae are fit for each counting bin, where the data 
supplied to the fit comprise \emph{two} read count values for each sample, corresponding to the 
two levels of the \Robject{exon} factor: the number of reads mapping to the bin in question 
(level \Robject{this}), and the sum of the read counts from all other bins of the same gene 
(level \Robject{others}). Note that this approach differs from the approach described in 
the paper~\cite{DEXSeqPaper} and used in older versions of \Rpackage{DEXSeq}; see Appendix~\ref{changes} 
for further discussion.

Readers familiar with linear model formulae might find one aspect of Equation~(\ref{eq:altmodel}) surprising: 
We have an interaction term \verb|condition:exon|, but denote no main effect for \verb|condition|.
Note, however, that all observations from the same sample are also from the same condition, i.e., the
\verb|condition| main effects are absorbed in the \verb|sample| main effects, because the \verb|sample| 
factor is nested within the \verb|condition| factor. 

The deviances of both fits are compared using a $\chi^2$-distribution, giving rise to a $p$ value.
Based on that, we can decide whether the null model
(\ref{eq:nullmodel}) is sufficient to explain the data, or whether it
may be rejected in favour of the alternative,
model~(\ref{eq:altmodel}), which contains an interaction coefficient for \verb|condition:exon|. The latter
means that the fraction of the gene's reads that fall onto the exon under the test differs significantly
between the experimental conditions.

The function \Rfunction{testForDEU} performs these tests for each exon in each gene.
%
<<testForDEU1,cache=TRUE>>=
dxd = testForDEU( dxd )
@

The resulting \Rclass{DEXSeqDataSet} object contains slots with information
regarding the test.  

%

For some uses, we may also want to estimate relative exon fold changes. To this end, 
we call \Rfunction{estimateExonFoldChanges}. Exon usage fold changes are 
calculated by fitting for each gene, a GLM from the joint data of all its 
exons.  The model frame can be found in the slot object@modelFrameBM of a 
\Rclass{DEXSeqDataSet} object. The model "~ sample + fitExpToVar * exon" is fitted.  
The resulted coefficients are arranged and reformatted in order to remove gene
expression effects (absorbed by the 'sample' variable in the formula), leaving 
only exon usage effects for each individual exon in each level of the parameter 
"fitExpToVar".

%
<<estFC,cache=TRUE>>=
dxd = estimateExonFoldChanges( dxd, fitExpToVar="condition")
@
%

So far in the pipeline, the intermediate and final results have been 
stored in the meta data of a \Rclass{DEXSeqDataSet} object, they can 
be accessed using the function \Rfunction{mcol}. In order to summarize
the results without showing the values from intermediate steps, we call the 
function \Rfunction{DEXSeqResults}. The result is a \Rclass{DEXSeqResults}
object, which is a subclass of a \Rclass{DataFrame} object. 

<<results1,cache=TRUE>>=
dxr1 = DEXSeqResults( dxd )
dxr1
@

The description of each of the column of the object \Rclass{DEXSeqResults} 
can be found in the element meta data.

<<results2,cache=TRUE>>=
elementMetadata(dxr1)$description
@

From this object, we can ask how many genes are significant with a false
discovery rate of 10\%:

%
<<tallyExons>>=
table ( dxr1$padj < 0.1 )
@
%
We may also ask how many genes are affected
<<tallyGenes>>=
table ( tapply( dxr1$padj < 0.1, dxr1$groupID, any ) )
@
%
Remember that our example data set contains only a selection of genes. We have chosen these to
contain interesting cases; so the fact that such a large fraction of genes is significantly affected 
here is not typical.

To see how the power to detect differential exon usage depends on the number of reads that
map to an exon, a so-called MA plot is useful, which plots the logarithm of fold change versus average
normalized count per exon and marks by red colour the exons which are considered significant; here,
the exons with an adjusted $p$ values of less than 0.1 (Figure \ref{figure/MvsA}).
There is of course nothing special about the number 0.1, and you can specify other thresholds in the call to \Rfunction{plotMA}.
%
<<MvsA, dev='png', resolution=200>>=
plotMA( dxr1, cex=0.8 )
@
\incfig{figure/MvsA-1}{.5\textwidth}{MA plot.}{
Mean expression versus $\log_2$ fold change plot. Significant hits (at \Robject{padj}<0.1) are coloured in red.
}

%$
%------------------------------------------------------------
\section{Additional technical or experimental variables}\label{sec:glm}
%------------------------------------------------------------
In the previous section we performed a simple analysis of differential exon usage, in which each sample was assigned
to one of two experimental conditions. If your experiment is of this type, you can use the work flow shown above. All
you have to make sure is that you indicate which sample belongs to which experimental condition when you construct
the \Rclass{DEXSeqDataSet} object (Section \ref{sec:readData}. Do so by means of a column called \verb|condition| in the sample table.

If you have a more complex experimental design, you can provide different or additional columns in the sample table. You
then have to indicate the design by providing explicit formulae for the test.

In the \Rpackage{pasilla} dataset, some samples were sequenced in single-end and others in paired-end mode. Possibly,
this influenced counts and should hence be accounted for. We therefore use this as an example for a complex design.

When we constructed the \Rclass{DEXSeqDataSet} object in Section \ref{sec:readData}, we provided in the sample table an
additional column called \verb|libType|, which has been stored in the object:
%
<<design>>=
sampleAnnotation(dxd)
@
%
We specify two design formulae, which indicate that the \verb|libType| \Rclass{factor} should be treated
as a blocking factor:
%
<<formulas2>>=
formulaFullModel    =  ~ sample + exon + libType:exon + condition:exon
formulaReducedModel =  ~ sample + exon + libType:exon 
@
%
Compare these formulae with the default formulae (\ref{eq:altmodel}, \ref{eq:nullmodel}) 
given in Section \ref{sec:deu}. We have added, in both 
the full model and the reduced model, the term \verb|libType:exon|. Therefore, any dependence of exon
usage on library type will be absorbed by this term and accounted for equally in the full
and a reduced model, and the likelihood ratio test comparing them will only detect differences
in exon usage that can be attributed to \verb|condition|, independent of \verb|type|.

Next, we estimate the dispersions. This time, we need to inform the \Rfunction{estimateDispersions} function
about our design by providing the full model's formula, which should be used instead of the default 
formula (\ref{eq:altmodel}).
%
<<estDisps_again, cache=TRUE, results='hide'>>=
dxd = estimateDispersions( dxd, formula = formulaFullModel )
@

%
The test function now needs to be informed about both formulae
<<test_again, cache=TRUE >>=
dxd = testForDEU( dxd, 
	reducedModel = formulaReducedModel, 
        fullModel = formulaFullModel )
@
%
Finally, we get a summary table, as before.
<<res_again>>=
dxr2 = DEXSeqResults( dxd )
@
%
How many significant DEU cases have we got this time?
<<table2>>=
table( dxr2$padj < 0.1 )
@
%
We can now compare with the previous result:
<<table3>>=
table( before = dxr1$padj < 0.1, now = dxr2$padj < 0.1 )
@

Accounting for the library type has allowed us to find six more hits, which
confirms that accounting for the covariate improves power.

%--------------------------------------------------
\section{Visualization}
%--------------------------------------------------
The \Rfunction{plotDEXSeq} provides a means  to visualize the results of an analysis.
%
<<plot1, fig.height=8, fig.width=12>>=
plotDEXSeq( dxr2, "FBgn0010909", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )
@
%
\incfig{figure/plot1-1}{\textwidth}{Fitted expression.}{
The plot represents the expression estimates from a call to \texttt{testForDEU}.
Shown in red is the exon that showed significant differential exon usage.
}
%
<<checkClaim,echo=FALSE>>=
wh = (dxr2$groupID=="FBgn0010909")
stopifnot(sum(dxr2$padj[wh] < formals(plotDEXSeq)$FDR)==1)
@
%
The result is shown in Figure~\ref{figure/plot1}. This plot shows the fitted expression values of each of the exons
of gene FBgn0010909, for each of the two conditions, treated and untreated.
The function \Rfunction{plotDEXSeq} expects at least two arguments, the \Rclass{DEXSeqDataSet} object and the gene ID.
The option \texttt{legend=TRUE} causes a legend to be included. The three remaining
arguments in the code chunk above are ordinary plotting parameters which are simply handed over to \R's standard
plotting functions. They are not strictly needed and included here to improve appearance of
the plot. See the help page for \Rfunction{par} for details.

Optionally, one can also visualize the transcript models (Figure~\ref{figure/plot2}), which can be
useful for putting differential exon usage results into the context of isoform regulation.
%
<<plot2, fig.height=8, fig.width=12>>=
plotDEXSeq( dxr2, "FBgn0010909", displayTranscripts=TRUE, legend=TRUE,
   cex.axis=1.2, cex=1.3, lwd=2 )
@
%
\incfig{figure/plot2-1}{\textwidth}{Transcripts.}{
As in Figure~\ref{figure/plot1}, but including the annotated transcript models.}
%
Other useful options are to look at the count values from the individual samples, rather than at the
model effect estimates. For this display (option \texttt{norCounts=TRUE}), the counts are normalized by 
dividing them by the size factors (Figure~\ref{figure/plot3}).
%
<<plot3, fig.height=8, fig.width=12>>=
plotDEXSeq( dxr2, "FBgn0010909", expression=FALSE, norCounts=TRUE,
   legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )
@
%
\incfig{figure/plot3-1}{\textwidth}{Normalized counts.}{
As in Figure~\ref{figure/plot1}, with normalized count values of each exon in each of the samples.}
%
As explained in Section~\ref{sec:praeludium}, \Rpackage{DEXSeq} is designed to find
changes in relative exon usage, i.\,e., changes in the expression of individual exons that are
not simply the consequence of overall up- or down-regulation of the gene. To visualize such changes, it is
sometimes advantageous to remove overall changes in expression from the
plots. Use the (somewhat misnamed) option \texttt{splicing=TRUE} for this purpose.
%
<<plot4, fig.height=8, fig.width=12>>=
plotDEXSeq( dxr2, "FBgn0010909", expression=FALSE, splicing=TRUE,
   legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )
@
%
\incfig{figure/plot4-1}{\textwidth}{Fitted splicing.}{
The plot represents the estimated effects, as in Figure~\ref{figure/plot1},
but after subtraction of overall changes in gene expression.}
%
To generate an easily browsable, detailed overview over all analysis results,
the package provides an HTML report generator, implemented in the function \Rpackage{DEXSeqHTML}.
This function uses the package \Rpackage{hwriter} \cite{hwriter} to create a result table with links to plots for the
significant results, allowing a more detailed exploration of the results. 
%
<<DEXSeqHTML,cache=TRUE, eval=FALSE>>=
DEXSeqHTML( dxr2, FDR=0.1, color=c("#FF000080", "#0000FF80") )
@

%--------------------------------------------------
\section{Parallelization} \label{parallelization}
%--------------------------------------------------
DEXSeq analyses can be computationally heavy, especially with data
sets that comprise a large number of samples, or with genomes
containing genes with large numbers of exons.  While some steps 
of the analysis work on the whole data set, the computational load
can be parallelized for some steps. We use the package \Rpackage{BiocParallel},  
and implemented the \Rfunction{BPPARAM} parameter of the functions 
\Rfunction{estimateDispersions},  \Rfunction{testForDEU} and \Rfunction{estimateExonFoldChanges}:

<<para1,cache=TRUE,results='hide', eval=FALSE>>=
BPPARAM = MultiCoreParam(workers=4)
dxd = estimateSizeFactors( dxd )
dxd = estimateDispersions( dxd, BPPARAM=BPPARAM)
dxd = testForDEU( dxd, BPPARAM=BPPARAM)
dxd = estimateExonFoldChanges(dxd, BPPARAM=BPPARAM)
@


%--------------------------------------------------
\section{Perform a standard differential exon usage analysis in one command}
%--------------------------------------------------
In the previous sections, we went through the analysis step by step.
Once you are sufficiently confident about the work flow for your data,
its invocation can be streamlined by the wrapper function
\Rfunction{DEXseq}, which runs the analysis shown
above through a single function call. 

In the simplest case, construct the \Robject{DEXSeqDataSet} as shown in Section \ref{preps}
or in Appendix \ref{sec:creatingInR}, then run \Rfunction{DEXSeq} passing the
\Robject{DEXSeqDataSet} as only argument, this function will output
a \Rclass{DEXSeqResults} object.
%

<<alldeu, cache=TRUE>>=
dxr = DEXSeq(dxd)
class(dxr)
@


\appendix
\clearpage
\begin{center}
{\Large\sffamily\bfseries\color{BiocBlue} APPENDIX} \addcontentsline{toc}{section}{APPENDIX}
\end{center}

%--------------------------------------------------
\section{Preprocessing within R}\label{sec:creatingInR}
%--------------------------------------------------

As an alternative to the approach described in Section \ref{preps}, users can also create  
\Rclass{DEXSeqDataSeq} objects from other \Rpackage{Bioconductor} data objects.
The code for implementationg these functions was kindly contributed by Michael I.\ Love. For details, see the
\Rpackage{parathyroidSE} package vignette \cite{parathyroidSEVignette}. The work flow is similar to the one 
using the \Rpackage{HTSeq} python scripts.

emph{Note:} The code in this section is not run when the vignette is built, as some of the commands
have long run time. Therefore, no output is given.

We use functionality from the following Bioconductor packages
<<buildExonCountSetLoadPacks,cache=TRUE, eval=FALSE>>=
library(GenomicRanges)
library(GenomicFeatures)
library(GenomicAlignments)
@
%
We demonstrate the workflow briefly (for more details, see \cite{parathyroidSEVignette}) on the data set
of Haglund et al.\ \cite{parathyroidPaper}, which is provided as example data in the 
\Rpackage{parathyroidSE} data package.

First, we download the current human gene model annotation from Ensembl via Biomart and create
a transcript data base from these. Note that this step takes some time.
<<buildExonCountSetDownloadAnno,cache=TRUE, eval=FALSE>>=
hse = makeTranscriptDbFromBiomart( biomart="ensembl", 
   dataset="hsapiens_gene_ensembl" )
@
%
Next, we collapse the gene models into counting bins, analogous to Section~\ref{prepAnno}.
%
<<buildExonCountSetDisjoin,cache=TRUE, eval=FALSE>>=
exonicParts = disjointExons( hse, aggregateGenes=FALSE )
@
%
As before, we have to choose how to handle genes with overlapping exons. The \verb|aggregateGenes| option
here plays the same role as the \verb|-r| option to \verb|dexseq_prepare_anotation.py| described at the 
end of Section \ref{prepAnno}.
The \Robject{exonicParts} object contains a GRanges object with our counting bins. We use it to
count the number of read fragments that overlap with the bins by means of the function 
\Rfunction{summarizeOverlaps}. To demonstrate this, we first determine the paths to the
example BAM files in the \Rpackage{parathyroidSE} data package.
%
<<buildExonCountSet2FindBAMs,cache=TRUE, eval=FALSE>>=
bamDir = system.file( "extdata", package="parathyroidSE", mustWork=TRUE )
fls = list.files( bamDir, pattern="bam$", full=TRUE )
@
%
Then, use the following code to count the reads overlapping the bins.
%
<<buildExonCountSet2FindBAMs2,cache=TRUE, eval=FALSE>>=
bamlst = BamFileList( fls, index=character(), yieldSize=100000, obeyQname=TRUE )
SE = summarizeOverlaps( exonicParts, bamlst, mode="Union", singleEnd=FALSE, 
   ignore.strand=TRUE, inter.feature=FALSE, fragments=TRUE )
@
%
We can now call the function \Rfunction{DEXSeqDataSetFromSE}
to build an \Robject{DEXSeqDataSet} object. We modify the \Rfunction{colData}
slot in order to specify the sample annotation, indicating that the first two BAM files form
one experimental condition and the third one the other. Then
we create our \Robject{DEXSeqDataSet} object.

%     
<<buildExonCountSet3,cache=TRUE, eval=FALSE>>=
colData(SE)$condition = c("A", "A", "B")
DEXSeqDataSetFromSE( SE, design= ~ sample + exon + condition:exon )
@

\subsection{Further accessors}

The function \Rfunction{geneIDs} returns the gene ID column of the feature data as a character
vector, and the function \Rfunction{exonIDs} return the exon ID column as a \Rclass{factor}.
%
<<acc>>=
head( geneIDs(dxd) )
head( exonIDs(dxd) )
@
%
These functions are useful for subsetting an \Rclass{DEXSeqDataSet} object.

\subsection{Overlap operations}

The methods \Rfunction{subsetByOverlaps} and \Rfunction{findOverlaps} have
been implemented for the DEXSeqResults object, the \Rfunction{query} argument 
must be a \Rclass{DEXSeqResults} object. 

%
<<grmethods>>=
interestingRegion = GRanges("chr2L", IRanges(start=3872658, end=3875302))
subsetByOverlaps( query=dxr, subject=interestingRegion )
findOverlaps( query=dxr, subject=interestingRegion )
@
%
This functions could be useful for further downstream analysis.


%--------------------------------------------------
\section{Methodological changes since publication of the paper} \label{changes}
%--------------------------------------------------

In our paper \cite{DEXSeqPaper}, we suggested to fit for each exon a model that includes separately 
the counts for all the gene's exons. However, this turned out to be computationally inefficient
for genes with many exons, because the many exons required large model matrices,
which are computationally expensive to deal with. We have therefore modified the approach: when fitting
a model for an exon, we now sum up the counts from all the other exon and use only the total, rather than
the individual counts in the model. Now, computation time per exon is independent of the number of other exons in the gene,
which improved \Rpackage{DEXSeq}'s scalability. While the $p$ values returned by the two approaches are not
exactly equal, the differences were very minor in the tests that we performed.

For now, the function for our original approach (which we now call the ``big model'' or ``BM'' approach)
are still included; all relevant functions, however, have been renamed to carry the suffix \verb|_BM|
in their name. The new approach, which is now default and is used by the work flow described in this 
vignette, has no special name (in some previous releases of \Rpackage{DEXSeq} which had included it 
first on an experimental basis, it was termed the ``TRT'' approach).

In the following, we describe the current default (``TRT'') approach in detail (though the exposition assumes
the reader's familiarity with our paper).

Deviating from the paper's notation, we now use the index $i$ to indicate a specific counting bin, 
with $i$ running through all counting bins of all genes. The samples are indexed with $j$, as in the paper.
We write $K_{ij0}$ for the count or reads mapped to counting bin $i$ in sample $j$ and $K_{ij1}$ for the sum
of the read counts from all other counting bins in the same gene. Hence, when we write $K_{ijl}$, the third index
$l$ indicates whether we mean the read count for bin $i$ ($l=0$) or the sum of counts for all other bins of the 
same gene ($l=1$). As before, we fit a GLM of the negative binomial (NB) family
\begin{equation} 
  K_{ijl} \sim \operatorname{NB}(\text{mean}=s_j\mu_{ijl},\text{dispersion}=\alpha_i), 
\end{equation}
now with the model specified in Equation (\ref{eq:altmodel}), which we write out as
\begin{equation} 
  \log_2 \mu_{ijl} = \beta^\text{S}_{ij} + l \beta^\text{E}_{i} + \beta^\text{EC}_{i\rho_j}. 
\end{equation}

This model is fit separately for each counting bin $i$. The coefficient $\beta^\text{S}_{ij}$ accounts for
the sample-specific contribution (factor \verb|sample| in Equation (\ref{eq:altmodel})), the term $\beta^\text{E}_{i}$
is only included if $l=1$ and hence estimates the logarithm of the ratio $K_{ij1}/K_{ij0}$ between the counts for all
other exons and the counts for the tested exon. As this coefficient is estimated from data from all samples, it can 
be considered as a measure of ``average exon usage''. In the R model formula, it is represented by the term \verb|exon|
with its two levels \verb|this| ($l=0$) and \verb|others| ($l=1$). Finally, the last term, 
$\beta^\text{EC}_{i,\rho_j}$, captures the interaction \verb|condition:exon|, i.e., the change in exon usage
if sample $j$ is from experimental condition group $\rho(j)$. Here, the first condition, $\rho=0$,
is absorbed in the sample coefficients, i.e., $\beta^\text{EC}_{i0}$ is fixed to zero and does not appear
in the model matrix.

For the dispersion estimation, one dispersion value $\alpha_i$ is estimated with Cox-Reid-adjusted maximum likelihood
using the full model given above. A mean-variance relation is fitted using the individual dispersion values. Finally, 
the individual values are shrinked towards the fitted values. For more details about this shrinkage approach look at
the \Rpackage{DESeq2} vignette and/or its manuscript~\cite{deseq2}. For the likelihood ratio test, this full model 
is fit and compared with the fit of the reduced model, which lacks the interaction term $\beta^\text{EC}_{i\rho_j}$. 
As described in Section~\ref{sec:glm}, alternative model formulae can be specified.

%--------------------------------------------------
\section{Requirements on GTF files} \label{GTF}
%--------------------------------------------------

In the initial preprocessing step described in Section \ref{prepAnno}, the Python script \verb|dexseq_prepare_annotation.py| 
is used to convert a GTF file with gene models into a GFF file with collapsed gene models. We recommend to
use GTF files downloaded from Ensembl as input for this script, as files from other sources may deviate from the format
expected by the script. Hence, if you need to use a GTF or GFF file from another source, you may need to
convert it to the expected format. To help with this task, we here give details on
how the \verb|dexseq_prepare_annotation.py| script interprets a GFF file.

\begin{itemize}
\item The script only looks at \texttt{exon} lines, i.e., at lines which contain the term \texttt{exon} in the 
third (``type'') column. All other lines are ignored.
\item Of the data in these lines, the information about chromosome, start, end, and strand (1st, 4th, 5th, and 7th column) 
are used, and, from the last column, the attributes \verb|gene_id| and \verb|transcript_id|. The rest is ignored.
\item The \verb|gene_id| attribute is used to see which exons belong to the same gene. It must be called \verb|gene_id|
(and not \verb|Parent| as in GFF3 files, or \verb|GeneID| as in some older GFF files), and it must give the same identifier to 
all exons from the same gene, even if they are from different transcripts of this gene. (This last requirement is not met
by GTF files generated by the Table Browser function of the UCSC Genome Browser.)
\item The \texttt{transcript\_id} attribute is used to build the \texttt{transcripts} attribute in the flattened GFF file, 
which indicates which transcripts contain the described counting bin. This information is needed only to draw the transcript 
model at the bottom of the plots when the \Rcode{displayTranscript} option to \Rfunction{plotDEXSeq} is used.
\end{itemize}

Therefore, converting a GFF file to make it suitable as input to \verb|dexseq_prepare_annotation.py| amounts to making sure 
that the exon lines have type \texttt{exon} and that the atributes giving gene ID (or gene symbol) and transcript ID are 
called \texttt{gene\_id} and \texttt{transcript\_id}, with this exact spelling. Remember to also take care that the chromosome 
names match those in your SAM files, and that the coordinates refer to the reference assembly that you used when aligning
your reads.


%--------------------------------------------------
\section{Session Information}
%--------------------------------------------------

The session information records the versions of all the packages used in the generation of the present document.

<<sessionInfo>>=
sessionInfo()
@

%--------------------------------------------------
\section{References}
%--------------------------------------------------
\begingroup
\renewcommand{\section}[2]{}%
\bibliography{DEXSeq}
\endgroup


\end{document}