%\VignetteIndexEntry{Analyzing RNA-seq data for differential exon usage with the "DEXSeq" package}
%\VignettePackage{DEXSeq}

% To compile this document
% library('cacheSweave'); rm(list=ls()); Sweave('DEXSeq.Rnw', driver=cacheSweaveDriver())

\documentclass[12pt]{article}

<<style, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

% My changes to the Bioc style:
\IfFileExists{inconsolata.sty}{\usepackage{inconsolata}}{\usepackage{zi4}}  % I like Inconsolata as terminal font
\renewcommand{\familydefault}{\rmdefault}  % I want serifs in my body text
\renewcommand{\maketitlehooka}{\sffamily\bfseries} % The title should be sans-serif, though
\renewcommand{\maketitlehookb}{\rmfamily\mdseries}  % But not the authors
\fancyhead[R]{\sffamily\small\thepage}  % The header should not be the same font and size as the body,
\fancyhead[L]{\sffamily\small\thetitle} % to set the two visually apart.

\newcommand{\tttildemiddle}{\raise.17ex\hbox{$\scriptstyle\mathtt{\sim}$}}

\usepackage[sort]{cite}
\usepackage{xstring}
\usepackage[fleqn]{amsmath}

\usepackage{Sweave}
\SweaveOpts{keep.source=TRUE, eps=FALSE, pdf=TRUE, png=FALSE, include=FALSE, width=4, height=4.5}


\author{Alejandro Reyes, Simon Anders, Wolfgang Huber\\[1em]
  European Molecular Biology Laboratory (EMBL),\\
  Heidelberg, Germany}

\title{Inferring differential exon usage in RNA-Seq data with the DEXSeq package}

\date{}
%\date{\Rpackage{DEXSeq} version \Sexpr{packageDescription("DEXSeq")$Version} %
%(Last revision \StrMid{$Date: 2013-10-08 08:23:56 -0700 (Tue, 08 Oct 2013) $}{8}{18})}

\begin{document}

\maketitle

\noindent This vignette describes version \Sexpr{packageDescription("DEXSeq")$Version} of the \Rpackage{DEXSeq} package.

\noindent Last revision of this document: \StrMid{$Date: 2013-10-08 08:23:56 -0700 (Tue, 08 Oct 2013) $}{8}{18}
% Note: The preceding line uses SVN's keyword substitution mechanism: The text between "$Date:" and the second "$" 
% is automatically replaced by a current time stamp when "svn ci" is used. The \StrMid function (from the 
% xstring package) takes out the relevant part. (To activate keyword substitution for another file, use
% "svn propset svn:keywords Date filename.txt".)

<<options,results=hide,echo=FALSE>>=
options(digits=2, width=80)
@

\tableofcontents

%-----------------------------------------------------------
\section{Overview}\label{sec:praeludium}
%-----------------------------------------------------------

The Bioconductor package \Rpackage{DEXseq} implements a method to test for differential exon usage
in comparative RNA-Seq experiments. By \emph{differential exon usage} (DEU), we mean changes in the relative 
usage of exons caused by the experimental condition. The relative usage of an exon is defined as
\begin{equation}\label{eq:reu}
  \frac{\text{number of transcripts from the gene that contain this exon}}%
  {\text{number of all transcripts from the gene}}.
\end{equation}
The statistical method used by \Rpackage{DEXSeq} was introduced in our paper \cite{DEXSeqPaper}. The basic concept can 
be summarized as follows. For each exon (or part of an exon) and each sample, we count how many 
reads map to this exon and how many reads map to any of the other exons of the same gene. 
We consider the ratio of these two counts, and how it changes across conditions, to infer changes in 
the relative exon usage~(\ref{eq:reu}).
In the case of an inner exon, a change in relative exon usage is typically due to a change
in the rate with which this exon is spliced into transcripts (alternative splicing). 
Note, however, that DEU is a more
general concept than alternative splicing, since it also includes 
changes in the usage of alternative transcript start sites
and polyadenylation sites, which can cause
differential usage of  exons at the 5' and 3' boundary of transcripts.

Similar as with differential gene expression, we need to make sure that observed 
differences of values of the ratio~(\ref{eq:reu}) between conditions
are statistically significant, i.\,e., are sufficiently unlikely to be just due to random fluctuations such as those 
seen even between samples from the same condition, i.\,e., between replicates. 
To this end, \Rpackage{DEXSeq} assesses the strength 
of these fluctuations (quantified by the so-called \emph{dispersion}) by comparing replicates 
before comparing the averages between the sample groups.

The preceding description is somewhat simplified (and perhaps over-simplified), and we recommend that 
users consult the paper \cite{DEXSeqPaper} for a more complete description, as well as Appendix~\ref{changes} of 
this vignette, which describes how the current implementation of \Rpackage{DEXSeq} differs from 
the original approach described in the paper. Nevertheless, two important aspects should be 
mentioned already here: First, \Rpackage{DEXSeq} does not actually work 
on the ratios~(\ref{eq:reu}), but on the counts in the numerator
and denominator, to be able to make use of the information that is
contained in the magnitude of count values. (3000 reads versus 1000 reads is the same ratio as 3 reads
versus 1 read, but the latter is a far less reliable estimate of the
underlying true value, because of statistical sampling.) Second, \Rpackage{DEXSeq} is not limited to
simple two-group comparisons; rather, it uses so-called generalized linear models
(GLMs) to permit ANOVA-like analysis of potentially complex experimental designs.

\section{Preparations} \label{preps}

\subsection{Example data}

To demonstrate the use of \emph{DEXSeq}, we use the \emph{pasilla} dataset, an
RNA-Seq dataset generated by Brooks et al.~\cite{Brooks2010}. They investigated the effect of 
siRNA knock-down of the gene \emph{pasilla} on the transcriptome of fly S2-DRSC cells. The 
RNA-binding protein \emph{pasilla} protein is thought to be involved
in the regulation of splicing. (Its mammalian orthologs, NOVA1 and NOVA2, are well-studied examples
of splicing factors.) Brooks et al.\ prepared seven cell cultures, treated three with siRNA to knock
down \emph{pasilla} and left four as untreated controls, and performed RNA-Seq on all samples. They 
deposited the raw sequencing reads with the NCBI Gene Expression Omnibus (GEO) under the accession number
GSE18508.\footnote{\url{http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508}}

\paragraph{Executability of the code.} 
Usually, Bioconductor vignettes contain automatically executable code, i.\,e., you can follow the vignette 
by directly running the code shown, using functionality and data provided with the package. However, it would not be
practical to include the voluminous raw data of the pasilla experiment here. Therefore, the
code in this section is not automatically executable. You may download the raw data yourself 
from GEO, as well as the required extra tools, 
and follow the work flow shown here and in the \Rpackage{pasilla} vignette~\cite{pasillaVignette}. 
From Section~\ref{stdAnalysis} on, code is directly executable, as usual. 
Therefore, we recommend that you just read this section, and try following our analysis in R only from
the next section onwards. Once you work with your own data, you will want to come back and adapt the 
work flow shown here to your data.

\subsection{Alignment}
The first step of the analysis is to align the reads to a reference genome. It is important
to align them to the genome, not to the transcriptome, 
and to use a splice-aware aligner (i.\,e., a short-read
alignment tool that can deal with reads that span across introns) such as TopHat2 \cite{TopHat2}, 
GSNAP \cite{GSNAP}, or STAR \cite{STAR}. The explanation
of the analysis work-flow presented here starts with the aligned reads in the SAM format. If you
are unfamiliar with the process of aligning reads to obtain SAM files, you can find a summary how
we proceeded in preparing the pasilla data in the vignette for the \Rpackage{pasilla} data 
package~\cite{pasillaVignette} and a more extensive explanation, using the same  data set, in our 
protocol article on differential expression calling~\cite{DEProt}.

\subsection{HTSeq} \label{HTSeq}

The initial steps of a \emph{DEXSeq} analysis, described in the following two sections,
is typically done outside R, by using two provided Python scripts. You do not need
to know how to use Python; however you have to install the Python package \software{HTSeq},
following the explanations given on the HTSeq web page:

\url{http://www-huber.embl.de/users/anders/HTSeq/doc/install.html}

Once you have installed \emph{HTSeq}, you can use the two Python scripts, \verb|dexseq_prepare_annotation.py|
(described in Section~\ref{prepAnno}) and \verb|dexseq_count.py| (Section~\ref{sec:counting}), that 
come with the \Rpackage{DEXSeq} package. If you have trouble finding them, start R and ask for 
the installation directory with
%
<<systemFile,results=hide>>=
system.file( "python_scripts", package="DEXSeq" )
@
<<systemFileCheck,echo=FALSE,results=hide>>=
system.file( "python_scripts", package="DEXSeq", mustWork=TRUE )
@
%
The displayed path should contain the two files. If it does not, try to re-install \Rpackage{DEXSeq} 
(as usual, with \Rfunction{biocLite}).

An alternative work flow, which replaces the two Python-based steps with R=based code, is also available
and is demonstrated in the vignette of the \Rpackage{parathyroidSE} package~\cite{parathyroidSEVignette}. 


\subsection{Preparing the annotation} \label{prepAnno}

The Python scripts expect a GTF file with gene models for your species. We have tested our tools chiefly
with GTF files from Ensembl and hence recommend to prefer these, as files from other providers sometimes do 
not adhere fully to the GTF standard and cause the preprocessing to fail.
Ensembl GTF files can be found in the ``FTP Download'' sections of
the Ensembl web sites (i.\,e., Ensembl, EnsemblPlants, EnsemblFungi, etc.). Make sure that your GTF file uses a coordinate system that matches the reference genome that you have used
for aligning your reads. (The safest way to ensure this is to download the reference genome from Ensembl, too.)
If you cannot use an Ensembl GTF file, see Appendix~\ref{GTF} for advice on converting GFF files from other
sources to make them suitable as input for the  \verb|dexseq_prepare_annotation.py| script.

In a GTF file, many exons appear multiple times, once for each transcript that contains them. We
need to ``collapse'' this information to define \emph{exon counting bins}, i.\,e., a list of intervals, each corresponding to one exon or part of an exon. Counting bins for parts of exons arise when an exonic
region appears with different boundaries in different transcripts. See Figure~1 of the DEXSeq paper~\cite{DEXSeqPaper}
for an illustration.
The Python script \verb|dexseq_prepare_annotation.py| takes an Ensembl GTF file and translates it into
a GFF file with collapsed exon counting bins.

Make sure that your current working directory contains the GTF file and call the script 
(from the command line shell, not from within R) with

\begin{verbatim}
python /path/to/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py 
    Drosophila_melanogaster.BDGP5.72.gtf Dmel_flattened.gff
\end{verbatim}

In this command, which should be entered as a single line, replace
\verb|/path/to|...\verb|/python_scripts| with the correct path to the Python scripts,
which you have found with the call to \Rfunction{system.file} shown above.
\verb|Drosophila_melanogaster.BDGP5.72.gtf| is the Ensembl GTF file (here the one for 
fruit fly, already de-compressed) and \verb|Dmel_flattenend.gff| is the name of the output
file.

In the process of forming the counting bins, the script might come across overlapping genes. If two genes on the
same strand are found with an exon of the first gene overlapping with an exon of the second gene, the script's
default behaviour is to combine the genes into a single ``aggregate gene'' which is subsequently referred to
with the IDs of the individual genes, joined by a plus ('+') sign. If you do not like this behaviour, you can
disable aggregation with the option ``\verb|-r no|''. Without aggregation, exons that overlap with other exons from
different genes are simply skipped.


\subsection{Counting reads}\label{sec:counting}

For each SAM file, we next count the number of reads that overlap with each of the exon counting bins defined
in the flattened GFF file. This is done with the script \verb|python_count.py|:

\begin{verbatim}
python /path/to/library/DEXSeq/python_scripts/dexseq_count.py 
    Dmel_flattenend.gff untreated1.sam untreated1.counts
\end{verbatim}

This command (again, to be entered as a single line) expects two files in the current working directory, namely the GFF file produced in the previous
step (here \verb|Dmel_flattened.py|) and a SAM file with the aligned reads from a sample (here the
file \verb|untreated1.sam| with the aligned reads from the first control sample). The command generates an
output file, here called \verb|untreated1.counts|, with one line for each exon counting bin defined in
the flattened GFF file. The lines contain the exon counting bin IDs (which are composed of gene IDs and
exon bin numbers), followed by a integer number which indicates the number of reads that were aligned such that
they overlap with the counting bin.

Use the script multiple times to produce a count file from each of your SAM files.

There are a number of crucial points to pay attention to when using the \verb|python_count.py| script:

\emph{Paired-end data:} If your data is from a paired-end sequencing run, you need to 
add the option ``\verb|-p yes|'' to 
the command to call the script. (As usual, options have to be placed before the file names, surrounded by spaces.)
In addition, the SAM file needs to be sorted, either by read name or by position. Most aligners produce
sorted SAM files; if your SAM file is not sorted, use \verb|samtools sort -n| to sort by read name
(or \verb|samtools sort|) to sort by position. (See e.g. reference \cite{DEProt}, if you need further explanations
on how to sort SAM files.) Use the option ``\verb|-r pos|'' or ``\verb|-r name|'' to indicate whether
your paired-end data is sorted by alignment position or by read name.\footnote{The possibility to process
paired-end data from a file sorted by position is based on recent contributions of Paul-Theodor Pyl
to \software{HTSeq}.}

\emph{Strandedness:} By default, the counting script assumes your library to be \emph{strand-specific}, 
i.e., reads are aligned to
the same strand as the gene they originate from. If you have used a library preparation protocol that does
not preserve strand information (i.e., reads from a given gene can appear equally likely on either strand),
you need to inform the script by specifying the option ``\verb|-s no|''. If your library preparation
protocol reverses the strand (i.e., reads appear on the strand opposite to their gene of origin), use 
``\verb|-s reverse|''. In case of paired-end data, the default (\verb|-s yes|) means that the read from the
first sequence pass is on the same strand as the gene and the read from the second pass on the opposite strand
(``forward-reverse'' or ``fr'' order in the parlance of the Bowtie/TopHat manual) and the options \verb|-s reverse|
specifies the opposite case.

\emph{SAM and BAM files:} By default, tehs cript expects its input to be in plain-text SAM format.
However, it can also read BAM files, i.e., files in the the compressed binary variant of the SAM format.
If you wish to do so, use the option ``\verb|-f bam|''. This works only if you have installed the Python
package \software{pysam}, which can be found at \url{https://code.google.com/p/pysam/}.

\emph{Alignment quality:} The scripts takes a further option, \verb|-a| to specify the 
minimum alignment quality (as given in the fifth column of the SAM file). All reads with a lower quality 
than specified (with default \verb|-a 10|) are skipped.

\emph{Help pages:} Calling either script without arguments displays a help page with an overview of all
options and arguments.


\subsection{Reading the data in to R} \label{sec:readData}

The remainder of the analysis process is now done in \R. Start \R\ and load \Rpackage{DEXSeq}
%
<<loadDEXSeq,resuls=hide>>=
library( "DEXSeq" )
@
%
Now, we need to prepare a sample table. This table should contain one row for each library, and columns for
all relevant information such as name of the file with the read counts, experimental conditions, technical
information and further covariates.
To keep this vignette simple, we construct the table on the fly.
%
<<sampleTable>>=
sampleTable <- data.frame(
   row.names = c( "untreated1", "untreated2", "untreated3", 
      "untreated4", "treated1", "treated2", "treated3" ),
   countFile = c( "untreated1.counts", "untreated2.counts", 
      "untreated3.counts", "untreated4.counts",
      "treated1.counts", "treated2.counts","treated3.counts" ),
   condition = c( "control", "control", "control", 
      "control", "knockdown", "knockdown", "knockdown" ),
   libType = c( "single-end", "paired-end", "paired-end", 
      "single-end", "single-end", "paired-end", "paired-end" ) )
@
%
While this is a simple way to prepare the table, it may be less error-prone and
more prudent to used an existing table that had already been prepared 
when the experiments were done, save it in CSV format and use the R function 
\Rfunction{read.csv} to load it.

In any case, it is vital to check the table carefully for correctness.
%
<<displaySampleTable>>=
sampleTable
@
%
Our table contains the sample names as row names, the names of the count files that we
prepared in the previous step and the two covariates that vary between samples: first
the experimental condition (factor \texttt{condition} with levels \texttt{control} and
\texttt{treatment}) and the library type (factor \texttt{libType}), which we included because
the samples in this particular experiment were sequenced partly in single-end
runs and partly in paired-end runs. 

For now, we will, however, ignore this latter piece of information, and postpone the discussion of how to include
such additional covariates to Section~\ref{sec:glm}. If you have only a single covariate and
want to perform a simple analysis, the column with this covariate should be
named \texttt{condition}.

Now, we construct an \Rclass{ExonCountSet} object from this data. This object holds all
the input data and will be passed along the stages of the subsequent analysis.
We construct the object with the \Rpackage{DEXSeq} function \Rfunction{read.HTSeqCounts}, 
as follows:
%
<<makeecs, eval=FALSE>>=
ecs <- read.HTSeqCounts(
   sampleTable$countFile,
   sampleTable,
   "Dmel_flattenend.gff" )
@   
%
The function takes three arguments. First, a vector with names of count files, i.e., of files
that have been generated with the \verb|dexseq_count.py| script. The function will read these
files and arrange the count data in a matrix, which is stored in the \Rclass{ExonCountSet}
object \Robject{ecs}. The second argument is our sample table, with one row for each of the 
files listed in the first argument. This information is simply stored as is in the object's
meta-data slot (see below). The third argument is a file name again, now of the flattened GFF file
that was generated with \verb|dexseq_prepare_annotation.py| and used as input to \verb|dexseq_count.py|
when creating the count file.

There are other ways to get a \Rpackage{DEXSeq} analysis started. See Appendix \ref{sec:creatingInR} and 
Ref.~\cite{parathyroidSEVignette} for details.

\section{Standard analysis work-flow} \label{stdAnalysis}

\subsection{Loading and inspecting the example data}

To demonstrate the \Rpackage{DEXSeq} work flow, we have prepared, in the \Rpackage{pasilla}
data package, an \Rclass{ExonCountSet} similar to the one constructed in the previous section.
However, in order to keep the run-time of this vignette small, we have subset the object to
only a few genes.

To start, we load the package \Rpackage{DEXSeq} and
the example data from the \Rpackage{pasilla} package.
%
<<start,results=hide>>=
library( DEXSeq )
data( "pasillaExons", package="pasilla" )
@
%
The object \Robject{pasillaExons} is an \Rclass{ExonCountSet}, which has been constructed with 
\Rfunction{read.HTSeqCounts} as described above. For consistency, we rename it to \Robject{ecs}
as above, and then inspect it.
%
<<renameEcs>>=
ecs <- pasillaExons
ecs
@
%
The \Rclass{ExonCountSet} class is derived from \Rclass{eSet},
Bioconductor's standard container class for experimental data \cite{ExpressionSet}. As
such, it contains the usual accessor functions for sample, feature and
assay data (including \Rfunction{pData},
\Rfunction{fData}, \Rfunction{experimentData}), and some specific ones.
The core data in an \Rclass{ExonCountSet} object are the counts 
per exon. Let's have a look at the first 7 rows.
<<seeCounts>>=
head( counts(ecs), 7 )
@
%
The rows are labelled with gene IDs (here Flybase IDs), followed by a colon and the counting bin
number. (As a counting bin corresponds to an exon or part of an exon, this ID is called the \emph{exon ID}
within \Rpackage{DEXSeq}.) The table content indicates the
number of reads that have been mapped to each counting bin in the respective sample.

To see details on the counting bins, we also print the first
3 lines of the feature data annotation:
%
{\small
<<fData>>=
head( fData(pasillaExons), 3 )
@
}
%
So far, this table contains information on the annotation data, such as gene and exon IDs,
genomic coordinates of the exon, and the list of transcripts that contain an exon. Further
columns for intermediate and final results are already present (and filled with \Robject{NA}) and will be filled with results
in subsequent steps of the analysis.

The accessor function \Rfunction{design} shows the design table with the sample annotation
(which was passed as the second argument to \Rfunction{read.HTSeqCounts}):
%
<<pData>>=
design(ecs)
@
%

In the following sections, we will update the object by calling a number of analysis functions,
always using the idiom ``\verb| ecs <- |\textit{\texttt{someFunction}}\verb|( ecs )|'', which
takes the \verb|ecs| object, fills in the results of the performed computation and writes the returned
and updated object back into the variable \verb|ecs|.

 %--------------------------------------------------
\subsection{Normalisation}\label{sec:norm}
%--------------------------------------------------
Different samples might be sequenced with different depths. In order to adjust for such coverage biases,
we estimate \emph{size factors}, which measure relative sequencing depth.  \Rpackage{DEXSeq} uses the same method
as \Rpackage{DESeq} and \Rpackage{DESeq2}, which is provided in the function \Rfunction{estimateSizeFactors}.
%
<<sizeFactors1>>=
ecs <- estimateSizeFactors( ecs )
@
%
You may want to inspect the estimated size factors
<<inspectSizeFactors>>=
sizeFactors(ecs)
@


%--------------------------------------------------
\subsection{Dispersion estimation}\label{sec:dispest}
%--------------------------------------------------
To test for differential expression, we need to estimate the variability of the data. This is necessary
to be able to distinguish technical and biological variation (noise) from real effects
on exon expression due to the different conditions.
The information on the strength of the noise is inferred from the biological replicates in the data set
and characterized by the so-called \emph{dispersion}. In RNA-Seq experiments the number of replicates is 
typically too small to reliably estimate variance or dispersion parameters individually exon by exon, 
and therefore, variance information is shared across exons and genes, in an intensity-dependent manner.

In this section, we discuss simple one-way designs: In this setting, samples with the same experimental 
condition, as indicated in the \texttt{condition} factor of the design table (see above), are 
considered as replicates -- and therefore, the design table needs to contain a column with the
name \verb|condition|. In Section~\ref{sec:glm}, we discuss how to treat more complicated experimental
designs which are not accommodated by a single \Robject{condition} factor.

The first step is to get a dispersion estimate for each exon. This task is performed by
the function \Rfunction{estimateDispersions}, using Cox-Reid (CR)
likelihood estimation (our approach here follows that of the package \Rpackage{edgeR}~\cite{edgeR_GLM}).
Before starting estimating the CR dispersion estimates, \Rfunction{estimateDispersions}
first defines the ``testable'' counting bins. In this step, all bins are excluded with less
that \Robject{minCount} reads. By default, \Robject{minCount=10}, i.e., the bin must have at least 10
reads, \emph{summed up} across all samples, but other values can be specified when calling 
\Rfunction{estimateDispersion}. If a gene has only one testable counting bin, then this counting bin
is then considered as not testable, either, because it's usage cannot be compared to any other counting 
bins. The testable bins are marked in the column \texttt{testable} of the feature data.

Then, a dispersion estimate is computed for each bin, and the obtained values are stored
in the feature data column \Robject{dispBeforeSharing}. The following command performs these steps.
<<estDisp1,cache=TRUE,results=hide>>=
ecs <- estimateDispersions( ecs )
@
%
Note that for a full, genome-wide data set, execution of this function can take a while. To indicate
progress, a dot is printed on the console whenever 100 genes have been processed. If you have a machine
with multiple cores, you may want to use the \texttt{nCores} option to instruct the function to parallelize the
task over several CPU cores. See Section \ref{parallelization} and the function's help page for details.

Afterwards, the function \Rfunction{fitDispersionFunction} needs to be called, in which a dispersion-mean
relation $\alpha(\mu) = \alpha_0 + \alpha_1/\mu$ is fitted to the individual CR dispersion values
(as stored in \Robject{dispBeforeSharing}). The fit coefficients are stored in the slot \Robject{dispFitCoefs} and finally,
for each exon, the maximum between the dispersion before sharing and the fitted dispersion value
is taken as the exon's final dispersion value and stored in the \Robject{dispersion}
slot.\footnote{Especially when the dispersion estimates are very large, this fit can be difficult,
  and has occasionally caused the function to fail. In these rare cases please contact the developers.}
See our paper~\cite{DEXSeqPaper} for the rationale behind this ``sharing'' scheme.
%
<<fitDisp1>>=
ecs <- fitDispersionFunction( ecs )
@
%
We can have a look at the results of the CR estimation, the coefficients from the fit, and the
fitted values:
%
<<fitDisp2>>=
head( fData(ecs)$dispBeforeSharing )
ecs@dispFitCoefs
head( fData(ecs)$dispFitted )
@
%
As a fit diagnostic, we plot the per-exon dispersion estimates versus the mean normalised count.
%
<<fitdiagnostics, fig=TRUE>>=
plotDispEsts( ecs )
@
%
\incfig{DEXSeq-fitdiagnostics}{.5\textwidth}{Fit Diagnostics.}{Per-gene dispersion estimates 
  (shown by points) and the fitted mean-dispersion function (red line).}
%
The plot (Figure~\ref{DEXSeq-fitdiagnostics}) shows the estimates for all exons as dots and the fit as red
line. The red line follows the trend of the dots in the upper cluster of dots. The lower cluster
stems from exons for which the sample noise happens to fall below shot noise, i.e., their sample estimates
of the dispersion is zero or close to zero and hence form another cluster at the bottom. The fact that these
two clusters look so well separated is to a large extent an artefact of the logarithmic $y$-axis scaling.
Inspect the fit and make sure that
the regression line follows the trend of the points within the upper cluster.

The intercept coefficient of the dispersion fit (see above), which is the horizontal asymptote of the red
line, can be understood as the square of the coefficient of variation of highly expressed exons, and
is a good rough estimate of the overall noise in the data set and hence of the available power to infer
differential exon usage,


%--------------------------------------------------
\subsection{Testing for differential exon usage}\label{sec:deu}
%--------------------------------------------------
Having the dispersion estimates and the size factors, we can now test for differential exon usage.
For each gene, \Rpackage{DEXSeq} fits a generalized linear model with the formula
\begin{equation}\label{eq:altmodel}
\mbox{\texttt{\tttildemiddle\ sample + exon + condition:exon}}
\end{equation}
and compare it to the smaller model (the null model)
\begin{equation}\label{eq:nullmodel}
\mbox{\texttt{\tttildemiddle\ sample + exon}.}
\end{equation}
In these formulae (which use the standard notation for linear model formulae in \R; consult a text book
on \R\ if you are unfamiliar with it), \verb|sample| is a factor with different levels for each sample,
\verb|condition| is the factor of experimental conditions that we defined when constructing the
\Rclass{ExonCountSet} object at the beginning of the analysis, and \verb|exon| is a factor with
two levels, \verb|this| and \verb|others|. The two models described by these formulae are fit for
each counting bin, where the data supplied to the fit comprise \emph{two} read count values for each sample, 
corresponding to the two levels of the \Robject{exon} factor: the number of reads mapping to the bin in 
question (level \Robject{this}), and the sum of the read counts from all other bins
of the same gene (level \Robject{others}). Note that this approach differs from the approach described in 
the paper~\cite{DEXSeqPaper} and used in older versions of \Rpackage{DEXSeq}; see Appendix~\ref{changes} 
for further discussion.

Readers familiar with linear model formulae might find one aspect of Equation~(\ref{eq:altmodel}) surprising: 
We have an interaction term \verb|condition:exon|, but denote no main effect for \verb|condition|.
Note, however, that all observations from the same sample are also from the same condition, i.e., the
\verb|condition| main effects are absorbed in the \verb|sample| main effects, because the \verb|sample| 
factor is nested within the \verb|condition| factor. 

The deviances of both fits are compared using a $\chi^2$-distribution, giving rise to a $p$ value.
Based on that, we can decide whether the null model
(\ref{eq:nullmodel}) is sufficient to explain the data, or whether it
may be rejected in favour of the alternative,
model~(\ref{eq:altmodel}), which contains an interaction coefficient for \verb|condition:exon|. The latter
means that the fraction of the gene's reads that fall onto the exon under the test differs significantly
between the experimental conditions.

The function \Rfunction{testForDEU} performs these tests for each exon in each gene.
%
<<testForDEU1,cache=TRUE,results=hide>>=
ecs <- testForDEU( ecs )
@
%
The function stores its results in the \Robject{pvalue} and \Robject{padjust} columns of the \Robject{featureData} slots
of the \Rclass{ExonCountSet} object with the results. Here, \Robject{pvalue} contains
the p values from the $\chi^2$ likelihood-ratio test, and \Robject{padjust} is the result of
performing Benjamini-Hochberg adjustment for multiple testing on \Robject{pvalue}:
%
<<testForDEU2>>=
head( fData(ecs)[, c( "pvalue", "padjust" ) ] )
@
%
For some usages, we may also want to estimate fold changes. To this end, we call
\Rfunction{estimatelog2FoldChanges}:
%
<<estFC,cache=TRUE>>=
ecs <- estimatelog2FoldChanges( ecs )
@
%
This function stores its results in the feature data table, too.

A convenient way to extract all relevant columns of the feature data table is
offered by the function \Rfunction{DEUresultTable}:
%
<<DEUresults>>=
res1 <- DEUresultTable(ecs)
head( res1 )
@
%
Controlling false discovery rate (FDR) at 0.1 (10\%), we can now ask
how many counting bins show evidence of differential exon usage:
%
<<tallyExons>>=
table ( res1$padjust < 0.1 )
@
%
We may also ask how many genes are affected
<<tallyGenes>>=
table ( tapply( res1$padjust < 0.1, geneIDs(pasillaExons), any ) )
@
%
Remember that our example data set contains only a selection of genes. We have chosen these to
contain interesting cases; so the fact that such a large fraction of genes is significantly affected 
here is not typical.

To see how the power to detect differential exon usage depends on the number of reads that
map to an exon, a so-called MA plot is useful, which plots the logarithm of fold change versus average
normalized count per exon and marks by red colour the exons which are considered significant; here,
the exons with an adjusted $p$ values of less than 0.1 (Figure \ref{DEXSeq-MvsA}).
There is of course nothing special about the number 0.1, and you can specify other thresholds in the call to \Rfunction{plotMA}.
%
<<MvsA, fig=TRUE, pdf=FALSE, png=TRUE, resolution=180>>=
plotMA( ecs, FDR=0.1, ylim=c(-4,4), cex=0.8 )
@
\incfig{DEXSeq-MvsA}{.5\textwidth}{MA plot.}{
Mean expression versus $\log_2$ fold change plot. Significant hits (at \Robject{padj}<0.1) are coloured in red.
}

%$
%------------------------------------------------------------
\section{Additional technical or experimental variables}\label{sec:glm}
%------------------------------------------------------------
In the previous section we performed a simple analysis of differential exon usage, in which each sample was assigned
to one of two experimental conditions. If your experiment is of this type, you can use the work flow shown above. All
you have to make sure is that you indicate which sample belongs to which experimental condition when you construct
the \Rclass{ExonCountSet} object (Section \ref{sec:readData}. Do so by means of a column called \verb|condition| in the sample table.

If you have a more complex experimental design, you can provide different or additional columns in the sample table. You
then have to indicate the design by providing explicit formulae for the test.

In the \Rpackage{pasilla} dataset, some samples were sequenced in single-end and others in paired-end mode. Possibly,
this influenced counts and should hence be accounted for. We therefore use this as an example for a complex design.

When we constructed the \Rclass{ExonCountSet} object in Section \ref{sec:readData}, we provided in the sample table an
additional column called \verb|type|, which has been stored in the object:
%
<<design>>=
design(pasillaExons)
@
%
We specify two design formulae, which indicate that the \verb|libType| \Rclass{factor} should be treated
as a blocking factor:
%
<<formulas2>>=
formulaFullModel    <-  ~ sample + exon + type:exon + condition:exon
formulaReducedModel <-  ~ sample + exon + type:exon 
@
%
Compare these formulae with the default formulae (\ref{eq:altmodel}, \ref{eq:nullmodel}) 
given in Section \ref{sec:deu}. We have added, in both 
the full model and the reduced model, the term \verb|type:exon|. Therefore, any dependence of exon
usage on library type will be absorbed by this term and accounted for equally in the full
and a reduced model, and the likelihood ratio test comparing them will only detect differences
in exon usage that can be attributed to \verb|condition|, independent of \verb|type|.

To start a fresh analysis, now using these formulae instead of the default ones, we copy the example 
data into a new object.
%
<<new_ecs>>=
ecs2 <- pasillaExons
@
%
As before, we first estimate the size factors
<<sf_again>>=
ecs2 <- estimateSizeFactors( ecs2 )
@
%
Next, we estimate the dispersions. This time, we need to inform the \Rfunction{estimateDispersions} function
about our design by providing the full model's formula, which should be used instead of the default 
formula (\ref{eq:altmodel}).
%
<<estDisps_again, cache=true, results=hide >>=
ecs2 <- estimateDispersions( ecs2, formula = formulaFullModel )
@
%
The fit is performed as before.
%
<<fit_again>>=
ecs2 <- fitDispersionFunction( ecs2 )
@
%
The test function now needs to be informed about both formulae
<<test_again, cache=true >>=
ecs2 <- testForDEU( ecs2, 
	formula0 = formulaReducedModel, formula1 = formulaFullModel )
@
%
Finally, we get a summary table, as before.
<<res_again>>=
res2 <- DEUresultTable( ecs2 )
head(res2)
@
%
How many significant DEU cases have we got this time?
<<table2>>=
table( res2$padjust < 0.1 )
@
%
We can now compare with the previous result:
<<table3>>=
table( before = res1$padjust < 0.1, now = res2$padjust < 0.1 )
@
<<oneMore,echo=false,results=hide>>=
stopifnot( sum( res2$padjust<.1, na.rm=TRUE ) == sum( res1$padjust<.1, na.rm=TRUE ) + 1 )
@
%
Accounting for the library type has allowed us to find one more hit. (With so few genes, this could be
coincidence, of course. However, performing the analysis on the full pasilla data confirms that
accounting for the covariate improves power.)

%--------------------------------------------------
\section{Visualization}
%--------------------------------------------------
The \Rfunction{plotDEXSeq} provides a means  to visualize the results of an analysis.
%
<<plot1, fig=TRUE, height=8, width=12>>=
plotDEXSeq( ecs2, "FBgn0010909", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )
@
%
\incfig{DEXSeq-plot1}{\textwidth}{Fitted expression.}{
The plot represents the expression estimates from a call to \texttt{testForDEU}.
Shown in red is the exon that showed significant differential exon usage.
}
%
<<checkClaim,echo=FALSE>>=
wh = (fData(ecs2)$geneID=="FBgn0010909")
stopifnot(sum(fData(ecs2)$padjust[wh] < formals(plotDEXSeq)$FDR)==1)
@
%
The result is shown in Figure~\ref{DEXSeq-plot1}. This plot shows the fitted expression values of each of the exons
of gene FBgn0010909, for each of the two conditions, treated and untreated.
The function \Rfunction{plotDEXSeq} expects at least two arguments, the \Rclass{ExonCountSet} object and the gene ID.
The option \texttt{legend=TRUE} causes a legend to be included. The three remaining
arguments in the code chunk above are ordinary plotting parameters which are simply handed over to \R's standard
plotting functions. They are not strictly needed and included here to improve appearance of
the plot. See the help page for \Rfunction{par} for details.

Optionally, one can also visualize the transcript models (Figure~\ref{DEXSeq-plot2}), which can be
useful for putting differential exon usage results into the context of isoform regulation.
%
<<plot2, fig=TRUE, height=8, width=12>>=
plotDEXSeq( ecs2, "FBgn0010909", displayTranscripts=TRUE, legend=TRUE,
   cex.axis=1.2, cex=1.3, lwd=2 )
@
%
\incfig{DEXSeq-plot2}{\textwidth}{Transcripts.}{
As in Figure~\ref{DEXSeq-plot1}, but including the annotated transcript models.}
%
Other useful options are to look at the count values from the individual samples, rather than at the
model effect estimates. For this display (option \texttt{norCounts=TRUE}), the counts are normalized by 
dividing them by the size factors (Figure~\ref{DEXSeq-plot3}).
%
<<plot3, fig=TRUE, height=8, width=12>>=
plotDEXSeq( ecs2, "FBgn0010909", expression=FALSE, norCounts=TRUE,
   legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )
@
%
\incfig{DEXSeq-plot3}{\textwidth}{Normalized counts.}{
As in Figure~\ref{DEXSeq-plot1}, with normalized count values of each exon in each of the samples.}
%
As explained in Section~\ref{sec:praeludium}, \Rpackage{DEXSeq} is designed to find
changes in relative exon usage, i.\,e., changes in the expression of individual exons that are
not simply the consequence of overall up- or down-regulation of the gene. To visualize such changes, it is
sometimes advantageous to remove overall changes in expression from the
plots. Use the (somewhat misnamed) option \texttt{splicing=TRUE} for this purpose.
%
<<plot4, fig=TRUE, height=8, width=12>>=
plotDEXSeq( ecs2, "FBgn0010909", expression=FALSE, splicing=TRUE,
   legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )
@
%
\incfig{DEXSeq-plot4}{\textwidth}{Fitted splicing.}{
The plot represents the estimated effects, as in Figure~\ref{DEXSeq-plot1},
but after subtraction of overall changes in gene expression.}
%
To generate an easily browsable, detailed overview over all analysis results,
the package provides an HTML report generator, implemented in the function \Rpackage{DEXSeqHTML}.
This function uses the package \Rpackage{hwriter} \cite{hwriter} to create a result table with links to plots for the
significant results, allowing a more detailed exploration of the results. 
%
<<DEXSeqHTML,cache=TRUE, eval=FALSE>>=
DEXSeqHTML( pasillaExons, FDR=0.1, color=c("#FF000080", "#0000FF80") )
@

%--------------------------------------------------
\section{Parallelization} \label{parallelization}
%--------------------------------------------------
DEXSeq analyses can be computationally heavy, especially with data
sets that comprise a large number of samples, or with genomes
containing genes with large numbers of exons.  While some steps of the
analysis work on the whole data set, the two parts that are most time
consuming -- the functions \Rfunction{estimateDispersions} and
\Rfunction{testForDEU} -- can be parallelized by setting the two functions'
parameter \Rfunction{nCores} to the number of CPU cores to be used. 
These functions will then distribute the
\Rclass{ExonCountSet} object into smaller objects that are processed
in parallel on different cores.  This functionality uses the
\Rpackage{parallel} package, which hence has to be loaded first.
<<para1,cache=TRUE,results=hide, eval=FALSE>>=
data("pasillaExons", package="pasilla")
library("parallel")
pasillaExons <- estimateSizeFactors( pasillaExons )
pasillaExons <- estimateDispersions( pasillaExons, nCores=16)
pasillaExons <- fitDispersionFunction( pasillaExons )
pasillaExons <- testForDEU( pasillaExons, nCores=16)
@


%--------------------------------------------------
\section{Perform a standard differential exon usage analysis in one command}
%--------------------------------------------------
In the previous sections, we went through the analysis step by step.
Once you are sufficiently confident about the work flow for your data,
its invocation can be streamlined by the wrapper function
\Rfunction{doCompleteDEUAnalysis}, which runs the analysis shown
above through a single function call. 

In the simplest case, construct the \Robject{ExonCountSet} as shown in Section \ref{preps}
or in Appendix \ref{sec:creatingInR}, then run \Rfunction{doCompleteDEUAnalysis} passing the
\Robject{ExonCountSet} as only argument, and finally inspect the result with 
\Rfunction{DEUresultTable}. In the following, we also specify design formulae
to account for library type as described in Section \ref{sec:glm} and the \Robject{nCores}
argument for parallelization (Section \ref{parallelization}).
%
<<alldeu, cache=TRUE, results=hide>>=
data( "pasillaExons", package="pasilla" )
pasillaExons <- doCompleteDEUAnalysis(
   pasillaExons,
   formula0 =  ~ sample + exon + type:exon,
   formula1 =  ~ sample + exon + type:exon + condition:exon,
   nCores = 1 )
@
<<alldeuout>>=
head( DEUresultTable( pasillaExons ) )
@

\appendix
\clearpage
\begin{center}
{\Large\sffamily\bfseries\color{BiocBlue} APPENDIX} \addcontentsline{toc}{section}{APPENDIX}
\end{center}

%--------------------------------------------------
\section{Preprocessing within R}\label{sec:creatingInR}
%--------------------------------------------------

As an alternative to the approach described in Section \ref{preps}, users can also create  
\Rclass{ExonCountSet} objects from other \Rpackage{Bioconductor} data objects.
The code for implementationg these functions was kindly contributed by Michael I.\ Love. For details, see the
\Rpackage{parathyroidSE} package vignette \cite{parathyroidSEVignette}. The work flow is similar to the one 
using the \Rpackage{HTSeq} python scripts.

emph{Note:} The code in this section is not run when the vignette is built, as some of the commands
have long run time. Therefore, no output is given.

We use functionality from the following Bioconductor packages
<<buildExonCountSetLoadPacks,cache=TRUE, eval=FALSE>>=
library("GenomicFeatures")
library("GenomicRanges")
library("Rsamtools")
@
%
We demonstrate the workflow briefly (for more details, see \cite{parathyroidSEVignette}) on the data set
of Haglund et al.\ \cite{parathyroidPaper}, which is provided as example data in the 
\Rpackage{parathyroidSE} data package.

First, we download the current human gene model annotation from Ensembl via Biomart and create
a transcript data base from these. Note that this step takes some time.
<<buildExonCountSetDownloadAnno,cache=TRUE, eval=FALSE>>=
hse <- makeTranscriptDbFromBiomart( biomart="ensembl", 
   dataset="hsapiens_gene_ensembl" )
@
%
Next, we collapse the gene models into counting bins, analogous to Section~\ref{prepAnno}.
%
<<buildExonCountSetDisjoin,cache=TRUE, eval=FALSE>>=
exonicParts <- disjointExons( hse, aggregateGenes=TRUE )
@
%
As before, we have to choose how to handle genes with overlapping exons. The \verb|aggregateGenes| option
here plays the same role as the \verb|-r| option to \verb|dexseq_prepare_anotation.py| described at the 
end of Section \ref{prepAnno}.
The \Robject{exonicParts} object contains a GRanges object with our counting bins. We use it to
count the number of read fragments that overlap with the bins by means of the function 
\Rfunction{countReadsForDEXSeq}. To demonstrate this, we first determine the paths to the
example BAM files in the \Rpackage{parathyroidSE} data package.
%
<<buildExonCountSet2FindBAMs,cache=TRUE, eval=FALSE>>=
bamDir <- system.file( "extdata", package="parathyroidSE", mustWork=TRUE )
fls <- list.files( bamDir, pattern="bam$", full=TRUE )
@
%
Then, use the following code to count the reads overlapping the bins.
%
<<buildExonCountSet2FindBAMs,cache=TRUE, eval=FALSE>>=
bamlst <- BamFileList( fls, index=character(), yieldSize=100000, obeyQname=TRUE )
SE <- summarizeOverlaps( exonicParts, bamlst, mode="Union", singleEnd=FALSE, 
   ignore.strand=TRUE, inter.feature=FALSE, fragments=TRUE )
@
%
We can now call the function \Rfunction{buildExonCountSet}
to build an \Robject{ExonCountSet} object. The middle argument in
the following call is the design, indicating that the first two BAM files form
one experimental condition and the third one the other. Generally, you will want
to pass a \Rclass{data.frame} with a sample table as in Section~\ref{sec:readData}.
%     
<<buildExonCountSet3,cache=TRUE, eval=FALSE>>=
ecs <- buildExonCountSet( SE, c("A", "A", "B"), exonicParts )
@

%--------------------------------------------------
\section{Lower-level functions}
%--------------------------------------------------

The following functions are not needed in a standard analysis work-flow, but may be useful for special purposes.
We list them here briefly; see the help pages of these function for further options (e.\,g., to specify formulae).

\subsection{Count tables}

While the function \Rfunction{counts} returns the whole read count table, the function \Rfunction{countTableForGene}
returns the count table for a single gene:
<<countTableForGene, results=hide>>=
countTableForGene( pasillaExons, "FBgn0010909" ) 
@
%
Like the function \Rfunction{counts}, the function \Rfunction{countTableForGene} can also return normalized counts (i.e.,
counts divided by the size factors). Use the option \texttt{normalized=TRUE}.
<<countTableForGeneNorm>>=
head( countTableForGene( pasillaExons, "FBgn0010909", normalized=TRUE ) )
@
%
The function \Rfunction{geneCountTable} computes a table of \emph{gene counts}, which are obtained by summing
the counts from all exons with the same geneID.  This might be useful for the detection of differential expression
of genes, where the table can be used as input e.\,g.\ for the packages \Rpackage{DESeq} or \Rpackage{edgeR}.
This kind of table can also be produced with the package \Rpackage{GenomicRanges}, e.\,g.\ 
with the function \Rfunction{summarizeOverlaps}.
%
<<gct>>=
head( geneCountTable( pasillaExons ) )
@
%
Note that a read that mapped to several exons of a gene is counted for each of these
exons by the \texttt{dexseq\_count.py} script. The table returned \Rfunction{geneCountTable}
will hence count the read several time for the gene, which may skew downstream analyses in
subtle ways. Hence, we recommend to use \Rfunction{geneCountTable} with care and use more
sophisticated counting schemes where appropriate.

\subsection{Model frames}

The function \Rfunction{constructModelFrame} returns the model frame
used for the dispersion fits and the model fits involved in the likelihood ratio test. The terms used
in the formulae passed to \Rfunction{estimateDispersions} and \Rfunction{testForDEU} refer 
to this model frame and hence must appear as column names.
<<modelFrame>>=
constructModelFrame( pasillaExons )
@
%
For the visualization with \Rfunction{plotDEXSeq}, a GLM is fitted over the joint data from all exons of a gene. The function 
\Rfunction{modelFrameForGene} returns the model frame used for this fit for a single gene.
<<modelFrame>>=
mf <- modelFrameForGene( pasillaExons, "FBgn0010909" )
head( mf )
@

\subsection{Further accessors}

The function \Rfunction{geneIDs} returns the gene ID column of the feature data as a character
vector, and the function \Rfunction{exonIDs} return the exon ID column as a \Rclass{factor}.
%
<<acc>>=
head( geneIDs(pasillaExons) )
head( exonIDs(pasillaExons) )
@
%
These functions are useful for subsetting an \Rclass{ExonCountSet} object.

%--------------------------------------------------
\section{Methodological changes since publication of the paper} \label{changes}
%--------------------------------------------------

In our paper \cite{DEXSeqPaper}, we suggested to fit for each exon a model that includes separately 
the counts for all the gene's exons. However, this turned out to be computationally inefficient
for genes with many exons, because the many exons required large model matrices,
which are computationally expensive to deal with. We have therefore modified the approach: when fitting
a model for an exon, we now sum up the counts from all the other exon and use only the total, rather than
the individual counts in the model. Now, computation time per exon is independent of the number of other exons in the gene,
which improved \Rpackage{DEXSeq}'s scalability. While the $p$ values returned by the two approaches are not
exactly equal, the differences were very minor in the tests that we performed.

For now, the function for our original approach (which we now call the ``big model'' or ``BM'' approach)
are still included; all relevant functions, however, have been renamed to carry the suffix \verb|_BM|
in their name. The new approach, which is now default and is used by the work flow described in this 
vignette, has no special name (in some previous releases of \Rpackage{DEXSeq} which had included it 
first on an experimental basis, it was termed the ``TRT'' approach).

In the following, we describe the current default (``TRT'') approach in detail (though the exposition assumes
the reader's familiarity with our paper).

Deviating from the paper's notation, we now use the index $i$ to indicate a specific counting bin, 
with $i$ running through all counting bins of all genes. The samples are indexed with $j$, as in the paper.
We write $K_{ij0}$ for the count or reads mapped to counting bin $i$ in sample $j$ and $K_{ij1}$ for the sum
of the read counts from all other counting bins in the same gene. Hence, when we write $K_{ijl}$, the third index
$l$ indicates whether we mean the read count for bin $i$ ($l=0$) or the sum of counts for all other bins of the 
same gene ($l=1$). As before, we fit a GLM of the negative binomial (NB) family
\begin{equation} 
  K_{ijl} \sim \operatorname{NB}(\text{mean}=s_j\mu_{ijl},\text{dispersion}=\alpha_i), 
\end{equation}
now with the model specified in Equation (\ref{eq:altmodel}), which we write out as
\begin{equation} 
  \log_2 \mu_{ijl} = \beta^\text{S}_{ij} + l \beta^\text{E}_{i} + \beta^\text{EC}_{i\rho_j}. 
\end{equation}

This model is fit separately for each counting bin $i$. The coefficient $\beta^\text{S}_{ij}$ accounts for
the sample-specific contribution (factor \verb|sample| in Equation (\ref{eq:altmodel})), the term $\beta^\text{E}_{i}$
is only included if $l=1$ and hence estimates the logarithm of the ratio $K_{ij1}/K_{ij0}$ between the counts for all
other exons and the counts for the tested exon. As this coefficient is estimated from data from all samples, it can be considered as a measure of ``average exon usage''. In the R model formula, it is represented by the term \verb|exon|
with its two levels \verb|this| ($l=0$) and \verb|others| ($l=1$). Finally, the last term, 
$\beta^\text{EC}_{i,\rho_j}$, captures the interaction \verb|condition:exon|, i.e., the change in exon usage
if sample $j$ is from experimental condition group $\rho(j)$. Here, the first condition, $\rho=0$,
is absorbed in the sample coefficients, i.e., $\beta^\text{EC}_{i0}$ is fixed to zero and does not appear
in the model matrix.

For the dispersion estimation, one dispersion value $\alpha_i$ is estimated with Cox-Reid-adjusted maximum likelihood
using the full model given above. For the likelihood ratio test, this full model is fit and compared with the fit
of the reduced model, which lacks the interaction term $\beta^\text{EC}_{i\rho_j}$. As described in Section~\ref{sec:glm},
alternative model formulae can be specified.

%--------------------------------------------------
\section{Requirements on GTF files} \label{GTF}
%--------------------------------------------------

In the initial preprocessing step described in Section \ref{prepAnno}, the Python script \verb|dexseq_prepare_annotation.py| 
is used to convert a GTF file with gene models into a GFF file with collapsed gene models. We recommend to
use GTF files downloaded from Ensembl as input for this script, as files from other sources may deviate from the format
expected by the script. Hence, if you need to use a GTF or GFF file from another source, you may need to
convert it to the expected format. To help with this task, we here give details on
how the \verb|dexseq_prepare_annotation.py| script interprets a GFF file.

\begin{itemize}
\item The script only looks at \texttt{exon} lines, i.e., at lines which contain the term \texttt{exon} in the 
third (``type'') column. All other lines are ignored.
\item Of the data in these lines, the information about chromosome, start, end, and strand (1st, 4th, 5th, and 7th column) 
are used, and, from the last column, the attributes \verb|gene_id| and \verb|transcript_id|. The rest is ignored.
\item The \verb|gene_id| attribute is used to see which exons belong to the same gene. It must be called \verb|gene_id|
(and not \verb|Parent| as in GFF3 files, or \verb|GeneID| as in some older GFF files), and it must give the same identifier to 
all exons from the same gene, even if they are from different transcripts of this gene. (This last requirement is not met
by GTF files generated by the Table Browser function of the UCSC Genome Browser.)
\item The \texttt{transcript\_id} attribute is used to build the \texttt{transcripts} attribute in the flattened GFF file, 
which indicates which transcripts contain the described counting bin. This information is needed only to draw the transcript 
model at the bottom of the plots when the \Rcode{displayTranscript} option to \Rfunction{plotDEXSeq} is used.
\end{itemize}

Therefore, converting a GFF file to make it suitable as input to \verb|dexseq_prepare_annotation.py| amounts to making sure 
that the exon lines have type \texttt{exon} and that the atributes giving gene ID (or gene symbol) and transcript ID are 
called \texttt{gene\_id} and \texttt{transcript\_id}, with this exact spelling. Remember to also take care that the chromosome 
names match those in your SAM files, and that the coordinates refer to the reference assembly that you used when aligning
your reads.


%--------------------------------------------------
\section{Session Information}
%--------------------------------------------------

The session information records the versions of all the packages used in the generation of the present document.

<<sessionInfo>>=
sessionInfo()
@

%--------------------------------------------------
\section{References}
%--------------------------------------------------
\begingroup
\renewcommand{\section}[2]{}%
\bibliography{DEXSeq}
\bibliographystyle{unsrt}
\endgroup


\end{document}