%\VignetteIndexEntry{Analyzing RNA-seq data for differential exon usage with the "DEXSeq" package} %\VignettePackage{DEXSeq} % To compile this document % library('cacheSweave'); rm(list=ls()); Sweave('DEXSeq.Rnw', driver=cacheSweaveDriver()) \documentclass[12pt]{article} <>= BiocStyle::latex() @ % My changes to the Bioc style: \IfFileExists{inconsolata.sty}{\usepackage{inconsolata}}{\usepackage{zi4}} % I like Inconsolata as terminal font \renewcommand{\familydefault}{\rmdefault} % I want serifs in my body text \renewcommand{\maketitlehooka}{\sffamily\bfseries} % The title should be sans-serif, though \renewcommand{\maketitlehookb}{\rmfamily\mdseries} % But not the authors \fancyhead[R]{\sffamily\small\thepage} % The header should not be the same font and size as the body, \fancyhead[L]{\sffamily\small\thetitle} % to set the two visually apart. \newcommand{\tttildemiddle}{\raise.17ex\hbox{$\scriptstyle\mathtt{\sim}$}} \usepackage[sort]{cite} \usepackage{xstring} \usepackage[fleqn]{amsmath} \usepackage{Sweave} \SweaveOpts{keep.source=TRUE, eps=FALSE, pdf=TRUE, png=FALSE, include=FALSE, width=4, height=4.5} \author{Alejandro Reyes, Simon Anders, Wolfgang Huber\\[1em] European Molecular Biology Laboratory (EMBL),\\ Heidelberg, Germany} \title{Inferring differential exon usage in RNA-Seq data with the DEXSeq package} \date{} %\date{\Rpackage{DEXSeq} version \Sexpr{packageDescription("DEXSeq")$Version} % %(Last revision \StrMid{$Date: 2013-10-08 08:23:56 -0700 (Tue, 08 Oct 2013) $}{8}{18})} \begin{document} \maketitle \noindent This vignette describes version \Sexpr{packageDescription("DEXSeq")$Version} of the \Rpackage{DEXSeq} package. \noindent Last revision of this document: \StrMid{$Date: 2013-10-08 08:23:56 -0700 (Tue, 08 Oct 2013) $}{8}{18} % Note: The preceding line uses SVN's keyword substitution mechanism: The text between "$Date:" and the second "$" % is automatically replaced by a current time stamp when "svn ci" is used. The \StrMid function (from the % xstring package) takes out the relevant part. (To activate keyword substitution for another file, use % "svn propset svn:keywords Date filename.txt".) <>= options(digits=2, width=80) @ \tableofcontents %----------------------------------------------------------- \section{Overview}\label{sec:praeludium} %----------------------------------------------------------- The Bioconductor package \Rpackage{DEXseq} implements a method to test for differential exon usage in comparative RNA-Seq experiments. By \emph{differential exon usage} (DEU), we mean changes in the relative usage of exons caused by the experimental condition. The relative usage of an exon is defined as \begin{equation}\label{eq:reu} \frac{\text{number of transcripts from the gene that contain this exon}}% {\text{number of all transcripts from the gene}}. \end{equation} The statistical method used by \Rpackage{DEXSeq} was introduced in our paper \cite{DEXSeqPaper}. The basic concept can be summarized as follows. For each exon (or part of an exon) and each sample, we count how many reads map to this exon and how many reads map to any of the other exons of the same gene. We consider the ratio of these two counts, and how it changes across conditions, to infer changes in the relative exon usage~(\ref{eq:reu}). In the case of an inner exon, a change in relative exon usage is typically due to a change in the rate with which this exon is spliced into transcripts (alternative splicing). Note, however, that DEU is a more general concept than alternative splicing, since it also includes changes in the usage of alternative transcript start sites and polyadenylation sites, which can cause differential usage of exons at the 5' and 3' boundary of transcripts. Similar as with differential gene expression, we need to make sure that observed differences of values of the ratio~(\ref{eq:reu}) between conditions are statistically significant, i.\,e., are sufficiently unlikely to be just due to random fluctuations such as those seen even between samples from the same condition, i.\,e., between replicates. To this end, \Rpackage{DEXSeq} assesses the strength of these fluctuations (quantified by the so-called \emph{dispersion}) by comparing replicates before comparing the averages between the sample groups. The preceding description is somewhat simplified (and perhaps over-simplified), and we recommend that users consult the paper \cite{DEXSeqPaper} for a more complete description, as well as Appendix~\ref{changes} of this vignette, which describes how the current implementation of \Rpackage{DEXSeq} differs from the original approach described in the paper. Nevertheless, two important aspects should be mentioned already here: First, \Rpackage{DEXSeq} does not actually work on the ratios~(\ref{eq:reu}), but on the counts in the numerator and denominator, to be able to make use of the information that is contained in the magnitude of count values. (3000 reads versus 1000 reads is the same ratio as 3 reads versus 1 read, but the latter is a far less reliable estimate of the underlying true value, because of statistical sampling.) Second, \Rpackage{DEXSeq} is not limited to simple two-group comparisons; rather, it uses so-called generalized linear models (GLMs) to permit ANOVA-like analysis of potentially complex experimental designs. \section{Preparations} \label{preps} \subsection{Example data} To demonstrate the use of \emph{DEXSeq}, we use the \emph{pasilla} dataset, an RNA-Seq dataset generated by Brooks et al.~\cite{Brooks2010}. They investigated the effect of siRNA knock-down of the gene \emph{pasilla} on the transcriptome of fly S2-DRSC cells. The RNA-binding protein \emph{pasilla} protein is thought to be involved in the regulation of splicing. (Its mammalian orthologs, NOVA1 and NOVA2, are well-studied examples of splicing factors.) Brooks et al.\ prepared seven cell cultures, treated three with siRNA to knock down \emph{pasilla} and left four as untreated controls, and performed RNA-Seq on all samples. They deposited the raw sequencing reads with the NCBI Gene Expression Omnibus (GEO) under the accession number GSE18508.\footnote{\url{http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508}} \paragraph{Executability of the code.} Usually, Bioconductor vignettes contain automatically executable code, i.\,e., you can follow the vignette by directly running the code shown, using functionality and data provided with the package. However, it would not be practical to include the voluminous raw data of the pasilla experiment here. Therefore, the code in this section is not automatically executable. You may download the raw data yourself from GEO, as well as the required extra tools, and follow the work flow shown here and in the \Rpackage{pasilla} vignette~\cite{pasillaVignette}. From Section~\ref{stdAnalysis} on, code is directly executable, as usual. Therefore, we recommend that you just read this section, and try following our analysis in R only from the next section onwards. Once you work with your own data, you will want to come back and adapt the work flow shown here to your data. \subsection{Alignment} The first step of the analysis is to align the reads to a reference genome. It is important to align them to the genome, not to the transcriptome, and to use a splice-aware aligner (i.\,e., a short-read alignment tool that can deal with reads that span across introns) such as TopHat2 \cite{TopHat2}, GSNAP \cite{GSNAP}, or STAR \cite{STAR}. The explanation of the analysis work-flow presented here starts with the aligned reads in the SAM format. If you are unfamiliar with the process of aligning reads to obtain SAM files, you can find a summary how we proceeded in preparing the pasilla data in the vignette for the \Rpackage{pasilla} data package~\cite{pasillaVignette} and a more extensive explanation, using the same data set, in our protocol article on differential expression calling~\cite{DEProt}. \subsection{HTSeq} \label{HTSeq} The initial steps of a \emph{DEXSeq} analysis, described in the following two sections, is typically done outside R, by using two provided Python scripts. You do not need to know how to use Python; however you have to install the Python package \software{HTSeq}, following the explanations given on the HTSeq web page: \url{http://www-huber.embl.de/users/anders/HTSeq/doc/install.html} Once you have installed \emph{HTSeq}, you can use the two Python scripts, \verb|dexseq_prepare_annotation.py| (described in Section~\ref{prepAnno}) and \verb|dexseq_count.py| (Section~\ref{sec:counting}), that come with the \Rpackage{DEXSeq} package. If you have trouble finding them, start R and ask for the installation directory with % <>= system.file( "python_scripts", package="DEXSeq" ) @ <>= system.file( "python_scripts", package="DEXSeq", mustWork=TRUE ) @ % The displayed path should contain the two files. If it does not, try to re-install \Rpackage{DEXSeq} (as usual, with \Rfunction{biocLite}). An alternative work flow, which replaces the two Python-based steps with R=based code, is also available and is demonstrated in the vignette of the \Rpackage{parathyroidSE} package~\cite{parathyroidSEVignette}. \subsection{Preparing the annotation} \label{prepAnno} The Python scripts expect a GTF file with gene models for your species. We have tested our tools chiefly with GTF files from Ensembl and hence recommend to prefer these, as files from other providers sometimes do not adhere fully to the GTF standard and cause the preprocessing to fail. Ensembl GTF files can be found in the ``FTP Download'' sections of the Ensembl web sites (i.\,e., Ensembl, EnsemblPlants, EnsemblFungi, etc.). Make sure that your GTF file uses a coordinate system that matches the reference genome that you have used for aligning your reads. (The safest way to ensure this is to download the reference genome from Ensembl, too.) If you cannot use an Ensembl GTF file, see Appendix~\ref{GTF} for advice on converting GFF files from other sources to make them suitable as input for the \verb|dexseq_prepare_annotation.py| script. In a GTF file, many exons appear multiple times, once for each transcript that contains them. We need to ``collapse'' this information to define \emph{exon counting bins}, i.\,e., a list of intervals, each corresponding to one exon or part of an exon. Counting bins for parts of exons arise when an exonic region appears with different boundaries in different transcripts. See Figure~1 of the DEXSeq paper~\cite{DEXSeqPaper} for an illustration. The Python script \verb|dexseq_prepare_annotation.py| takes an Ensembl GTF file and translates it into a GFF file with collapsed exon counting bins. Make sure that your current working directory contains the GTF file and call the script (from the command line shell, not from within R) with \begin{verbatim} python /path/to/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py Drosophila_melanogaster.BDGP5.72.gtf Dmel_flattened.gff \end{verbatim} In this command, which should be entered as a single line, replace \verb|/path/to|...\verb|/python_scripts| with the correct path to the Python scripts, which you have found with the call to \Rfunction{system.file} shown above. \verb|Drosophila_melanogaster.BDGP5.72.gtf| is the Ensembl GTF file (here the one for fruit fly, already de-compressed) and \verb|Dmel_flattenend.gff| is the name of the output file. In the process of forming the counting bins, the script might come across overlapping genes. If two genes on the same strand are found with an exon of the first gene overlapping with an exon of the second gene, the script's default behaviour is to combine the genes into a single ``aggregate gene'' which is subsequently referred to with the IDs of the individual genes, joined by a plus ('+') sign. If you do not like this behaviour, you can disable aggregation with the option ``\verb|-r no|''. Without aggregation, exons that overlap with other exons from different genes are simply skipped. \subsection{Counting reads}\label{sec:counting} For each SAM file, we next count the number of reads that overlap with each of the exon counting bins defined in the flattened GFF file. This is done with the script \verb|python_count.py|: \begin{verbatim} python /path/to/library/DEXSeq/python_scripts/dexseq_count.py Dmel_flattenend.gff untreated1.sam untreated1.counts \end{verbatim} This command (again, to be entered as a single line) expects two files in the current working directory, namely the GFF file produced in the previous step (here \verb|Dmel_flattened.py|) and a SAM file with the aligned reads from a sample (here the file \verb|untreated1.sam| with the aligned reads from the first control sample). The command generates an output file, here called \verb|untreated1.counts|, with one line for each exon counting bin defined in the flattened GFF file. The lines contain the exon counting bin IDs (which are composed of gene IDs and exon bin numbers), followed by a integer number which indicates the number of reads that were aligned such that they overlap with the counting bin. Use the script multiple times to produce a count file from each of your SAM files. There are a number of crucial points to pay attention to when using the \verb|python_count.py| script: \emph{Paired-end data:} If your data is from a paired-end sequencing run, you need to add the option ``\verb|-p yes|'' to the command to call the script. (As usual, options have to be placed before the file names, surrounded by spaces.) In addition, the SAM file needs to be sorted, either by read name or by position. Most aligners produce sorted SAM files; if your SAM file is not sorted, use \verb|samtools sort -n| to sort by read name (or \verb|samtools sort|) to sort by position. (See e.g. reference \cite{DEProt}, if you need further explanations on how to sort SAM files.) Use the option ``\verb|-r pos|'' or ``\verb|-r name|'' to indicate whether your paired-end data is sorted by alignment position or by read name.\footnote{The possibility to process paired-end data from a file sorted by position is based on recent contributions of Paul-Theodor Pyl to \software{HTSeq}.} \emph{Strandedness:} By default, the counting script assumes your library to be \emph{strand-specific}, i.e., reads are aligned to the same strand as the gene they originate from. If you have used a library preparation protocol that does not preserve strand information (i.e., reads from a given gene can appear equally likely on either strand), you need to inform the script by specifying the option ``\verb|-s no|''. If your library preparation protocol reverses the strand (i.e., reads appear on the strand opposite to their gene of origin), use ``\verb|-s reverse|''. In case of paired-end data, the default (\verb|-s yes|) means that the read from the first sequence pass is on the same strand as the gene and the read from the second pass on the opposite strand (``forward-reverse'' or ``fr'' order in the parlance of the Bowtie/TopHat manual) and the options \verb|-s reverse| specifies the opposite case. \emph{SAM and BAM files:} By default, tehs cript expects its input to be in plain-text SAM format. However, it can also read BAM files, i.e., files in the the compressed binary variant of the SAM format. If you wish to do so, use the option ``\verb|-f bam|''. This works only if you have installed the Python package \software{pysam}, which can be found at \url{https://code.google.com/p/pysam/}. \emph{Alignment quality:} The scripts takes a further option, \verb|-a| to specify the minimum alignment quality (as given in the fifth column of the SAM file). All reads with a lower quality than specified (with default \verb|-a 10|) are skipped. \emph{Help pages:} Calling either script without arguments displays a help page with an overview of all options and arguments. \subsection{Reading the data in to R} \label{sec:readData} The remainder of the analysis process is now done in \R. Start \R\ and load \Rpackage{DEXSeq} % <>= library( "DEXSeq" ) @ % Now, we need to prepare a sample table. This table should contain one row for each library, and columns for all relevant information such as name of the file with the read counts, experimental conditions, technical information and further covariates. To keep this vignette simple, we construct the table on the fly. % <>= sampleTable <- data.frame( row.names = c( "untreated1", "untreated2", "untreated3", "untreated4", "treated1", "treated2", "treated3" ), countFile = c( "untreated1.counts", "untreated2.counts", "untreated3.counts", "untreated4.counts", "treated1.counts", "treated2.counts","treated3.counts" ), condition = c( "control", "control", "control", "control", "knockdown", "knockdown", "knockdown" ), libType = c( "single-end", "paired-end", "paired-end", "single-end", "single-end", "paired-end", "paired-end" ) ) @ % While this is a simple way to prepare the table, it may be less error-prone and more prudent to used an existing table that had already been prepared when the experiments were done, save it in CSV format and use the R function \Rfunction{read.csv} to load it. In any case, it is vital to check the table carefully for correctness. % <>= sampleTable @ % Our table contains the sample names as row names, the names of the count files that we prepared in the previous step and the two covariates that vary between samples: first the experimental condition (factor \texttt{condition} with levels \texttt{control} and \texttt{treatment}) and the library type (factor \texttt{libType}), which we included because the samples in this particular experiment were sequenced partly in single-end runs and partly in paired-end runs. For now, we will, however, ignore this latter piece of information, and postpone the discussion of how to include such additional covariates to Section~\ref{sec:glm}. If you have only a single covariate and want to perform a simple analysis, the column with this covariate should be named \texttt{condition}. Now, we construct an \Rclass{ExonCountSet} object from this data. This object holds all the input data and will be passed along the stages of the subsequent analysis. We construct the object with the \Rpackage{DEXSeq} function \Rfunction{read.HTSeqCounts}, as follows: % <>= ecs <- read.HTSeqCounts( sampleTable$countFile, sampleTable, "Dmel_flattenend.gff" ) @ % The function takes three arguments. First, a vector with names of count files, i.e., of files that have been generated with the \verb|dexseq_count.py| script. The function will read these files and arrange the count data in a matrix, which is stored in the \Rclass{ExonCountSet} object \Robject{ecs}. The second argument is our sample table, with one row for each of the files listed in the first argument. This information is simply stored as is in the object's meta-data slot (see below). The third argument is a file name again, now of the flattened GFF file that was generated with \verb|dexseq_prepare_annotation.py| and used as input to \verb|dexseq_count.py| when creating the count file. There are other ways to get a \Rpackage{DEXSeq} analysis started. See Appendix \ref{sec:creatingInR} and Ref.~\cite{parathyroidSEVignette} for details. \section{Standard analysis work-flow} \label{stdAnalysis} \subsection{Loading and inspecting the example data} To demonstrate the \Rpackage{DEXSeq} work flow, we have prepared, in the \Rpackage{pasilla} data package, an \Rclass{ExonCountSet} similar to the one constructed in the previous section. However, in order to keep the run-time of this vignette small, we have subset the object to only a few genes. To start, we load the package \Rpackage{DEXSeq} and the example data from the \Rpackage{pasilla} package. % <>= library( DEXSeq ) data( "pasillaExons", package="pasilla" ) @ % The object \Robject{pasillaExons} is an \Rclass{ExonCountSet}, which has been constructed with \Rfunction{read.HTSeqCounts} as described above. For consistency, we rename it to \Robject{ecs} as above, and then inspect it. % <>= ecs <- pasillaExons ecs @ % The \Rclass{ExonCountSet} class is derived from \Rclass{eSet}, Bioconductor's standard container class for experimental data \cite{ExpressionSet}. As such, it contains the usual accessor functions for sample, feature and assay data (including \Rfunction{pData}, \Rfunction{fData}, \Rfunction{experimentData}), and some specific ones. The core data in an \Rclass{ExonCountSet} object are the counts per exon. Let's have a look at the first 7 rows. <>= head( counts(ecs), 7 ) @ % The rows are labelled with gene IDs (here Flybase IDs), followed by a colon and the counting bin number. (As a counting bin corresponds to an exon or part of an exon, this ID is called the \emph{exon ID} within \Rpackage{DEXSeq}.) The table content indicates the number of reads that have been mapped to each counting bin in the respective sample. To see details on the counting bins, we also print the first 3 lines of the feature data annotation: % {\small <>= head( fData(pasillaExons), 3 ) @ } % So far, this table contains information on the annotation data, such as gene and exon IDs, genomic coordinates of the exon, and the list of transcripts that contain an exon. Further columns for intermediate and final results are already present (and filled with \Robject{NA}) and will be filled with results in subsequent steps of the analysis. The accessor function \Rfunction{design} shows the design table with the sample annotation (which was passed as the second argument to \Rfunction{read.HTSeqCounts}): % <>= design(ecs) @ % In the following sections, we will update the object by calling a number of analysis functions, always using the idiom ``\verb| ecs <- |\textit{\texttt{someFunction}}\verb|( ecs )|'', which takes the \verb|ecs| object, fills in the results of the performed computation and writes the returned and updated object back into the variable \verb|ecs|. %-------------------------------------------------- \subsection{Normalisation}\label{sec:norm} %-------------------------------------------------- Different samples might be sequenced with different depths. In order to adjust for such coverage biases, we estimate \emph{size factors}, which measure relative sequencing depth. \Rpackage{DEXSeq} uses the same method as \Rpackage{DESeq} and \Rpackage{DESeq2}, which is provided in the function \Rfunction{estimateSizeFactors}. % <>= ecs <- estimateSizeFactors( ecs ) @ % You may want to inspect the estimated size factors <>= sizeFactors(ecs) @ %-------------------------------------------------- \subsection{Dispersion estimation}\label{sec:dispest} %-------------------------------------------------- To test for differential expression, we need to estimate the variability of the data. This is necessary to be able to distinguish technical and biological variation (noise) from real effects on exon expression due to the different conditions. The information on the strength of the noise is inferred from the biological replicates in the data set and characterized by the so-called \emph{dispersion}. In RNA-Seq experiments the number of replicates is typically too small to reliably estimate variance or dispersion parameters individually exon by exon, and therefore, variance information is shared across exons and genes, in an intensity-dependent manner. In this section, we discuss simple one-way designs: In this setting, samples with the same experimental condition, as indicated in the \texttt{condition} factor of the design table (see above), are considered as replicates -- and therefore, the design table needs to contain a column with the name \verb|condition|. In Section~\ref{sec:glm}, we discuss how to treat more complicated experimental designs which are not accommodated by a single \Robject{condition} factor. The first step is to get a dispersion estimate for each exon. This task is performed by the function \Rfunction{estimateDispersions}, using Cox-Reid (CR) likelihood estimation (our approach here follows that of the package \Rpackage{edgeR}~\cite{edgeR_GLM}). Before starting estimating the CR dispersion estimates, \Rfunction{estimateDispersions} first defines the ``testable'' counting bins. In this step, all bins are excluded with less that \Robject{minCount} reads. By default, \Robject{minCount=10}, i.e., the bin must have at least 10 reads, \emph{summed up} across all samples, but other values can be specified when calling \Rfunction{estimateDispersion}. If a gene has only one testable counting bin, then this counting bin is then considered as not testable, either, because it's usage cannot be compared to any other counting bins. The testable bins are marked in the column \texttt{testable} of the feature data. Then, a dispersion estimate is computed for each bin, and the obtained values are stored in the feature data column \Robject{dispBeforeSharing}. The following command performs these steps. <>= ecs <- estimateDispersions( ecs ) @ % Note that for a full, genome-wide data set, execution of this function can take a while. To indicate progress, a dot is printed on the console whenever 100 genes have been processed. If you have a machine with multiple cores, you may want to use the \texttt{nCores} option to instruct the function to parallelize the task over several CPU cores. See Section \ref{parallelization} and the function's help page for details. Afterwards, the function \Rfunction{fitDispersionFunction} needs to be called, in which a dispersion-mean relation $\alpha(\mu) = \alpha_0 + \alpha_1/\mu$ is fitted to the individual CR dispersion values (as stored in \Robject{dispBeforeSharing}). The fit coefficients are stored in the slot \Robject{dispFitCoefs} and finally, for each exon, the maximum between the dispersion before sharing and the fitted dispersion value is taken as the exon's final dispersion value and stored in the \Robject{dispersion} slot.\footnote{Especially when the dispersion estimates are very large, this fit can be difficult, and has occasionally caused the function to fail. In these rare cases please contact the developers.} See our paper~\cite{DEXSeqPaper} for the rationale behind this ``sharing'' scheme. % <>= ecs <- fitDispersionFunction( ecs ) @ % We can have a look at the results of the CR estimation, the coefficients from the fit, and the fitted values: % <>= head( fData(ecs)$dispBeforeSharing ) ecs@dispFitCoefs head( fData(ecs)$dispFitted ) @ % As a fit diagnostic, we plot the per-exon dispersion estimates versus the mean normalised count. % <>= plotDispEsts( ecs ) @ % \incfig{DEXSeq-fitdiagnostics}{.5\textwidth}{Fit Diagnostics.}{Per-gene dispersion estimates (shown by points) and the fitted mean-dispersion function (red line).} % The plot (Figure~\ref{DEXSeq-fitdiagnostics}) shows the estimates for all exons as dots and the fit as red line. The red line follows the trend of the dots in the upper cluster of dots. The lower cluster stems from exons for which the sample noise happens to fall below shot noise, i.e., their sample estimates of the dispersion is zero or close to zero and hence form another cluster at the bottom. The fact that these two clusters look so well separated is to a large extent an artefact of the logarithmic $y$-axis scaling. Inspect the fit and make sure that the regression line follows the trend of the points within the upper cluster. The intercept coefficient of the dispersion fit (see above), which is the horizontal asymptote of the red line, can be understood as the square of the coefficient of variation of highly expressed exons, and is a good rough estimate of the overall noise in the data set and hence of the available power to infer differential exon usage, %-------------------------------------------------- \subsection{Testing for differential exon usage}\label{sec:deu} %-------------------------------------------------- Having the dispersion estimates and the size factors, we can now test for differential exon usage. For each gene, \Rpackage{DEXSeq} fits a generalized linear model with the formula \begin{equation}\label{eq:altmodel} \mbox{\texttt{\tttildemiddle\ sample + exon + condition:exon}} \end{equation} and compare it to the smaller model (the null model) \begin{equation}\label{eq:nullmodel} \mbox{\texttt{\tttildemiddle\ sample + exon}.} \end{equation} In these formulae (which use the standard notation for linear model formulae in \R; consult a text book on \R\ if you are unfamiliar with it), \verb|sample| is a factor with different levels for each sample, \verb|condition| is the factor of experimental conditions that we defined when constructing the \Rclass{ExonCountSet} object at the beginning of the analysis, and \verb|exon| is a factor with two levels, \verb|this| and \verb|others|. The two models described by these formulae are fit for each counting bin, where the data supplied to the fit comprise \emph{two} read count values for each sample, corresponding to the two levels of the \Robject{exon} factor: the number of reads mapping to the bin in question (level \Robject{this}), and the sum of the read counts from all other bins of the same gene (level \Robject{others}). Note that this approach differs from the approach described in the paper~\cite{DEXSeqPaper} and used in older versions of \Rpackage{DEXSeq}; see Appendix~\ref{changes} for further discussion. Readers familiar with linear model formulae might find one aspect of Equation~(\ref{eq:altmodel}) surprising: We have an interaction term \verb|condition:exon|, but denote no main effect for \verb|condition|. Note, however, that all observations from the same sample are also from the same condition, i.e., the \verb|condition| main effects are absorbed in the \verb|sample| main effects, because the \verb|sample| factor is nested within the \verb|condition| factor. The deviances of both fits are compared using a $\chi^2$-distribution, giving rise to a $p$ value. Based on that, we can decide whether the null model (\ref{eq:nullmodel}) is sufficient to explain the data, or whether it may be rejected in favour of the alternative, model~(\ref{eq:altmodel}), which contains an interaction coefficient for \verb|condition:exon|. The latter means that the fraction of the gene's reads that fall onto the exon under the test differs significantly between the experimental conditions. The function \Rfunction{testForDEU} performs these tests for each exon in each gene. % <>= ecs <- testForDEU( ecs ) @ % The function stores its results in the \Robject{pvalue} and \Robject{padjust} columns of the \Robject{featureData} slots of the \Rclass{ExonCountSet} object with the results. Here, \Robject{pvalue} contains the p values from the $\chi^2$ likelihood-ratio test, and \Robject{padjust} is the result of performing Benjamini-Hochberg adjustment for multiple testing on \Robject{pvalue}: % <>= head( fData(ecs)[, c( "pvalue", "padjust" ) ] ) @ % For some usages, we may also want to estimate fold changes. To this end, we call \Rfunction{estimatelog2FoldChanges}: % <>= ecs <- estimatelog2FoldChanges( ecs ) @ % This function stores its results in the feature data table, too. A convenient way to extract all relevant columns of the feature data table is offered by the function \Rfunction{DEUresultTable}: % <>= res1 <- DEUresultTable(ecs) head( res1 ) @ % Controlling false discovery rate (FDR) at 0.1 (10\%), we can now ask how many counting bins show evidence of differential exon usage: % <>= table ( res1$padjust < 0.1 ) @ % We may also ask how many genes are affected <>= table ( tapply( res1$padjust < 0.1, geneIDs(pasillaExons), any ) ) @ % Remember that our example data set contains only a selection of genes. We have chosen these to contain interesting cases; so the fact that such a large fraction of genes is significantly affected here is not typical. To see how the power to detect differential exon usage depends on the number of reads that map to an exon, a so-called MA plot is useful, which plots the logarithm of fold change versus average normalized count per exon and marks by red colour the exons which are considered significant; here, the exons with an adjusted $p$ values of less than 0.1 (Figure \ref{DEXSeq-MvsA}). There is of course nothing special about the number 0.1, and you can specify other thresholds in the call to \Rfunction{plotMA}. % <>= plotMA( ecs, FDR=0.1, ylim=c(-4,4), cex=0.8 ) @ \incfig{DEXSeq-MvsA}{.5\textwidth}{MA plot.}{ Mean expression versus $\log_2$ fold change plot. Significant hits (at \Robject{padj}<0.1) are coloured in red. } %$ %------------------------------------------------------------ \section{Additional technical or experimental variables}\label{sec:glm} %------------------------------------------------------------ In the previous section we performed a simple analysis of differential exon usage, in which each sample was assigned to one of two experimental conditions. If your experiment is of this type, you can use the work flow shown above. All you have to make sure is that you indicate which sample belongs to which experimental condition when you construct the \Rclass{ExonCountSet} object (Section \ref{sec:readData}. Do so by means of a column called \verb|condition| in the sample table. If you have a more complex experimental design, you can provide different or additional columns in the sample table. You then have to indicate the design by providing explicit formulae for the test. In the \Rpackage{pasilla} dataset, some samples were sequenced in single-end and others in paired-end mode. Possibly, this influenced counts and should hence be accounted for. We therefore use this as an example for a complex design. When we constructed the \Rclass{ExonCountSet} object in Section \ref{sec:readData}, we provided in the sample table an additional column called \verb|type|, which has been stored in the object: % <>= design(pasillaExons) @ % We specify two design formulae, which indicate that the \verb|libType| \Rclass{factor} should be treated as a blocking factor: % <>= formulaFullModel <- ~ sample + exon + type:exon + condition:exon formulaReducedModel <- ~ sample + exon + type:exon @ % Compare these formulae with the default formulae (\ref{eq:altmodel}, \ref{eq:nullmodel}) given in Section \ref{sec:deu}. We have added, in both the full model and the reduced model, the term \verb|type:exon|. Therefore, any dependence of exon usage on library type will be absorbed by this term and accounted for equally in the full and a reduced model, and the likelihood ratio test comparing them will only detect differences in exon usage that can be attributed to \verb|condition|, independent of \verb|type|. To start a fresh analysis, now using these formulae instead of the default ones, we copy the example data into a new object. % <>= ecs2 <- pasillaExons @ % As before, we first estimate the size factors <>= ecs2 <- estimateSizeFactors( ecs2 ) @ % Next, we estimate the dispersions. This time, we need to inform the \Rfunction{estimateDispersions} function about our design by providing the full model's formula, which should be used instead of the default formula (\ref{eq:altmodel}). % <>= ecs2 <- estimateDispersions( ecs2, formula = formulaFullModel ) @ % The fit is performed as before. % <>= ecs2 <- fitDispersionFunction( ecs2 ) @ % The test function now needs to be informed about both formulae <>= ecs2 <- testForDEU( ecs2, formula0 = formulaReducedModel, formula1 = formulaFullModel ) @ % Finally, we get a summary table, as before. <>= res2 <- DEUresultTable( ecs2 ) head(res2) @ % How many significant DEU cases have we got this time? <>= table( res2$padjust < 0.1 ) @ % We can now compare with the previous result: <>= table( before = res1$padjust < 0.1, now = res2$padjust < 0.1 ) @ <>= stopifnot( sum( res2$padjust<.1, na.rm=TRUE ) == sum( res1$padjust<.1, na.rm=TRUE ) + 1 ) @ % Accounting for the library type has allowed us to find one more hit. (With so few genes, this could be coincidence, of course. However, performing the analysis on the full pasilla data confirms that accounting for the covariate improves power.) %-------------------------------------------------- \section{Visualization} %-------------------------------------------------- The \Rfunction{plotDEXSeq} provides a means to visualize the results of an analysis. % <>= plotDEXSeq( ecs2, "FBgn0010909", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 ) @ % \incfig{DEXSeq-plot1}{\textwidth}{Fitted expression.}{ The plot represents the expression estimates from a call to \texttt{testForDEU}. Shown in red is the exon that showed significant differential exon usage. } % <>= wh = (fData(ecs2)$geneID=="FBgn0010909") stopifnot(sum(fData(ecs2)$padjust[wh] < formals(plotDEXSeq)$FDR)==1) @ % The result is shown in Figure~\ref{DEXSeq-plot1}. This plot shows the fitted expression values of each of the exons of gene FBgn0010909, for each of the two conditions, treated and untreated. The function \Rfunction{plotDEXSeq} expects at least two arguments, the \Rclass{ExonCountSet} object and the gene ID. The option \texttt{legend=TRUE} causes a legend to be included. The three remaining arguments in the code chunk above are ordinary plotting parameters which are simply handed over to \R's standard plotting functions. They are not strictly needed and included here to improve appearance of the plot. See the help page for \Rfunction{par} for details. Optionally, one can also visualize the transcript models (Figure~\ref{DEXSeq-plot2}), which can be useful for putting differential exon usage results into the context of isoform regulation. % <>= plotDEXSeq( ecs2, "FBgn0010909", displayTranscripts=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 ) @ % \incfig{DEXSeq-plot2}{\textwidth}{Transcripts.}{ As in Figure~\ref{DEXSeq-plot1}, but including the annotated transcript models.} % Other useful options are to look at the count values from the individual samples, rather than at the model effect estimates. For this display (option \texttt{norCounts=TRUE}), the counts are normalized by dividing them by the size factors (Figure~\ref{DEXSeq-plot3}). % <>= plotDEXSeq( ecs2, "FBgn0010909", expression=FALSE, norCounts=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 ) @ % \incfig{DEXSeq-plot3}{\textwidth}{Normalized counts.}{ As in Figure~\ref{DEXSeq-plot1}, with normalized count values of each exon in each of the samples.} % As explained in Section~\ref{sec:praeludium}, \Rpackage{DEXSeq} is designed to find changes in relative exon usage, i.\,e., changes in the expression of individual exons that are not simply the consequence of overall up- or down-regulation of the gene. To visualize such changes, it is sometimes advantageous to remove overall changes in expression from the plots. Use the (somewhat misnamed) option \texttt{splicing=TRUE} for this purpose. % <>= plotDEXSeq( ecs2, "FBgn0010909", expression=FALSE, splicing=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 ) @ % \incfig{DEXSeq-plot4}{\textwidth}{Fitted splicing.}{ The plot represents the estimated effects, as in Figure~\ref{DEXSeq-plot1}, but after subtraction of overall changes in gene expression.} % To generate an easily browsable, detailed overview over all analysis results, the package provides an HTML report generator, implemented in the function \Rpackage{DEXSeqHTML}. This function uses the package \Rpackage{hwriter} \cite{hwriter} to create a result table with links to plots for the significant results, allowing a more detailed exploration of the results. % <>= DEXSeqHTML( pasillaExons, FDR=0.1, color=c("#FF000080", "#0000FF80") ) @ %-------------------------------------------------- \section{Parallelization} \label{parallelization} %-------------------------------------------------- DEXSeq analyses can be computationally heavy, especially with data sets that comprise a large number of samples, or with genomes containing genes with large numbers of exons. While some steps of the analysis work on the whole data set, the two parts that are most time consuming -- the functions \Rfunction{estimateDispersions} and \Rfunction{testForDEU} -- can be parallelized by setting the two functions' parameter \Rfunction{nCores} to the number of CPU cores to be used. These functions will then distribute the \Rclass{ExonCountSet} object into smaller objects that are processed in parallel on different cores. This functionality uses the \Rpackage{parallel} package, which hence has to be loaded first. <>= data("pasillaExons", package="pasilla") library("parallel") pasillaExons <- estimateSizeFactors( pasillaExons ) pasillaExons <- estimateDispersions( pasillaExons, nCores=16) pasillaExons <- fitDispersionFunction( pasillaExons ) pasillaExons <- testForDEU( pasillaExons, nCores=16) @ %-------------------------------------------------- \section{Perform a standard differential exon usage analysis in one command} %-------------------------------------------------- In the previous sections, we went through the analysis step by step. Once you are sufficiently confident about the work flow for your data, its invocation can be streamlined by the wrapper function \Rfunction{doCompleteDEUAnalysis}, which runs the analysis shown above through a single function call. In the simplest case, construct the \Robject{ExonCountSet} as shown in Section \ref{preps} or in Appendix \ref{sec:creatingInR}, then run \Rfunction{doCompleteDEUAnalysis} passing the \Robject{ExonCountSet} as only argument, and finally inspect the result with \Rfunction{DEUresultTable}. In the following, we also specify design formulae to account for library type as described in Section \ref{sec:glm} and the \Robject{nCores} argument for parallelization (Section \ref{parallelization}). % <>= data( "pasillaExons", package="pasilla" ) pasillaExons <- doCompleteDEUAnalysis( pasillaExons, formula0 = ~ sample + exon + type:exon, formula1 = ~ sample + exon + type:exon + condition:exon, nCores = 1 ) @ <>= head( DEUresultTable( pasillaExons ) ) @ \appendix \clearpage \begin{center} {\Large\sffamily\bfseries\color{BiocBlue} APPENDIX} \addcontentsline{toc}{section}{APPENDIX} \end{center} %-------------------------------------------------- \section{Preprocessing within R}\label{sec:creatingInR} %-------------------------------------------------- As an alternative to the approach described in Section \ref{preps}, users can also create \Rclass{ExonCountSet} objects from other \Rpackage{Bioconductor} data objects. The code for implementationg these functions was kindly contributed by Michael I.\ Love. For details, see the \Rpackage{parathyroidSE} package vignette \cite{parathyroidSEVignette}. The work flow is similar to the one using the \Rpackage{HTSeq} python scripts. emph{Note:} The code in this section is not run when the vignette is built, as some of the commands have long run time. Therefore, no output is given. We use functionality from the following Bioconductor packages <>= library("GenomicFeatures") library("GenomicRanges") library("Rsamtools") @ % We demonstrate the workflow briefly (for more details, see \cite{parathyroidSEVignette}) on the data set of Haglund et al.\ \cite{parathyroidPaper}, which is provided as example data in the \Rpackage{parathyroidSE} data package. First, we download the current human gene model annotation from Ensembl via Biomart and create a transcript data base from these. Note that this step takes some time. <>= hse <- makeTranscriptDbFromBiomart( biomart="ensembl", dataset="hsapiens_gene_ensembl" ) @ % Next, we collapse the gene models into counting bins, analogous to Section~\ref{prepAnno}. % <>= exonicParts <- disjointExons( hse, aggregateGenes=TRUE ) @ % As before, we have to choose how to handle genes with overlapping exons. The \verb|aggregateGenes| option here plays the same role as the \verb|-r| option to \verb|dexseq_prepare_anotation.py| described at the end of Section \ref{prepAnno}. The \Robject{exonicParts} object contains a GRanges object with our counting bins. We use it to count the number of read fragments that overlap with the bins by means of the function \Rfunction{countReadsForDEXSeq}. To demonstrate this, we first determine the paths to the example BAM files in the \Rpackage{parathyroidSE} data package. % <>= bamDir <- system.file( "extdata", package="parathyroidSE", mustWork=TRUE ) fls <- list.files( bamDir, pattern="bam$", full=TRUE ) @ % Then, use the following code to count the reads overlapping the bins. % <>= bamlst <- BamFileList( fls, index=character(), yieldSize=100000, obeyQname=TRUE ) SE <- summarizeOverlaps( exonicParts, bamlst, mode="Union", singleEnd=FALSE, ignore.strand=TRUE, inter.feature=FALSE, fragments=TRUE ) @ % We can now call the function \Rfunction{buildExonCountSet} to build an \Robject{ExonCountSet} object. The middle argument in the following call is the design, indicating that the first two BAM files form one experimental condition and the third one the other. Generally, you will want to pass a \Rclass{data.frame} with a sample table as in Section~\ref{sec:readData}. % <>= ecs <- buildExonCountSet( SE, c("A", "A", "B"), exonicParts ) @ %-------------------------------------------------- \section{Lower-level functions} %-------------------------------------------------- The following functions are not needed in a standard analysis work-flow, but may be useful for special purposes. We list them here briefly; see the help pages of these function for further options (e.\,g., to specify formulae). \subsection{Count tables} While the function \Rfunction{counts} returns the whole read count table, the function \Rfunction{countTableForGene} returns the count table for a single gene: <>= countTableForGene( pasillaExons, "FBgn0010909" ) @ % Like the function \Rfunction{counts}, the function \Rfunction{countTableForGene} can also return normalized counts (i.e., counts divided by the size factors). Use the option \texttt{normalized=TRUE}. <>= head( countTableForGene( pasillaExons, "FBgn0010909", normalized=TRUE ) ) @ % The function \Rfunction{geneCountTable} computes a table of \emph{gene counts}, which are obtained by summing the counts from all exons with the same geneID. This might be useful for the detection of differential expression of genes, where the table can be used as input e.\,g.\ for the packages \Rpackage{DESeq} or \Rpackage{edgeR}. This kind of table can also be produced with the package \Rpackage{GenomicRanges}, e.\,g.\ with the function \Rfunction{summarizeOverlaps}. % <>= head( geneCountTable( pasillaExons ) ) @ % Note that a read that mapped to several exons of a gene is counted for each of these exons by the \texttt{dexseq\_count.py} script. The table returned \Rfunction{geneCountTable} will hence count the read several time for the gene, which may skew downstream analyses in subtle ways. Hence, we recommend to use \Rfunction{geneCountTable} with care and use more sophisticated counting schemes where appropriate. \subsection{Model frames} The function \Rfunction{constructModelFrame} returns the model frame used for the dispersion fits and the model fits involved in the likelihood ratio test. The terms used in the formulae passed to \Rfunction{estimateDispersions} and \Rfunction{testForDEU} refer to this model frame and hence must appear as column names. <>= constructModelFrame( pasillaExons ) @ % For the visualization with \Rfunction{plotDEXSeq}, a GLM is fitted over the joint data from all exons of a gene. The function \Rfunction{modelFrameForGene} returns the model frame used for this fit for a single gene. <>= mf <- modelFrameForGene( pasillaExons, "FBgn0010909" ) head( mf ) @ \subsection{Further accessors} The function \Rfunction{geneIDs} returns the gene ID column of the feature data as a character vector, and the function \Rfunction{exonIDs} return the exon ID column as a \Rclass{factor}. % <>= head( geneIDs(pasillaExons) ) head( exonIDs(pasillaExons) ) @ % These functions are useful for subsetting an \Rclass{ExonCountSet} object. %-------------------------------------------------- \section{Methodological changes since publication of the paper} \label{changes} %-------------------------------------------------- In our paper \cite{DEXSeqPaper}, we suggested to fit for each exon a model that includes separately the counts for all the gene's exons. However, this turned out to be computationally inefficient for genes with many exons, because the many exons required large model matrices, which are computationally expensive to deal with. We have therefore modified the approach: when fitting a model for an exon, we now sum up the counts from all the other exon and use only the total, rather than the individual counts in the model. Now, computation time per exon is independent of the number of other exons in the gene, which improved \Rpackage{DEXSeq}'s scalability. While the $p$ values returned by the two approaches are not exactly equal, the differences were very minor in the tests that we performed. For now, the function for our original approach (which we now call the ``big model'' or ``BM'' approach) are still included; all relevant functions, however, have been renamed to carry the suffix \verb|_BM| in their name. The new approach, which is now default and is used by the work flow described in this vignette, has no special name (in some previous releases of \Rpackage{DEXSeq} which had included it first on an experimental basis, it was termed the ``TRT'' approach). In the following, we describe the current default (``TRT'') approach in detail (though the exposition assumes the reader's familiarity with our paper). Deviating from the paper's notation, we now use the index $i$ to indicate a specific counting bin, with $i$ running through all counting bins of all genes. The samples are indexed with $j$, as in the paper. We write $K_{ij0}$ for the count or reads mapped to counting bin $i$ in sample $j$ and $K_{ij1}$ for the sum of the read counts from all other counting bins in the same gene. Hence, when we write $K_{ijl}$, the third index $l$ indicates whether we mean the read count for bin $i$ ($l=0$) or the sum of counts for all other bins of the same gene ($l=1$). As before, we fit a GLM of the negative binomial (NB) family \begin{equation} K_{ijl} \sim \operatorname{NB}(\text{mean}=s_j\mu_{ijl},\text{dispersion}=\alpha_i), \end{equation} now with the model specified in Equation (\ref{eq:altmodel}), which we write out as \begin{equation} \log_2 \mu_{ijl} = \beta^\text{S}_{ij} + l \beta^\text{E}_{i} + \beta^\text{EC}_{i\rho_j}. \end{equation} This model is fit separately for each counting bin $i$. The coefficient $\beta^\text{S}_{ij}$ accounts for the sample-specific contribution (factor \verb|sample| in Equation (\ref{eq:altmodel})), the term $\beta^\text{E}_{i}$ is only included if $l=1$ and hence estimates the logarithm of the ratio $K_{ij1}/K_{ij0}$ between the counts for all other exons and the counts for the tested exon. As this coefficient is estimated from data from all samples, it can be considered as a measure of ``average exon usage''. In the R model formula, it is represented by the term \verb|exon| with its two levels \verb|this| ($l=0$) and \verb|others| ($l=1$). Finally, the last term, $\beta^\text{EC}_{i,\rho_j}$, captures the interaction \verb|condition:exon|, i.e., the change in exon usage if sample $j$ is from experimental condition group $\rho(j)$. Here, the first condition, $\rho=0$, is absorbed in the sample coefficients, i.e., $\beta^\text{EC}_{i0}$ is fixed to zero and does not appear in the model matrix. For the dispersion estimation, one dispersion value $\alpha_i$ is estimated with Cox-Reid-adjusted maximum likelihood using the full model given above. For the likelihood ratio test, this full model is fit and compared with the fit of the reduced model, which lacks the interaction term $\beta^\text{EC}_{i\rho_j}$. As described in Section~\ref{sec:glm}, alternative model formulae can be specified. %-------------------------------------------------- \section{Requirements on GTF files} \label{GTF} %-------------------------------------------------- In the initial preprocessing step described in Section \ref{prepAnno}, the Python script \verb|dexseq_prepare_annotation.py| is used to convert a GTF file with gene models into a GFF file with collapsed gene models. We recommend to use GTF files downloaded from Ensembl as input for this script, as files from other sources may deviate from the format expected by the script. Hence, if you need to use a GTF or GFF file from another source, you may need to convert it to the expected format. To help with this task, we here give details on how the \verb|dexseq_prepare_annotation.py| script interprets a GFF file. \begin{itemize} \item The script only looks at \texttt{exon} lines, i.e., at lines which contain the term \texttt{exon} in the third (``type'') column. All other lines are ignored. \item Of the data in these lines, the information about chromosome, start, end, and strand (1st, 4th, 5th, and 7th column) are used, and, from the last column, the attributes \verb|gene_id| and \verb|transcript_id|. The rest is ignored. \item The \verb|gene_id| attribute is used to see which exons belong to the same gene. It must be called \verb|gene_id| (and not \verb|Parent| as in GFF3 files, or \verb|GeneID| as in some older GFF files), and it must give the same identifier to all exons from the same gene, even if they are from different transcripts of this gene. (This last requirement is not met by GTF files generated by the Table Browser function of the UCSC Genome Browser.) \item The \texttt{transcript\_id} attribute is used to build the \texttt{transcripts} attribute in the flattened GFF file, which indicates which transcripts contain the described counting bin. This information is needed only to draw the transcript model at the bottom of the plots when the \Rcode{displayTranscript} option to \Rfunction{plotDEXSeq} is used. \end{itemize} Therefore, converting a GFF file to make it suitable as input to \verb|dexseq_prepare_annotation.py| amounts to making sure that the exon lines have type \texttt{exon} and that the atributes giving gene ID (or gene symbol) and transcript ID are called \texttt{gene\_id} and \texttt{transcript\_id}, with this exact spelling. Remember to also take care that the chromosome names match those in your SAM files, and that the coordinates refer to the reference assembly that you used when aligning your reads. %-------------------------------------------------- \section{Session Information} %-------------------------------------------------- The session information records the versions of all the packages used in the generation of the present document. <>= sessionInfo() @ %-------------------------------------------------- \section{References} %-------------------------------------------------- \begingroup \renewcommand{\section}[2]{}% \bibliography{DEXSeq} \bibliographystyle{unsrt} \endgroup \end{document}