--- title: "Inferring differential exon usage in RNA-Seq data with the DEXSeq package" author: - name: Alejandro Reyes affiliation: Dana-Farber Cancer Institute, Boston, USA - name: Simon Anders affiliation: Heidelberg University, Heidelberg, USA - name: Wolfgang Huber affiliation: European Molecular Biology Laboratory, Heidelberg, Germany output: BiocStyle::html_document: toc_float: true package: DEXSeq bibliography: DEXSeq.bib abstract: | The Bioconductor package `r BiocStyle::Biocpkg("DEXSeq")` implements a method to test for differential exon usage in comparative RNA-Seq experiments. By *differential exon usage* (DEU), we mean changes in the relative usage of exons caused by the experimental condition. The relative usage of an exon is defined as $\frac{\text{number of transcripts from the gene that contain this exon}} {\text{number of all transcripts from the gene}}$. The statistical method used by `r BiocStyle::Biocpkg("DEXSeq")` was introduced in our paper [@DEXSeqPaper]. The basic concept can be summarized as follows. For each exon (or part of an exon) and each sample, we count how many reads map to this exon and how many reads map to any of the other exons of the same gene. We consider the ratio of these two counts, and how it changes across conditions, to infer changes in the relative exon usage. In the case of an inner exon, a change in relative exon usage is typically due to a change in the rate with which this exon is spliced into transcripts (alternative splicing). Note, however, that DEU is a more general concept than alternative splicing, since it also includes changes in the usage of alternative transcript start sites and polyadenylation sites, which can cause differential usage of exons at the 5' and 3' boundary of transcripts. DEXSeq package version: `r packageVersion("DEXSeq")` vignette: | %\VignetteIndexEntry{Inferring differential exon usage in RNA-Seq data with the DEXSeq package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r echo=FALSE, include=FALSE} knitr::opts_chunk$set(tidy = FALSE, cache = TRUE, dev = "png", message = FALSE, error = FALSE, warning = TRUE) ``` # Overview Similar as with differential gene expression, we need to make sure that observed differences of exon usage values between conditions are statistically significant, i.e., are sufficiently unlikely to be just due to random fluctuations such as those seen even between samples from the same condition, i.e., between replicates. To this end, `r BiocStyle::Biocpkg("DEXSeq")` assesses the strength of these fluctuations (quantified by the so-called *dispersion*) by comparing replicates before comparing the averages between the sample groups. The preceding description is somewhat simplified (and perhaps over-simplified), and we recommend that users consult the publication for a more complete description, as well as Appendix \@ref(methodChanges) of this vignette, which describes how the current implementation of `r BiocStyle::Biocpkg("DEXSeq")` differs from the original approach described in the paper. Nevertheless, two important aspects should be mentioned already here: First, `r BiocStyle::Biocpkg("DEXSeq")` does not actually work on the exon usage ratios, but on the counts in the numerator and denominator, to be able to make use of the information that is contained in the magnitude of count values. For example, 3000 reads versus 1000 reads is the same ratio as 3 reads versus 1 read, but the latter is a far less reliable estimate of the underlying true value, because of statistical sampling. Second, `r BiocStyle::Biocpkg("DEXSeq")` is not limited to simple two-group comparisons; rather, it uses so-called generalized linear models (*GLMs*) to permit ANOVA-like analysis of potentially complex experimental designs. # Preparations ## Example data To demonstrate the use of `r BiocStyle::Biocpkg("DEXSeq")`, we use the *pasilla* dataset, an RNA-Seq dataset generated by [@Brooks2010]. They investigated the effect of siRNA knock-down of the gene *pasilla* on the transcriptome of fly *S2-DRSC* cells. The RNA-binding protein *pasilla* protein is thought to be involved in the regulation of splicing. (Its mammalian orthologs, NOVA1 and NOVA2, are well-studied examples of splicing factors.) Brooks et al. prepared seven cell cultures, treated three with siRNA to knock-down *pasilla* and left four as untreated controls, and performed RNA-Seq on all samples. They deposited the raw sequencing reads with the NCBI Gene Expression Omnibus (GEO) under the accession number [GSE18508](http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508). ## Executability of the code Usually, Bioconductor vignettes contain automatically executable code, i.e., you can follow the vignette by directly running the code shown, using functionality and data provided with the package. However, it would not be practical to include the voluminous raw data of the pasilla experiment here. Therefore, the code in the alignment section is not automatically executable. You may download the raw data yourself from GEO, as well as the required extra tools, and follow the work flow shown here and in the *pasilla* vignette [@pasillaVignette]. From Section \@ref(standard-analysis-workflow) on, code is directly executable, as usual. Therefore, we recommend that you just read this section, and try following our analysis in R only from the next section onwards. Once you work with your own data, you will want to come back and adapt the workflow shown here to your data. ## Alignment The first step of the analysis is to align the reads to a reference genome. If you are using classic alignment methods (i.e. not pseudo-alignment approaches) it is important to align them to the genome, not to the transcriptome, and to use a splice-aware aligner (i.e., a short-read alignment tool that can deal with reads that span across introns) such as TopHat2 [@TopHat2], GSNAP [@GSNAP], or STAR [@STAR]. The explanation of the analysis workflow presented here starts with the aligned reads in the SAM format. If you are unfamiliar with the process of aligning reads to obtain SAM files, you can find a summary how we proceeded in preparing the pasilla data in the vignette for the `r BiocStyle::Biocpkg("pasilla")` data package [@pasillaVignette] and a more extensive explanation, using the same data set, in our protocol article on differential expression calling [@DEProt]. ## Preparing the annotation {#preparing-the-annotation-bioc} In a transcript annotations such as a GTF file, many exons appear multiple times, once for each transcript that contains them. We need to "collapse" or "flatten" this information to define *exon counting bins*, i.e., a list of intervals, each corresponding to one exon or part of an exon. Counting bins for parts of exons arise when an exonic region appears with different boundaries in different transcripts. See Figure 1 of the `r BiocStyle::Biocpkg("DEXSeq")` paper (@DEXSeqPaper) for an illustration. Here we use the function `exonicParts()` of the package `r BiocStyle::Biocpkg("GenomicFeatures")` to define exonic counting bins. First, we download a annotation file from *ENSEMBL* in *GTF* format, create an *txdb* object and then run the `exonicParts()` function. ```{r, warning=FALSE} library(GenomicFeatures) download.file( "https://ftp.ensembl.org/pub/release-62/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.25.62.gtf.gz", destfile="Drosophila_melanogaster.BDGP5.25.62.gtf.gz") txdb = makeTxDbFromGFF("Drosophila_melanogaster.BDGP5.25.62.gtf.gz") file.remove("Drosophila_melanogaster.BDGP5.25.62.gtf.gz") flattenedAnnotation = exonicParts( txdb, linked.to.single.gene.only=TRUE ) names(flattenedAnnotation) = sprintf("%s:E%0.3d", flattenedAnnotation$gene_id, flattenedAnnotation$exonic_part) ``` Note that the code above first converted the annotation *GTF* file into a *transcriptDb* object, but transcripts could also start from *txdb* objects available as Bioconductor packages or from objects available through `r BiocStyle::Biocpkg("AnnotationHub")`. ## Counting reads For each BAM file, we count the number of reads that overlap with each of the exon counting bin defined in the previous section. The read counting step is performed by the **summarizeOverlaps()** function of the `r BiocStyle::Biocpkg("GenomicAlignments")` package. To demonstrate this step, we will use a subset of the pasilla BAM files that are available through the `r BiocStyle::Biocpkg("pasillaBamSubset")` package. ```{r} library(pasillaBamSubset) library(Rsamtools) library(GenomicAlignments) bamFiles = c( untreated1_chr4(), untreated3_chr4() ) bamFiles = BamFileList( bamFiles ) seqlevelsStyle(flattenedAnnotation) = "UCSC" se = summarizeOverlaps( flattenedAnnotation, BamFileList(bamFiles), singleEnd=FALSE, fragments=TRUE, ignore.strand=TRUE ) ``` All `singleEnd=`, `fragments=`, `ignore.strand=` and `strandMode=` are important parameters to tune and are dataset specific. Two alternative ways to do the read counting process are the **HTSeq** python scripts provided with `r BiocStyle::Biocpkg("DEXSeq")` and the `r BiocStyle::Biocpkg("Rsubread")` package. Both of these approaches are described in the appendix section of this vignette. ## Building a _**DEXSeqDataSet**_ We can use the function `DEXSeqDataSetFromSE()` to build an _**DEXSeqDataSet**_ object. In a well-designed experiment, we would specify the conditions of the experiment in the `colData()` slot of the object. Since we have only two samples in our example, we just exemplify this step by specify the sample names as conditions. ```{r buildExonCountSet, eval=FALSE} library(DEXSeq) colData(se)$condition = factor(c("untreated1", "untreated3")) colData(se)$libType = factor(c("single-end", "paired-end")) dxd = DEXSeqDataSetFromSE( se, design= ~ sample + exon + condition:exon ) ``` # Standard analysis workflow ## Loading and inspecting the example data To demonstrate the rest of the work flow we will use a _**DEXSeqDataSet**_ that contains a subset of the data with only a few genes. This example object is available through the `r BiocStyle::Biocpkg("pasilla")` package. ```{r startVignette} data(pasillaDEXSeqDataSet, package="pasilla") ``` The _**DEXSeqDataSet**_ class is derived from the _**DESeqDataSet**_. As such, it contains the usual accessor functions for the column data, row data, and some specific ones. The core data in an _**DEXSeqDataSet**_ object are the counts per exon. Each row of the _**DEXSeqDataSet**_ contains in each column the count data from a given exon ('this') as well as the count data from the sum of the other exons belonging to the same gene ('others'). This annotation, as well as all the information regarding each column of the _**DEXSeqDataSet**_, is specified in the `colData()`. ```{r} colData(dxd) ``` We can access the first 5 rows from the count data by doing, ```{r} head( counts(dxd), 5 ) ``` Note that the number of columns is 14, the first seven (we have seven samples) corresponding to the number of reads mapping to out exonic regions and the last seven correspond to the sum of the counts mapping to the rest of the exons from the same gene on each sample. ```{r} split( seq_len(ncol(dxd)), colData(dxd)$exon ) ``` We can also access only the first five rows from the count belonging to the exonic regions ('this') (without showing the sum of counts from the rest of the exons from the same gene) by doing, ```{r} head( featureCounts(dxd), 5 ) ``` In both cases, the rows are labelled with gene IDs (here Flybase IDs), followed by a colon and the counting bin number. Since a counting bin corresponds to an exon or part of an exon, this ID is called the **feature ID** or **exon ID** within `r BiocStyle::Biocpkg("DEXSeq")`. The table content indicates the number of reads that have been mapped to each counting bin in the respective sample. To see details on the counting bins, we also print the first 3 lines of the feature data annotation: ```{r} head( rowRanges(dxd), 3 ) ``` So far, this table contains information on the annotation data, such as gene and exon IDs, genomic coordinates of the exon, and the list of transcripts that contain an exon. The accessor function `annotationData()` shows the design table with the sample annotation (which was passed as the second argument to `DEXSeqDataSetFromHTSeq()`): ```{r} sampleAnnotation( dxd ) ``` In the following sections, we will update the object by calling a number of analysis functions, always using the idiom `dxd = someFunction( dxd )`, which takes the _**dxd**_ object, fills in the results of the performed computation and writes the returned and updated object back into the variable _**dxd**_. ## Normalisation `r BiocStyle::Biocpkg("DEXSeq")` uses the same method as `r BiocStyle::Biocpkg("DESeq")` and `r BiocStyle::Biocpkg("DESeq2")`, which is provided in the function `estimateSizeFactors()`. ```{r} dxd = estimateSizeFactors( dxd ) ``` ## Dispersion estimation To test for differential exon usage, we need to estimate the variability of the data. This is necessary to be able to distinguish technical and biological variation (noise) from real effects on exon usage due to the different conditions. The information on the strength of the noise is inferred from the biological replicates in the data set and characterized by the so-called *dispersion*. In RNA-Seq experiments the number of replicates is typically too small to reliably estimate variance or dispersion parameters individually exon by exon, and therefore, variance information is shared across exons and genes, in an intensity-dependent manner. In this section, we discuss simple one-way designs: In this setting, samples with the same experimental condition, as indicated in the `condition` factor of the design table (see above), are considered as replicates -- and therefore, the design table needs to contain a column with the name `condition`. In Section \@ref(complexdesigns), we discuss how to treat more complicated experimental designs which are not accommodated by a single condition factor. To estimate the dispersion estimates, `r BiocStyle::Biocpkg("DEXSeq")` uses the approach of the package `r BiocStyle::Biocpkg("DESeq2")`. Internally, the functions from `r BiocStyle::Biocpkg("DEXSeq")` are called, adapting the parameters of the functions for the specific case of the `r BiocStyle::Biocpkg("DEXSeq")` model. Briefly, per-exon dispersions are calculated using a Cox-Reid adjusted profile likelihood estimation, then a dispersion-mean relation is fitted to this individual dispersion values andfinally, the fitted values are taken as a prior in order to shrink the per-exon estimates towards the fitted values. See the `r BiocStyle::Biocpkg("DESeq2")` paper for the rational behind the shrinkage approach (@deseq2). ```{r} dxd = estimateDispersions( dxd ) ``` As a shrinkage diagnostic, the _**DEXSeqDataSet**_ inherits the method `plotDispEsts()` that plots the per-exon dispersion estimates versus the mean normalised count, the resulting fitted valuesand the *a posteriori* (shrinked) dispersion estimates (Figure \@ref(fig:fitdiagnostics)). ```{r fitdiagnostics, fig.small=TRUE, fig.cap="Fit Diagnostics. The initial per-exon dispersion estimates (shown by black points), the fitted mean-dispersion values function (red line), and the shrinked values in blue"} plotDispEsts( dxd ) ``` # Testing for differential exon usage Having the dispersion estimates and the size factors, we can now test for differential exon usage. For each gene, `r BiocStyle::Biocpkg("DEXSeq")` fits a generalized linear model with the formula `~sample + exon + condition:exon` and compare it to the smaller model (the null model) `~ sample + exon`. In these formulae (which use the standard notation for linear model formulae in **R**; consult a text book on **R** if you are unfamiliar with it), `sample` is a factor with different levels for each sample, `condition` is the factor of experimental conditions that we defined when constructing the _**DEXSeqDataSet**_ object at the beginning of the analysis, and `exon` is a factor with two levels, `this` and `others`, that were specified when we generated our _**DEXSeqDataSet**_ object. The two models described by these formulae are fit for each counting bin, where the data supplied to the fit comprise **two** read count values for each sample, corresponding to the two levels of the `exon` factor: the number of reads mapping to the bin in question (level `this`), and the sum of the read counts from all other bins of the same gene (level `others`). Note that this approach differs from the approach described in the paper (@DEXSeqPaper) and used in older versions of `r BiocStyle::Biocpkg("DEXSeq")`; see Appendix \@ref(methodChanges) for further discussion. We have an interaction term `condition:exon`, but denote no main effect for `condition`. Note, however, that all observations from the same sample are also from the same condition, i.e., the `condition` main effects are absorbed in the `sample` main effects, because the `sample` factor is nested within the `condition` factor. The deviances of both fits are compared using a $\chi^2$-distribution, providing a $p$ value. Based on this $p$-value, we can decide whether the null model is sufficient to explain the data, or whether it may be rejected in favour of the alternative model, which contains an interaction coefficient for `condition:exon`. The latter means that the fraction of the gene's reads that fall onto the exon under the test differs significantly between the experimental conditions. The function `testForDEU()` performs these tests for each exon in each gene. ```{r testForDEU1} dxd = testForDEU( dxd ) ``` The resulting _**DEXSeqDataSet**_ object contains slots with information regarding the test. For some uses, we may also want to estimate relative exon usage fold changes. To this end, we call `estimateExonFoldChanges()`. Exon usage fold changes are calculated based on the coefficients of a GLM fit that uses the formula `count ~ condition + exon + condition:exon` where `condition` can be replaced with any of the column names of `colData()` (see man pages for details). The resulting coefficients allow the estimation of changes on the usage of exons across different conditions. Note that the differences on exon usage are distinguished from gene expression differences across conditions. For example, consider the case of a gene that is differentially expressed between conditions and has one exon that is differentially used between conditions. From the coefficients of the fitted model, it is possible to distinguish overall gene expression effects, that alter the counts from all the exons, from exon usage effects, that are captured by the interaction term `condition:exon` and that affect the each exon individually. ```{r estFC} dxd = estimateExonFoldChanges( dxd, fitExpToVar="condition") ``` So far in the pipeline, the intermediate and final results have been stored in the meta data of a _**DEXSeqDataSet**_ object, they can be accessed using the function `mcols()`. In order to summarize the results without showing the values from intermediate steps, we call the function `DEXSeqResults()`. The result is a _**DEXSeqResults**_ object, which is a subclass of a _**DataFrame**_ object. ```{r results1} dxr1 = DEXSeqResults( dxd ) dxr1 ``` The description of each of the column of the object _**DEXSeqResults**_ can be found in the metadata columns. ```{r results2} mcols(dxr1)$description ``` From this object, we can ask how many exonic regions are significant with a false discovery rate of 10%: ```{r tallyExons} table ( dxr1$padj < 0.1 ) ``` We may also ask how many genes are affected ```{r tallyGenes} table ( tapply( dxr1$padj < 0.1, dxr1$groupID, any ) ) ``` Remember that our example data set contains only a selection of genes. We have chosen these to contain interesting cases; so the fact that such a large fraction of genes is significantly affected here is not typical. To see how the power to detect differential exon usage depends on the number of reads that map to an exon, a so-called MA plot is useful, which plots the logarithm of fold change versus average normalized count per exon and marks by red colour the exons which are considered significant; here, the exons with an adjusted $p$ values of less than $0.1$ (Figure \@ref(fig:MvsA). There is of course nothing special about the number $0.1$, and you can specify other thresholds in the call to `plotMA()`. ```{r MvsA, fig.small=TRUE, fig.cap="MA plot. Mean expression versus $log{2}$ fold change plot. Significant hits at an $FDR=0.1$ are coloured in red."} plotMA( dxr1, cex=0.8 ) ``` # Additional technical or experimental variables {#complexdesigns} In the previous section we performed a simple analysis of differential exon usage, in which each sample was assigned to one of two experimental conditions. If your experiment is of this type, you can use the work flow shown above. All you have to make sure is that you indicate which sample belongs to which experimental condition when you construct the _**DEXSeqDataSet**_ object. Do so by means of a column called `condition` in the sample table. If you have a more complex experimental design, you can provide different or additional columns in the sample table. You then have to indicate the design by providing explicit formulae for the test. In the `r BiocStyle::Biocpkg("pasilla")` dataset, some samples were sequenced in single-end and others in paired-end mode. Possibly, this influenced counts and should hence be accounted for. We therefore use this as an example for a complex design. When we constructed the _**DEXSeqDataSet**_, we provided in the `colData()` an additional column called `type`, which has been stored in the object: ```{r design} sampleAnnotation(dxd) ``` We specify two design formulae, which indicate that the `type` _**factor**_ should be treated as a blocking factor: ```{r formulas2} formulaFullModel = ~ sample + exon + type:exon + condition:exon formulaReducedModel = ~ sample + exon + type:exon ``` Compare these formulae with the default formulae given in previous sections. We have added, in both the full model and the reduced model, the term `type:exon`. Therefore, any dependence of exon usage on library type will be absorbed by this term and accounted for equally in the full and a reduced model, and the likelihood ratio test comparing them will only detect differences in exon usage that can be attributed to `condition`, independent of `type`. Next, we estimate the dispersions. This time, we need to inform the `estimateDispersions()` function about our design by providing the full model's formula, which should be used instead of the default formula. ```{r estDisps_again} dxd = estimateDispersions( dxd, formula = formulaFullModel ) ``` The test function now needs to be informed about both formulae ```{r test_again} dxd = testForDEU( dxd, reducedModel = formulaReducedModel, fullModel = formulaFullModel ) ``` Finally, we get a summary table, as before. ```{r res_again} dxr2 = DEXSeqResults( dxd ) ``` How many significant DEU cases have we got this time? ```{r table2} table( dxr2$padj < 0.1 ) ``` We can now compare with the previous result: ```{r table3} table( before = dxr1$padj < 0.1, now = dxr2$padj < 0.1 ) ``` Accounting for the library type has allowed us to find six more hits, which confirms that accounting for the covariate improves power. # Visualization The `plotDEXSeq()` provides a means to visualize the results of an analysis. ```{r plot1, fig.height=8, fig.width=12, fig.cap="Fitted expression. The plot represents the expression estimates from a call to `testForDEU()`. Shown in red is the exon that showed significant differential exon usage."} plotDEXSeq( dxr2, "FBgn0010909", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 ) ``` ```{r checkClaim} wh = (dxr2$groupID=="FBgn0010909") stopifnot(sum(dxr2$padj[wh] < formals(plotDEXSeq)$FDR)==1) ``` The result is shown in Figure \@ref(fig:plot1). This plot shows the fitted expression values of each of the exons of gene *FBgn0010909*, for each of the two conditions, treated and untreated. The function `plotDEXSeq()` expects at least two arguments, the _**DEXSeqDataSet**_ object and the gene ID. The option `legend=TRUE` causes a legend to be included. The three remaining arguments in the code chunk above are ordinary plotting parameters which are simply handed over to _**R**_ standard plotting functions. They are not strictly needed and included here to improve appearance of the plot. See the help page for `par()` for details. Optionally, one can also visualize the transcript models (Figure \@ref(fig:plot2)), which can be useful for putting differential exon usage results into the context of isoform regulation. ```{r plot2, fig.height=8, fig.width=12, fig.cap="Transcripts. As in the previous plots, but including the annotated transcript models"} plotDEXSeq( dxr2, "FBgn0010909", displayTranscripts=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 ) ``` Other useful options are to look at the count values from the individual samples, rather than at the model effect estimates. For this display (option `norCounts=TRUE`), the counts are normalized by dividing them by the size factors (Figure \@ref(fig:plot3)) ```{r plot3, fig.height=8, fig.width=12, fig.cap="Normalized counts. As in the previous plots, with normalized count values of each exon in each of the samples."} plotDEXSeq( dxr2, "FBgn0010909", expression=FALSE, norCounts=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 ) ``` As explained in Section \@ref(overview), `r BiocStyle::Biocpkg("DEXSeq")` is designed to find changes in relative exon usage, i.e., changes in the expression of individual exons that are not simply the consequence of overall up- or down-regulation of the gene. To visualize such changes, it is sometimes advantageous to remove overall changes in expression from the plots. Use the option `splicing=TRUE` for this purpose. ```{r plot4, fig.height=8, fig.width=12, fig.cap="Fitted splicing. As in the previous plots, but after subtraction of overall changes in gene expression."} plotDEXSeq( dxr2, "FBgn0010909", expression=FALSE, splicing=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 ) ``` To generate an easily browsable, detailed overview over all analysis results, the package provides an HTML report generator, implemented in the function `DEXSeqHTML()`. This function uses the package `r BiocStyle::CRANpkg("hwriter")` [@hwriter] to create a result table with links to plots for the significant results, allowing a more detailed exploration of the results. ```{r DEXSeqHTML, cache=TRUE, eval=FALSE} DEXSeqHTML( dxr2, FDR=0.1, color=c("#FF000080", "#0000FF80") ) ``` # Parallelization and large number of samples `r BiocStyle::Biocpkg("DEXSeq")` analyses can be computationally heavy, especially with data sets that comprise a large number of samples, or with genomes containing genes with large numbers of exons. While some steps of the analysis work on the whole data set, the computational load can be parallelized for some steps. We use the package `r BiocStyle::Biocpkg("BiocParallel")`, and implemented the `BPPARAM=` parameter of the functions `estimateDispersions()`, `testForDEU()` and `estimateExonFoldChanges()`: ```{r para1,cache=TRUE,results='hide', eval=FALSE} BPPARAM = MultiCoreParam(4) dxd = estimateSizeFactors( dxd ) dxd = estimateDispersions( dxd, BPPARAM=BPPARAM) dxd = testForDEU( dxd, BPPARAM=BPPARAM) dxd = estimateExonFoldChanges(dxd, BPPARAM=BPPARAM) ``` For running analysis with a large number of samples (e.g. more than 20), we recommend configuring `BatchJobsParam()` classes in order to distribute the calculations across a computer cluster and significantly reduce running times. Users might also consider reducing the number of tests by filtering for lowly expressed isoforms or exonic regions with low counts. Appart from reducing running times, this filtering step also leads to more accurate results [@soneson2015]. Nevertheless, `r BiocStyle::Biocpkg("DEXSeq")` was designed to work with small sample numbers, and users might have to consider alternatives for large numbers of samples such as the diffSplice function from `r BiocStyle::Biocpkg("limma")`, `r BiocStyle::Biocpkg("satuRn")` or non-parametric approaches for exon usage testing. # Perform a standard differential exon usage analysis in one command In the previous sections, we went through the analysis step by step. Once you are sufficiently confident about the work flow for your data, its invocation can be streamlined by the wrapper function `DEXseq()`, which runs the analysis shown above through a single function call. In the simplest case, construct the _**DEXSeqDataSet**_ as shown before, then run `DEXSeq()` passing the _**DEXSeqDataSet**_ as only argument, this function will output a _**DEXSeqResults**_ object. ```{r alldeu} dxr = DEXSeq(dxd) class(dxr) ``` # Appendix ## Controlling FDR at the gene level `r BiocStyle::Biocpkg("DEXSeq")` returns one $p$-value for each exonic region. Using these p-values, a rejection threshold is established according to a multiple testing correction method. Consider a setting with $M$ exonic regions, where $p_{k}$ is the $p$-value for the $k$ exonic region and $\theta \in {p_{1}, ..., p_{M}} \subset [0, 1]$ is a rejection threshold, leading to $V=\left|\left\{k: p_k\le\theta\right\}\right|$ rejections (i.e. number of exonic regions being DEU). The Benjamini-Hochberg method establishes that FDR is controlled at the level $q^{*}$ where $q^* = \frac{M \theta}{\left|\left\{k: p_k\le\theta\right\}\right|}$. Note that $\theta$ in the denominator of this expression is the probability of rejecting a single hypothesis if it is true. $q^*$ allows to estimate the number of exons that are differentially used at a specific FDR. However, sometimes it is informative to know what is the FDR at the gene level, i.e. knowing the numer of genes with at least one differentially used exon while keeping control of FDR. For this scenario, let $p_{i,l}$ be the $p$-value from the DEU test for exon $l$ in gene $i$, let $n_i$ the number of exons in gene $i$, and $M$ the number of genes. A gene $i$ has at least one DEU exon if $\exists l$ such that $p_{i,l}\le\theta$. Then, FDR is controlled at the level $\frac{\sum_{i=1}^{M} 1-(1-\theta)^{n_i}}{\left|\left\{i: \exists l:p_{il}\le\theta\right\}\right|}$. The implementation of the formula above is provided by the function `perGeneQValue()`. For the `r BiocStyle::Biocpkg("pasilla")` example, the code below calculates the number of genes (at a FDR of 10%) with at least one differentially used exon. ```{r pergeneqval} numbOfGenes = sum( perGeneQValue(dxr) < 0.1) numbOfGenes ``` ## Preprocessing with python ### Preparing the annotation file with python The preprocessing steps of the analysis are done using two Python scripts that we provide with `r BiocStyle::Biocpkg("DEXSeq")`. You do not need to know how to use Python; however you have to install the Python package **HTSeq**, following the explanations given on the [**HTSeq** web page](http://www-huber.embl.de/users/anders/HTSeq/doc/install.html). Once you have installed **HTSeq**, you can use the two Python scripts, *dexseq_prepare_annotation.py* and *dexseq_count.py*, that come with the `r BiocStyle::Biocpkg("DEXSeq")` package. If you have trouble finding them, start R and ask for the installation directory with: ```{r systemFile} pythonScriptsDir = system.file( "python_scripts", package="DEXSeq" ) list.files(pythonScriptsDir) system.file( "python_scripts", package="DEXSeq", mustWork=TRUE ) ``` The displayed path should contain the two files. If it does not, try to re-install `r BiocStyle::Biocpkg("DEXSeq")` (as usual, with the function `BiocManager::install()`). An alternative work flow, which replaces the two Python-based steps with R based code, is also available in the Appendix section of this vignette. The Python scripts expect a GTF file with gene models for your species. We have tested our tools chiefly with GTF files from Ensembl and hence recommend to use these, as files from other providers sometimes do not adhere fully to the GTF standard and cause the preprocessing to fail. Ensembl GTF files can be found in the "FTP Download" sections of the Ensembl web sites (i.e., Ensembl, EnsemblPlants, EnsemblFungi, etc.). Make sure that your GTF file uses a coordinate system that matches the reference genome that you have used for aligning your reads. The safest way to ensure this is to download the reference genome from Ensembl, too. If you cannot use an Ensembl GTF file, see Appendix \@ref(requirements-on-gtf-files) for advice on converting GFF files from other sources to make them suitable as input for the *dexseq_prepare_annotation.py* script. In a GTF file, many exons appear multiple times, once for each transcript that contains them. We need to "collapse" this information to define *exon counting bins*, i.e., a list of intervals, each corresponding to one exon or part of an exon. Counting bins for parts of exons arise when an exonic region appears with different boundaries in different transcripts. See Figure 1 of the `r BiocStyle::Biocpkg("DEXSeq")` paper (@DEXSeqPaper) for an illustration. The Python script *dexseq_prepare_annotation.py* takes an Ensembl GTF file and translates it into a GFF file with collapsed exon counting bins. Make sure that your current working directory contains the GTF file and call the script (from the command line shell, not from within R) with ``` python /path/to/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py Drosophila_melanogaster.BDGP5.72.gtf Dmel.BDGP5.25.62.DEXSeq.chr.gff ``` In this command, which should be entered as a single line, replace */path/to.../python_scripts* with the correct path to the Python scripts, which you have found with the call to `system.file()` shown above. *Drosophila_melanogaster.BDGP5.72.gtf* is the Ensembl GTF file (here the one for fruit fly, already de-compressed) and *Dmel.BDGP5.25.62.DEXSeq.chr.gff* is the name of the output file. In the process of forming the counting bins, the script might come across overlapping genes. If two genes on the same strand are found with an exon of the first gene overlapping with an exon of the second gene, the script's default behaviour is to combine the genes into a single "aggregate gene" which is subsequently referred to with the IDs of the individual genes, joined by a plus ('+') sign. If you do not like this behaviour, you can disable aggregation with the option `-r no`. Without aggregation, exons that overlap with other exons from different genes are simply skipped. ### Counting reads with python Read counting is done with the script *python_count.py*: ``` python /path/to/library/DEXSeq/python_scripts/dexseq_count.py Dmel.BDGP5.25.62.DEXSeq.chr.gff untreated1.sam untreated1fb.txt ``` This command (again, to be entered as a single line) expects two files in the current working directory, namely the GFF file produced in the previous step (here *Dmel.BDGP5.25.62.DEXSeq.chr.gff*) and a SAM file with the aligned reads from a sample (here the file *untreated1.sam* with the aligned reads from the first control sample). The command generates an output file, here called *untreated1fb.txt*, with one line for each exon counting bin defined in the flattened GFF file. The lines contain the exon counting bin IDs (which are composed of gene IDs and exon bin numbers), followed by a integer number which indicates the number of reads that were aligned such that they overlap with the counting bin. Use the script multiple times to produce a count file from each of your SAM files. There are a number of crucial points to pay attention to when using the *python_count.py* script: **Paired-end data: ** If your data is from a paired-end sequencing run, you need to add the option `-p yes` to the command to call the script. (As usual, options have to be placed before the file names, surrounded by spaces.) In addition, the SAM file needs to be sorted, either by read name or by position. Most aligners produce sorted SAM files; if your SAM file is not sorted, use `samtools sort -n` to sort by read name (or `samtools sort`) to sort by position. (See @DEProt, if you need further explanations on how to sort SAM files.) Use the option `-r pos` or `-r name` to indicate whether your paired-end data is sorted by alignment position or by read name. **Strandedness:** By default, the counting script assumes your library to be *strand-specific*, i.e., reads are aligned to the same strand as the gene they originate from. If you have used a library preparation protocol that does not preserve strand information (i.e., reads from a given gene can appear equally likely on either strand), you need to inform the script by specifying the option `-s no`. If your library preparation protocol reverses the strand (i.e., reads appear on the strand opposite to their gene of origin), use `-s reverse`. In case of paired-end data, the default (`-s yes`) means that the read from the first sequence pass is on the same strand as the gene and the read from the second pass on the opposite strand ("forward-reverse" or "fr" order in the parlance of the Bowtie/TopHat manual) and the options `-s reverse` specifies the opposite case. **SAM and BAM files:** By default, the script expects its input to be in plain-text SAM format. However, it can also read BAM files, i.e., files in the the compressed binary variant of the SAM format. If you wish to do so, use the option `-f bam`. This works only if you have installed the Python package [*pysam*](https://code.google.com/p/pysam/). **Alignment quality:** The scripts takes a further option, `-a` to specify the minimum alignment quality (as given in the fifth column of the SAM file). All reads with a lower quality than specified (with default `-a 10`) are skipped. **Help pages:** Calling either script without arguments displays a help page with an overview of all options and arguments. ### Reading the data from the python ouputs into R The remainder of the analysis is now done in **R**. We will use the output of the python scripts for the *pasilla* experiment, that can be found in the package `r BiocStyle::Rpackage("pasilla")`. Open an **R** session and type: ```{r loadDEXSeq} inDir = system.file("extdata", package="pasilla") countFiles = list.files(inDir, pattern="fb.txt$", full.names=TRUE) basename(countFiles) flattenedFile = list.files(inDir, pattern="gff$", full.names=TRUE) basename(flattenedFile) ``` Now, we need to prepare a sample table. This table should contain one row for each library, and columns for all relevant information such as name of the file with the read counts, experimental conditions, technical information and further covariates. To keep this vignette simple, we construct the table on the fly. ```{r sampleTable} sampleTable = data.frame( row.names = c( "treated1", "treated2", "treated3", "untreated1", "untreated2", "untreated3", "untreated4" ), condition = c("knockdown", "knockdown", "knockdown", "control", "control", "control", "control" ), libType = c( "single-end", "paired-end", "paired-end", "single-end", "single-end", "paired-end", "paired-end" ) ) ``` While this is a simple way to prepare the table, it may be less error-prone and more prudent to used an existing table that had already been prepared when the experiments were done, save it in CSV format and use the **R** function `read.csv()` to load it. In any case, it is vital to check the table carefully for correctness. ```{r check} sampleTable ``` Our table contains the sample names as row names and the two covariates that vary between samples: first the experimental condition (factor _**condition**_ with levels **control** and **treatment**) and the library type (factor **libType**), which we included because the samples in this particular experiment were sequenced partly in single-end runs and partly in paired-end runs. For now, we will ignore this latter piece of information, and postpone the discussion of how to include such additional covariates to Section \@ref(complexdesigns). If you have only a single covariate and want to perform a simple analysis, the column with this covariate should be named **condition**. Now, we construct an _**DEXSeqDataSet**_ object from this data. This object holds all the input data and will be passed along the stages of the subsequent analysis. We construct the object with the function `DEXSeqDataSetFromHTSeq()`, as follows: ```{r message=FALSE} library( "DEXSeq" ) dxd = DEXSeqDataSetFromHTSeq( countFiles, sampleData=sampleTable, design= ~ sample + exon + condition:exon, flattenedfile=flattenedFile ) ``` The function takes four arguments. First, a vector with names of count files, i.e., of files that have been generated with the *dexseq_count.py* script. The function will read these files and arrange the count data in a matrix, which is stored in the _**DEXSeqDataSet**_ object `dxd`. The second argument is our sample table, with one row for each of the files listed in the first argument. This information is simply stored as is in the object's meta-data slot (see below). The third argument is a formula of the form `~ sample + exon + condition:exon` that specifies the contrast with of a variable from the sample table columns and the **'exon'** variable. Using this formula, we are interested in differences in exon usage due to the "condition" variable changes. Later in this vignette, we will how to add additional variables for complex designs. The fourth argument is a file name, now of the flattened GFF file that was generated with *dexseq_prepare_annotation.py* and used as input to *dexseq_count.py* when creating the count file. ## Preprocessing using featureCounts As an alternative to *HTSeq*, one could use the function `featureCounts()` to count the number of reads overlapping with each exonic region. The output files from `featureCounts()` can then be used as input to DEXSeq. In [this](https://github.com/vivekbhr/Subread_to_DEXSeq) *Github* repository, Vivek Bhardwaj provides scripts and documentation that integrate `featureCounts()` with `r BiocStyle::Biocpkg("DEXSeq")`. He also provides a function to import the count files into _**R**_ and generate a _**DEXSeqDataSet**_ object. ## Further accessors The function `geneIDs()` returns the gene ID column of the feature data as a character vector, and the function `exonIDs()` return the exon ID column as a _**factor**_. ```{r acc} head( geneIDs(dxd) ) head( exonIDs(dxd) ) ``` These functions are useful for subsetting an _**DEXSeqDataSet**_ object. ## Overlap operations The methods `subsetByOverlaps()` and `findOverlaps()` have been implemented for the _**DEXSeqResults**_ object, the `query=` argument must be a _**DEXSeqResults**_ object. ```{r grmethods} interestingRegion = GRanges( "chr2L", IRanges(start=3872658, end=3875302) ) subsetByOverlaps( x=dxr, ranges=interestingRegion ) findOverlaps( query=dxr, subject=interestingRegion ) ``` This functions could be useful for further downstream analysis. ## Methodological changes since publication of the paper {#methodChanges} In our paper [@DEXSeqPaper], we suggested to fit for each exon a model that includes separately the counts for all the gene's exons. However, this turned out to be computationally inefficient for genes with many exons, because the many exons required large model matrices, which are computationally expensive to deal with. We have therefore modified the approach: when fitting a model for an exon, we now sum up the counts from all the other exon and use only the total, rather than the individual counts in the model. Now, computation time per exon is independent of the number of other exons in the gene, which improved `r BiocStyle::Biocpkg("DEXSeq")`'s scalability. While the $p$ values returned by the two approaches are not exactly equal, the differences were very minor in the tests that we performed. Deviating from the paper's notation, we now use the index $i$ to indicate a specific counting bin, with $i$ running through all counting bins of all genes. The samples are indexed with $j$, as in the paper. We write $K_{ij0}$ for the count or reads mapped to counting bin $i$ in sample $j$ and $K_{ij1}$ for the sum of the read counts from all other counting bins in the same gene. Hence, when we write $K_{ijl}$, the third index $l$ indicates whether we mean the read count for bin $i$ ($l=0$) or the sum of counts for all other bins of the same gene ($l=1$). As before, we fit a GLM of the negative binomial (NB) family $ K_{ijl} \sim \operatorname{NB}(\text{mean}=s_j\mu_{ijl},\text{dispersion}=\alpha_i)$ and we write $\log_2 \mu_{ijl} = \beta^\text{S}_{ij} + l \beta^\text{E}_{i} + \beta^\text{EC}_{i\rho_j}$. This model is fit separately for each counting bin $i$. The coefficient $\beta^\text{S}_{ij}$ accounts for the sample-specific contribution (factor `sample`), the term $\beta^\text{E}_{i}$ is only included if $l=1$ and hence estimates the logarithm of the ratio $K_{ij1}/K_{ij0}$ between the counts for all other exons and the counts for the tested exon. As this coefficient is estimated from data from all samples, it can be considered as a measure of "average exon usage". In the R model formula, it is represented by the term `exon` with its two levels `this` ($l=0$) and `others` ($l=1$). Finally, the last term, $\beta^\text{EC}_{i,\rho_j}$, captures the interaction `condition:exon`, i.e., the change in exon usage if sample $j$ is from experimental condition group $\rho(j)$. Here, the first condition, $\rho=0$, is absorbed in the sample coefficients, i.e., $\beta^\text{EC}_{i0}$ is fixed to zero and does not appear in the model matrix. For the dispersion estimation, one dispersion value $\alpha_i$ is estimated with Cox-Reid-adjusted maximum likelihood using the full model given above. A mean-variance relation is fitted using the individual dispersion values. Finally, the individual values are shrinked towards the fitted values. For more details about this shrinkage approach look at the `r BiocStyle::Biocpkg("DESeq2")` vignette and/or its manuscript [@deseq2]. For the likelihood ratio test, this full model is fit and compared with the fit of the reduced model, which lacks the interaction term $\beta^\text{EC}_{i\rho_j}$. As described in Section \@ref{complexdesigns}, alternative model formulae can be specified. ## Requirements on GTF files In the initial preprocessing step described in Section \@ref{preparing-the-annotation}, the Python script *dexseq_prepare_annotation.py* is used to convert a GTF file with gene models into a GFF file with collapsed gene models. We recommend to use GTF files downloaded from Ensembl as input for this script, as files from other sources may deviate from the format expected by the script. Hence, if you need to use a GTF or GFF file from another source, you may need to convert it to the expected format. To help with this task, we here give details on how the *dexseq_prepare_annotation.py* script interprets a GFF file. * The script only looks at `exon` lines, i.e., at lines which contain the term `exon` in the third ("type") column. All other lines are ignored. * Of the data in these lines, the information about chromosome, start, end, and strand (1st, 4th, 5th, and 7th column) are used, and, from the last column, the attributes `gene_id` and `transcript_id`. The rest is ignored. * The `gene_id` attribute is used to see which exons belong to the same gene. It must be called `gene_id` (and not `Parent` as in GFF3 files, or `GeneID` as in some older GFF files), and it must give the same identifier to all exons from the same gene, even if they are from different transcripts of this gene. (This last requirement is not met by GTF files generated by the Table Browser function of the UCSC Genome Browser.) * The `transcript_id` attribute is used to build the *transcripts* attribute in the flattened GFF file, which indicates which transcripts contain the described counting bin. This information is needed only to draw the transcript model at the bottom of the plots when the `displayTranscripts=` option to `plotDEXSeq()` is used. Therefore, converting a GFF file to make it suitable as input to *dexseq_prepare_annotation.py* amounts to making sure that the exon lines have type `exon` and that the atributes giving gene ID (or gene symbol) and transcript ID are called `gene_id` and `transcript_id`, with this exact spelling. Remember to also take care that the chromosome names match those in your SAM files, and that the coordinates refer to the reference assembly that you used when aligning your reads. # Session Information The session information records the versions of all the packages used in the generation of the present document. ```{r sessionInfo} sessionInfo() ``` # References