---
title: "Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification"
author:
- name: Michael I. Love
  affiliation: 
  - Department of Biostatistics, UNC-Chapel Hill, Chapel Hill, NC, US
  - Department of Genetics, UNC-Chapel Hill, Chapel Hill, NC, US
- name: Charlotte Soneson
  affiliation:
  - Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
  - SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- name: Rob Patro
  affiliation: Department of Computer Science, Stony Brook University, Stony Brook, NY, US
date: 1 October, 2018
vignette: >
  %\VignetteIndexEntry{RNA-seq workflow for differential transcript usage following Salmon quantification}
  %\VignetteEngine{knitr::rmarkdown}
abstract: |
  Detection of differential transcript usage (DTU) from RNA-seq data
  is an important bioinformatic analysis that complements differential
  gene expression analysis. Here we present a simple workflow using a
  set of existing R/Bioconductor packages for analysis of DTU. We
  show how these packages can be used downstream of RNA-seq
  quantification using the Salmon software package. The entire
  pipeline is fast, benefiting from inference steps by Salmon
  to quantify expression at the transcript level. The workflow
  includes live, runnable code chunks for analysis using DRIMSeq and
  DEXSeq, as well as for performing two-stage testing of DTU using the
  stageR package, a statistical framework to screen at the gene level
  and then confirm which transcripts within the significant genes show
  evidence of DTU. We evaluate these packages and other related
  packages on a simulated dataset with parameters estimated from real
  data.
keywords: RNA-seq, workflow, differential transcript usage, Salmon, DRIMSeq, DEXSeq, stageR, tximport
bibliography: bibliography.bib
output: BiocStyle::html_document
---

<!-- to compile this: 
  devtools::install()
  rmarkdown::render("rnaseqDTU.Rmd")
-->

<!-- a list of all required libraries:
  reqlibs = sub(".*library\\((.*?)\\).*","\\1",grep("library\\(",readLines("rnaseqDTU.Rmd"),value=TRUE))
  find.package(reqlibs)
-->

```{r style, echo=FALSE, message=FALSE, warning=FALSE, results="asis"}
library(knitr)
library(rmarkdown)
opts_chunk$set(message=TRUE, warning=FALSE, error=FALSE, 
               cache=FALSE, fig.width=5, fig.height=5)
```

**R version**: `r R.version.string`

**Bioconductor version**: `r BiocManager::version()`

**Package**: `r packageVersion("rnaseqDTU")`

**Publication DOI**: [10.12688/f1000research.15398.3](http://dx.doi.org/10.12688/f1000research.15398.3)

# Getting help

If you have questions about this workflow or any Bioconductor
software, please post these to the 
[Bioconductor support site](https://support.bioconductor.org). 
If the questions concern a specific package, you can tag the post with
the name of the package, or for general questions about the workflow,
tag the post with `rnaseqdtu`.  Note the
[posting guide](http://www.bioconductor.org/help/support/posting-guide/)
for crafting an optimal question for the support site.

# Introduction

RNA-seq experiments can be analyzed to detect differences across groups of
samples in total gene expression -- the total expression 
produced by all isoforms of a gene -- and additionally differences in
transcript isoform usage within a gene. If the amount of expression
switches among two or more isoforms of a gene, then the total gene
expression may not change by a detectable amount, but 
the differential transcript usage (DTU) is nevertheless biologically relevant.
DTU is common when comparing expression across cell types: recent
analysis of the Genotype-Tissue Expression Project (GTEx) [@GTEx]
dataset demonstrated that half of all expressed genes contained
tissue-specific isoforms [@Reyes2018]. DTU may produce functionally
different gene products through alternative splicing and changes to
the coding sequence of the transcript, and DTU may result in
transcripts with different untranslated regions on the 5' or 3' end of
the transcript, which may affect binding sites of RNA binding
proteins. [@Reyes2018] found that alternative usage
of transcription start and termination sites was a more common event
than alternative splicing when examining DTU events across tissues in
GTEx.  Specific patterns of DTU have been identified in a number of
diseases, including cancer, retinal diseases, and neurological
disorders, among others [@Scotti2018]. Large-scale analyses of cancer
transcriptomic data, comparing tumor to normal samples, have
identified that protein domain losses are a common feature of DTU in
cancer, including domains involved in protein-protein interactions
[@Vitting2017;@Climente2017].

While many tutorials and workflows in the
Bioconductor project address differential gene expression, there are
fewer workflows for performing a differential transcript usage analysis,
which provides critical and complementary information to a 
gene-level analysis. Some of the existing Bioconductor packages and
functions that can be used for statistical analysis of DTU include 
*DEXSeq* (originally designed for differential exon usage) [@Anders2012Detecting], 
`diffSpliceDGE` from the *edgeR* package [@Robinson2009EdgeR;@McCarthy2012Differential],
`diffSplice` from the *limma* package [@Smyth2004Linear;@Law2014Voom], and
*DRIMSeq* [@Nowicka2016DRIMSeq].
Other Bioconductor packages which are designed around a DTU analysis include 
*stageR* [@Van2017StageR], *SGSeq* [@Goldstein2016SGSeq], and
*IsoformSwitchAnalyzeR* [@Vitting2018].
*stageR* can be used for post-processing of transcript- 
and gene-level p-values from DTU detection methods, and will be discussed in this workflow.
*SGSeq* can be used to visualize 
splice events, and leverages *DEXSeq* or *limma*
for differential testing of splice variant usage.
The Bioconductor package *IsoformSwitchAnalyzeR* is
well documented and the vignette available from the 
[IsoformSwitchAnalyzeR landing page](https://bioconductor.org/packages/IsoformSwitchAnalyzeR)
can be seen as an alternative to this workflow.
*IsoformSwitchAnalyzeR*  is designed for the downstream analysis of 
functional consequences of identified isoform switches.
It allows for import of data from various 
quantification methods, including *Salmon*, and allows for statistical
inference using *DRIMSeq*. In addition, *IsoformSwitchAnalyzeR* includes
functions for obtaining the nucleotide and amino acid sequence
consequences of isoform switching, which is not covered in this workflow.
Other packages related to splicing can be found at the *BiocViews* links for
[DifferentialSplicing](https://bioconductor.org/packages/release/BiocViews.html#___DifferentialSplicing) and
[AlternativeSplicing](https://bioconductor.org/packages/release/BiocViews.html#___AlternativeSplicing).
For more information about the Bioconductor project and its core
infrastructure, please refer to the overview by @Huber2015Orchestrating.

We note that there are numerous other methods for detecting
differential transcript usage outside of the Bioconductor project. The
*DRIMSeq* publication is a good reference for these, having descriptions and
comparisons with many current methods [@Nowicka2016DRIMSeq].
This workflow will build on the methods and vignettes from three
Bioconductor packages: *DRIMSeq*, *DEXSeq*, and *stageR*.

Previously, some of the developers of the Bioconductor packages 
*edgeR* and *DESeq2* have collaborated to develop the
*tximport* package [@Soneson2015Differential] for summarizing the output of fast
transcript-level quantifiers, such as *Salmon* [@Patro2017Salmon],
*Sailfish* [@Patro2014Sailfish], and *kallisto* [@Bray2016Near]. The *tximport* package focuses on
preparing estimated transcript-level counts, abundances
and effective transcript lengths, for gene-level statistical analysis
using *edgeR* [@Robinson2009EdgeR], *DESeq2* [@Love2014Moderated], or
*limma-voom* [@Law2014Voom]. *tximport* produces an 
offset matrix to accompany gene-level counts, that accounts for
a number of RNA-seq biases as well as differences in
transcript usage among transcripts of different length that would bias
an estimator of gene fold change based on the gene-level counts
[@Trapnell2013Differential]. 
*tximport* can alternatively produce a matrix of data that
is roughly on the scale of counts, by scaling transcript-per-million
(TPM) abundances to add up to 
the total number of mapped reads. This counts-from-abundance approach
directly corrects for technical biases and differential transcript
usage across samples, obviating the need for the accompanying offset matrix.

Complementary to an analysis of differential gene expression, one can
use *tximport* to import transcript-level estimated counts, and then pass these
counts to packages such as *DRIMSeq* or *DEXSeq* for statistical
analysis of differential transcript usage. Following a
transcript-level analysis, one can aggregate evidence of differential
transcript usage to the gene level. The *stageR* package in
Bioconductor provides a statistical framework to *screen* at the
gene-level for differential transcript usage with gene-level adjusted p-values,
followed by *confirmation* of which transcripts within the significant
genes show differential usage with transcript-level adjusted p-values
[@Van2017StageR]. The method controls the _overall false discovery
rate_ (OFDR)
[@Heller2009] for such a two-stage procedure, which will be discussed
in more detail later in the 
workflow. We believe that *stageR* represents a principled approach to
analyzing transcript usage changes, as the methods can be evaluated
against a target error rate in a manner that mimics how the methods
will be used in practice. That is, following rejection of
the null hypothesis at the gene-level, investigators would likely
desire to know which transcripts within a gene participated in the
differential usage.

```{r diagram, echo=FALSE, out.width="100%", fig.cap="Diagram of the methods presented in this workflow. The left-side shows two paths for performing differential transcript usage (DTU) using Bioconductor packages and the right-side shows two paths for performing differential gene expression (DGE). DTU and DGE are complementary analyses of changes in transcription across conditions. This workflow focuses mostly on DTU, as there are a number of other published Bioconductor workflows for DGE. In bold are the recommended choices for quantification and filtering Salmon transcript-level data as input to the statistical methods. The recommended filters implemented in DRIMSeq, and applied upstream of DRIMSeq and DEXSeq, are discussed in this workflow."}
knitr::include_graphics("figs/swimdown_diagram.png")
```

Here we provide a basic workflow for detecting
differential transcript usage using Bioconductor packages,
following quantification of transcript abundance using the *Salmon*
method (Figure \@ref(fig:diagram)). 
This workflow includes live, runnable code chunks for analysis
using *DRIMSeq* and *DEXSeq*, as well as for performing stage-wise
testing of differential transcript usage using the *stageR*
package. For the workflow, we use data that is simulated, so that 
we can also evaluate the performance of methods for differential
transcript usage, as well as differential gene and transcript
expression. The simulation was constructed using distributional
parameters estimated from the GEUVADIS project RNA-seq dataset
[@Lappalainen2013Transcriptome] quantified by the *recount2* project
[@Collado2017Recount], including the 
expression levels of the transcripts, the amount of biological
variability of gene expression levels across samples, and realistic coverage
of reads along the transcript.

Note that a [published version of this workflow](http://dx.doi.org/10.12688/f1000research.15398.3) 
exists in the Bioconductor channel of the journal *F1000Research*.
The published version additionally contains evaluations of the methods
shown here alongside other popular methods for DTU, DGE, and DTE.

# Methods

## Simulation

First we describe details of the simulated data, which will be used in
the following workflow. Understanding the details of the
simulation will be useful for assessing the methods in the later sections.
All of the code used to simulate RNA-seq experiments and write
paired-end reads to FASTQ files can be found at an associated GitHub repository for
the simulation code [@swimdown], and the reads and quantification
files can be downloaded from Zenodo [@swimdowndata].
*Salmon* [@Patro2017Salmon] was used to estimate transcript-level
abundances for a single 
sample ([ERR188297](https://www.ebi.ac.uk/ena/data/view/ERR188297)) of the GEUVADIS project
[@Lappalainen2013Transcriptome], and this was used as
a baseline for transcript abundances in the simulation. Transcripts
that were associated with estimated counts less than 10 had abundance
thresholded to 0, all other transcripts were considered "expressed" (n=46,933).
*alpine* [@Love2016Alpine] was used to estimate realistic fragment GC
bias from 12 samples from the GEUVADIS 
project, all from the same sequencing center (the first 12 samples from
CNAG-CRG in Supplementary Table 2 from @Love2016Alpine). 
*DESeq2* [@Love2014Moderated] was used to estimate mean and dispersion
parameters for a Negative 
Binomial distribution for gene-level counts for 458 GEUVADIS samples
provided by the *recount2* project [@Collado2017Recount].
Note that, while gene-level dispersion estimates were used
to generate underlying transcript-level counts, additional uncertainty
on the transcript-level data is a natural consequence of the
simulation, as the transcript-level counts must be estimated (the
underlying transcript counts are not provided to the methods).

*polyester* [@Frazee2015Polyester] was used to simulate paired-end RNA-seq reads for two
groups of 12 samples each, with realistic fragment GC bias, and with
dispersion on transcript-level counts drawn from the joint distribution
of mean and dispersion values estimated from the GEUVADIS samples.
To compare *DRIMSeq* and *DEXSeq* in further detail, we generated
an additional simulation in which dispersion parameters were assigned
to genes via matching on the gene-level count, and then all
transcripts of a gene had counts generated using the same per-gene dispersion.
The first sample for group 1 and the first sample for group 2 followed the
realistic GC bias profile of the same GEUVADIS sample, and so on for all 12
samples. This pairing of the samples was used to generate balanced
data, but not used in the statistical analysis. 
*countsimQC* [@Soneson2017Towards] was used to
examine the properties of the simulation relative to the dataset used
for parameter estimation, and the full report can be accessed at the
associated GitHub repository for simulation code [@swimdown].

Differential expression across two groups was generated as follows:
The 46,933 expressed transcripts as defined above belonged to 15,017 genes.
70% (n=10,514) of these genes with expressed transcripts
were set as null genes, where abundance was not
changed across the two groups. For 10% (n=1,501) of genes, all isoforms were
differentially expressed at a log fold change between 1 and 2.58 (fold
change between 2 and 6). The set of transcripts in these genes was
classified as DGE (differential 
gene expression) by construction, and the expressed transcripts were also
DTE (differential transcript expression), but they did not count as
DTU (differential transcript usage), as the proportions within the
gene remained constant. To simulate balanced differential expression, one of
the two groups was randomly chosen to be the baseline, and the other group
would have its counts multiplied by the fold change. For 10% (n=1,501) of genes,
a single expressed isoform was differentially expressed at a log fold
change between 1 and 2.58. This set of transcripts was DTE by
construction. If the chosen transcript was the only expressed isoform
of a gene, this counted also as DGE and not as DTU, but if there were
other isoforms that were expressed, this counted for both DGE
and DTU, as the proportion of expression among the isoforms was
affected. For 10%  (n=1,501) of genes, differential transcript
usage was constructed by exchanging the TPM abundance of two expressed
isoforms, or, if only one isoform was expressed, exchanging the
abundance of the expressed isoform with a non-expressed one. 
This counted for DTU and DTE, but not for DGE. An MA plot of the
simulated transcript abundances for the two groups is shown in Figure
\@ref(fig:ma-simulated).

```{r ma-simulated, message=FALSE, echo=FALSE, dev="png", out.width="75%", fig.cap="MA plot of simulated abundances. Each point depicts a transcript, with the average log2 abundance in transcripts-per-million (TPM) on the x-axis and the difference between the two groups on the y-axis. Of the 35,850 transcripts which are expressed with TPM > 1 in at least one group, 77\\% (n=27,429) are null transcripts (grey), which fall by construction on the M=0 line, and 23\\% (n=8,421) are differentially expressed (green, orange, or purple). This filtering of 1 TPM is for visualization only and unrelated to the DRIMSeq filtering used in the workflow. As transcripts can belong to multiple categories of differential gene expression (DGE), differential transcript expression (DTE), and differential transcript usage (DTU), here the transcripts are colored by which genes they belong to (those selected to be DGE-, DTE-, or DTU-by-construction)."}
library(rnaseqDTU)
library(rafalib)
data(simulate)
bigpar()
pc <- 1
col <- rep(8, nrow(tpms))
col[iso.dge] <- 1
col[iso.dte] <- 2
col[iso.dtu] <- 3
maplot(log2(tpms[,1]+pc), log2(tpms[,2]+pc),
       n=nrow(tpms), curve.add=FALSE,
       col=col, cex=.25, pch=20)
legend("bottomright",
       c("DGE","DTE","DTU","null"),
       col=c(1:3,8), pch=20, bty="n")
```

## Operation

This workflow was designed to work with R 3.5 or higher, and the
*DRIMSeq*, *DEXSeq*, *stageR*, and *tximport* packages for Bioconductor
version 3.7 or higher. Bioconductor packages should always be
installed following the [official instructions](https://bioconductor.org/install).
The workflow uses a subset of all genes to speed
up the analysis, but the Bioconductor packages can easily be run for
this dataset on all human genes on a laptop in less than an
hour. Timing for the various packages is included within each section.

## Quantification and data import

As described in the Introduction, this workflow uses transcript-level
quantification estimates produced by *Salmon* [@Patro2017Salmon]
and imported into R/Bioconductor with *tximport* [@Soneson2015Differential].
Details about how to run *Salmon*, and which type of transcript-level 
estimated counts should be imported, is covered in the Workflow, 
with the exact code used to run the DTU analysis.
*Salmon* estimates the relative abundance of each annotated transcript 
for each sample in units of transcripts-per-million (TPM);
the estimated TPM values should be proportional to the abundance
of the transcripts in the population of cells that were assayed.
One critical point is that *Salmon* only considers the transcripts that 
are provided in the annotation; it is not able to detect expression
of any novel transcripts.
If many un-annotated transcripts are expressed in the particular set of 
samples, successful application of this workflow may require first building out
a more representative set of annotated transcripts via transcriptome assembly or
full transcript sequencing.

In addition to relative abundance, *Salmon* estimates the *effective length* 
of each transcript, which can take into account a number of sample-specific 
technical biases including fragment size distribution (default), fragment GC content,
random hexamer priming bias, and positional bias along the transcript.
If a transcript had certain properties, related to its length and its
sequence content, that made it difficult to produce cDNA fragments and 
sequence reads from these fragments, then its effective length should be
lower for that sample, than if these technical biases were absent.
The estimates of TPM and the effective lengths can be used to estimate 
the number of fragments that each transcript produced, which will be called 
*estimated counts* in this workflow. Different types of estimated counts
may be correlated with effective transcript length, and so we will discuss 
in the Workflow our recommended type to use for DTU and DGE analysis 
(also diagrammed in Figure \@ref(fig:diagram)).

## DTU testing

We focus on two statistical models for DTU testing in this workflow, 
*DRIMSeq* [@Nowicka2016DRIMSeq] and *DEXSeq* [@Anders2012Detecting].
*DEXSeq* was published first, as a statistical model for detecting differences
in exon usage across samples in different conditions. Most of the *DEXSeq*
functions and documentation refer specifically to *exons* or *exonic parts* 
within a *gene*, while the final results table refers more generally 
to these as *features* within a *group*.
It is this more general usage that we will employ in this workflow, substituting
estimated transcript counts in place of exonic part counts.

*DEXSeq* assumes a Negative Binomial (NB) distribution for the feature counts, 
and considers the counts for each feature (originally, the exonic parts) 
relative to counts for all other features of the group (the gene), 
using an interaction term in a generalized linear model (GLM). 
The GLM framework is an extension of the linear model (LM), but shares
with LM the usage of a design matrix, typically represented by $\mathbf{X}$, 
which is made up of columns of covariates that are multiplied by scalar coefficients, 
typically represented by $\mathbf{\beta}$. The design matrix with its multiple coefficients
is useful for extending statistical models beyond simple group comparisons, allowing for 
more complex situations, such as within-patient comparisons, batch correction,
or testing of ratios.

*DEXSeq* models each feature independently in the GLM fitting stage,
and so does not take into account any correlation among the 
counts across features in a group (e.g. transcript counts within a gene),
except insofar as such correlation is accounted for by columns in the design matrix.
This last point is important, as correlation induced by DTU across condition groups, 
or even DTU that can be associated with cell-type heterogeneity, can be 
modeled in the *DEXSeq* framework by interaction
terms with relevant covariates introduced into the design matrix.
In the current workflow, we provide an example of capturing DTU across
condition using *DEXSeq*, but future iterations of this workflow 
may also include controlling for additional covariates,
such as estimated cell type proportions.
*DEXSeq* was evaluated on transcript counts by @Soneson2016Isoform
and then later by @Nowicka2016DRIMSeq, where it was shown in both cases 
that *DEXSeq* could be used to detect DTU in this configuration, 
though isoform pre-filtering greatly improved its performance in 
reducing the observed false discovery rate (FDR) closer to its nominal level.

In contrast to the NB model, *DRIMSeq* assumes an Dirichlet Multinomial
model (DM) for each gene, where the total count for the gene is 
considered fixed, and the quantity of interest is the proportion for the
transcript within a gene for each sample. If the proportion for one
transcript increases, it must result in a decrease for the proportions
of the other transcripts of the gene. Genes that are detected as having 
statistically significant DTU are those in which the 
proportion changes observed across condition were large, considering the variation
in proportions seen within condition. The variation in proportions
across biological replicates
is modeled using a single *precision* parameter per gene, which will be discussed in the 
workflow below. *DRIMSeq* also uses a design matrix, 
and so can be used to analyze DTU for complex experimental designs,
including within-patient comparisons, batch correction, or testing of ratios.

A critical difference between the two models is that *DRIMSeq*
directly models the correlations among transcript counts 
within a gene via the DM distribution, and so can capture these correlations 
across exchangeable samples *within a condition*.
*DEXSeq* with the NB distribution only can capture correlations among 
transcript counts within a gene when the DTU is *across sample groups* defined by
covariates in the design matrix. On the other hand, *DRIMSeq* is limited
in modeling a single precision parameter per gene. If there are two 
transcripts within a gene with very different biological variability
-- perhaps they have separate promoters under different regulatory control 
-- then *DEXSeq* may do a better job modeling such heterogeneity 
in the biological variability of transcript expression, 
as it estimates a separate dispersion parameter for each transcript.

# Workflow

## Salmon quantification

We used *Salmon* version 0.10.0 to quantify abundance and effective transcript 
lengths for all of the 24 simulated samples. For this workflow, we
will use the first six samples from each group. We quantified against
the [GENCODE](https://www.gencodegenes.org/releases/current.html)
human annotation version 28, which was the same reference used to
generate the simulated reads. We used the transcript sequences FASTA
file that contains "Nucleotide sequences of all transcripts on the
reference chromosomes". When downloading the FASTA file, it is useful
to download the corresponding GTF file, as this will be used in later
sections. 

To build the *Salmon* index, we used the following command. Recent
versions of *Salmon* will discard identical sequence duplicate
transcripts, and keep a log of these within the index directory.
The `--gencode` command trims the *Gencode* FASTA headers to only
include the transcript identifier.

```
salmon index --gencode -t gencode.v28.transcripts.fa -i gencode.v28_salmon-0.10.0
```

To quantify each sample, we used the following command, which says to
quantify with six threads using the GENCODE index, with inward and
unstranded paired end reads, using fragment GC bias correction,
writing out to the directory `sample` and using as input these two
reads files. The library type is specified by `-l IU` (inward and
unstranded) and the options are discussed in the 
[Salmon documentation](http://salmon.readthedocs.io/en/latest/library_type.html). 
Recent versions of Salmon can automatically detect the library type by
setting `-l A`. Such a command can be automated in a bash loop using
bash variables, or one can use more advanced workflow management
systems such as Snakemake [@Koster2012Snakemake] or Nextflow
[@Di2017Nextflow]. 

```
salmon quant -p 6 -i gencode.v28_salmon-0.10.0 -l IU \
      --gcBias -o sample -1 sample_1.fa.gz -2 sample_2.fa.gz
```

## Importing counts into R/Bioconductor

We can use *tximport* to import the estimated counts, abundances and
effective transcript lengths into R. The *tximport* package offers
three methods for producing count matrices from transcript-level
quantification files, as described in detail in
@Soneson2015Differential, and already mentioned in the introduction.
To recap these are: (1) original estimated counts with effective
transcript length information used as a statistical offset, (2)
generating "counts from abundance" by scaling transcript-per-million
(TPM) abundance estimates per sample such that they sum to the total
number of mapped reads (*scaledTPM*), or generating "counts from
abundance" by scaling TPM abundance estimates first by the average
effective transcript length over samples, and then per sample such
that they sum to the total number of mapped reads (*lengthScaledTPM*).
We will use *scaledTPM* for DTU detection, with the statistical
motivation described below, and the original estimated counts with
length offset for DGE detection.

We recommend constructing a CSV file
that keeps track of the sample identifiers and any relevant variables,
e.g. condition, time point, batch, and so on. 
Here we have made a sample CSV file and provided it along with this
workflow's R package, *rnaseqDTU*.  If a user is applying the code in
this workflow with her own RNA-seq data, she does not need to load the
*rnaseqDTU* package. If a user is running through the code in this
workflow with the workflow simulated data, she does need to load the
*rnaseqDTU* package.

In order to find this CSV file, we first need to know where on the
machine this workflow package lives, so we can point to the `extdata`
directory where the CSV file is located. These two lines of code load
the workflow package and find this directory on the machine. Again,
these two lines of code would therefore not be part of a *typical*
analysis using one's own RNA-seq data.

```{r}
library(rnaseqDTU)
csv.dir <- system.file("extdata", package="rnaseqDTU")
```

The CSV file records which samples are condition 1 and which are
condition 2. The columns of this CSV file can have any names, although
`sample_id` will be used later by *DRIMSeq*, and so using this column
name allows us to pass this *data.frame* directly to *DRIMSeq* at a
later step.

```{r}
samps <- read.csv(file.path(csv.dir, "samples.csv"))
head(samps)
samps$condition <- factor(samps$condition)
table(samps$condition)
files <- file.path("/path/to/dir", samps$sample_id, "quant.sf")
names(files) <- samps$sample_id
head(files)
```

We can then import transcript-level counts using *tximport*.
For DTU analysis, we suggest generating counts from abundance, using the
`scaledTPM` method described by @Soneson2015Differential. 
The `countsFromAbundance` option of *tximport* uses estimated
abundances to generate roughly count-scaled data, such that each
column will sum to the number of reads mapped for that library. 
By using `scaledTPM` counts, the estimated proportions 
fit by *DRIMSeq*, which are generated from counts, will be
equivalent to proportions of the abundance of the isoforms.

If instead of `scaledTPM`, we used the original estimated transcript counts
(`countsFromAbundance="no"`), or if we used `lengthScaledTPM`
transcript counts, then a change in transcript usage among
transcripts of different length could result in a changed total count
for the gene, even if there is no change in total gene
expression. For more detail and a diagram of this
effect, we refer the reader to Figure 1 of @Trapnell2013Differential.
Briefly, this is because the original transcript counts and
`lengthScaledTPM` transcript counts scale with transcript
length, while `scaledTPM` transcript counts do not. 
A change in the total count for the gene, not corrected by an offset, could then
bias the calculation of proportions and therefore confound DTU analysis.
For testing DTU using *DRIMSeq* and *DEXSeq*, it is convenient if the count-scale data
do *not* scale with transcript length within a gene.
That this could be corrected by an offset, but this is not
easily implemented in the current DTU analysis packages. 
For *gene-level* analysis (DGE), we can use gene-level original 
counts with an average transcript length offset, 
gene-level `scaledTPM` counts, or gene-level  `lengthScaledTPM` counts, 
as all of these three methods control for the length bias 
described by @Trapnell2013Differential and @Soneson2015Differential.
Our DTU and DGE `countsFromAbundance` recommendations are
summarized in Figure \@ref(fig:diagram). 

A final note is that, the motivation for using `scaledTPM` counts 
hinges on the fact that estimated fragment counts scale with
transcript length in fragmented RNA-seq data. If a different experiment
is performed and a different quantification method used to produce
counts per transcript which *do not* scale with transcript length,
then the recommendation would be to use these counts per transcript directly.
Examples of experiments producing counts per transcript that would potentially
not scale with transcript length include 
counts of full-transcript-length or nearly-full-transcript-length reads, 
or counts of 3' tagged RNA-seq reads aggregated to transcript groups.
In either case, the statistical methods for DTU could be provided directly 
with the transcript counts.

The following code chunk is what one would use in a typical analysis,
but is not evaluated in this workflow because the quantification files
are not provided in the *rnaseqDTU* package due to size restrictions.
Instead we will load a pre-constructed matrix of counts below. In a
typical workflow, the code below would be used to generate the matrix
of counts from the quantification files. We have also trimmed the counts
matrix to two decimal places per count (which is more than enough
precision on estimated counts). All of the quantification files and
simulated reads for this dataset have been made publicly available on
*Zenodo*; see the *Data availability* section at the end of this
workflow.

```{r eval=FALSE}
library(tximport)
txi <- tximport(files, type="salmon", txOut=TRUE,
                countsFromAbundance="scaledTPM")
cts <- txi$counts
cts <- cts[rowSums(cts) > 0,]
```

## Transcript-to-gene mapping

Bioconductor offers numerous approaches for building a *TxDb* object,
a transcript database that can be used to link transcripts to genes
(among other uses).
The following code chunks were used to generate a *TxDb*,
and then used the `select` function with the *TxDb* to produce 
a corresponding *data.frame* called `txdf` which links transcript IDs
to gene IDs. In this *TxDb*, the transcript IDs are called `TXNAME` and
the gene IDs are called `GENEID`. The version 28 human GTF file was
downloaded from the GENCODE website when downloading the transcripts
FASTA file.
Due to size restrictions, neither the `gencode.v28.annotation.gtf.gz`
file nor the generated `.sqlite` file are included in the *rnaseqDTU* package.

```{r eval=FALSE}
library(GenomicFeatures)
gtf <- "gencode.v28.annotation.gtf.gz"
txdb.filename <- "gencode.v28.annotation.sqlite"
txdb <- makeTxDbFromGFF(gtf)
saveDb(txdb, txdb.filename)
```

Once the *TxDb* database has been generated and saved, it can be
quickly reloaded: 

```{r eval=FALSE}
txdb <- loadDb(txdb.filename)
txdf <- select(txdb, keys(txdb, "GENEID"), "TXNAME", "GENEID")
tab <- table(txdf$GENEID)
txdf$ntx <- tab[match(txdf$GENEID, names(tab))]
```

# Statistical analysis of differential transcript usage

## DRIMSeq

We load the `cts` object as created in the *tximport* code chunks. This
contains count-scale data, generated from abundance using the
`scaledTPM` method. The column sums are equal to the number of mapped 
paired-end reads per experiment. The experiments have between 31 and
38 million paired-end reads that were mapped to the transcriptome
using *Salmon*.

```{r}
data(salmon_cts)
cts[1:3,1:3]
range(colSums(cts)/1e6)
```

We also have the `txdf` object giving the transcript-to-gene
mappings (for construction, see previous section). 
This is contained in a file called `simulate.rda` that
contains a number of R objects with information about the
simulation, that we will use later to assess the methods'
performance. 

```{r}
data(simulate)
head(txdf)
all(rownames(cts) %in% txdf$TXNAME)
txdf <- txdf[match(rownames(cts),txdf$TXNAME),]
all(rownames(cts) == txdf$TXNAME)
```

In order to run *DRIMSeq*, we build a *data.frame* with the gene ID,
the feature (transcript) ID, and then columns for each of the samples:

```{r}
counts <- data.frame(gene_id=txdf$GENEID,
                     feature_id=txdf$TXNAME,
                     cts)
```

We can now load the *DRIMSeq* package and create a *dmDSdata* object,
with our `counts` and `samps` *data.frames*. Typing in the object name
and pressing return will give information about the number of genes:

```{r}
library(DRIMSeq)
d <- dmDSdata(counts=counts, samples=samps)
d
```

The *dmDSdata* object has a number of specific methods. Note that the
rows of the object are gene-oriented, so pulling out the first *row*
corresponds to all of the transcripts of the first gene:

```{r}
methods(class=class(d))
counts(d[1,])[,1:4]
```

It will be useful to first filter the object, before running
procedures to estimate model parameters. This greatly speeds up the
fitting and removes transcripts that may be troublesome for
parameter estimation, e.g. estimating the proportion of expression
among the transcripts of a gene when the total count is very low. We first
define `n` to be the total number of samples, and `n.small` to be the
sample size of the smallest group. We use all
three of the possible filters: for a transcript to be retained in the dataset, 
we require that (1) it has a count of at least 10 in at least `n.small`
samples, (2) it has a relative abundance proportion of at least 0.1 in 
at least `n.small` samples, and (3) the total count of the corresponding gene 
is at least 10 in all `n` samples. We used all three possible filters, 
whereas only the two count
filters are used in the *DRIMSeq* vignette example code.

It is important to consider what types of transcripts may be removed
by the filters, and potentially adjust depending on the dataset.  If
`n` was large, it would make sense to allow perhaps a few samples to
have very low counts, so lowering `min_samps_gene_expr` to some
factor multiple ($< 1$) of `n`, and likewise for the first two filters for
`n.small`. The second filter means that if a transcript does not make
up more than 10% of the gene's expression for at least `n.small`
samples, it will be removed. If this proportion seems too high, for
example, if very lowly expressed isoforms are of particular interest,
then the filter can be omitted or the `min_feature_prop` lowered. For
a concrete example, if a transcript goes from a proportion of 0% in the
control group to a proportion of 9% in the treatment group, this would
be removed by the above 10% filter. After filtering, this dataset has
7,764 genes.

```{r}
n <- 12
n.small <- 6
d <- dmFilter(d,
              min_samps_feature_expr=n.small, min_feature_expr=10,
              min_samps_feature_prop=n.small, min_feature_prop=0.1,
              min_samps_gene_expr=n, min_gene_expr=10)
d
```

The *dmDSdata* object only contains genes that have more than one
isoform, which makes sense as we are testing for differential
transcript usage. We can find out how many of the remaining genes have
$N$ isoforms by tabulating the number of times we see a gene ID, then
tabulating the output again:

```{r}
table(table(counts(d)$gene_id))
```

We create a design matrix, using a design formula and the sample
information contained in the object, accessed via *samples*. Here we
use a simple design with just two groups, but more complex designs are
possible. For some discussion of complex designs, one can refer to
the vignettes of the *limma*, *edgeR*, or *DESeq2* packages.

```{r}
design_full <- model.matrix(~condition, data=DRIMSeq::samples(d))
colnames(design_full)
```

Only for speeding up running the live code chunks in this workflow, we
subset to the first 250 genes, representing about one thirtieth of the
dataset. This step would not be run in a typical workflow.

```{r}
d <- d[1:250,]
7764 / 250
```

We then use the following three functions to estimate the model
parameters and test for DTU. We first estimate the *precision*, which is related to the
dispersion in the Dirichlet Multinomial model via the formula below. Because
precision is in the denominator of the right hand side of the
equation, they are inversely related. Higher *dispersion* -- counts more
variable around their expected value -- is associated with lower
*precision*. For full details about the *DRIMSeq* model, one should read
both the detailed software vignette and the publication
[@Nowicka2016DRIMSeq]. After estimating the precision, we fit regression coefficients
and perform null hypothesis testing on the coefficient of
interest. Because we have a simple two-group model, we test the
coefficient associated with the difference between condition 2 and
condition 1, called `condition2`. The following code takes about half
a minute, and so a full analysis on this dataset takes about 15
minutes on a laptop.

\[ \textrm{dispersion} = \frac{1}{1 + \textrm{precision}} \]

```{r}
set.seed(1)
system.time({
  d <- dmPrecision(d, design=design_full)
  d <- dmFit(d, design=design_full)
  d <- dmTest(d, coef="condition2")
})
```

To build a results table, we run the `results` function. We can
generate a single p-value per gene, which tests whether there is any
differential transcript usage within the gene, or a single p-value per transcript,
which tests whether the proportions for this transcript changed within
the gene:

```{r}
res <- DRIMSeq::results(d)
head(res)
res.txp <- DRIMSeq::results(d, level="feature")
head(res.txp)
```

Because the `pvalue` column may contain `NA` values, we use the
following function to turn these into 1's. The `NA` values would
otherwise cause problems for the stage-wise analysis.
From investigating these `NA` p-value cases for *DRIMSeq*,
they all occur when one condition group has all zero counts for a
transcript, but sufficient counts from the other condition group, and
sufficient counts for the gene. *DRIMSeq* will not estimate a
precision for such a gene.  These all happen to be true positive genes
for DTU in the simulation, where the isoform switch is total or nearly
total.  *DEXSeq*, shown in a later section, does not produce `NA`
p-values for these genes.  A potential fix would be to use a
plug-in common or trended precision for such genes, but this is not
implemented in the current version of *DRIMSeq*.


```{r}
no.na <- function(x) ifelse(is.na(x), 1, x)
res$pvalue <- no.na(res$pvalue)
res.txp$pvalue <- no.na(res.txp$pvalue)
```

We can plot the estimated proportions for one of the significant
genes, where we can see evidence of switching (Figure \@ref(fig:plot-prop)).

```{r plot-prop, out.width="75%", fig.cap="Estimated proportions for one of the significant genes."}
idx <- which(res$adj_pvalue < 0.05)[1]
res[idx,]
plotProportions(d, res$gene_id[idx], "condition")
```

## stageR following DRIMSeq

Because we have been working with only a subset of the data, we now
load the results tables that would have been generated by running
*DRIMSeq* functions on the entire dataset.

```{r}
data(drim_tables)
nrow(res)
nrow(res.txp)
```

A typical analysis of differential transcript usage would involve
asking first: "which genes contain any evidence of DTU?", and secondly,
"which transcripts in the genes that contain some evidence may be
participating in the DTU?" Note that a gene may pass the first
stage without exhibiting enough evidence to identify one or more
transcripts that are participating in the DTU. The *stageR* package is designed
to allow for such two-stage testing procedures, where the first stage
is called a *screening* stage and the second stage a *confirmation*
stage [@Van2017StageR]. The methods are general, and can also be applied to
testing, for example, changes across a time series followed by
investigation of individual time points, as shown in the *stageR*
package vignette. We show below how *stageR* is used to detect DTU and
how to interpret its output.

We first construct a vector of p-values for the screening
stage. Because of how the *stageR* package will combine transcript and
gene names, we need to strip the gene and transcript version numbers
from their Ensembl IDs (this is done by keeping only the first 15 characters of 
the gene and transcript IDs).

```{r}
pScreen <- res$pvalue
strp <- function(x) substr(x,1,15)
names(pScreen) <- strp(res$gene_id)
```

We construct a one column matrix of the confirmation p-values:

```{r}
pConfirmation <- matrix(res.txp$pvalue, ncol=1)
rownames(pConfirmation) <- strp(res.txp$feature_id)
```

We arrange a two column *data.frame* with the transcript and gene
identifiers.

```{r}
tx2gene <- res.txp[,c("feature_id", "gene_id")]
for (i in 1:2) tx2gene[,i] <- strp(tx2gene[,i])
```

The following functions then perform the *stageR* analysis. We must specify an
`alpha`, which will be the *overall false discovery rate* target for the
analysis, defined below. Unlike typical adjusted p-values or
q-values, we cannot choose an arbitrary threshold later: after
specifying `alpha=0.05`, we need to use 5% as the target in downstream
steps. There are also convenience functions *getSignificantGenes* and
*getSignificantTx*, which are demonstrated in the *stageR* vignette.

```{r message=FALSE}
library(stageR)
stageRObj <- stageRTx(pScreen=pScreen, pConfirmation=pConfirmation,
                      pScreenAdjusted=FALSE, tx2gene=tx2gene)
stageRObj <- stageWiseAdjustment(stageRObj, method="dtu", alpha=0.05)
suppressWarnings({
  drim.padj <- getAdjustedPValues(stageRObj, order=FALSE,
                                  onlySignificantGenes=TRUE)
})
head(drim.padj)
```

The final table with adjusted p-values summarizes the information from
the two-stage analysis. Only genes that passed the filter are
included in the table, so the table already represents *screened* genes. 
The transcripts with values in the column, `transcript`, less than 0.05 pass the
*confirmation* stage on a target 5% *overall false
discovery rate*, or OFDR. This means that, in expectation, no more
than 5% of the genes that pass screening will either (1) not contain
any DTU, so be falsely screened genes, or (2) contain a falsely confirmed transcript.
A falsely confirmed transcript is a transcript with an adjusted p-value 
less than 0.05 which does not exhibit differential usage across condition.
The *stageR* procedure allows us to look at both the genes that passed the
screening stage and the transcripts with adjusted p-values less than 
our target `alpha`, and understand what kind of *overall* error rate
this procedure entails. This cannot be said for an arbitrary procedure
of looking at standard gene adjusted p-values and transcript adjusted
p-values, where the adjustment was performed independently.

## Post-hoc filtering on the standard deviation in proportions

We found that *DRIMSeq* was sensitive to detect DTU, but could exceed
its false discovery rate (FDR) bounds, particularly on the transcript-level tests, and 
that a post-hoc, non-specific filtering of the *DRIMSeq* transcript
p-values and adjusted p-values improved the FDR and OFDR control. 
We considered the standard deviation (SD) of the
per-sample proportions as a filtering statistic. This statistic does
not use the information about which samples belong to which condition
group. We set the p-values and adjusted p-values for transcripts with
small per-sample proportion SD to 1. We do not recompute adjusted
p-values, although we will provide the filtered p-values to the
*stageR* procedure. 

We note that the p-values are no longer necessarily uniform after
filtering out small effect size transcripts and genes, although we
find that in this simulation at least, the filtering made the
procedure *more conservative*: excluding transcripts with small SD of
the per-sample proportions brought both the observed FDR and the
observed OFDR closer to their nominal targets, as will be shown in the
evaluations below.

```{r}
res.txp.filt <- DRIMSeq::results(d, level="feature")
smallProportionSD <- function(d, filter=0.1) {
  cts <- as.matrix(subset(counts(d), select=-c(gene_id, feature_id)))
  gene.cts <- rowsum(cts, counts(d)$gene_id)
  total.cts <- gene.cts[match(counts(d)$gene_id, rownames(gene.cts)),]
  props <- cts/total.cts
  propSD <- sqrt(rowVars(props))
  propSD < filter
}
filt <- smallProportionSD(d)
res.txp.filt$pvalue[filt] <- 1 
res.txp.filt$adj_pvalue[filt] <- 1
```

The above post-hoc filter is not part of the *DRIMSeq* modeling steps,
and to avoid interfering with the modeling, we run it after
*DRIMSeq*. The other three filters used before have been tested by the
*DRIMSeq* package authors and are therefore a recommended part of an
analysis before the modeling begins. We do not apply this post-hoc
filter to *DEXSeq* in this workflow, as *DEXSeq* seemed to be closer
to controlling its FDR and OFDR in the evaluations, after using the
*DRIMSeq* filters recommended in this workflow.

## DEXSeq

The *DEXSeq* package was originally designed for detecting
differential exon usage [@Anders2012Detecting], but can also be adapted
to run on estimated transcript counts, in order to detect DTU. 
Using *DEXSeq* on transcript counts was evaluated by
@Soneson2016Isoform, showing the benefits in FDR control from
filtering lowly expressed transcripts for a transcript-level analysis.
We benchmarked *DEXSeq* here, beginning with the *DRIMSeq*
filtered object, as these filters are intuitive, they greatly speed up
the analysis, and such filtering was shown to be beneficial in FDR
control. 

The two factors of (1) working on isoform counts rather than
individual exons and (2) using the *DRIMSeq* filtering procedure
dramatically increase the speed of *DEXSeq*, compared to running an
exon-level analysis. Another advantage is that we benefit from
the sophisticated bias models of *Salmon*, which account for drops in
coverage on alternative exons that can otherwise throw off estimates
of transcript abundance [@Love2016Alpine]. A disadvantage over the exon-level analysis
is that we must know in advance all of the possible isoforms that can
be generated from a gene locus, all of which are assumed to be
contained in the annotation files (FASTA and GTF).

We first load the *DEXSeq* package and then build a *DEXSeqDataSet* from
the data contained in the *dmDStest* object (the class of the
*DRIMSeq* object changes as the results are added). The design formula
of the *DEXSeqDataSet* here uses the language "exon" but this should
be read as "transcript" for our analysis. *DEXSeq* will test -- after
accounting for total gene expression for this sample and for the
proportion of this transcript relative to the others -- whether there is
a condition-specific difference in the transcript proportion relative
to the others. 

```{r message=FALSE}
library(DEXSeq)
sample.data <- DRIMSeq::samples(d)
count.data <- round(as.matrix(counts(d)[,-c(1:2)]))
dxd <- DEXSeqDataSet(countData=count.data,
                     sampleData=sample.data,
                     design=~sample + exon + condition:exon,
                     featureID=counts(d)$feature_id,
                     groupID=counts(d)$gene_id)
```

The following functions run the *DEXSeq* analysis. While we are only
working on a subset of the data, the full analysis for this dataset
took less than 3 minutes on a laptop.

```{r}
system.time({
  dxd <- estimateSizeFactors(dxd)
  dxd <- estimateDispersions(dxd, quiet=TRUE)
  dxd <- testForDEU(dxd, reducedModel=~sample + exon)
})
```

We then extract the results table, not filtering on mean counts (as we
have already conducted filtering via *DRIMSeq* functions). We compute a
per-gene adjusted p-value, using the *perGeneQValue* function, which
aggregates evidence from multiple tests within a gene to a single
p-value for the gene and then corrects for multiple testing across
genes [@Anders2012Detecting]. Other methods for aggregative evidence from the
multiple tests within genes have been discussed in a recent
publication and may be substituted at this step [@Yi2018Gene].
Finally, we build a simple results table with the
per-gene adjusted p-values.

```{r}
dxr <- DEXSeqResults(dxd, independentFiltering=FALSE)
qval <- perGeneQValue(dxr)
dxr.g <- data.frame(gene=names(qval),qval)
```

For size consideration of the workflow R package, we reduce also the
transcript-level results table to a simple *data.frame*: 

```{r}
columns <- c("featureID","groupID","pvalue")
dxr <- as.data.frame(dxr[,columns])
head(dxr)
```

## stageR following DEXSeq

Again, as we have been working with only a subset of the data, we now
load the results tables that would have been generated by running
*DEXSeq* functions on the entire dataset.

```{r}
data(dex_tables)
```

If the *stageR* package has not already been loaded, we make sure to
load it, and run code very similar to that used above for *DRIMSeq*
two-stage testing, with a target `alpha=0.05`.

```{r}
library(stageR)
strp <- function(x) substr(x,1,15)
pConfirmation <- matrix(dxr$pvalue,ncol=1)
dimnames(pConfirmation) <- list(strp(dxr$featureID),"transcript")
pScreen <- qval
names(pScreen) <- strp(names(pScreen))
tx2gene <- as.data.frame(dxr[,c("featureID", "groupID")])
for (i in 1:2) tx2gene[,i] <- strp(tx2gene[,i])
```

The following three functions provide a table with the OFDR control
described above. To repeat, the set of genes passing screening should
not have more than 5% of either genes which have in fact no DTU or genes which
contain a transcript with an adjusted p-value less than 5% which do
not participate in DTU.

```{r}
stageRObj <- stageRTx(pScreen=pScreen, pConfirmation=pConfirmation,
                      pScreenAdjusted=TRUE, tx2gene=tx2gene)
stageRObj <- stageWiseAdjustment(stageRObj, method="dtu", alpha=0.05)
suppressWarnings({
  dex.padj <- getAdjustedPValues(stageRObj, order=FALSE,
                                 onlySignificantGenes=TRUE)
})
head(dex.padj)
```

## Citing methods in published research

This concludes the DTU section of the workflow. If you use *DRIMSeq*
[@Nowicka2016DRIMSeq], *DEXSeq* [@Anders2012Detecting], 
*stageR* [@Van2017StageR], *tximport* [@Soneson2015Differential], 
or *Salmon* [@Patro2017Salmon] in
published research, please cite the relevant methods publications,
which can be found in the References section of this workflow.

# DTU analysis complements DGE analysis

## DGE analysis with DESeq2

In the final section of the workflow containing live code examples, we
demonstrate how differential transcript usage, summarized to the
gene-level, can be visualized with respect to differential gene
expression analysis results. We use *tximport* and summarize counts to
the gene level and compute an average transcript length offset for
count-based methods [@Soneson2015Differential]. We will then show code for using *DESeq2*
and *edgeR* to assess differential gene expression. Because we have
simulated the genes according to three different categories, we can
color the final plot by the true simulated state of the genes. We note
that we will pair *DEXSeq* with *DESeq2* results in the following
plot, and *DRIMSeq* with *edgeR* results. However, this pairing is
arbitrary, and any DTU method can reasonably be paired with
any DGE method.

The following line of code is unevaluated, but was used to generate an
object `txi.g` which contains the gene-level counts, abundances and
average transcript lengths.

```{r eval=FALSE}
txi.g <- tximport(files, type="salmon", tx2gene=txdf[,2:1])
``` 

For the workflow, we load the `txi.g` object which is saved in a file
`salmon_gene_txi.rda`. We then load the *DESeq2* package and build a
*DESeqDataSet* from `txi.g`, providing also the sample information and
a design formula.

```{r}
data(salmon_gene_txi)
library(DESeq2)
dds <- DESeqDataSetFromTximport(txi.g, samps, ~condition)
```

The following two lines of code run the *DESeq2* analysis [@Love2014Moderated].

```{r message=FALSE}
dds <- DESeq(dds)
dres <- DESeq2::results(dds)
```

Because we happen to know the true status of each of the genes, we can
make a scatterplot of the results, coloring the genes by their status
(whether DGE, DTE, or DTU by construction). 

```{r}
all(dxr.g$gene %in% rownames(dres))
dres <- dres[dxr.g$gene,]
# we can only color because we simulated...
col <- rep(8, nrow(dres))
col[rownames(dres) %in% dge.genes] <- 1
col[rownames(dres) %in% dte.genes] <- 2
col[rownames(dres) %in% dtu.genes] <- 3
```

Figure \@ref(fig:tuge-plot) displays the evidence for differential
transcript usage over that for differential gene expression. We can
see that the DTU genes cluster on the y-axis (mostly not captured
in the DGE analysis), and the DGE genes cluster on the x-axis (mostly
not captured in the DTU analysis). The DTE genes fall in the middle,
as all of them represent DGE, and some of them additionally represent
DTU (if the gene had other expressed transcripts). Because *DEXSeq*
outputs an adjusted p-value of 0 for some of the genes, we set these
instead to a jittered value around $10^{-20}$, so that their number and
location on the x-axis could be visualized. These jittered values
should only be used for visualization.

```{r tuge-plot, dev="png", out.width="75%", fig.cap="Transcript usage over gene expression plot. Each point represents a gene, and plotted are -log10 adjusted p-values for DEXSeq's test of differential transcript usage (y-axis) and DESeq2's test of differential gene expression (x-axis). Because we simulated the data we can color the genes according to their true category."}
bigpar()
# here cap the smallest DESeq2 adj p-value
cap.padj <- pmin(-log10(dres$padj), 100)
# this vector only used for plotting
jitter.padj <- -log10(dxr.g$qval + 1e-20)
jp.idx <- jitter.padj == 20
jitter.padj[jp.idx] <- rnorm(sum(jp.idx),20,.25)
plot(cap.padj, jitter.padj, col=col,
     xlab="Gene expression",
     ylab="Transcript usage")
legend("topright",
       c("DGE","DTE","DTU","null"),
       col=c(1:3,8), pch=20, bty="n")
```

## DGE analysis with edgeR

We can repeat the same analysis using *edgeR* as the inference engine
[@Robinson2009EdgeR]. The following code incorporates the average transcript length
matrix as an offset for an *edgeR* analysis.

```{r message=FALSE}
library(edgeR)
cts.g <- txi.g$counts
normMat <- txi.g$length
normMat <- normMat / exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts.g/normMat)) + log(colSums(cts.g/normMat))
y <- DGEList(cts.g)
y <- scaleOffset(y, t(t(log(normMat)) + o))
keep <- filterByExpr(y)
y <- y[keep,]
```

The basic *edgeR* model fitting and results extraction can be
accomplished with the following lines: 

```{r}
y <- estimateDisp(y, design_full)
fit <- glmFit(y, design_full)
lrt <- glmLRT(fit)
tt <- topTags(lrt, n=nrow(y), sort="none")[[1]]
```

Again, we can color the genes by their true status in the simulation:

```{r}
common <- intersect(res$gene_id, rownames(tt))
tt <- tt[common,]
res.sub <- res[match(common, res$gene_id),]
# we can only color because we simulated...
col <- rep(8, nrow(tt))
col[rownames(tt) %in% dge.genes] <- 1
col[rownames(tt) %in% dte.genes] <- 2
col[rownames(tt) %in% dtu.genes] <- 3
```

Figure \@ref(fig:tuge-drim-edger) displays the evidence for
differential transcript usage over that for differential gene
expression, now using *DRIMSeq* and *edgeR*. One obvious contrast with
Figure \@ref(fig:tuge-plot) is that *DRIMSeq* outputs lower non-zero
adjusted p-values than *DEXSeq* does, where *DEXSeq* instead outputs 0
for many genes. The plots look more similar when zooming in on the
*DRIMSeq* y-axis, as can be seen in Figure \@ref(fig:tuge-drim-zoom).

```{r tuge-drim-edger, dev="png", out.width="75%", fig.cap="Transcript usage over gene expression plot, as previously, but for DRIMSeq and edgeR."}
bigpar()
plot(-log10(tt$FDR), -log10(res.sub$adj_pvalue), col=col,
     xlab="Gene expression",
     ylab="Transcript usage")
legend("topright",
       c("DGE","DTE","DTU","null"),
       col=c(1:3,8), pch=20, bty="n")
```

```{r tuge-drim-zoom, dev="png", out.width="75%", fig.cap="Transcript usage over gene expression plot, zooming in on the DRIMSeq adjusted p-values."}
bigpar()
plot(-log10(tt$FDR), -log10(res.sub$adj_pvalue), col=col,
     xlab="Gene expression",
     ylab="Transcript usage", ylim=c(0,20))
legend("topright",
       c("DGE","DTE","DTU","null"),
       col=c(1:3,8), pch=20, bty="n")
```

## End of workflow section

This marks the end of the *DTU workflow*. Please consult the 
[published version of this workflow](http://dx.doi.org/10.12688/f1000research.15398.3) 
for a further section on *Evaluation*, including comparison of the
Bioconductor packages and methods shown in the workflow with other
popular packages and methods.

# Discussion 

Here we presented a workflow for analyzing RNA-seq experiments for
differential transcript usage across groups of samples. The
Bioconductor packages used, *DRIMSeq*, *DEXSeq*, and *stageR*, are
simple to use and fast when run on transcript-level data. We show how
these can be used downstream of transcript abundance quantification
with *Salmon*. 

We recommend the use of *stageR* for its formal
statistical procedure involving a screening and confirmation stage, as
this fits closely to what we expect a typical analysis to
entail. 
It is likely that an investigator would want both a list of 
statistically significant genes *and* transcripts participating in DTU, 
and *stageR* provides error control on this pair of lists, 
assuming that the underlying tests are well calibrated.

One potential limitation of this workflow is that, in contrast to
other methods such as the standard *DEXSeq* analysis, *SUPPA2*, or
*LeafCutter* [@Li2018Leaf], 
here we considered and detected expression switching between annotated
transcripts. Other methods such as *DEXSeq* (exon-based),
*SUPPA2*, or *LeafCutter* may benefit in
terms of power and interpretability from performing statistical
analysis directly on exon usage or splice events. Methods such
as *DEXSeq* (exon-based) and *LeafCutter* benefit in the ability to detect
un-annotated events.
The workflow presented here would require further processing to
attribute transcript usage changes to specific splice events, and is
limited to considering the estimated abundance of annotated transcripts.

# Session information

The following provides the session information used when compiling
this document. 

```{r}
devtools::session_info()
```

# Software versions

The samples were quantified with *Salmon* version 0.10.0 and
*kallisto* version 0.44.0. *polyester* version 1.16.0 and *alpine*
version 1.6.0 were used in generating the simulated dataset.

# Data availability 

The simulated paired-end read FASTQ files have been uploaded in
three batches of eight samples each to Zenodo, and the quantification
files are also available as a separate Zenodo dataset
[@swimdowndata]. The scripts used to generate the simulated dataset
are available at the simulation GitHub repository [@swimdown]

# Acknowledgments

The authors thank Koen Van den Berge, Malgorzata Nowicka, and Botond
Sipos for helpful comments on the workflow.

# References