swish
methodThe swish
method for differential expression analysis of RNA-seq data using inferential replicate counts is described in the following reference: Zhu et al. (2019) - doi: 10.1093/nar/gkz622.
We note that swish
extends and builds on another method, SAMseq (Li and Tibshirani 2011), implemented in the samr package, by taking into account inferential uncertainty, and allowing to control for batch effects and matched samples. Additionally, swish
has methods for testing changes in effect size across secondary covariates, which we refer to as “interactions”. swish
calls functions from the qvalue (Storey and Tibshirani 2003) or samr package for calculation of local FDR and q-value. This vignette gives an example of differential analysis of matched samples, and an interaction test for matched samples, to see if a condition effect changes in magnitude across two groups of samples.
The following lines of code will perform a basic transcript-level swish
two group analysis. For more details, read on.
# 'coldata.csv': sample information table
coldata <- read.csv("coldata.csv")
library(tximeta)
y <- tximeta(coldata)
library(swish)
y <- scaleInfReps(y)
y <- labelKeep(y)
set.seed(1)
y <- swish(y, x="condition")
The results can be found in mcols(y)
. For example, one can calculate the number of genes passing a 5% FDR threshold:
One can at any point remove the genes that didn’t pass the expression filter with the following line of code (can be run before or after swish
). These genes are ignored by swish
, and so will have NA
in the results columns in mcols(y)
.
A gene-level analysis looks identical to a transcript-level analysis, only the input data changes. Examples follow.
We begin the fishpond vignette by loading data from a Bioconductor Experiment Data package, macrophage. The package contains RNA-seq quantification from 24 RNA-seq samples, which are a subset of the RNA-seq samples generated and analyzed by Alasoo et al. (2018) - doi: 10.1038/s41588-018-0046-7.
The experiment involved treatment of macrophage cell lines from a number of human donors with IFN gamma, Salmonella infection, or both treatments combined. In the beginning of this vignette, we will focus on comparing the IFN gamma stimulated cell lines with the control cell lines, accounting for the paired nature of the data (cells from the same donor). Later in the vignette we will analyze differences in the Salmonella infection response by IFN gamma treatment status – whether the cells are primed for immune response.
We load the package, and point to the extdata
directory. For a typical analysis, the user would just point dir
to the location on the machine or cluster where the transcript quantifications are stored (e.g. the quant.sf
files).
The data was quantified using Salmon (Patro et al. 2017) 0.12.0 against the Gencode v29 human reference transcripts (Frankish, GENCODE-consoritum, and Flicek 2018). For more details and all code used for quantification, refer to the macrophage package vignette.
Importantly, --numGibbsSamples 20
was used to generate 20 inferential replicates with Salmon’s Gibbs sampling procedure. Inferential replicates, either from Gibbs sampling or bootstrapping of reads, are required for the swish method shown below. We also recommend to use --gcBias
when running Salmon to protect against common sample-specific biases present in RNA-seq data.
We start by reading in a CSV with the column data, that is, information about the samples, which are represented as columns of the SummarizedExperiment object we will construct containing the counts of reads per gene or transcript.
## names sample_id line_id replicate condition_name
## 1 SAMEA103885102 diku_A diku_1 1 naive
## 2 SAMEA103885347 diku_B diku_1 1 IFNg
## 3 SAMEA103885043 diku_C diku_1 1 SL1344
## 4 SAMEA103885392 diku_D diku_1 1 IFNg_SL1344
## 5 SAMEA103885182 eiwy_A eiwy_1 1 naive
## 6 SAMEA103885136 eiwy_B eiwy_1 1 IFNg
## macrophage_harvest salmonella_date ng_ul_mean rna_extraction rna_submit
## 1 11/6/2015 11/13/2015 293.9625 11/30/2015 12/9/2015
## 2 11/6/2015 11/13/2015 293.9625 11/30/2015 12/9/2015
## 3 11/6/2015 11/13/2015 293.9625 11/30/2015 12/9/2015
## 4 11/6/2015 11/13/2015 293.9625 11/30/2015 12/9/2015
## 5 11/25/2015 12/2/2015 193.5450 12/3/2015 12/9/2015
## 6 11/25/2015 12/2/2015 193.5450 12/3/2015 12/9/2015
## library_pool chemistry rna_auto
## 1 3226_RNAauto-091215 V4_auto 1
## 2 3226_RNAauto-091215 V4_auto 1
## 3 3226_RNAauto-091215 V4_auto 1
## 4 3226_RNAauto-091215 V4_auto 1
## 5 3226_RNAauto-091215 V4_auto 1
## 6 3226_RNAauto-091215 V4_auto 1
We will subset to certain columns of interest, and re-name them for later.
coldata
needs to have a column files
which specifies the path to the quantification files. In this case, we’ve gzipped the quantification files, so we point to the quant.sf.gz
file. We make sure that all the files exist in the location we specified.
coldata$files <- file.path(dir, "quants", coldata$names, "quant.sf.gz")
all(file.exists(coldata$files))
## [1] TRUE
tximeta
We will read in quantification data for some of the samples. First we load the SummarizedExperiment package. We will store out data and the output of the statistical method in a SummarizedExperiment object.
We load in the quantification data with tximeta
:
We can see that all the assays have been loaded:
## [1] "counts" "abundance" "length" "infRep1" "infRep2"
## [6] "infRep3" "infRep4" "infRep5" "infRep6" "infRep7"
## [11] "infRep8" "infRep9" "infRep10" "infRep11" "infRep12"
## [16] "infRep13" "infRep14" "infRep15" "infRep16" "infRep17"
## [21] "infRep18" "infRep19" "infRep20"
tximeta
loads transcript-level data, although it can later be summarized to the gene levels:
## [1] "ENST00000456328.2" "ENST00000450305.2" "ENST00000488147.1"
## [4] "ENST00000619216.1" "ENST00000473358.1" "ENST00000469289.1"
We will rename our SummarizedExperiment y
for the statistical analysis. For speed of the vignette, we subset to the transcripts on chromosome 1.
Two demonstrate a two group comparison, we subset to the “naive” and “IFNg” groups.
swish
at the transcript levelRunning swish
has three steps: scaling the inferential replicates, labeling the rows with sufficient counts for running differential expression, and then calculating the statistics. As swish
makes use of pseudo-random number generation in breaking ties and in calculating permutations (through samr), to obtain identical results, one needs to set a random seed before running swish()
, as we do below:
library(fishpond)
y <- scaleInfReps(y)
y <- labelKeep(y)
y <- y[mcols(y)$keep,]
set.seed(1)
y <- swish(y, x="condition", pair="line")
A note about labelKeep
: by default we keep features with minN=3
samples with a minimal count of 10. For scRNA-seq data with de-duplicated UMI counts, we recommend to lower the count, e.g. a count of 3, across a higher number of minN
cells, depending on the number of cells being compared. You can also set x="condition"
when running labelKeep
which will use the condition variable to set minN
.
The results are stored in mcols(y)
. We will show below how to pull out the top up- and down-regulated transcripts.
We can see how many transcripts are in a 5% FDR set:
##
## FALSE TRUE
## 5081 1329
We can check the distribution of p-values. This looks as expected for a comparison where we expect many transcripts will be affected by the treatment (IFNg stimulation of macrophage cells). There is a flat component and then an enrichment of transcripts with p-values near 0.
Of the transcripts in this set, which have the most extreme log2 fold change? Note that often many transcripts will share the same q-value, so it’s valuable to look at the log2 fold change as well (see further note below on q-value computation). The log2 fold change computed by swish
is the median over inferential replicates, and uses a pseudo-count of 5 on the scaled counts, to stabilize the variance on the fold change from division by small counts. Here we make two vectors that give the significant genes with the lowest (most negative) and highest (most positive) log fold changes.
## sign.lfc
## sig -1 0 1
## FALSE 2827 2 2252
## TRUE 616 0 713
Here we print a small table with just the calculated statistics for the large positive log fold change transcripts (up-regulation):
## [1] "tx_id" "gene_id" "tx_name" "log10mean" "keep"
## [6] "stat" "log2FC" "pvalue" "locfdr" "qvalue"
## log10mean log2FC pvalue qvalue
## ENST00000370459.7 3.85 10.27 5.2e-06 5.14e-05
## ENST00000355754.6 4.46 9.80 5.2e-06 5.14e-05
## ENST00000481145.1 3.41 8.78 5.2e-06 5.14e-05
## ENST00000443807.1 3.35 8.37 5.2e-06 5.14e-05
## ENST00000370473.4 4.87 8.11 5.2e-06 5.14e-05
## ENST00000368042.7 3.44 7.98 5.2e-06 5.14e-05
Likewise for the largest negative log fold change transcripts (down-regulation):
## log10mean log2FC pvalue qvalue
## ENST00000649724.1 2.83 -6.26 9.75e-03 4.83e-02
## ENST00000305352.6 2.90 -5.54 5.20e-06 5.14e-05
## ENST00000348581.9 2.19 -4.33 5.20e-06 5.14e-05
## ENST00000451341.1 1.73 -4.11 5.20e-06 5.14e-05
## ENST00000393688.7 2.18 -3.86 7.71e-03 4.27e-02
## ENST00000610222.2 2.07 -3.45 4.71e-03 3.48e-02
We can plot the scaled counts for the inferential replicates, and also group the samples by a covariate, in this case the cell line. The analysis was paired, so the statistic assessed if the change within pairs was consistent. Here we plot the 100th top up-regulated transcript:
We can make an MA plot, where the transcripts in our FDR set are colored:
Using the addIds
function from tximeta, we can easily add gene symbols. By specifying gene=TRUE
, this will use the gene ID to match to gene symbols for all of the transcripts.
We can then add gene symbols to our MA plot:
swish
at the gene levelWe can also run swish at the gene level. First we summarize all of the data to the gene level, using the summarizeToGene
function from tximeta. Again, we rename the object for statistical analysis, and then we subset to the genes on chromosome 1 for the demonstration.
Two demonstrate a two group comparison, we subset to the “naive” and “IFNg” groups, as before.
gy <- gy[,gy$condition %in% c("naive","IFNg")]
gy$condition <- factor(gy$condition, c("naive","IFNg"))
Next we can run the same steps as before. Again we set a random seed in order to be able to reproduce exact results in the future:
gy <- scaleInfReps(gy)
gy <- labelKeep(gy)
gy <- gy[mcols(gy)$keep,]
set.seed(1)
gy <- swish(gy, x="condition", pair="line")
As before, the number of genes in a 1% FDR set:
##
## FALSE TRUE
## 1047 744
The histogram of p-values:
As before, finding the genes with the most extreme log2 fold change:
## sign.lfc
## sig -1 1
## FALSE 599 448
## TRUE 374 370
sig <- mcols(gy)$qvalue < .05
glo <- order(mcols(gy)$log2FC * sig)
ghi <- order(-mcols(gy)$log2FC * sig)
## log10mean log2FC pvalue qvalue
## ENSG00000154451.14 4.15 10.24 1.86e-05 6.73e-05
## ENSG00000162654.8 4.54 10.04 1.86e-05 6.73e-05
## ENSG00000117228.9 4.90 8.13 1.86e-05 6.73e-05
## ENSG00000163568.14 2.81 6.89 1.86e-05 6.73e-05
## ENSG00000162645.12 4.35 6.61 1.86e-05 6.73e-05
## ENSG00000026751.16 4.36 6.34 1.86e-05 6.73e-05
## log10mean log2FC pvalue qvalue
## ENSG00000170989.9 2.93 -5.26 1.86e-05 6.73e-05
## ENSG00000224968.1 1.75 -4.16 1.86e-05 6.73e-05
## ENSG00000007968.6 2.73 -3.35 1.86e-05 6.73e-05
## ENSG00000183856.10 3.41 -3.06 1.86e-05 6.73e-05
## ENSG00000085999.11 2.47 -2.92 1.05e-02 3.51e-02
## ENSG00000229162.1 1.65 -2.86 1.86e-05 6.73e-05
We can plot a particular one of these genes:
As expected, the highly up-regulated genes are involved in immune response. Many genes encoding guanylate-binding proteins (GBP) are up-regulated, and these proteins are induced by interferon, produced in response to infection by pathogenic microbes.
We can make an MA plot, where the genes in our FDR set are colored:
Again, using the addIds
function from tximeta, we can easily add gene symbols to our gene-level expression analysis:
We can then add gene symbols to our MA plot:
We also provide in swish
methods for testing if a condition effect varies across a secondary covariate, using matched samples for condition, or un-matched samples, which we refer to as “interactions” in the software.
If matched samples are available, we compute the log2 fold change for each pair of samples across condition in the same covariate group, and then we use a Wilcoxon rank sum statistic for comparing the log2 fold changes across the secondary covariate. For permutation significance, the secondary covariate labels of the pairs are permuted. For unmatched samples, multiple random “pseudo-pairs” of samples across condition within the two covariate groups are chosen, and the statistic computed as above, averaging over the random pseudo-pairings. The motivation for the above permutation schemes is to ensure the following condition, that “under the null hypothesis, the likelihood of the data is invariant under these permutations” (Anderson and Ter Braak 2003), where our null hypothesis specifically involves the interaction between condition and the secondary covariate.
For the macrophage dataset we have been working with (Alasoo et al. 2018), we have a 2x2 experimental design, with IFN gamma stimulation, Salmonella infection, and both treatments, as well as control samples. We have these four conditions across 6 cell lines from 6 donors (a subset of all the RNA-seq samples available). So we can use the first method described above, where the cell line is used to match samples across condition. Our implementation does not make use of the pairing information across the secondary covariate, but we will still be well powered to detect differences in the log2 fold change.
We begin the interaction analysis by re-loading the SummarizedExperiment with all the samples, and defining two new factors indicating IFNg status and Salmonella status:
se$ifng <- factor(ifelse(
grepl("IFNg",se$condition),
"treated","control"))
se$salmonella <- factor(ifelse(
grepl("SL1344",se$condition),
"infected","control"))
with(colData(se),
table(ifng, salmonella)
)
## salmonella
## ifng control infected
## control 6 6
## treated 6 6
We will work with the chromosome 1 transcripts for demonstration:
Our implementation of the interaction design for matched samples takes into account matched samples within the x
condition, which we will specify to be the Salmonella infection status. We will specify the secondary covariate cov
to be the IFN gamma treatment. We will look for transcripts where the infection response changes based on IFN gamma treatment.
We actually have matched samples across both IFN gamma treatment and Salmonella infection, but the extra pairing is not used by our current implementation of interactions (it is common that there would not be pairing across the secondary covariate).
To perform the analysis, we create a new variable pair
which will record which samples are related within a group based on IFN gamma treatment status.
## [1] 1 1 2 2 3 3 4 4 5 5 6 6
## [1] 1 1 2 2 3 3 4 4 5 5 6 6
y2$pair[y2$ifng == "treated"] <- rep(7:12,each=2)
y2$pair <- factor(y2$pair)
table(y2$pair, y2$salmonella)
##
## control infected
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 1
## 5 1 1
## 6 1 1
## 7 1 1
## 8 1 1
## 9 1 1
## 10 1 1
## 11 1 1
## 12 1 1
swish
for interaction effectsWe now perform swish
analysis, specifying the Salmonella infection as our main condition, the IFN gamma treatment as the secondary covariate, and providing the pairing within IFN gamma treatment groups. We specify interaction=TRUE
to test for differences in infection response across IFN gamma treatment group.
In this case, we appear to have fewer non-null p-values from first impression of the p-value histogram:
The MA plot shows significant transcripts on either side of log2FC=0
. Note that the log2 fold change reported is the difference between the log2 fold change in the IFN gamma treated and IFN gamma control group. So positive log2FC
in this plot indicates that the effect is higher with IGN gamma treatment than in absence of the treatment.
We can plot some of the transcripts with high log2 fold change difference across IFN gamma treatment group, and which belong to the less than 5% nominal FDR group:
swish
There are currently five types of analysis supported by swish
:
This vignette demonstrated the third in this list, but the others can be run by either not specifying any additional covariates, or by specifying a batch variable with the argument cov
instead of pair
. The two interaction tests can be run by specifying interaction=TRUE
and providing x
, cov
, and optionally pair
.
As with SAMseq and SAM, swish
makes use of the permutation plug-in approach for q-value calculation. swish
calls the empPvals
and qvalue
functions from the qvalue package to calculate the q-values (or optionally similar functions from the samr package). If we plot the q-values against the statistic, or against the log2 fold change, one can see clusters of genes with the same q-value (because they have the same or similar statistic). One consequence of this is that, in order to rank the genes, rather than ranking directly by q-value, it makes more sense to pick a q-value threshold and then within that set of genes, to rank by the log2 fold change, as shown above when the code chunk has log2FC * sig
.
## [1] 6.734007e-05
The Alevin (Srivastava et al. 2018) and tximport / tximeta maintainers are working on an efficient format for storing and importing the sparse scRNA-seq estimated gene counts. We plan to develop an official importer for Alevin bootstrap inferential replicates in March-April 2019, and add to tximport before the Bioconductor 3.9 release (import of Alevin counts without inferential replicates is already supported in tximport). Currently, Alevin outputs dense binary inferential replicates, which is a format we would like to move away from, as the quantification data is then larger on disk than necessary. For example, the 900 mouse neuron dataset from 10x Genomics with 30 bootstrap replicates takes up 365 Mb. We are coordinating with other Bioconductor developers to avoid duplicated efforts.
Between now and the release of Bioconductor 3.9 in April 2019, one can use the following code snippet to import Alevin bootstrap inferential replicates into R/Bioconductor, for analysis with swish
:
This code reads in point estimates of gene counts as a matrix mat
. It then passes once over the data to count the number of non-zero entries per cell and per inferential replicate. It then pre-allocates vectors of certain size, representing the non-zero entries per cell and per inferential replicate, and passes over the data again to fill these vectors. Finally a list of sparse matrices bmat.list
is created which provides all the inferential replicates of estimated gene counts. These matrices can be efficiently subsetted to genes and cells with sufficient counts before building a SummarizedExperiment.
Why not loom to store the data? While loom and the LoomExperiment package in Bioconductor are designed for analyzing large ’omics datasets, note that loom is explicitly not recommended for data storage. We imagine that we may want to support LoomExperiment objects for working with the inferential replicates in memory during an R/Bioconductor session, but it is not a recommended approach for data storage, from the loom documentation:
Loom files are great for distribution of large datasets, and for use in computational pipelines. They do not support writing and reading concurrently. They also do not support journalling, so if something happens during a write, the entire file can be lost. Therefore, do not use loom files as your primary data storage. They are for working with data, not keeping it safe.
The following diagrams describe the permutation schemes used for the interaction designs implemented in swish
. The case with matched samples (pair indicated by number, primary condition indicated by color, the vertical line separating the pairs by secondary covariate):
The case without matched samples (sample indicated by letter, primary condition indicated by color, the vertical line separating the samples by secondary covariate). Here multiple random pseudo-pairs are chosen across condition. The permutation scheme ensures that LFCs are always calculated between samples from the same covariate group.
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GenomicFeatures_1.36.4 org.Hs.eg.db_3.8.2
## [3] AnnotationDbi_1.46.0 fishpond_1.0.2
## [5] tximeta_1.2.2 SummarizedExperiment_1.14.1
## [7] DelayedArray_0.10.0 BiocParallel_1.18.1
## [9] matrixStats_0.54.0 Biobase_2.44.0
## [11] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0
## [13] IRanges_2.18.1 S4Vectors_0.22.0
## [15] BiocGenerics_0.30.0 macrophage_1.0.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.1 splines_3.6.1
## [3] bit64_0.9-7 jsonlite_1.6
## [5] gtools_3.8.1 assertthat_0.2.1
## [7] BiocFileCache_1.8.0 blob_1.2.0
## [9] GenomeInfoDbData_1.2.1 Rsamtools_2.0.0
## [11] yaml_2.2.0 progress_1.2.2
## [13] pillar_1.4.2 RSQLite_2.1.2
## [15] backports_1.1.4 lattice_0.20-38
## [17] glue_1.3.1 digest_0.6.20
## [19] XVector_0.24.0 qvalue_2.16.0
## [21] colorspace_1.4-1 plyr_1.8.4
## [23] htmltools_0.3.6 Matrix_1.2-17
## [25] XML_3.98-1.20 pkgconfig_2.0.2
## [27] biomaRt_2.40.3 zlibbioc_1.30.0
## [29] purrr_0.3.2 scales_1.0.0
## [31] tibble_2.1.3 ggplot2_3.2.1
## [33] AnnotationFilter_1.8.0 lazyeval_0.2.2
## [35] magrittr_1.5 crayon_1.3.4
## [37] memoise_1.1.0 evaluate_0.14
## [39] tools_3.6.1 prettyunits_1.0.2
## [41] hms_0.5.0 stringr_1.4.0
## [43] munsell_0.5.0 ensembldb_2.8.0
## [45] Biostrings_2.52.0 compiler_3.6.1
## [47] rlang_0.4.0 grid_3.6.1
## [49] RCurl_1.95-4.12 tximport_1.12.3
## [51] rappdirs_0.3.1 bitops_1.0-6
## [53] rmarkdown_1.14 gtable_0.3.0
## [55] abind_1.4-5 DBI_1.0.0
## [57] curl_4.0 reshape2_1.4.3
## [59] R6_2.4.0 GenomicAlignments_1.20.1
## [61] knitr_1.24 dplyr_0.8.3
## [63] rtracklayer_1.44.2 bit_1.1-14
## [65] zeallot_0.1.0 ProtGenerics_1.16.0
## [67] readr_1.3.1 stringi_1.4.3
## [69] Rcpp_1.0.2 vctrs_0.2.0
## [71] dbplyr_1.4.2 tidyselect_0.2.5
## [73] xfun_0.8
Alasoo, K, J Rodrigues, S Mukhopadhyay, AJ Knights, AL Mann, K Kundu, HIPSCI-Consortium, C Hale, Dougan G, and DJ Gaffney. 2018. “Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response.” Nature Genetics 50:424–31. https://doi.org/10.1038/s41588-018-0046-7.
Anderson, MJ, and CJF Ter Braak. 2003. “Permutation Tests for Multi-Factorial Analysis of Variance.” Journal of Statistical Computation and Simulation 73 (2):85–113.
Frankish, A, GENCODE-consoritum, and P Flicek. 2018. “GENCODE reference annotation for the human and mouse genomes.” Nucleic Acids Research. https://doi.org/10.1093/nar/gky955.
Li, J, and R Tibshirani. 2011. “Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data.” Statistical Methods in Medical Research 22 (5):519–36. https://doi.org/10.1177/0962280211428386.
Patro, R, G Duggal, MI Love, RA Irizarry, and C Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nature Methods 14:417–19. https://doi.org/10.1038/nmeth.4197.
Srivastava, A, L Malik, TS Smith, I Sudbery, and R Patro. 2018. “Alevin efficiently estimates accurate gene abundances from dscRNA-seq data.” bioRxiv. https://doi.org/10.1101/335000.
Storey, JD, and R Tibshirani. 2003. “ Statistical significance for genome-wide experiments.” Proceedings of the National Academy of Sciences 100 (16):9440–5. https://doi.org/10.1073/pnas.1530509100.
Zhu, A, A Srivastava, JG Ibrahim, R Patro, and MI Love. 2019. “Nonparametric expression analysis using inferential replicate counts.” bioRxiv. https://doi.org/10.1101/561084.