LineagePulse 1.4.0
LineagePulse is a differential expression algorithm for single-cell RNA-seq (scRNA-seq) data. LineagePulse is based on zero-inflated negative binomial noise model and can capture both discrete and continuous population structures: Discrete population structures are groups of cells (e.g. condition of an experiment or tSNE clusters). Continous population structures can for example be pseudotemporal orderings of cells or temporal orderings of cells. The main use and novelty of LineagePulse lies in its ability to fit gene expression trajectories on pseudotemporal orderings of cells well. Note that LineagePulse does not infer a pseudotemporal ordering but is a downstream analytic tool to analyse gene expression trajectories on a given pseudotemporal ordering (such as from diffusion pseudotime or monocle2).
To run LineagPulse on scRNA-seq data, the user needs to use a minimal input parameter set for the wrapper function runLineagePulse, which then performs all normalisation, model fitting and differential expression analysis steps without any more user interaction required:
Additionally, one can provide:
Lastly, the experienced user who has a solid grasp of the mathematical and algorithmic basis of LineagePulse may change the defaults of these advanced input options:
Here, we present a differential expression analysis scenario on a longitudinal ordering. The differential expression results are in a data frame which can be accessed from the output object via list like properties ($). The core differential expression analysis result are p-value and false-discovery-rate corrected p-value of differential expression which are the result of a gene-wise hypothesis test of a non-constant expression model (impulse, splines or groups) versus a constant expression model.
library(LineagePulse)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
lsSimulatedData <- simulateContinuousDataSet(
scaNCells = 100,
scaNConst = 10,
scaNLin = 10,
scaNImp = 10,
scaMumax = 100,
scaSDMuAmplitude = 3,
vecNormConstExternal=NULL,
vecDispExternal=rep(20, 30),
vecGeneWiseDropoutRates = rep(0.1, 30))
## Draw mean trajectories
## Setting size factors uniformly =1
## Draw dispersion
## Simulate negative binomial noise
## Simulate drop-out
objLP <- runLineagePulse(
counts = lsSimulatedData$counts,
dfAnnotation = lsSimulatedData$annot)
## LineagePulse for count data: v1.4.0
## --- Data preprocessing
## # 0 out of 100 cells did not have a continuous covariate and were excluded.
## # 0 out of 30 genes did not contain non-zero observations and are excluded from analysis.
## # 0 out of 100 cells did not contain non-zero observations and are excluded from analysis.
## --- Compute normalisation constants:
## # All size factors are set to one.
## --- Fit ZINB model for both H1 and H0.
## ### a) Fit ZINB model A (H0: mu=constant disp=constant) with noise model.
## # . Initialisation: ll -24950.2826257684
## # 1. Iteration with ll -13057.3434879246 in 0.02 min.
## # 2. Iteration with ll -12725.6100800047 in 0.05 min.
## # 3. Iteration with ll -12725.6087943848 in 0.02 min.
## Finished fitting zero-inflated negative binomial model A with noise model in 0.12 min.
## ### b) Fit ZINB model B (H1: mu=splines disp=constant).
## # . Initialisation: ll -14388.5875174006
## # 1. Iteration with ll -12337.0338195748 in 0.02 min.
## Finished fitting zero-inflated negative binomial model B in 0.03 min.
## ### c) Fit NB model A (H0: mu=constant disp=constant).
## # . Initialisation: ll -14251.5454698876
## # 1. Iteration with ll -14072.0690962406 in 0.01 min.
## Finished fitting NB model B in 0.02 min.
## ### d) Fit NB model B (H1: mu=splines disp=constant).
## # . Initialisation: ll -14458.6945343055
## # 1. Iteration with ll -13957.998971559 in 0.02 min.
## Finished fitting NB model B in 0.03 min.
## Time elapsed during ZINB fitting: 0.22 min
## --- Run differential expression analysis.
## Finished runLineagePulse().
head(objLP$dfResults)
## gene p padj mean_H0 p_nb padj_nb
## gene_1 gene_1 0.71078558 0.7615560 70.923131 0.71078558 0.9990978
## gene_2 gene_2 0.08328868 0.1469800 6.249991 0.08328868 0.2803342
## gene_3 gene_3 0.64299702 0.7419196 3.149716 0.64299702 0.9990978
## gene_4 gene_4 0.57069463 0.6848336 35.019657 0.57069463 0.9990978
## gene_5 gene_5 0.36380894 0.4547612 87.571101 0.36380894 0.9990978
## gene_6 gene_6 0.07616542 0.1428102 77.777865 0.07616542 0.9990978
## df_full_zinb df_red_zinb df_full_nb df_red_nb loglik_full_zinb
## gene_1 7 2 7 2 -441.3621
## gene_2 7 2 7 2 -279.0289
## gene_3 7 2 7 2 -204.6424
## gene_4 7 2 7 2 -408.0237
## gene_5 7 2 7 2 -454.5199
## gene_6 7 2 7 2 -442.8618
## loglik_red_zinb loglik_full_nb loglik_red_nb allZero
## gene_1 -442.8271 -519.4502 -520.7991 FALSE
## gene_2 -283.8934 -279.6291 -284.4807 FALSE
## gene_3 -206.3279 -206.5198 -208.9529 FALSE
## gene_4 -409.9504 -451.1589 -452.4751 FALSE
## gene_5 -457.2433 -542.6512 -543.0586 FALSE
## gene_6 -447.8455 -522.8982 -523.3171 FALSE
In addition to the raw p-values, one may be interested in further details of the expression models such as shape of the expression mean as a function of pseudotime, log fold changes (LFC) and global expression trends as function of pseudotime. We address each of these follow-up questions with separate sections in the following. Note that all of these follow-up questions are answered based on the model that were fit to compute the p-value of differential expression. Therefore, once runLineagePulse() was called once, no further model fitting is required.
# Further inspection of results ## Plot gene-wise trajectories
Multiple options are available for gene-wise expression trajectory plotting: Observations can be coloured by the posterior probability of drop-out (boolColourByDropout). Observations can be normalized based on the alternative expression model or taken as raw observerations for the scatter plot (boolH1NormCounts). Lineage contours can be added to aid visual interpretation of non-uniform population density in pseudotime related effects (boolLineageContour). Log counts can be displayed instead of counts if the fold changes are large (boolLogPlot). In any case, the output object of the gene-wise expression trajectors plotting function plotGene is a ggplot2 object which can then be printed or modified.
# plot the gene with the lowest p-value of differential expression
gplotExprProfile <- plotGene(
objLP = objLP, boolLogPlot = FALSE,
strGeneID = objLP$dfResults[which.min(objLP$dfResults$p),]$gene,
boolLineageContour = FALSE)
gplotExprProfile
The function plotGene also shows the H1 model fit under a negative binomial noise model (“H1(NB)”) as a reference to show what the model fit looks like if drop-out is not accounted for.
LineagePulse provides the user with parameter extraction functions that allow the user to interact directly with the raw model fits for analytic tasks or questions not addressed above.
# extract the mean parameter fits per cell of the gene with the lowest p-value.
matMeanParamFit <- getFitsMean(
lsMuModel = lsMuModelH1(objLP),
vecGeneIDs = objLP$dfResults[which.min(objLP$dfResults$p),]$gene)
cat("Minimum fitted mean parameter: ", round(min(matMeanParamFit),1) )
## Minimum fitted mean parameter: 69.3
cat("Mean fitted mean parameter: ", round(mean(matMeanParamFit),1) )
## Mean fitted mean parameter: 168.2
Given a discrete population structure, such as tSNE cluster or experimental conditions, a fold change is the ratio of the mean expression value of both groups. The definition of a fold change is less clear if a continous expression trajector is considered: Of interest may be for example the fold change from the first to the last cell on the expression trajectory or from the minimum to the maximum expression value. Note that in both cases, we compute fold changes on the model fit of the expression mean parameter which is corrected for noise and therefore more stable than the estimate based on the raw expression count observation.
# first, extract the model fits for a given gene again
vecMeanParamFit <- getFitsMean(
lsMuModel = lsMuModelH1(objLP),
vecGeneIDs = objLP$dfResults[which.min(objLP$dfResults$p),]$gene)
# compute log2-fold change from first to last cell on trajectory
idxFirstCell <- which.min(dfAnnotationProc(objLP)$pseudotime)
idxLastCell <- which.max(dfAnnotationProc(objLP)$pseudotime)
cat("LFC first to last cell on trajectory: ",
round( (log(vecMeanParamFit[idxLastCell]) -
log(vecMeanParamFit[idxFirstCell])) / log(2) ,1) )
## LFC first to last cell on trajectory:
# compute log2-fold change from minimum to maximum value of expression trajectory
cat("LFC minimum to maximum expression value of model fit: ",
round( (log(max(vecMeanParamFit)) -
log(min(vecMeanParamFit))) / log(2),1) )
## LFC minimum to maximum expression value of model fit: 2.1
Global expression profiles or expression profiles across large groups of genes can be visualised via heatmaps of expression z-scores. One could extract the expression mean parameter fits as described above and create such heatmaps from scratch. LineaegePulse also offers a wrapper for creating such a heatmap:
# create heatmap with all differentially expressed genes
lsHeatmaps <- sortGeneTrajectories(
vecIDs = objLP$dfResults[which(objLP$dfResults$padj < 0.01),]$gene,
lsMuModel = lsMuModelH1(objLP),
dirHeatmap=NULL)
print(lsHeatmaps$hmGeneSorted)
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] LineagePulse_1.4.0 BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 lattice_0.20-38
## [3] circlize_0.4.6 png_0.1-7
## [5] gtools_3.8.1 assertthat_0.2.1
## [7] digest_0.6.18 SingleCellExperiment_1.6.0
## [9] R6_2.4.0 GenomeInfoDb_1.20.0
## [11] plyr_1.8.4 stats4_3.6.0
## [13] evaluate_0.13 ggplot2_3.1.1
## [15] pillar_1.3.1 gplots_3.0.1.1
## [17] GlobalOptions_0.1.0 zlibbioc_1.30.0
## [19] rlang_0.3.4 lazyeval_0.2.2
## [21] gdata_2.18.0 S4Vectors_0.22.0
## [23] GetoptLong_0.1.7 Matrix_1.2-17
## [25] rmarkdown_1.12 labeling_0.3
## [27] splines_3.6.0 BiocParallel_1.18.0
## [29] stringr_1.4.0 RCurl_1.95-4.12
## [31] munsell_0.5.0 DelayedArray_0.10.0
## [33] compiler_3.6.0 xfun_0.6
## [35] pkgconfig_2.0.2 BiocGenerics_0.30.0
## [37] shape_1.4.4 htmltools_0.3.6
## [39] tidyselect_0.2.5 SummarizedExperiment_1.14.0
## [41] tibble_2.1.1 GenomeInfoDbData_1.2.1
## [43] bookdown_0.9 IRanges_2.18.0
## [45] matrixStats_0.54.0 crayon_1.3.4
## [47] dplyr_0.8.0.1 bitops_1.0-6
## [49] grid_3.6.0 gtable_0.3.0
## [51] magrittr_1.5 scales_1.0.0
## [53] KernSmooth_2.23-15 stringi_1.4.3
## [55] XVector_0.24.0 rjson_0.2.20
## [57] RColorBrewer_1.1-2 tools_3.6.0
## [59] Biobase_2.44.0 glue_1.3.1
## [61] purrr_0.3.2 parallel_3.6.0
## [63] yaml_2.2.0 clue_0.3-57
## [65] colorspace_1.4-1 cluster_2.0.9
## [67] BiocManager_1.30.4 caTools_1.17.1.2
## [69] GenomicRanges_1.36.0 ComplexHeatmap_2.0.0
## [71] knitr_1.22