DecontX is a Bayesian hierarchical model to estimate and remove cross-contamination from ambient RNA in single-cell RNA-seq count data generated from droplet-based sequencing devices. DecontX will take the count matrix with/without the cell labels and estimate the contamination level and deliver a decontaminted count matrix for downstream analysis.
In this vignette we will demonstrate how to use DecontX to estimate and remove contamination.
celda can be installed from Bioconductor:
if (!requireNamespace("BiocManager", quietly=TRUE)){
install.packages("BiocManager")}
BiocManager::install("celda")
The package can be loaded using the library
command.
library(celda)
To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/celda. To ask questions about running celda, post a thread on Bioconductor support site at https://support.bioconductor.org/.
Many functions in celda make use of stochastic algorithms or procedures which require the use of random number generator (RNG) for simulation or sampling. To maintain reproducibility, all these functions use a default seed of 12345 to make sure same results are generated each time one of these functions is called. Explicitly setting the seed
arguments is needed for greater control and randomness.
DecontX will take a matrix of counts (referred as observed counts) where each row is a feature, each column is a cell, and each entry in the matrix is the number of counts of each feature in each cell. To illustrate the utility of DecontX, we will apply it to a simulated dataset.
In the function simulateContaminatedMatrix
, the K parameter designates the number of cell clusters, the C parameter determines the number of cells, the G parameter determines the number of genes in the simulated dataset.
simCounts <- simulateContaminatedMatrix(G = 300, C = 100, K = 3)
The nativeCounts
is the natively expressed counts matrix, and observedCounts
is the observed counts matrix that contains both contaminated and natively expressed transctripts. The NByC
is the total number of observed transcripts per cell. The counts matrix which only contains contamianted transcripts can be obtained by subtracting the observed counts matrix from the observed counts matrix.
contamination <- simCounts$observedCounts - simCounts$nativeCounts
The z
variable contains the population label for each cell.
table(simCounts$z)
##
## 1 2 3
## 26 36 38
The phi
and eta
variables contain the expression distributions and contamination distributions for each population, respectively. Each column corresponds to a population, each row represents a gene. The sum of the rows equal to 1.
colSums(simCounts$phi)
## [1] 1 1 1
colSums(simCounts$eta)
## [1] 1 1 1
DecontX uses bayesian method to estimate and remove contamination via varitaional inference.
decontxModel <- decontX(counts = simCounts$observedCounts, z = simCounts$z)
Use log-likelihood to check convergence
plot(decontxModel$resList$logLikelihood)
decontX
estimates a contamination proportion for each cell. We compare the estimated contamination proportion with the real contamination proportion.
plot(decontxModel$resList$estConp,
colSums(contamination) / simCounts$NByC, col = simCounts$z)
abline(0, 1)
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] celda_1.0.4 BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.44.0 httr_1.4.0
## [3] jsonlite_1.6 splines_3.6.0
## [5] foreach_1.4.4 assertthat_0.2.1
## [7] BiocManager_1.30.4 stats4_3.6.0
## [9] GenomeInfoDbData_1.2.1 yaml_2.2.0
## [11] ggrepel_0.8.1 pillar_1.4.1
## [13] lattice_0.20-38 reticulate_1.12
## [15] glue_1.3.1 RcppEigen_0.3.3.5.0
## [17] digest_0.6.19 GenomicRanges_1.36.0
## [19] RColorBrewer_1.1-2 XVector_0.24.0
## [21] minqa_1.2.4 colorspace_1.4-1
## [23] enrichR_1.0 htmltools_0.3.6
## [25] Matrix_1.2-17 plyr_1.8.4
## [27] pkgconfig_2.0.2 bookdown_0.11
## [29] zlibbioc_1.30.0 purrr_0.3.2
## [31] scales_1.0.0 Rtsne_0.15
## [33] BiocParallel_1.18.0 lme4_1.1-21
## [35] tibble_2.1.2 combinat_0.0-8
## [37] IRanges_2.18.1 ggplot2_3.1.1
## [39] withr_2.1.2 umap_0.2.2.0
## [41] SummarizedExperiment_1.14.0 BiocGenerics_0.30.0
## [43] lazyeval_0.2.2 magrittr_1.5
## [45] crayon_1.3.4 evaluate_0.14
## [47] doParallel_1.0.14 nlme_3.1-140
## [49] MASS_7.3-51.4 MAST_1.10.0
## [51] tools_3.6.0 data.table_1.12.2
## [53] matrixStats_0.54.0 stringr_1.4.0
## [55] S4Vectors_0.22.0 munsell_0.5.0
## [57] DelayedArray_0.10.0 compiler_3.6.0
## [59] GenomeInfoDb_1.20.0 rlang_0.3.4
## [61] blme_1.0-4 grid_3.6.0
## [63] RCurl_1.95-4.12 nloptr_1.2.1
## [65] iterators_1.0.10 rjson_0.2.20
## [67] SingleCellExperiment_1.6.0 bitops_1.0-6
## [69] rmarkdown_1.13 boot_1.3-22
## [71] gtable_0.3.0 codetools_0.2-16
## [73] abind_1.4-5 reshape2_1.4.3
## [75] R6_2.4.0 gridExtra_2.3
## [77] knitr_1.23 dplyr_0.8.1
## [79] stringi_1.4.3 parallel_3.6.0
## [81] Rcpp_1.0.1 MCMCprecision_0.3.9
## [83] tidyselect_0.2.5 xfun_0.7