--- title: "Estimate and remove cross-contamination from ambient RNA for scRNA-seq data with DecontX" author: "Shiyi Yang, Sean Corbett, Yusuke Koga, Zhe Wang, W. Evan Johnson, Masanao Yajima, Joshua D. Campbell" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{Estimate and remove cross-contamination from ambient RNA for scRNA-seq data with DecontX} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction DecontX is a Bayesian hierarchical model to estimate and remove cross-contamination from ambient RNA in single-cell RNA-seq count data generated from droplet-based sequencing devices. DecontX will take the count matrix with/without the cell labels and estimate the contamination level and deliver a decontaminted count matrix for downstream analysis. In this vignette we will demonstrate how to use DecontX to estimate and remove contamination. # Installation celda can be installed from Bioconductor: ```{r, eval= FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)){ install.packages("BiocManager")} BiocManager::install("celda") ``` The package can be loaded using the `library` command. ```{r, eval=TRUE, message=FALSE} library(celda) ``` To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/celda. To ask questions about running celda, post a thread on Bioconductor support site at https://support.bioconductor.org/. # Reproducibility note Many functions in *celda* make use of stochastic algorithms or procedures which require the use of random number generator (RNG) for simulation or sampling. To maintain reproducibility, all these functions use a **default seed of 12345** to make sure same results are generated each time one of these functions is called. Explicitly setting the `seed` arguments is needed for greater control and randomness. # Generation of a cross-contaminated dataset DecontX will take a matrix of counts (referred as observed counts) where each row is a feature, each column is a cell, and each entry in the matrix is the number of counts of each feature in each cell. To illustrate the utility of DecontX, we will apply it to a simulated dataset. In the function `simulateContaminatedMatrix`, the K parameter designates the number of cell clusters, the C parameter determines the number of cells, the G parameter determines the number of genes in the simulated dataset. ```{r} simCounts <- simulateContaminatedMatrix(G = 300, C = 100, K = 3) ``` The `nativeCounts` is the natively expressed counts matrix, and `observedCounts` is the observed counts matrix that contains both contaminated and natively expressed transctripts. The `NByC` is the total number of observed transcripts per cell. The counts matrix which only contains contamianted transcripts can be obtained by subtracting the observed counts matrix from the observed counts matrix. ```{r} contamination <- simCounts$observedCounts - simCounts$nativeCounts ``` The `z` variable contains the population label for each cell. ```{r} table(simCounts$z) ``` The `phi` and `eta` variables contain the expression distributions and contamination distributions for each population, respectively. Each column corresponds to a population, each row represents a gene. The sum of the rows equal to 1. ```{r} colSums(simCounts$phi) colSums(simCounts$eta) ``` # Decontamination using DecontX DecontX uses bayesian method to estimate and remove contamination via varitaional inference. ```{r, warning = FALSE, message = FALSE} decontxModel <- decontX(counts = simCounts$observedCounts, z = simCounts$z) ``` ## Check convergence Use log-likelihood to check convergence ```{r, eval = TRUE, fig.width = 5, fig.height = 5} plot(decontxModel$resList$logLikelihood) ``` ## Evaluate model performance `decontX` estimates a contamination proportion for each cell. We compare the estimated contamination proportion with the real contamination proportion. ```{r, eval = TRUE, fig.width = 5, fig.height = 5} plot(decontxModel$resList$estConp, colSums(contamination) / simCounts$NByC, col = simCounts$z) abline(0, 1) ``` # Session Information ```{r} sessionInfo() ```