--- title: "Analysis of single-cell genomic data with celda" author: - name: Joshua Campbell affiliation: Boston University School of Medicine email: camp@bu.edu - name: Zhe Wang affiliation: Boston University School of Medicine - name: Shiyi Yang affiliation: Boston University School of Medicine - name: Sean Corbett affiliation: Boston University School of Medicine - name: Yusuke Koga affiliation: Boston University School of Medicine date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{Analysis of single-cell genomic data with celda} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, dev = "png") ``` # Introduction **CE**llular **L**atent **D**irichlet **A**llocation (celda) is a collection of Bayesian hierarchical models to perform feature and cell bi-clustering for count data generated by single-cell platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications and has shown good performance with sparse data. celda simultaneously clusters features (i.e. gene expression) into modules based on co-expression patterns across cells and cells into subpopulations based on the probabilities of the feature modules within each cell. Starting from Bioconductor release 3.12 (`celda` version 1.6.0), `celda` makes use of `r BiocStyle::Biocpkg("SingleCellExperiment")` (SCE) objects for storing data and results. In this vignette we will demonstrate how to use celda to perform cell and feature clustering with a simple, small simulated dataset. This vignette does not include upstream importing of data, quality control, or filtering. To see a more complete analysis of larger real-world datasets, visit [camplab.net/celda](https://www.camplab.net/celda/) for additional vignettes. # Installation celda can be installed from Bioconductor: ```{r install, eval= FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("celda") ``` To load the package, type the following: ```{r library, message = FALSE} library(celda) ``` A complete list of help files are accessible using the help command with the `package` option. ```{r help, eval = FALSE} help(package = celda) ``` To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/celda. To ask questions about running celda, post a thread on Bioconductor support site at https://support.bioconductor.org/.
# Generation of a simulated single cell dataset celda will take a matrix of counts where each row is a feature, each column is a cell, and each entry in the matrix is the number of counts of each feature in each cell. To illustrate the utility of celda, we will apply it to a simulated dataset. In the function `simulateCells`, the **K** parameter designates the number of cell clusters, the **L** parameter determines the number of feature modules, the **S** parameter determines the number of samples in the simulated dataset, the **G** parameter determines the number of features to be simulated, and **CRange** specifies the lower and upper bounds of the number of cells to be generated in each sample. To simulate a dataset of 5 samples with 5 cell populations, 10 feature modules, 200 features, and between 30 to 50 cells per sample using `celda_CG` model: ```{r simulate} simsce <- simulateCells("celda_CG", S = 5, K = 5, L = 10, G = 200, CRange = c(30, 50)) ``` The `counts` assay slot in `simsce` contains the counts matrix. The dimensions of counts matrix: ```{r assay, message = FALSE} library(SingleCellExperiment) dim(counts(simsce)) ``` Columns `celda_sample_label` and `celda_cell_cluster` in `colData(simsce)` contain sample labels and celda cell population cluster labels. Here are the numbers of cells in each subpopulation and in each sample: ```{r cell_numbers} table(colData(simsce)$celda_cell_cluster) table(colData(simsce)$celda_sample_label) ``` Column `celda_feature_module` in `rowData(simsce)` contains feature module labels. Here is the number of features in each feature module: ```{r module_numbers} table(rowData(simsce)$celda_feature_module) ``` # Feature selection A simple heuristic feature selection is performed to reduce the size of features used for clustering. To speed up the process, only features with at least 3 counts in at least 3 cells are included in downstream clustering for this data. A subset `SingleCellExperiment` object with filtered features is stored in `altExp(simsce, "featureSubset")` slot by default. ```{r, warning = FALSE, message = FALSE} simsce <- selectFeatures(simsce) ``` If the number of features is still too large, then a smaller subset of features can be obtained by selecting the top number of most variable genes. For an example code, see the PBMC3K tutorial in the online celda [documentation](https://www.camplab.net/celda). # Performing bi-clustering with celda There are currently three models within *celda* package: `celda_C` will cluster cells, `celda_G` will cluster features, and `celda_CG` will simultaneously cluster cells and features. Within the functions the `K` parameter will be the number of cell populations to be estimated, while the `L` parameter will be the number of feature modules to be estimated in the output model. ```{r celda_cg, warning = FALSE, message = FALSE} sce <- celda_CG(x = simsce, K = 5, L = 10, verbose = FALSE, nchains = 1) ``` Here is a comparison between the true cluster labels and the estimated cluster labels. ```{r accuracy} table(celdaClusters(sce), celdaClusters(simsce)) table(celdaModules(sce), celdaModules(simsce)) ``` # Visualization ## Plotting cell populations on 2D-embeddings celda contains its own wrapper function for tSNE and UMAP called `celdaTsne` and `celdaUmap`, respectively. Both of these functions can be used to embed cells into 2-dimensions. The output can be used in the downstream plotting functions `plotDimReduceCluster`, `plotDimReduceModule`, and `plotDimReduceFeature` to show cell population clusters, module probabilities, and expression of individual features, respectively. ```{r umap} sce <- celdaUmap(sce) ``` ```{r plot_umap, eval = TRUE, fig.width = 7, fig.height = 7} plotDimReduceCluster(x = sce, reducedDimName = "celda_UMAP") plotDimReduceModule(x = sce, reducedDimName = "celda_UMAP", rescale = TRUE) plotDimReduceFeature(x = sce, reducedDimName = "celda_UMAP", normalize = TRUE, features = "Gene_1") ``` ## Creating an expression heatmap The clustering results can be viewed with a heatmap of the normalized counts using the function `celdaHeatmap`. The top `nfeatures` in each module will be selected according to the factorized module probability matrix. ```{r celda_heatmap, eval = TRUE, fig.width = 7, fig.height = 7} plot(celdaHeatmap(sce = sce, nfeatures = 10)) ``` ## Displaying relationships between modules and cell populations The relationships between feature modules and cell populations can be visualized with `celdaProbabilityMap`. The absolute probabilities of each feature module in each cellular subpopulation is shown on the left. The normalized and z-scored expression of each module in each cell population is shown on the right. ```{r propmap, eval = TRUE, fig.width = 7, fig.height = 7} celdaProbabilityMap(sce) ``` ## Examining co-expression with module heatmaps `moduleHeatmap` creates a heatmap using only the features from a specific feature module. Cells are ordered from those with the lowest probability of the module to the highest. If more than one module is used, then cells will be ordered by the probabilities of the first module. ```{r module_heatmap, eval = TRUE, fig.width = 7, fig.height = 7} moduleHeatmap(sce, featureModule = c(1,2), topCells = 100) ``` # Identifying reasonable numbers of feature modules and cell subpopulations In the previous example, the best `K` (the number of cell clusters) and `L` (the number of feature modules) was already known. However, the optimal `K` and `L` for each new dataset will likely not be known beforehand and multiple choices of `K` and `L` may need to be tried and compared. celda offers two sets of functions to determine the optimum `K` and `L`, `recursiveSplitModule`/`recursiveSplitCell`, and `celdaGridSearch`. ## Using recursive splitting Functions `recursiveSplitModule` and `recursiveSplitCell` offer a fast method to generate a celda model with optimum `K` and `L`. First, `recursiveSplitModule` is used to determine the optimal `L`. `recursiveSplitModule` first splits features into however many modules are specified in `initialL`. The module labels are then recursively split in a way that would generate the highest log-likelihood, all the way up to `maxL`. ```{r, message = FALSE} moduleSplit <- recursiveSplitModule(simsce, initialL = 2, maxL = 15) ``` Perplexity is a statistical measure of how well a probability model can predict new data. Lower perplexity indicates a better model. The perplexity of each model can be visualized with `plotGridSearchPerplexity`. In general, visual inspection of the plot can be used to select the optimal number of modules (`L`) or cell populations (`K`) by identifying the "elbow" - where the rate of decrease in the perplexity starts to drop off. ```{r} plotGridSearchPerplexity(moduleSplit) ``` In this example, the perplexity for `L` stops decreasing at L = 10, thus L = 10 would be a good choice. Sometimes the perplexity alone does not show a clear elbow or "leveling off". However, the rate of perplexity change (RPC) can be more informative to determine when adding new modules does not add much additional information [Zhao et al., 2015](https://doi.org/10.1186/1471-2105-16-S13-S8){target="_blank"}). An RPC closer to zero indicates that the addition of new modules or cell clusters is not substantially decreasing the perplexity. The RPC of models can be visualized using function `plotRPC`: ```{r module_split_rpc, message = FALSE, warning = FALSE} plotRPC(moduleSplit) ``` Once you have identified the optimal `L` (in this case, L is selected to be 10), the module labels are used for initialization in `recursiveSplitCell`. Similarly to `recursiveSplitModule`, cells are initially split into a small number of subpopulations, and the subpopulations are recursively split up. ```{r module_split_select, message = FALSE} moduleSplitSelect <- subsetCeldaList(moduleSplit, params = list(L = 10)) cellSplit <- recursiveSplitCell(moduleSplitSelect, initialK = 3, maxK = 12, yInit = celdaModules(moduleSplitSelect)) ``` ```{r rpc_cell} plotGridSearchPerplexity(cellSplit) plotRPC(cellSplit) ``` In this plot, the perplexity for K stops decreasing at K = 5, with a final K/L combination of K = 5, L = 10. Generally, this method can be used to pick a reasonable `L` and a potential range of `K`. However, manual review of specific selections of `K` is often required to ensure results are biologically coherent. Once users have chosen the K/L parameters for further analysis, the `subsetCeldaList` function can be used to subset the celda list *SCE* object to a single model *SCE* object. ```{r subset_celda, eval = TRUE} sce <- subsetCeldaList(cellSplit, params = list(K = 5, L = 10)) ``` ## Using a grid search Alternativley to recursive splitting, celda is able to run multiple combinations of K and L with multiple chains in parallel via the `celdaGridSearch` function. ```{r grid_search, eval = TRUE, message = FALSE} cgs <- celdaGridSearch(simsce, paramsTest = list(K = seq(4, 6), L = seq(9, 11)), cores = 1, model = "celda_CG", nchains = 2, maxIter = 100, verbose = FALSE, bestOnly = TRUE) ``` Setting `verbose` to `TRUE` will print the output of each model to a text file. These results can be visualized with `plotGridSearchPerplexity`. The major goal is to pick the lowest `K` and `L` combination with relatively good perplexity. In general, visual inspection of the plot can be used to select the number of modules (`L`) or cell populations (`K`) where the rate of decrease in the perplexity starts to drop off. `bestOnly = TRUE` indicates that only the chain with the best log likelihood will be returned for each K/L combination. ```{r plot_grid_search, eval = TRUE, fig.width = 8, fig.height = 8, warning = FALSE, message = FALSE} plotGridSearchPerplexity(cgs) ``` In this example, the perplexity for `L` stops decreasing at L = 10 for the majority of `K` values. For the line corresponding to L = 10, the perplexity stops decreasing at K = 5. Thus L = 10 and K = 5 would be a good choice. Again, manual review of specific selections of K is often be required to ensure results are biologically coherent. Once users have chosen the K/L parameters for further analysis, the `subsetCeldaList` function can be used to subset the celda list *SCE* object to a single model *SCE* object. ```{r subset_grid_search, eval = TRUE} sce <- subsetCeldaList(cgs, params = list(K = 5, L = 10)) ``` If the "bestOnly" parameter is set to FALSE in the `celdaGridSearch`, then the `selectBestModel` function can be used to select the chains with the lowest log likelihoods within each combination of parameters. Alternatively, users can select a specific chain by specifying the index within the `subsetCeldaList` function. ```{r best_only_cgs, eval = FALSE, message=FALSE} cgs <- celdaGridSearch(simsce, paramsTest = list(K = seq(4, 6), L = seq(9, 11)), cores = 1, model = "celda_CG", nchains = 2, maxIter = 100, verbose = FALSE, bestOnly = FALSE) cgs <- resamplePerplexity(cgs, celdaList = cgs, resample = 2) cgsK5L10 <- subsetCeldaList(cgs, params = list(K = 5, L = 10)) sce <- selectBestModel(cgsK5L10) ``` # Miscellaneous utility functions celda also contains several utility functions for the users' convenience. ## Finding the modules for feature with featureModuleLookup `featureModuleLookup` can be used to look up the module a specific feature was clustered to. ```{r module_lookup} featureModuleLookup(sce, feature = c("Gene_99")) ``` ## Reordering cluster labels with recodeClusterZ, recodeClusterY `recodeClusterZ` and `recodeClusterY` allows the user to recode the cell and feature cluster labels, respectively. ```{r recode_clusters} sceZRecoded <- recodeClusterZ(sce, from = c(1, 2, 3, 4, 5), to = c(2, 1, 3, 4, 5)) ``` The model prior to reordering cell labels compared to after reordering cell labels: ```{r recode_clusters_show} table(celdaClusters(sce), celdaClusters(sceZRecoded)) ``` # Session Information ```{r session_info} sessionInfo() ```