--- title: "Sorting and annotation of single-cell clustering results with multiclass decision trees" author: "Eric Reed, Ahmed Youseff, Jing Zhang, Yusuke Koga, Zhe Wang, Evan Johnson, Paola Sebastiani, Masanao Yajima, Joshua Campbell" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{Sorting and annotation of single-cell clustering results with multiclass decision trees} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction **CE**llular **L**atent **D**irichlet **A**llocation (celda), is among multiple approaches to cluster cells from single-cell RNA-seq data into discrete sub-populations. In most cases, annotating a concise set of important features for distinguishing these sub-populations is not a trivial task. In this vignette we will demonstrate the implementation of a multiclass decision tree approach to simultaneously sort and annotate cell cluster label estimations by generating a sequence of univariate rules for each cluster. The procedure has two main deviations from simple multiclass decision tree procedures. First, at each split cells from the same cluster are never separated during tree building. Instead cells from the same population are moved to one-side of a particular split based on majority voting. Second, each cluster split can be determined by one of two heuristics, as follows... 1. Find univariate rules that accurately distinguish individual cell clusters, referred to as a one-off split. 2. If no sufficient one-off split exists, the tree finds thes best more ballanced split that segregates the clusters into two sets of comparatively similar sub-populations. # Installation celda can be installed from Bioconductor: ```{r, eval= FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)){ install.packages("BiocManager")} BiocManager::install("celda") ``` The package can be loaded using the `library` command. ```{r, eval = TRUE, message = FALSE, warning = FALSE} library(celda) ``` Complete list of help files are accessible using the help command with a `package` option. ```{r, eval = FALSE} help(package = celda) ``` To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/celda. To ask questions about running celda, post a thread on Bioconductor support site at https://support.bioconductor.org/.
# Read in data and run `celda_CG()` First, we will read in a single-cell data set and use `celda_CG()` to cluster the genes and samples. The [M3DExampleData](https://www.bioconductor.org/packages/release/data/experiment/html/M3DExampleData.html) is a subset of single-cell RNA-Seq expression data of r mouse 2-cell embryos and blastocysts, originally generated by *Deng et al. (2014)*. In this example we will use 500 genes to generate 10 sample and 10 gene clusters using `celda_CG()`. Note, that these parameters are not based on knowledge of these data, but rather just for running a speedy example. ```{r, eval = TRUE, message = FALSE} library(M3DExampleData) counts <- M3DExampleData::Mmus_example_list$data # Subset 500 genes for fast clustering counts <- as.matrix(counts[1501:2000, ]) # Cluster genes ans samples each into 10 modules cm <- celda_CG(counts = counts, L = 10, K = 10, verbose = FALSE) ``` # Format celda output for decision tree generation The decision trees are generated and evaluated by `findMarkers`. `findMarkers` requires two arguments: **features** and **class**. **features** is a numeric matrix with a row for every variable (or feature) to be used in sorting and a column for every sample. In this example we will use a factorized matrix of the gene clusters as our feature matrix (Check `?factorizedMatrix` for more details). **class** is a vector with a label assignment for every sample. Note, that neither **features** nor **class** may have any missing values. ```{r, eval = TRUE, message = FALSE} # Get features matrix and cluster assignments factorized <- factorizeMatrix(counts, cm) features <- factorized$proportions$cell class <- clusters(cm)$z ``` # Generate celda decision tree `findMarkers` allows for different parameter options. The optimal combination of these parameters is generally analysis specific and may require some trial and error. Parameters include: - **oneoffMetric** A selection for one of two possible metrics for evaluating good one-off splits for segregating a single cluster from every other cluster based on up-regulation of that cluster alone. The performance of either is a value between 0 and 1. The two options include: - *modified F1* - Finds the univariate split that maximizes the harmonic mean of the sensitivity and precision of the up-regulated cluster, as well as the minimum cluster-specific sensitivity across all down-regulated clusters. - *pairwise AUC* - Finds the univariate split for the minimum pairwise AUC between the up-regulated cluster and every down-regulated cluster, individually. - **threshold** The threshold for determining a good one-off split. Generally *modified F1* is more conservative, so the using the same threshold for either one-off metric will result in equal or fewer one-off splits for *modified F1* compared to *pairwise AUC*. - **reuseFeatures** A logical value. Whether a feature can be used as a rule for the same cluster more than once. Note, that even when `reuseFeatures = FALSE` a feature is still allowed to be used more than once in the decision tree. This happens when the same feature is used for different sets of cluster rules after those sets of clusters have diverged. - **reuseFeatures** A logical value. Often times one cluster will never include a positive marker in its set of rules, i.e. it is only sorted by have lower values for each feature in the tree. This option forces a positive marker at its terminal split. - **consecutiveOneoff** A logical value. Whether one-off splits can be used at two consecutive levels in the tree. Note, that even when `consecutiveOneoff = FALSE`, more then one good one-off splits will be assessed at the same level and used if determined to be a good split. ```{r, eval = TRUE, message = FALSE} DecTree <- findMarkers(features, class, oneoffMetric = "modified F1", threshold = 0.95, reuseFeatures = FALSE, altSplit = TRUE, consecutiveOneoff = FALSE) ``` ## `findMarkers` output The `findMarkers` output is a named list of four elements - **rules** A named list with one `data.frame` for every label. Each `data.frame` has five columns and gives the set of rules for distinguishing each label. - *feature* - Feature identifier. - *direction* - Relationship to feature value, -1 if less than, 1 if greater than. - *value* - The feature value which defines the decision boundary - *stat* - The performance value returned by the splitting metric for this split. - *statUsed* - Which performance metric was used. "IG" if information gain and "OO" if one-off. - *level* - The level of the tree at which is rule was defined. 1 is the level of the first split of the tree. - **dendro** A dendrogram object of the decision tree output - **performance** A named list denoting the training performance of the - *accuracy* - (number correct/number of samples) for the whole set of samples. - *balAcc* - Mean sensitivity across all labels. - *meanPrecision* - Mean precision across all labels. - *correct* - Number of correct predictions of each label. - *sizes* Number of actual counts of each label. - *sensitivity* - Sensitivity of the prediction of each label. - *precision* - Precision of the prediction of each label. # Plot decision tree `plotDendro` generates a dendrogram as a ggplot object. The **addSensPrec** argument determines whether to print cluster-specific sensitivities and precisions under the leaf labels for each cluster. **leafSize**, **boxSize**, **boxColor** are all stylistic choices to make the plot labels legible. ```{r, eval = TRUE, message = FALSE} plotDendro(DecTree, classLabel = NULL, addSensPrec = TRUE, leafSize = 24, boxSize = 7, boxColor = "black") ``` In the plot, the feature(s) used for splits determined by the one-off metric are printed above the cluster labels for which they are markers. Alternatively, the features used for the balanced splits are centered above the two sets of clusters defined by that split. The up-regulated set of clusters for a balanced split are one the right side of that split. The size, sensitivitie and precision of each class are printed below the leaf labels, respectively. # Plot decision tree highlight rules for specific cluster In `plotDendro` the **classLabel** argument may used to only print the sequence of relevant rules for a given split. ```{r, eval = TRUE, message = FALSE} plotDendro(DecTree, classLabel = "1", addSensPrec = TRUE, leafSize = 15, boxSize = 7, boxColor = "black") ``` # Get label estimates from features matrix `findMarkers` performs label estimation of each sample in the training set automatically. You can use the set of rules generated by `findMarkers` to predict the labels on an independent feature matrix using `getDecisions`. ```{r, eval = TRUE, message = FALSE} head(getDecisions(DecTree$rules, features)) ``` # Create decision tree with meta clusters If you have a priori understanding of sub-groups of your cluster labels, you can ensure that these sub-groups are not separated up-stream in the tree by using the optional *cellTypes* argument. This is just a named list of labels in your *class* vector. For example, if we knew that clusters, 4, 5, and 1 were of the same subtype of clusters, we could do the following... ```{r, eval = TRUE, message = FALSE} # Run with a hierarchichal split cellTypes <- list(metaLabel = c("4", "5", "1")) DecTreeMeta <- findMarkers(features, class, cellTypes, oneoffMetric = "modified F1", threshold = 1, reuseFeatures = F, consecutiveOneoff = FALSE) plotDendro(DecTreeMeta) ``` # Session Information ```{r} sessionInfo() ```