--- title: "scDblFinder" author: - name: Pierre-Luc Germain affiliation: University and ETH Zürich package: scDblFinder output: BiocStyle::html_document abstract: | An introduction to the scDblFinder method for fast and comprehensive doublet identification in single-cell data. vignette: | %\VignetteIndexEntry{2_scDblFinder} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} library(BiocStyle) ``` # scDblFinder `scDblFinder` identifies doublets in single-cell RNAseq by creating artificial doublets and looking at their prevalence (as well as that of any known doublets) in the neighborhood of each cell, along with a few other covariates. The rough logic is very similar to other methods (e.g. `r Githubpkg("chris-mcginnis-ucsf/DoubletFinder")`, `r Biocpkg("scds")`), with a few twists that make it more efficient and provide extra features. ## Installation ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("scDblFinder") # or, to get that latest developments: BiocManager::install("plger/scDblFinder") ``` ## Usage The input of `scDblFinder` is an object `sce` of class `r Biocpkg("SingleCellExperiment")` (empty drops having already been removed) containing at least the counts (assay 'counts'). If normalized expression (assay 'logcounts') and PCA (reducedDim 'PCA') are also present, these will be used for the clustering step (unless clusters are given.) Given an SCE object: ```{r, warning=FALSE} set.seed(123) library(scDblFinder) # we create a dummy dataset; since it's small we set a higher doublet rate sce <- mockDoubletSCE(dbl.rate=0.1) # we run scDblFinder (providing the unsually high doublet rate) sce <- scDblFinder(sce, dbr=0.1) ``` For 10x data, it is usually safe to leave the `dbr` empty, and it will be automatically estimated. `scDblFinder` will add a number of columns to the colData of `sce` prefixed with 'scDblFinder', the most important of which are: * `sce$scDblFinder.score` : the final doublet score * `sce$scDblFinder.class` : the classification (doublet or singlet) ```{r} table(truth=sce$type, call=sce$scDblFinder.class) ``` ### Multiple samples If you have multiple samples (understood as different cell captures), then it is preferable to look for doublets separately for each sample (for multiplexed samples with cell hashes, this means for each batch). You can do this by simply providing a vector of the sample ids to the `samples` parameter of scDblFinder or, if these are stored in a column of `colData`, the name of the column. In this case, you might also consider multithreading it using the `BPPARAM` parameter (assuming you've got enough RAM!). For example: ```{r, eval=FALSE} library(BiocParallel) sce <- scDblFinder(sce, samples="sample_id", BPPARAM=MulticoreParam(3)) table(sce$scDblFinder.class) ```

## Description of the method Wrapped in the `scDblFinder` function are the following steps: ### Splitting captures Doublets can only arise within a given sample or capture, and for this reason are better sought independently for each sample, which also speeds up the analysis. If the `samples` argument is given, `scDblFinder` will use it to split the cells into samples/captures, and process each of them in parallel if the `BPPARAM` argument is given. The classifier will be trained globally, but thresholds will be optimized on a per-sample basis. If your samples are multiplexed, i.e. the different samples are mixed in different batches, then the batches should be what you provide to this argument. ### Reducing and clustering the data The analysis can be considerably sped up, at little if any cost in accuracy, by reducing the dataset to only the top expressed genes (controlled by the `nfeatures` argument). Then, if the `clusters` argument isn't provided, the cells will be clustered using a fast clustering procedure on the PCA space (taken from the 'PCA' dimRed if present, otherwise generated internally). The aim, here, is to favour over-clustering so as to avoid collapsing pairs of celltypes whose doublets could be distinguishable. The internal clustering procedure can be accessed via the `fastcluster` function, and the cluster assignments are included in the output. If the dataset is not expected to contain distinct subpopulations but rather continuous gradients, e.g. trajectories, then it might be advisable to employ a different approach and setting `clusters` to a positive integer (see below). ### Generating artificial doublets In addition to a proportion of artificial doublets generated by combining random cells, a large fraction of the doublets are generated on all pairs of non-identical clusters (this can be performed manually using the `getArtificialDoublets` function). The rationale is that homotypic doublets are nearly impossible to distinguish without cellular barcodes, and therefore that creating that kind of doublets is a waste of computational resources. If the dataset is not expected to contain distinct subpopulations, but rather trajectories, a different approach can be used. A first option is to generate artificial doublets randomly, which can be done using the `propRandom=1` argument. A better approach is to use `trajectoryMode=TRUE`, and (unless you already have your clusters) setting the `clusters` argument of `scDblFinder` to a positive integer (depending on the number of cells and complexity, e.g. `k=12` for smaller datasets), which will split the cells using k-means clustering (where `k=clusters`). The average cluster profiles are then related, so that fewer doublets are created between very nearby points in the trajectory (where they would be wasted, such doublets being nearly undistinguishable from real cells). ### Examining the k-nearest neighbors (KNN) of each cell A new PCA is performed on the combination of real and artificial cells, from which a KNN network is generated. Using this KNN, a number of parameters are gathered for each cell, such as the proportion of doublets (i.e. artificial doublets or known doublets provided through the `knownDoublets` argument, if given - these could be for instance inter-sample doublets flagged by SNPs) among the KNN, ratio of the distances to the nearest doublet and nearest non-doublet, etc. Several of this features are reported in the output with the 'scDblFinder.' prefix, e.g.: * `distanceToNearest` : distance to the nearest cell (real or artificial) * `nearestClass` : whether the nearest cell is a doublet or singlet * `ratio` : the proportion of the KNN that are doublets. (If more than one value of `k` is given, the various ratios will be used during classification and will be reported) * `weighted` : the proportion of the KNN that are doublets, weighted by their distance (useful for isolated cells) ### Training a classifier Unless the `score` argument is set to 'weighted' or 'ratio' (in which case the aforementioned ratio is directly used as a doublet score), `scDblFinder` then uses gradient boosted trees trained on the KNN-derived properties along with a few additional features (e.g. library size, number of non-zero features, and an estimate of the difficultly of detecting artificial doublets in the cell's neighborhood, etc.) to distinguish doublets (either artificial or given) from other cells, and assigns a score on this basis. If the `use.cxds=TRUE`, the `cxds` score from the `r Biocpkg("scds","scds")` package will also be included among the predictors. One problem of using a classifier for this task is that some of the real cells (the actual doublets) are mislabeled as singlet, so to speak. `scDblFinder` therefore iteratively retrains the classifier, each time excluding from the training the (real) cells called as doublets in the previous step (the number of steps being controlled by the `iter` parameter). This score is available in the output as either `scDblFinder.score` or `scDblFinder.score.global` (when local calibration is used - see below). If the data is multi-sample, a single model is trained for all samples. ### Thresholding and local calibration Rather than thresholding on some arbitrary cutoff of the score, `scDblFinder` uses the expected number of doublets to establish a threshold. Unless it is manually given through the `dbr` argument, the expected doublet rate is first estimated using the empirical rule of thumb applicable to 10X data, namely roughly 1\% per 1000 cells captures (so with 5000 cells, (0.01\*5)\*5000 = 250 doublets, and the more cells you capture the higher the chance of creating a doublet). If samples were specified, and if the `dbr` is automatically calculated, thresholding is performed separately across samples. Thresholding then tries to simultaneously minimize: 1) the classification error (in terms of the proportion of known doublets real cells below the threshold) and 2) the deviation from the expected number of doublets (as a ratio of the total number of expected doublets within the range determined by `dbr.sd`, and adjusted for homotypic doublets), giving an equal weight to each. If `score` is either to 'xgb' (default), 'weighted' or 'ratio', then thresholding is performed directly on the said score. If `score='xgb.local.optim'`, then a local calibration of the score is performed. #### Local calibration (experimental - use with caution!) Doublets that are located in an otherwise sparse region of the space are easier to identify (and hence will have a higher score) that those located in a dense region. In other words, the aforementioned doublet scores are biased against rarer cell states. To correct for this, it is possible to perform threshold optimization separately for different groups of cells, depending on the type of artificial doublets (i.e. their originating clusters) closest to them. In this context, the expected number of doublets is calculated for each combination of clusters. This procedure can be problematic whenever the doublets are not random (e.g. when two cell types preferentially stick together). For this reason, instead of using the local threshold directly, `scDblFinder` moderates it by establishing a relationship between the local threshold and the difficultly in indentifying doublets, and pulling the threshold towards this expectation. The cells' scores are then transformed so that the different local thresholds are equalized, by substracting the logit-threshold to the logit-score and transforming back to a probability. Finally, the global threshold is calculated as described above. When the local calibration is enabled, the calibrated score is available in `scDblFinder.score`, and the original one in `scDblFinder.score.global`. This procedure should be considered experimental and used with care, and only in contexts that demand this (see doublet enrichment, below). In our experience, local calibration often increases FDR in normal circumstances. ### Doublet origins and enrichments Because we generate artificial doublets from clusters, it is most often possible to call the most likely origin (in terms of the combination of clusters) of a given putative real doublet. This information is provided through the `scDblFinder.mostLikelyOrigin` column of the output (and the `scDblFinder.originAmbiguous` column indicates whether this origin is ambiguous or rather clear). This, in turn, allows us to identify enrichment over expectation for specific kinds of doublets. Some statistics on each combination of clusters are saved in `metadata(sce)$scDblFinder.stats`, and the `plotDoubletMap` function can be used to visualize enrichments. Most enrichment or depletion is explained by the difficulty of identifying doublets of certain kinds, but departures from expectation might also point to technical or biological effects.

## Some important parameters `scDblFinder` has a fair number of parameters governing the preprocessing, generation of doublets, classification, etc. (see `?scDblFinder`). Here we describe just a few of the most important ones. ### Expected proportion of doublets The expected proportion of doublets has no impact on the density of artificial doublets in the neighborhood, but impacts the classifier's score and, especially, where the cutoff will be placed. It is specified through the `dbr` parameter and the `dbr.sd` parameter (the latter specifies a +/- range around `dbr` within which the deviation from `dbr` will be considered null). For 10x data, the more cells you capture the higher the chance of creating a doublet, and Chromium documentation indicates a doublet rate of roughly 1\% per 1000 cells captures (so with 5000 cells, (0.01\*5)\*5000 = 250 doublets), and the default expected doublet rate will be set to this value (with a default standard deviation of 0.015). Note however that different protocols may create considerably more doublets, and that this should be updated accordingly. ### Clustering Since doublets are created across clusters, it is important that subpopulations are not misrepresented as belonging to the same cluster. For this reason, we favor over-clustering at this stage, and if you provide your own clusters, the resolution should not be too coarse -- although an overly fine-grained resolution tends to reduce accuracy. `scDblFinder`'s default clustering method is implemented in the `fastcluster` function.

## Combination with other tools If the input SCE already contains a `logcounts` assay or a `reducedDim` slot named 'PCA', scDblFinder will use them for the clustering step. In addition, a clustering can be manually given using the `clusters` argument of `scDblFinder()`. In this way, `r Githubpkg("satijalab.org/seurat")` clustering could for instance be used (in which case we suggest not to use a too low `resolution` parameter) to create the artifical doublets (see `?Seurat::as.SingleCellExperiment.Seurat` for conversion to SCE).