--- title: "The scRNA PipelineDefinition" author: - name: Pierre-Luc Germain affiliation: - &uzh DMLS, University of Zürich - D-HEST Institute for Neuroscience, ETH Zürich - name: Anthony Sonrel affiliation: *uzh - name: Mark D. Robinson affiliation: *uzh package: pipeComp output: BiocStyle::html_document abstract: | A description of the PipelineDefinition for scRNAseq clustering and its evaluation metrics. vignette: | %\VignetteIndexEntry{pipeComp_scRNA} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} library(BiocStyle) ``` # Introduction This vignette is centered around the application of `r Rpackage("pipeComp")` to scRNA-seq clustering pipelines, and assumes a general understanding of `r Rpackage("pipeComp")` (for an overview, see the [pipeComp vignette](pipeComp.html)). The scRNAseq `PipelineDefinition` comes in two variants determined by the object used as a backbone, either `r Biocpkg("SingleCellExperiment")` (SCE) or `r Githubpkg("satijalab.org/seurat")` (see `?scrna_pipeline`). Both use the same evaluation metrics, and most method wrappers included in the package have been made so that they are compatible with both. For simplicity, we will therefore stick to just one variant here, and will focus on few very basic comparisons to illustrate the main functionalities, metrics and evaluation plots. For more detailed benchmarks, refer to our preprint: _pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools_
Pierre-Luc Germain, Anthony Sonrel & Mark D Robinson, bioRxiv [2020.02.02.930578](https://doi.org/10.1101/2020.02.02.930578)

## The PipelineDefinition The `PipelineDefinition` can be obtained with the following function: ```{r} library(pipeComp) # we use the variant of the pipeline used in the paper pipDef <- scrna_pipeline(pipeClass = "seurat") pipDef ``` ## Example run To illustrate the use of the pipeline, we will run a basic comparison using wrappers that are included in the package. However, in order for `r Rpackage("pipeComp")` not to systematically require the installation of all dependencies related to all methods for which there are wrappers, they were not included in the package code but rather as source files, which can be loaded in the following way: ```{r} source(system.file("extdata", "scrna_alternatives.R", package="pipeComp")) ``` (To know which packages are required by the set of wrappers you intend to use, see `?checkPipelinePackages`) Any function that has been loaded in the environment can then be used as alternative. We define a small set of alternatives to test: ```{r} alternatives <- list( doubletmethod=c("none"), filt=c("filt.lenient", "filt.stringent"), norm=c("norm.seurat", "norm.sctransform", "norm.scran"), sel=c("sel.vst"), selnb=2000, dr=c("seurat.pca"), clustmethod=c("clust.seurat"), dims=c(10, 15, 20, 30), resolution=c(0.01, 0.1, 0.2, 0.3, 0.5, 0.8, 1, 1.2, 2) ) ``` We also assume three datasets in \Rclass{SingleCellExperiment} (SCE) format (not included in the package - see the last section of this vignette) and run the pipeline: ```{r, eval=FALSE} # available on https://doi.org/10.6084/m9.figshare.11787210.v1 datasets <- c( mixology10x5cl="/path/to/mixology10x5cl.SCE.rds", simMix2="/path/to/simMix2.SCE.rds", Zhengmix8eq="/path/to/Zhengmix8eq.SCE.rds" ) # not run res <- runPipeline( datasets, alternatives, pipDef, nthreads=3) ``` Instead of running the analyses here, we will load the final example results: ```{r} data("exampleResults", package = "pipeComp") res <- exampleResults ```

# Exploring the metrics Benchmark metrics are organized according to the step at which they are computed, and will be presented here in this fashion. This does not mean that they are relevant only for that step: alternative parameters at a given step can also be evaluated with respect to the metrics defined in all downstream steps. ## Doublet detection and cell filtering The evaluation performed after the first two steps (doublet detection and filtering) is the same: ```{r} head(res$evaluation$filtering) ``` For each method and subpopulation, we report: * `N.before` the number of cells before the step * `N.lost` the number of cells excluded by the step * `pc.lost` the percentage of cells lost (relative to the supopulation) As noted in [our manuscript](https://doi.org/10.1101/2020.02.02.930578), stringent filtering can lead to strong bias against certain supopulations. We therefore especially monitor the max `pc.lost` of different methods in relation to the impact on clustering accuracy (privileging, at this step, metrics that are not dependent on the relative abundances of the subpopulations, such as the mean F1 score per subpopulation). This can conveniently be done using the following function: ```{r, fig.width=6, fig.height=2.5} scrna_evalPlot_filtering(res) ``` ## Evaluation based on the reduced space Evaluations based on the reduced space are much more varied: ```{r} names(res$evaluation$dimreduction) ``` ### Subpopulation silhouette The `silhouette` slot contains information about the silhouettes width of true subpopulations. Depending on the methods used for dimensionality (i.e. fixed vs estimated number of dimensions), there will be a single output or outputs for different sets of dimensions, as it is the case in our example: ```{r} names(res$evaluation$dimreduction$silhouette) ``` For each of them we have a data.frame including, for each subpopulation in each analysis (i.e. combination of parameters), the minimum, maximum, median and mean silhouette width: ```{r} head(res$evaluation$dimreduction$silhouette$top_10_dims) ``` This information can be plotted using the function `scrna_evalPlot_silh`; the function outputs a `r CRANpkg("ComplexHeatmap")`, which means that many arguments of that package and options can be used, for instance: ```{r, fig.width=8.5, fig.height=3} library(ComplexHeatmap) scrna_evalPlot_silh( res ) h <- scrna_evalPlot_silh( res, heatmap_legend_param=list(direction="horizontal") ) draw(h, heatmap_legend_side="bottom", annotation_legend_side = "bottom", merge_legend=TRUE) ``` See `?scrna_evalPlot_silh` for more options. ### Variance in the PCs explained by the subpopulations The slot `varExpl.subpops` indicates, for each analysis, the proportion of variance of each principal component explained by the true supopulations. ```{r} res$evaluation$dimreduction$varExpl.subpops[1:5,1:15] ``` ### Correlation with covariates The following slots in `res$evaluation$dimreduction` track the correlation between principal components (PCs) and predefined cell-level covariates such as library size and number of detected genes: * `corr.covariate` contains the pearson correlation between the covariates and each PC; however, since there are major differences in library sizes between subpopulations, we advise against using this directly. * `meanAbsCorr.covariate2` circumvents this bias by computing the mean absolute correlation (among the first 5 components) for each subpopulation, and averaging them. * `PC1.covar.adjR2` gives the difference in adjusted _R^2_ between a model fit on PC1 containing the covariate along with subpopulations (PC1~subpopulation+covariate) and one without the covariate (PC1~subpopulation). These metrics, as well as the following ones, can be plotted using generic `pipeComp` plotting functions, for example: ```{r, fig.width=8, fig.height=3} evalHeatmap(res, step="dimreduction", what="log10_total_features", what2="meanAbsCorr.covariate2") ``` Since the output of these plotting functions are of class `r CRANpkg("ComplexHeatmap")`, they can be combined: ```{r, eval=FALSE} evalHeatmap(res, step="dimreduction", what="log10_total_features", what2="meanAbsCorr.covariate2") + evalHeatmap(res, step="dimreduction", what="log10_total_counts", what2="meanAbsCorr.covariate2") ``` Alternatively, when the other arguments are shared, the following construction can also be used: ```{r, fig.width=8, fig.height=3} evalHeatmap( res, step="dimreduction", what2="meanAbsCorr.covariate2", what=c("log10_total_features","log10_total_counts"), row_title="mean(abs(correlation))\nwith covariate" ) ``` We see here for instance that `r Githubpkg("ChristophH/sctransform")` successfully reduces the correlation with covariates, and that `r Biocpkg("scran")` is somewhat in the middle. ## Clustering ### Metrics We compute several metrics comparing the clustering to the true cell labels: ```{r} colnames(res$evaluation$clustering) ``` The first columns represent the parameters, while the others are evaluation metrics: * `n_clus`: the number of clusters produced by the method * `mean_pr`, `mean_re`, and `mean_F1`: respectively the mean precision, recall and F1 score (harmonic mean of precision and recall) per (true) subpopulation, using the Hungarian algorithm for label matching (see `?match_evaluate_multiple`). * `min_pr`, `min_re` and `min_F1`: the minimum precision/recall/F1 per (true) subpopulation * `RI` and `ARI`: the Rand index and adjusted Rand index. * `MI` and `AMI`: the mutual information and adjusted mutual information, respectively. * `ID`, `NID`, `VI`, `NVI`: the information difference, variation of information, and their normalized counterparts; these decrease with increasing clustering accuracy. See the `r CRANpkg("aricode")` package for more information. There is a high redundancy between some of these metrics, and their relationship across a vast number of scRNAseq clusterings is represented here (see [our preprint](https://doi.org/10.1101/2020.02.02.930578) for more detail): ```{r} data("clustMetricsCorr", package="pipeComp") ComplexHeatmap::Heatmap(clustMetricsCorr$pearson, name = "Pearson\ncorr") ``` We also included, here, the deviation (`nbClust.diff`) and absolute deviation (`nbClust.absDiff`) from the true number of clusters. This shows that, for instance, most metrics (including the commonly-used ARI) are highly correlated (or anti-correlated) with the absolute deviation from the true number of clusters (`nbClust.absDiff`), making the number of clusters called the primary determinant of the score. Instead, mutual information (MI) is considerably less sensitive to this, but does tend to increase when the number of clusters is increased (positive correlation with `nbClust.diff`). We therefore recommend using a combination of MI, ARI, and ARI at the right number of clusters. ### Plotting Plotting all combinations is undesirable with the parameters such as resolution, which can take very many values. We therefore need to aggregate by parameters of interest: ```{r, fig.height=3, fig.width=7} evalHeatmap(res, step="clustering", what=c("MI","ARI"), agg.by=c("filt","norm")) ``` Steps for which there was a single alternative (after aggregation) are not included in the labels. We could investigate the joint impact of the normalization method and of the number of dimensions included using: ```{r, fig.width=7, fig.height=4} evalHeatmap(res, step = "clustering", what=c("MI","ARI"), agg.by=c("norm", "dims"), row_split=norm) ``` Here, we've used the `row_split` argument to improve the clarity of the figure. The argument can accept either the name of a parameter, or any expression using them (e.g. `row_split=norm!="norm.scran"`). We can also filter the analyses before aggregation. For instance, if we wish to plot the ARI only at the true number of clusters, we can filter to those analyses where the detected number of clusters (`n_clus`) is equal to the true one (`true.nbClusts`): ```{r, fig.width=8.5, fig.height=4} evalHeatmap(res, step = "clustering", what=c("MI","ARI"), agg.by=c("filt","norm")) + evalHeatmap(res, step = "clustering", what="ARI", agg.by=c("filt", "norm"), filter=n_clus==true.nbClusts, title="ARI at\ntrue k") ``` Finally, a pipeline-specific plotting function enables overview heatmaps across steps: ```{r, fig.width=8, fig.height=5} h <- scrna_evalPlot_overall(res) draw(h, heatmap_legend_side="bottom") ``` ## Computing time There is nothing specific to the scRNAseq pipeline about computing times, but the default `pipeComp` functionalities are available: the timings are accessible in `res$elapsed`, and can be plotted either manually or using: ```{r, fig.width=7, fig.height=3} plotElapsed(res, agg.by="norm") ```

# Extension and reuse The scRNAseq `PipelineDefinition` can be modified or extented with new steps or arguments like any other objects of that class (see the [pipeComp vignette](pipeComp.html)). For instance, in the paper we included tests on an additional step that filtered out classes of genes, which we implemented in the following way: ```{r} pipDef <- addPipelineStep(pipDef, "featureExcl", after="filtering") # once the step has been added, we can set its function: stepFn(pipDef, "featureExcl", type="function") <- function(x, classes){ if(classes!="all"){ classes <- strsplit(classes, ",")[[1]] x <- x[subsetFeatureByType(row.names(x), classes=classes),] } x } pipDef ``` Then we can simply add alternatives for this new parameter: ```{r} alternatives$classes <- c("all","Mt","ribo") # runPipeline... ``` In addition, the evaluation functions used at each step can be accessed from the package's namespace and use for other purposes. See in particular `?evaluateDimRed` and `?evaluateClustering`. If you feel like other metrics should be included, please contact us!

# Datasets ## scRNAseq benchmark datasets used in the paper The scRNAseq datasets used in the paper can be downloaded from [figshare](https://doi.org/10.6084/m9.figshare.11787210.v1), for instance in the following way: ```{r, eval=FALSE} download.file("https://ndownloader.figshare.com/articles/11787210/versions/1", "datasets.zip") unzip("datasets.zip", exdir="datasets") datasets <- list.files("datasets", pattern="SCE\\.rds", full.names=TRUE) names(datasets) <- sapply(strsplit(basename(datasets),"\\."),FUN=function(x) x[1]) ``` ## Using new datasets In order to use new datasets with this pipeline, you need to have them in `r Biocpkg("SingleCellExperiment")` format, with the true subpopulations stored in the `phenoid` column of `colData`. In addition, if you wish to use some of the wrappers included in the package, some cell- and gene-statistics should be generated using the following function: ```{r, eval=FALSE} source(system.file("extdata", "scrna_alternatives.R", package="pipeComp")) sce <- add_meta(sce) # requires the variancePartition packages installed: sce <- compute_all_gene_info(sce) ```