--- title: "Supervised Demultiplexing using Cell Hashing and SNPs" author: - name: Michael Lynch affiliation: University of Limerick - name: Aedin Culhane affiliation: University of Limerick output: BiocStyle::html_document: toc_float: true bibliography: references.bib vignette: | %\VignetteIndexEntry{Supervised Demultiplexing using Cell Hashing and SNPs} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 80 --- ```{r eval=TRUE, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, fig.width = 7.5 ) ``` ```{r message=FALSE} library(demuxSNP) library(ComplexHeatmap) library(viridisLite) library(Seurat) library(ggpubr) library(dittoSeq) library(utils) library(EnsDb.Hsapiens.v86) ``` ```{r} colors <- structure(viridis(n = 3), names = c("-1", "0", "1")) ``` # Introduction Multiplexing in scRNAseq involves the sequencing of samples from different patients, treatment types or physiological locations together, resulting in significant cost savings. The cells must then be demultiplexed, or assigned back to their respective groups. A number of experimental and computational methods have been proposed to facilitate this, but a universally robust algorithm remains elusive. Below, we introduce some existing methods, highlight the novel features of our approach and its advantages to the user. ## Existing Methods ### Cell Hashing Cells from each group are labelled with a distinct tag (HTO or LMO) which is sequenced to give a counts matrix. Due to non-specific binding, these counts form a bimodal distribution. Such methods are generally computationally efficient. Their classification performance, however, is highly dependent on the tagging quality and many methods do not account for uncertainty in classification (@boggy_bff_2022, @kim_citefuse_2020 & @stoeckius_cell_2018). More recent methods, including [demuxmix](https://bioconductor.org/packages/release/bioc/html/demuxmix.html), assign a probability that a cell is from a particular group, or made up of multiple groups (doublet). This allows users to define a cut-off threshold for the assignment confidence. Accounting for uncertainty is an important feature for these types of algorithms. But, while they give the user greater flexibility in determining which cells to keep, this ultimately results in a trade-off between keeping cells which cannot be confidently called or discarding them - due to issues with tagging quality rather than RNA quality. ### SNPs The second class of methods exploits natural genetic variation between cells and so can only be used where the groups are genetically distinct. Demuxlet (@kang_multiplexed_2018) uses genotype information from each group to classify samples. This genotyping incurs additional experimental cost. To address this, Souporcell (@heaton_souporcell_2020) and Vireo (@huang_vireo_2019) among other methods were developed to classify cells based on their SNPs in an unsupervised manner. Without prior knowledge of the SNPs associated with each group, these unsupervised methods may confuse groups with lower cell counts for other signals in the data. Demuxlet remains the standard often used to benchmark other methods but its more widespread adoption has been limited by the requirement of sample genotype information. ## demuxSNP Motivation **With cell hashing, we can confidently demultiplex *some* but not *all* cells. Using these high confidence cells, we can learn the SNPs associated with each group. This SNP information can then be used to assign remaining cells (which we could not confidently call using cell hashing) to their most similar group based on their SNP profile.** Novel features: - Uses both cell hashing and SNP data. Current methods are limited to using one or the other. - Selects SNPs based on being located in a gene expressed in a large proportion of cells to reduce noise, computational cost and increase interpretability. Impact: - Users can visually confirm validity (or lack thereof) of existing demultiplexing results in a tangible manner. - Users can recover otherwise high quality cells which could not be confidently assigned using other methods. - Cells from groups which are present in lower proportions may be classified better than with unsupervised SNP approaches. Note: the approach used here differs from most SNP methods in that it is supervised. We attain knowledge of which SNPs are associated with which patients from the high confidence cells then use this to train a classifier. It is similar to demuxlet in the sense that the classifier uses group specific SNP information, **however** our method does not require the expense of genotyping and so may be much more widely applicable. ## Installation ```{r eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("demuxSNP") ## or for development version: devtools::install_github("michaelplynch/demuxSNP") ``` ## Quick Usage For full list of arguments and explanation of each function, please refer to relevant documentation. ## Workflow ```{r} #Data loading data(multiplexed_scrnaseq_sce, commonvariants_1kgenomes_subset, vartrix_consensus_snps, package = "demuxSNP") small_sce<-multiplexed_scrnaseq_sce[,1:100] ensdb <- EnsDb.Hsapiens.v86::EnsDb.Hsapiens.v86 #Preprocessing top_genes<-common_genes(small_sce) vcf_sub<-subset_vcf(commonvariants_1kgenomes_subset, top_genes, ensdb) small_sce<-high_conf_calls(small_sce) #Use subsetted vcf with VarTrix in default 'consensus' mode to generate SNPs #matrix small_sce<-add_snps(small_sce,vartrix_consensus_snps[,1:100]) small_sce<-reassign(small_sce,k=5) table(small_sce$knn) ``` ### Function Explanation `top_genes <- common_genes(sce)`: Returns the genes which are expressed (expression \> 0) in the highest proportion of cells. These genes are used below to subset the .vcf file. `new_vcf <- subset_vcf(vcf = vcf, top_genes = top_genes, ensdb = ensdb)`: Subsets a supplied .vcf to SNP locations within the genes supplied. The ranges of the genes are extracted from the EnsDb object. `sce <- high_conf_calls(sce = sce, assay = "HTO")`: Takes a SingleCellExperiment object with HTO altExp, runs [demuxmix](https://bioconductor.org/packages/release/bioc/html/demuxmix.html) and returns a factor of assigned labels, a logical vector indicating high confidence calls and a logical vector indicating which cells to predict (all). `sce <- add_snps(sce = sce, mat = vartrix_consensus_snps, thresh = 0.8)`: Adds the SNP data from VarTrix (default consensus mode) to the SingleCellExperiment object as an altExp. Additionally, filters out SNPs with no reads in less than 'thresh' proportion of cells. ```{r eval=FALSE} sce <- reassign(sce, k = 10, d = 10, train_cells = sce$train, predict_cells = sce$predict ) ``` Reassigns cells based on SNP profiles of high confidence cells. Singlet training data is based on high confidence singlet assignment. Doublets are simulated by systematically sampling and combining the SNP profiles of n cells pairs from each grouping combination. Cells to be used for training data are specified by "train_cells" (logical). Cells to be used for prediction are specified by "predict_cells" (logical), this may also include the training data. # Exploratory Analysis We load three data objects: a SingleCellExperiment object containing RNA and HTO counts, a .vcf file of class CollapsedVCF containing SNPs from 1000 Genomes common variants and a matrix containing SNP information for each cell (we will show you how to generate this SNPs matrix using [VarTrix](https://github.com/10XGenomics/vartrix) outside of R). We have already removed low quality cells (library size\<1,000 and percentage of genes mapping to mitochondrial genes\>10%). ```{r} class(multiplexed_scrnaseq_sce) class(commonvariants_1kgenomes_subset) class(vartrix_consensus_snps) ``` The HTO or LMO distribution is usually bimodal, with a signal (high counts) and background distribution (low counts) caused by non-specific binding. Ideally, these distributions would be clearly separated with no overlap, but in practice, this is not always the case. In our example data, we see that the signal and noise overlap to varying extents in each group. ```{r echo=FALSE, warning=FALSE, fig.height=5,fig.width=8.5} htos <- as.data.frame(t(as.matrix(logcounts(altExp(multiplexed_scrnaseq_sce, "HTO"))))) x1 <- gghistogram(htos, x = "Hashtag1", fill = dittoColors(1)[1], palette = "lancet", xlim = c(0, 8), ylim = c(0, 750), alpha = 1) x2 <- gghistogram(htos, x = "Hashtag2", fill = dittoColors(1)[2], palette = "lancet", xlim = c(0, 8), ylim = c(0, 750), alpha = 1) x3 <- gghistogram(htos, x = "Hashtag3", fill = dittoColors(1)[3], palette = "lancet", xlim = c(0, 8), ylim = c(0, 750), alpha = 1) x4 <- gghistogram(htos, x = "Hashtag4", fill = dittoColors(1)[4], palette = "lancet", xlim = c(0, 8), ylim = c(0, 750), alpha = 1) x5 <- gghistogram(htos, x = "Hashtag5", fill = dittoColors(1)[5], palette = "lancet", xlim = c(0, 8), ylim = c(0, 750), alpha = 1) x6 <- gghistogram(htos, x = "Hashtag6", fill = dittoColors(1)[6], palette = "lancet", xlim = c(0, 8), ylim = c(0, 750), alpha = 1) plot <- ggarrange(x1, x2, x3, x4, x5, x6, align = "hv", ncol = 3, nrow = 2) annotate_figure(plot, top = text_grob("CLR normalised HTO counts", size = 12)) ``` We will begin by running Seurat's HTODemux, a popular HTO demultiplexing algorithm on the data. ```{r} logcounts(multiplexed_scrnaseq_sce) <- counts(multiplexed_scrnaseq_sce) seurat <- as.Seurat(multiplexed_scrnaseq_sce) seurat <- HTODemux(seurat) seurat$hash.ID <- factor(as.character(seurat$hash.ID)) multiplexed_scrnaseq_sce$seurat <- seurat$hash.ID multiplexed_scrnaseq_sce$seurat <- seurat$hash.ID table(multiplexed_scrnaseq_sce$seurat) ``` Although HTO library size of the Negative group is low, the RNA library size is similar to that of other groups, indicating that they may be misclassified as Negative due to their tagging quality rather than overall RNA quality. ```{r} dittoPlot(seurat, "nCount_HTO", group.by = "hash.ID") dittoPlot(seurat, "nCount_RNA", group.by = "hash.ID") ``` For the remainder of this vignette we will outline our method of checking whether or not these cells have been called correctly and how to assign them to their appropriate group! # Preprocessing Common variants files, for example from the 1000 Genomes Project, can contain over 7 million SNPs. To reduce computational cost and cell-type effects, we subset our SNPs list to those located within genes expressed across most cells in our data. We first find the most commonly expressed genes in our RNA data. ```{r} top_genes <- common_genes(sce = multiplexed_scrnaseq_sce, n = 100) top_genes[1:10] ``` We have a sample .vcf preloaded, but you can load your .vcf file in using 'readVcf()' from [VariantAnnotation](https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html). We will subset our .vcf file to SNPs seen in commonly expressed genes from our dataset. Notice that the genome for the vcf and EnsDb object must be compatible! The returned vcf can be written to file and used with VarTrix later on. ```{r} ensdb <- EnsDb.Hsapiens.v86::EnsDb.Hsapiens.v86 genome(commonvariants_1kgenomes_subset)[1] == genome(ensdb)[1] new_vcf <- subset_vcf(commonvariants_1kgenomes_subset, top_genes = top_genes, ensdb) commonvariants_1kgenomes_subset new_vcf ``` The subsetted .vcf can be written to disk using 'writeVcf()', again from [VariantAnnotation](https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html) package. Next, we wish to identify cells which we can confidently call to a particular group. There are a number of ways this can be achieved, including probabilistic modelling of the HTO counts, manually setting non-conservative thresholds or using consensus calls. The user may wish to experiment with different approaches. Here we have used [demuxmix](https://bioconductor.org/packages/release/bioc/html/demuxmix.html), a probabilistic model which we have set with a high acceptance threshold to identify high confidence cell calls to use as training data (cells which we can confidently call as a particular singlet group). ```{r } multiplexed_scrnaseq_sce <- high_conf_calls(multiplexed_scrnaseq_sce) table(multiplexed_scrnaseq_sce$train) table(multiplexed_scrnaseq_sce$predict) table(multiplexed_scrnaseq_sce$labels) ``` So, for this particular dataset, we can confidently call 1,045 cells as being from a particular singlet group. 608 cells cannot be called to a group with high confidence. # Variant Calling (VarTrix) Variant calling is not done within the package. Instead, we refer the reader to [VarTrix](https://github.com/10XGenomics/vartrix), where they can use the subsetted .vcf file along with their .bam, barcodes.tsv and reference genome to call SNPs in each cell. A sample VarTrix command looks like the following: ```{bash eval=FALSE} ./vartrix -v -b -f -c -o ``` Using the output matrix from Vartrix and the high confidence classifications from the HTO algorithm, we can reassign cells using k-nearest neighbours. # Cell Reassignment, Visualisation and Evaluation To keep things tidy, we will add the SNP data to our SingleCellExperiment object as an altExp. We recode the SNP matrix as follows: 0=no read, 1=SNP present, -1=SNP absent. This function also filters out SNPs which are observed at a low frequency in the data, and the frequency threshold can be set manually. ```{r} dim(vartrix_consensus_snps) multiplexed_scrnaseq_sce <- add_snps(multiplexed_scrnaseq_sce, vartrix_consensus_snps, thresh = 0.95) altExp(multiplexed_scrnaseq_sce, "SNP") ``` Before we reassign any cells, we will first use the SNPs data to inspect the results from stand-alone algorithms. Splitting the SNP data by Seurat HTODemux classification, we initially see a large number of 'negative' cells which appear of good quality (high proportion of reads) which may be assignable to another group. This is consistent with the library size plot we visualised earlier. ```{r} hm <- Heatmap(counts(altExp(multiplexed_scrnaseq_sce, "SNP")), column_split = multiplexed_scrnaseq_sce$seurat, cluster_rows = FALSE, show_column_names = FALSE, cluster_column_slices = FALSE, column_title_rot = -45, row_title = "SNPs", show_row_names = FALSE, col = colors ) draw(hm, column_title = "SNP profiles split by Seurat HTODemux call", padding = unit(c(2, 15, 2, 2), "mm") ) ``` We will use our knn method to reassign cells based on their SNP profiles. The training data is the high confidence cells ```{r} set.seed(1) multiplexed_scrnaseq_sce <- reassign(multiplexed_scrnaseq_sce, k = 10, d = 10, train_cells = multiplexed_scrnaseq_sce$train, predict_cells = multiplexed_scrnaseq_sce$predict ) table(multiplexed_scrnaseq_sce$knn) ``` ```{r} hm <- Heatmap(counts(altExp(multiplexed_scrnaseq_sce, "SNP")), column_split = multiplexed_scrnaseq_sce$knn, cluster_rows = FALSE, show_column_names = FALSE, cluster_column_slices = FALSE, column_names_rot = 45, column_title_rot = -45, row_title = "SNPs", show_row_names = FALSE, col = colors ) draw(hm, column_title = "SNP profiles split by updated knn classification", padding = unit(c(2, 15, 2, 2), "mm") ) ``` Focusing in on the knn Hashtag5 group, we see that a lot of the Negative cells have now been correctly reclassed to this group, as well as a small number of cells from other groups. ```{r} hm <- Heatmap(counts(altExp(multiplexed_scrnaseq_sce, "SNP"))[, multiplexed_scrnaseq_sce$knn == "Hashtag5"], column_split = multiplexed_scrnaseq_sce$seurat[multiplexed_scrnaseq_sce$knn == "Hashtag5"], cluster_rows = FALSE, show_column_names = FALSE, cluster_column_slices = FALSE, column_names_rot = 45, column_title_rot = -45, row_title = "SNPs", show_row_names = FALSE, col = colors ) draw(hm, column_title = "knn Hashtag5 group split by Seurat HTODemux classification", padding = unit(c(2, 15, 2, 2), "mm") ) ``` ## Performance Next we will run some basic performance checks. We subset our SingleCellExperiment object to only retain cells which we could confidently call as being from a singlet group, then split this into a training and test dataset. ```{r} sce_test <- multiplexed_scrnaseq_sce[, multiplexed_scrnaseq_sce$train == TRUE] sce_test$knn <- NULL sce_test$labels <- droplevels(sce_test$labels) sce_test sce_test$train2 <- rep(FALSE, length(sce_test$train)) sce_test$train2[seq_len(500)] <- TRUE sce_test$test <- sce_test$train2 == FALSE ``` Comparing the predicted labels in the test dataset with the hidden high confidence labels, we see excellent agreement. ```{r} sce_test <- reassign(sce_test, k = 3, train_cells = sce_test$train2, predict_cells = sce_test$test) table(sce_test$labels[sce_test$test == TRUE], sce_test$knn[sce_test$test == TRUE]) ``` We can also show that that the model can correct misclassified cells when predicted back on the training data. We create a new vector "labels2" which is initially identical to the labels used in training previously. ```{r} sce_test$knn <- NULL sce_test$labels2 <- droplevels(sce_test$labels) table(sce_test$labels, sce_test$labels2) ``` We then randomly reassign 25 cells from Hashtag5 to Hashtag2 in the training data. Predicting the model back on itself we see that in the new predicted labels, the 25 altered have been correctly reclassified back to Hashtag5 ```{r} sce_test$labels2[which(sce_test$labels2 == "Hashtag5")[1:25]] <- "Hashtag2" table(sce_test$labels, sce_test$labels2) sce_test <- reassign(sce_test, train_cells = sce_test$train, predict_cells = sce_test$train ) table(sce_test$labels, sce_test$knn) ``` For the knn Hashtag6 group, we see mostly good agreement with the original demuxmix labels. However, it appears one cell now being called a Hashtag6 was originally called Hashtag5. Again, we can investigate this by visualising the corresponding SNP profile. In doing so, we can see that this most likely is a true Hashtag6. ```{r} hm <- Heatmap(counts(altExp(sce_test, "SNP"))[, sce_test$knn == "Hashtag6"], column_split = sce_test$labels[sce_test$knn == "Hashtag6"], cluster_rows = FALSE, show_column_names = FALSE, cluster_column_slices = FALSE, column_names_rot = 45, column_title_rot = -45, row_title = "SNPs", show_row_names = FALSE, col = colors ) draw(hm, column_title = "knn Hashtag6 group split by demuxmix classification", padding = unit(c(2, 15, 2, 2), "mm") ) ``` # Session Info ```{r} sessionInfo() ``` # References