---
output:
html_document
bibliography: ref.bib
---
# Droplet processing
## Motivation
Droplet-based single-cell protocols aim to isolate each cell inside its own droplet in a water-in-oil emulsion, such that each droplet serves as a miniature reaction chamber for highly multiplexed library preparation [@macosko2015highly;@klein2015droplet].
Upon sequencing, reads are assigned to individual cells based on the presence of droplet-specific barcodes.
This enables a massive increase in the number of cells that can be processed in typical scRNA-seq experiments, contributing to the dominance^[As of time of writing.] of technologies such as the 10X Genomics platform [@zheng2017massively].
However, as the allocation of cells to droplets is not known in advance, the data analysis requires some special steps to determine what each droplet actually contains.
This chapter explores some of the more common preprocessing procedures that might be applied to the count matrices generated from droplet protocols.
## Calling cells from empty droplets {#qc-droplets}
### Background
An unique aspect of droplet-based data is that we have no prior knowledge about whether a particular library (i.e., cell barcode) corresponds to cell-containing or empty droplets.
Thus, we need to call cells from empty droplets based on the observed expression profiles.
This is not entirely straightforward as empty droplets can contain ambient (i.e., extracellular) RNA that can be captured and sequenced, resulting in non-zero counts for libraries that do not contain any cell.
To demonstrate, we obtain the **unfiltered** count matrix for the PBMC dataset from 10X Genomics.
``` r
sce.pbmc
```
```
## class: SingleCellExperiment
## dim: 33694 737280
## metadata(1): Samples
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
## ENSG00000268674
## rowData names(2): ID Symbol
## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ...
## TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
```
The distribution of total counts exhibits a sharp transition between barcodes with large and small total counts (Figure \@ref(fig:rankplot)), probably corresponding to cell-containing and empty droplets respectively.
A simple approach would be to apply a threshold on the total count to only retain those barcodes with large totals.
However, this unnecessarily discards libraries derived from cell types with low RNA content.
``` r
library(DropletUtils)
bcrank <- barcodeRanks(counts(sce.pbmc))
# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
xlab="Rank", ylab="Total UMI count", cex.lab=1.2)
abline(h=metadata(bcrank)$inflection, col="darkgreen", lty=2)
abline(h=metadata(bcrank)$knee, col="dodgerblue", lty=2)
legend("bottomleft", legend=c("Inflection", "Knee"),
col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)
```
(\#fig:rankplot)Total UMI count for each barcode in the PBMC dataset, plotted against its rank (in decreasing order of total counts). The inferred locations of the inflection and knee points are also shown.
### Testing for empty droplets
We use the `emptyDrops()` function to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool [@lun2018distinguishing].
Any significant deviation indicates that the barcode corresponds to a cell-containing droplet.
This allows us to discriminate between well-sequenced empty droplets and droplets derived from cells with little RNA, both of which would have similar total counts in Figure \@ref(fig:rankplot).
We call cells at a false discovery rate (FDR) of 0.1%, meaning that no more than 0.1% of our called barcodes should be empty droplets on average.
``` r
# emptyDrops performs Monte Carlo simulations to compute p-values,
# so we need to set the seed to obtain reproducible results.
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
# See ?emptyDrops for an explanation of why there are NA values.
summary(e.out$FDR <= 0.001)
```
```
## Mode FALSE TRUE NA's
## logical 989 4300 731991
```
`emptyDrops()` uses Monte Carlo simulations to compute $p$-values for the multinomial sampling transcripts from the ambient pool.
The number of Monte Carlo iterations determines the lower bound for the $p$-values [@phipson2010permutation].
The `Limited` field in the output indicates whether or not the computed $p$-value for a particular barcode is bounded by the number of iterations.
If any non-significant barcodes are `TRUE` for `Limited`, we may need to increase the number of iterations.
A larger number of iterations will result in a lower $p$-value for these barcodes, which may allow them to be detected after correcting for multiple testing.
``` r
table(Sig=e.out$FDR <= 0.001, Limited=e.out$Limited)
```
```
## Limited
## Sig FALSE TRUE
## FALSE 989 0
## TRUE 1728 2572
```
As mentioned above, `emptyDrops()` assumes that barcodes with low total UMI counts are empty droplets.
Thus, the null hypothesis should be true for all of these barcodes.
We can check whether the hypothesis testing procedure holds its size by examining the distribution of $p$-values for low-total barcodes with `test.ambient=TRUE`.
Ideally, the distribution should be close to uniform (Figure \@ref(fig:ambientpvalhist)).
Large peaks near zero indicate that barcodes with total counts below `lower` are not all ambient in origin.
This can be resolved by decreasing `lower` further to ensure that barcodes corresponding to droplets with very small cells are not used to estimate the ambient profile.
``` r
set.seed(100)
limit <- 100
all.out <- emptyDrops(counts(sce.pbmc), lower=limit, test.ambient=TRUE)
hist(all.out$PValue[all.out$Total <= limit & all.out$Total > 0],
xlab="P-value", main="", col="grey80")
```
(\#fig:ambientpvalhist)Distribution of $p$-values for the assumed empty droplets.
Once we are satisfied with the performance of `emptyDrops()`, we subset our `SingleCellExperiment` object to retain only the detected cells.
Discerning readers will notice the use of `which()`, which conveniently removes the `NA`s prior to the subsetting.
``` r
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]
```
It usually only makes sense to call cells using a count matrix involving libraries from a single sample.
The composition of transcripts in the ambient solution will usually vary between samples, so the same ambient profile cannot be reused.
If multiple samples are present in a dataset, their counts should only be combined after cell calling is performed on each matrix.
### Relationship with other QC metrics
While `emptyDrops()` will distinguish cells from empty droplets, it makes no statement about the quality of the cells.
It is entirely possible for droplets to contain damaged or dying cells, which need to be removed prior to downstream analysis.
This is achieved using the same outlier-based strategy described in [Basic Section 1.3.2](http://bioconductor.org/books/3.19/OSCA.basic/quality-control.html#quality-control-outlier).
Filtering on the mitochondrial proportion provides the most additional benefit in this situation, provided that we check that we are not removing a subpopulation of metabolically active cells (Figure \@ref(fig:qc-mito-pbmc)).
``` r
library(scuttle)
is.mito <- grep("^MT-", rowData(sce.pbmc)$Symbol)
pbmc.qc <- perCellQCMetrics(sce.pbmc, subsets=list(MT=is.mito))
discard.mito <- isOutlier(pbmc.qc$subsets_MT_percent, type="higher")
summary(discard.mito)
```
```
## Mode FALSE TRUE
## logical 3985 315
```
``` r
plot(pbmc.qc$sum, pbmc.qc$subsets_MT_percent, log="x",
xlab="Total count", ylab='Mitochondrial %')
abline(h=attr(discard.mito, "thresholds")["higher"], col="red")
```
(\#fig:qc-mito-pbmc)Percentage of reads assigned to mitochondrial transcripts, plotted against the library size. The red line represents the upper threshold used for QC filtering.
`emptyDrops()` already removes cells with very low library sizes or (by association) low numbers of expressed genes.
Thus, further filtering on these metrics is not strictly necessary.
It may still be desirable to filter on both of these metrics to remove non-empty droplets containing cell fragments or stripped nuclei that were not caught by the mitochondrial filter.
However, this should be weighed against the risk of losing genuine cell types as discussed in Section \@ref(outlier-assumptions).
Note that _CellRanger_ version 3 automatically performs cell calling using an algorithm similar to `emptyDrops()`.
If we had started our analysis with the **filtered** count matrix, we could go straight to computing other QC metrics.
We would not need to run `emptyDrops()` manually as shown here, and indeed, attempting to do so would lead to nonsensical results if not outright software errors.
Nonetheless, it may still be desirable to load the **unfiltered** matrix and apply `emptyDrops()` ourselves, on occasions where more detailed inspection or control of the cell-calling statistics is desired.
## Removing ambient contamination
For routine analyses, there is usually no need to remove the ambient contamination from each library.
A consistent level of contamination across the dataset does not introduce much spurious heterogeneity, so dimensionality reduction and clustering on the original (log-)expression matrix remain valid.
For genes that are highly abundant in the ambient solution, we can expect some loss of signal due to shrinkage of the log-fold changes between clusters towards zero, but this effect should be negligible for any genes that are so strongly upregulated that they are able to contribute to the ambient solution in the first place.
This suggests that ambient removal can generally be omitted from most analyses, though we will describe it here regardless as it can be useful in specific situations.
Effective removal of ambient contamination involves tackling a number of issues.
We need to know how much contamination is present in each cell, which usually requires some prior biological knowledge about genes that should not be expressed in the dataset (e.g., mitochondrial genes in single-nuclei datasets, see Section \@ref(nuclei-ambient-tricks)) or genes with mutually exclusive expression profiles [@young2018soupx].
Those same genes must be highly abundant in the ambient solution to have enough counts in each cell for precise estimation of the scale of the contamination.
The actual subtraction of the ambient contribution also must be done in a manner that respects the mean-variance relationship of the count data.
Unfortunately, these issues are difficult to address for single-cell data due to the imprecision of low counts,
Rather than attempting to remove contamination from individual cells, a more measured approach is to operate on clusters of related cells.
The `removeAmbience()` function from *[DropletUtils](https://bioconductor.org/packages/3.19/DropletUtils)* will remove the contamination from the cluster-level profiles and propagate the effect of those changes back to the individual cells.
Specifically, given a count matrix for a single sample and its associated ambient profile, `removeAmbience()` will:
1. Aggregate counts in each cluster to obtain an average profile per cluster.
2. Estimate the contamination proportion in each cluster with `maximumAmbience()` (see [Multi-sample Chapter 5](http://bioconductor.org/books/3.19/OSCA.multisample/ambient-problems.html#ambient-problems)).
This has the useful property of not requiring any prior knowledge of control or mutually exclusive expression profiles, albeit at the cost of some statistical rigor.
3. Subtract the estimated contamination from the cluster-level average.
4. Perform quantile-quantile mapping of each individual cell's counts from the old average to the new subtracted average.
This preserves the mean-variance relationship while yielding corrected single-cell profiles.
We demonstrate this process on our PBMC dataset below.
``` r
# Not all genes are reported in the ambient profile from emptyDrops,
# as genes with counts of zero across all droplets are just removed.
# So for convenience, we will restrict our analysis to genes with
# non-zero counts in at least one droplet (empty or otherwise).
amb <- metadata(e.out)$ambient[,1]
stripped <- sce.pbmc[names(amb),]
out <- removeAmbience(counts(stripped), ambient=amb, groups=colLabels(stripped))
dim(out)
```
```
## [1] 20112 3985
```
We can visualize the effects of ambient removal on a gene like _IGKC_, which presumably should only be expressed in the B cell lineage.
This gene has some level of expression in each cluster in the original dataset but is "zeroed" in most clusters after removal (Figure \@ref(fig:ambient-removal-igkc)).
``` r
library(scater)
counts(stripped, withDimnames=FALSE) <- out
stripped <- logNormCounts(stripped)
gridExtra::grid.arrange(
plotExpression(sce.pbmc, x="label", colour_by="label", features="IGKC") +
ggtitle("Before"),
plotExpression(stripped, x="label", colour_by="label", features="IGKC") +
ggtitle("After"),
ncol=2
)
```
(\#fig:ambient-removal-igkc)Distribution of _IGKC_ log-expression values in each cluster of the PBMC dataset, before and after removal of ambient contamination.
We observe a similar phenomenon with the _LYZ_ gene (Figure \@ref(fig:ambient-removal-lyz)), which should only be expressed in macrophages and neutrophils.
In fact, if we knew this beforehand, we could specify these two mutually exclusive sets - i.e., _LYZ_ and _IGKC_ and their related genes - in the `features=` argument to `removeAmbience()`.
This knowledge is subsequently used to estimate the contamination in each cluster, an approach that is more conceptually similar to the methods in the *[SoupX](https://CRAN.R-project.org/package=SoupX)* package.
``` r
gridExtra::grid.arrange(
plotExpression(sce.pbmc, x="label", colour_by="label", features="LYZ") +
ggtitle("Before"),
plotExpression(stripped, x="label", colour_by="label", features="LYZ") +
ggtitle("After"),
ncol=2
)
```
(\#fig:ambient-removal-lyz)Distribution of _LYZ_ log-expression values in each cluster of the PBMC dataset, before and after removal of ambient contamination.
While these results look impressive, discerning readers will note that the method relies on having sensible clusters.
This limits the function's applicability to the _end_ of an analysis after all the characterization has already been done.
As such, the stripped matrix can really only be used in downstream steps like the DE analysis (where it is unlikely to have much effect beyond inflating already-large log-fold changes) or - most importantly - in visualization, where users can improve the aesthetics of their plots by eliminating harmless background expression.
Of course, one could repeat the entire analysis on the stripped count matrix to obtain new clusters, but this seems unnecessarily circuituous, especially if the clusters were deemed good enough for use in `removeAmbience()` in the first place.
Finally, it may be worth considering whether a corrected per-cell count matrix is really necessary.
In `removeAmbience()`, counts for each gene are assumed to follow a negative binomial distribution with a fixed dispersion.
This is necessary to perform the quantile-quantile remapping to obtain a corrected version of each individual cell's counts, but violations of these distributional assumptions will introduce inaccuracies in downstream models.
Some analyses may have specific remedies to ambient contamination that do not require corrected per-cell counts ([Multi-sample Chapter 5](http://bioconductor.org/books/3.19/OSCA.multisample/ambient-problems.html#ambient-problems)), so we can avoid these assumptions altogether if such remedies are available.
## Demultiplexing cell hashes {#cell-hashing}
### Background
Cell hashing [@stoeckius2018hashing] is a useful technique that allows cells from different samples to be processed in a single run of a droplet-based protocol.
Cells from a single sample are first labelled with a unique hashing tag oligo (HTOs), usually via conjugation of the HTO to an antibody against a ubiquitous surface marker or a membrane-binding compound like cholesterol [@mcginnis2019multiseq].
Cells from different samples are then mixed together and the multiplexed pool is used for droplet-based library preparation; each cell is assigned back to its sample of origin based on its most abundant HTO.
By processing multiple samples together, we can avoid batch effects and simplify the logistics of studies with a large number of samples.
Sequencing of the HTO-derived cDNA library yields a count matrix where each row corresponds to a HTO and each column corresponds to a cell barcode.
This can be stored as an alternative Experiment in our `SingleCellExperiment`, alongside the main experiment containing the counts for the actual genes.
We demonstrate on some data from the original @stoeckius2018hashing study, which contains counts for a mixture of 4 cell lines across 12 samples.
``` r
library(scRNAseq)
hto.sce <- StoeckiusHashingData(type="mixed")
hto.sce # The full dataset
```
```
## class: SingleCellExperiment
## dim: 25339 25088
## metadata(0):
## assays(1): counts
## rownames(25339): A1BG A1BG-AS1 ... snoU2-30 snoZ178
## rowData names(0):
## colnames(25088): CAGATCAAGTAGGCCA CCTTTCTGTCGGATCC ... CGGTTATCCATCTGCT
## CGGTTCACACGTCAGC
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(1): hto
```
``` r
altExp(hto.sce) # Contains the HTO counts
```
```
## class: SingleCellExperiment
## dim: 12 25088
## metadata(0):
## assays(1): counts
## rownames(12): HEK_A HEK_B ... KG1_B KG1_C
## rowData names(2): cell_line replicate
## colnames(25088): CAGATCAAGTAGGCCA CCTTTCTGTCGGATCC ... CGGTTATCCATCTGCT
## CGGTTCACACGTCAGC
## colData names(1): metrics
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
```
``` r
assay(altExp(hto.sce), "counts")[,1:3] # Preview of the count profiles
```
```
## 12 x 3 sparse Matrix of class "dgCMatrix"
## CAGATCAAGTAGGCCA CCTTTCTGTCGGATCC CATATGGCATGGAATA
## HEK_A 1 . .
## HEK_B 111 . 1
## HEK_C 7 1 7
## THP1_A 15 19 16
## THP1_B 10 8 5
## THP1_C 4 3 6
## K562_A 118 . 245
## K562_B 5 530 131
## K562_C 1 3 4
## KG1_A 30 25 24
## KG1_B 40 14 239
## KG1_C 32 36 38
```
### Cell calling options
Our first task is to identify the libraries corresponding to cell-containing droplets.
This can be applied on the gene count matrix or the HTO count matrix, depending on what information we have available.
We start with the usual application of `emptyDrops()` on the gene count matrix of `hto.sce` (Figure \@ref(fig:barcode-rank-mix-genes)).
``` r
set.seed(10010)
e.out.gene <- emptyDrops(counts(hto.sce))
is.cell <- e.out.gene$FDR <= 0.001
summary(is.cell)
```
```
## Mode FALSE TRUE NA's
## logical 1384 7934 15770
```
``` r
par(mfrow=c(1,2))
r <- rank(-e.out.gene$Total)
plot(r, e.out.gene$Total, log="xy", xlab="Rank", ylab="Total gene count", main="")
abline(h=metadata(e.out.gene)$retain, col="darkgrey", lty=2, lwd=2)
hist(log10(e.out.gene$Total[is.cell]), xlab="Log[10] gene count", main="")
```
(\#fig:barcode-rank-mix-genes)Cell-calling statistics from running `emptyDrops()` on the gene count in the cell line mixture data. Left: Barcode rank plot with the estimated knee point in grey. Right: distribution of log-total counts for libraries identified as cells.
Alternatively, we could also apply `emptyDrops()` to the HTO count matrix but this is slightly more complicated.
As HTOs are sequenced separately from the endogenous transcripts, the coverage of the former is less predictable across studies; this makes it difficult to determine an appropriate default value of `lower=` for estimation of the initial ambient profile.
We instead estimate the ambient profile by excluding the top `by.rank=` barcodes with the largest totals, under the assumption that no more than `by.rank=` cells were loaded.
Here we have chosen 12000, which is largely a guess to ensure that we can directly pick the knee point (Figure \@ref(fig:barcode-rank-mix-hto)) in this somewhat pre-filtered dataset.
``` r
set.seed(10010)
# Setting lower= for correct knee point detection,
# as the coverage in this dataset is particularly low.
e.out.hto <- emptyDrops(assay(altExp(hto.sce), "counts"), by.rank=12000, lower=10)
summary(is.cell.hto <- e.out.hto$FDR <= 0.001)
```
```
## Mode FALSE TRUE NA's
## logical 7067 4933 13088
```
``` r
par(mfrow=c(1,2))
r <- rank(-e.out.hto$Total)
plot(r, e.out.hto$Total, log="xy", xlab="Rank", ylab="Total HTO count", main="")
abline(h=metadata(e.out.hto)$retain, col="darkgrey", lty=2, lwd=2)
hist(log10(e.out.hto$Total[is.cell.hto]), xlab="Log[10] HTO count", main="")
```
(\#fig:barcode-rank-mix-hto)Cell-calling statistics from running `emptyDrops()` on the HTO counts in the cell line mixture data. Left: Barcode rank plot with the knee point shown in grey. Right: distribution of log-total counts for libraries identified as cells.
While both approaches are valid, we tend to favor the cell calls derived from the gene matrix as this directly indicates that a cell is present in the droplet.
Indeed, at least a few libraries have very high total HTO counts yet very low total gene counts (Figure \@ref(fig:hto-total-comp)), suggesting that the presence of HTOs may not always equate to successful capture of that cell's transcriptome.
HTO counts also tend to exhibit stronger overdispersion (i.e., lower `alpha` in the `emptyDrops()` calculations), increasing the risk of violating `emptyDrops()`'s distributional assumptions.
``` r
table(HTO=is.cell.hto, Genes=is.cell, useNA="always")
```
```
## Genes
## HTO FALSE TRUE
## FALSE 504 2834 3729
## TRUE 59 4757 117
## 821 343 11924
```
``` r
plot(e.out.gene$Total, e.out.hto$Total, log="xy",
xlab="Total gene count", ylab="Total HTO count")
abline(v=metadata(e.out.gene)$lower, col="red", lwd=2, lty=2)
abline(h=metadata(e.out.hto)$lower, col="blue", lwd=2, lty=2)
```
(\#fig:hto-total-comp)Total HTO counts plotted against the total gene counts for each library in the cell line mixture dataset. Each point represents a library while the dotted lines represent the thresholds below which libraries were assumed to be empty droplets.
Again, note that if we are picking up our analysis after processing with pipelines like _CellRanger_, it may be that the count matrix has already been subsetted to the cell-containing libraries.
If so, we can skip this section entirely and proceed straight to demultiplexing.
### Demultiplexing on HTO abundance
We run `hashedDrops()` to demultiplex the HTO count matrix for the subset of cell-containing libraries.
This reports the likely sample of origin for each library based on its most abundant HTO after adjusting those abundances for ambient contamination.
For quality control, it returns the log-fold change between the first and second-most abundant HTOs in each barcode libary (Figure \@ref(fig:hto-1to2-hist)), allowing us to quantify the certainty of each assignment.
``` r
hto.mat <- assay(altExp(hto.sce), "counts")[,which(is.cell)]
hash.stats <- hashedDrops(hto.mat)
hist(hash.stats$LogFC, xlab="Log fold-change from best to second HTO", main="")
```
(\#fig:hto-1to2-hist)Distribution of log-fold changes from the first to second-most abundant HTO in each cell.
Confidently assigned cells should have large log-fold changes between the best and second-best HTO abundances as there should be exactly one dominant HTO per cell.
These are marked as such by the `Confident` field in the output of `hashedDrops()`, which can be used to filter out ambiguous assignments prior to downstream analyses.
``` r
# Raw assignments:
table(hash.stats$Best)
```
```
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 664 732 636 595 629 655 570 684 603 726 662 778
```
``` r
# Confident assignments based on (i) a large log-fold change
# and (ii) not being a doublet.
table(hash.stats$Best[hash.stats$Confident])
```
```
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 580 619 560 524 553 573 379 605 427 640 556 607
```
In the absence of an _a priori_ ambient profile, `hashedDrops()` will attempt to automatically estimate it from the count matrix.
This is done by assuming that each HTO has a bimodal distribution where the lower peak corresponds to ambient contamination in cells that do not belong to that HTO's sample.
Counts are then averaged across all cells in the lower mode to obtain the relative abundance of that HTO (Figure \@ref(fig:hto-ambient)).
`hashedDrops()` uses the ambient profile to adjust for systematic differences in HTO concentrations that could otherwise skew the log-fold changes -
for example, this particular dataset exhibits order-of-magnitude differences in the concentration of different HTOs.
The adjustment process itself involves a fair number of assumptions that we will not discuss here; see `?hashedDrops` for more details.
``` r
barplot(metadata(hash.stats)$ambient, las=2,
ylab="Inferred proportion of counts in the ambient solution")
```
(\#fig:hto-ambient)Proportion of each HTO in the ambient solution for the cell line mixture data, estimated from the HTO counts of cell-containing droplets.
If we are dealing with unfiltered data, we have the opportunity to improve the inferences by defining the ambient profile beforehand based on the empty droplets.
This simply involves summing the counts for each HTO across all _known empty_ droplets, marked as those libraries with `NA` FDR values in the `emptyDrops()` output.
Alternatively, if we had called `emptyDrops()` directly on the HTO count matrix, we could just extract the ambient profile from the output's `metadata()`.
For this dataset, all methods agree well (Figure \@ref(fig:hto-ambient)) though providing an _a priori_ profile can be helpful in more extreme situations where the automatic method fails,
e.g., if there are too few cells in the lower mode for accurate estimation of a HTO's ambient concentration.
``` r
estimates <- rbind(
`Bimodal`=proportions(metadata(hash.stats)$ambient),
`Empty (genes)`=proportions(rowSums(assay(altExp(hto.sce), "counts")[,is.na(e.out.gene$FDR)])),
`Empty (HTO)`=metadata(e.out.hto)$ambient[,1]
)
barplot(estimates, beside=TRUE, ylab="Proportion of counts in the ambient solution")
legend("topleft", fill=gray.colors(3), legend=rownames(estimates))
```
(\#fig:hto-ambient2)Proportion of each HTO in the ambient solution for the cell line mixture data, estimated using the bimodal method in `hashedDrops()` or by computing the average abundance across all empty droplets (where the empty state is defined by using `emptyDrops()` on either the genes or the HTO matrix).
Given an estimate of the ambient profile - say, the one derived from empty droplets detected using the HTO count matrix - we can easily use it in `hashedDrops()` via the `ambient=` argument.
This yields very similar results to those obtained with the automatic method, as expected from the similarity in the profiles.
``` r
hash.stats2 <- hashedDrops(hto.mat, ambient=metadata(e.out.hto)$ambient[,1])
table(hash.stats2$Best[hash.stats2$Confident])
```
```
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 575 602 565 526 551 559 354 597 411 632 553 589
```
### Further comments
After demultiplexing, it is a simple matter to subset the `SingleCellExperiment` to the confident assignments.
This actually involves two steps - the first is to subset to the libraries that were actually used in `hashedDrops()`, and the second is to subset to the libraries that were confidently assigned to a single sample.
Of course, we also include the putative sample of origin for each cell.
``` r
sce <- hto.sce[,rownames(hash.stats)]
sce$sample <- hash.stats$Best
sce <- sce[,hash.stats$Confident]
```
We examine the success of the demultiplexing by performing a quick analysis.
Recall that this experiment involved 4 cell lines that were multiplexed together; we see that the separation between cell lines is preserved in Figure \@ref(fig:hto-mix-tsne), indicating that the cells were assigned to their correct samples of origin.
``` r
library(scran)
library(scater)
sce <- logNormCounts(sce)
dec <- modelGeneVar(sce)
set.seed(100)
sce <- runPCA(sce, subset_row=getTopHVGs(dec, n=5000))
sce <- runTSNE(sce, dimred="PCA")
cell.lines <- sub("_.*", "", rownames(altExp(sce)))
sce$cell.line <- cell.lines[sce$sample]
plotTSNE(sce, colour_by="cell.line")
```
(\#fig:hto-mix-tsne)The usual $t$-SNE plot of the cell line mixture data, where each point is a cell and is colored by the cell line corresponding to its sample of origin.
Cell hashing information can also be used to detect doublets - see Chapter \@ref(doublet-detection) for more details.
## Removing swapped molecules
Some of the more recent DNA sequencing machines released by Illumina (e.g., HiSeq 3000/4000/X, X-Ten, and NovaSeq) use patterned flow cells to improve throughput and cost efficiency.
However, in multiplexed pools, the use of these flow cells can lead to the mislabelling of DNA molecules with the incorrect library barcode [@sinha2017index], a phenomenon known as "barcode swapping".
This leads to contamination of each library with reads from other libraries - for droplet sequencing experiments, this is particularly problematic as it manifests as the appearance of artificial cells that are low-coverage copies of their originals from other samples [@griffiths2018detection].
Fortunately, it is easy enough to remove affected reads from droplet experiments with the `swappedDrops()` function from *[DropletUtils](https://bioconductor.org/packages/3.19/DropletUtils)*.
Given a multiplexed pool of samples, we identify potential swapping events as transcript molecules that share the same combination of UMI sequence, assigned gene and cell barcode across samples.
We only keep the molecule if it has dominant coverage in a single sample, which is likely to be its original sample; we remove all (presumably swapped) instances of that molecule in the other samples.
Our assumption is that it is highly unlikely that two molecules would have the same combination of values by chance.
To demonstrate, we will use some multiplexed 10X Genomics data from an _attempted_ study of the mouse mammary gland [@griffiths2018detection].
This experiment consists of 8 scRNA-seq samples from various stages of mammary gland development, sequenced using the HiSeq 4000.
We use the *[DropletTestFiles](https://bioconductor.org/packages/3.19/DropletTestFiles)* package to obtain the molecule information files produced by the CellRanger software suite; as its name suggests, this file format contains information on each individual transcript molecule in each sample.
``` r
library(DropletTestFiles)
swap.files <- listTestFiles(dataset="bach-mammary-swapping")
swap.files <- swap.files[dirname(swap.files$file.name)=="hiseq_4000",]
swap.files <- vapply(swap.files$rdatapath, getTestFile, prefix=FALSE, "")
names(swap.files) <- sub(".*_(.*)\\.h5", "\\1", names(swap.files))
```
We examine the barcode rank plots before making any attempt to remove swapped molecules (Figure \@ref(fig:mammary-swapped-barcode-rank)), using the `get10xMolInfoStats()` function to efficiently obtain summary statistics from each molecule information file.
We see that samples E1 and F1 have different curves but this is not cause for alarm given that they also correspond to a different developmental stage compared to the other samples.
``` r
library(DropletUtils)
before.stats <- lapply(swap.files, get10xMolInfoStats)
max.umi <- vapply(before.stats, function(x) max(x$num.umis), 0)
ylim <- c(1, max(max.umi))
max.ncells <- vapply(before.stats, nrow, 0L)
xlim <- c(1, max(max.ncells))
plot(0,0,type="n", xlab="Rank", ylab="Number of UMIs",
log="xy", xlim=xlim, ylim=ylim)
for (i in seq_along(before.stats)) {
u <- sort(before.stats[[i]]$num.umis, decreasing=TRUE)
lines(seq_along(u), u, col=i, lwd=5)
}
legend("topright", col=seq_along(before.stats),
lwd=5, legend=names(before.stats))
```
(\#fig:mammary-swapped-barcode-rank)Barcode rank curves for all samples in the HiSeq 4000-sequenced mammary gland dataset, before removing any swapped molecules.
We apply the `swappedDrops()` function to the molecule information files to identify and remove swapped molecules from each sample.
While all samples have some percentage of removed molecules, the majority of molecules in samples E1 and F1 are considered to be swapping artifacts.
The most likely cause is that these samples contain no real cells or highly damaged cells with little RNA, which frees up sequencing resources for deeper coverage of swapped molecules.
``` r
after.mat <- swappedDrops(swap.files, get.swapped=TRUE)
cleaned.sum <- vapply(after.mat$cleaned, sum, 0)
swapped.sum <- vapply(after.mat$swapped, sum, 0)
swapped.sum / (swapped.sum + cleaned.sum)
```
```
## A1 B1 C1 D1 E1 F1 G1 H1
## 0.02761 0.02767 0.12274 0.03797 0.82535 0.86561 0.02241 0.03648
```
After removing the swapped molecules, the barcode rank curves for E1 and F1 drop dramatically (Figure \@ref(fig:mammary-swapped-barcode-rank-after)).
This represents the worst-case outcome of the swapping phenomenon, where cells are "carbon-copied"^[For those of you who know what that is.] from the other multiplexed samples.
Proceeding with the cleaned matrices protects us from these egregious artifacts as well as the more subtle effects of contamination.
``` r
plot(0,0,type="n", xlab="Rank", ylab="Number of UMIs",
log="xy", xlim=xlim, ylim=ylim)
for (i in seq_along(after.mat$cleaned)) {
cur.stats <- barcodeRanks(after.mat$cleaned[[i]])
u <- sort(cur.stats$total, decreasing=TRUE)
lines(seq_along(u), u, col=i, lwd=5)
}
legend("topright", col=seq_along(after.mat$cleaned),
lwd=5, legend=names(after.mat$cleaned))
```
(\#fig:mammary-swapped-barcode-rank-after)Barcode rank curves for all samples in the HiSeq 4000-sequenced mammary gland dataset, after removing any swapped molecules.