---
title: Detecting differentially abundant subpopulations in mass cytometry data
author: Aaron Lun
package: cydar
output:
BiocStyle::html_document
vignette: >
%\VignetteIndexEntry{Detecting differentially abundant subpopulations in mass cytometry data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE, results="hide", message=FALSE}
require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
```
```{r setup, echo=FALSE, message=FALSE}
library(cydar)
register(SerialParam())
set.seed(100)
```
# Detecting differentially abundant subpopulations in mass cytometry data
Package: `r Rpackage("cydar")`
Author: Aaron Lun (alun@wehi.edu.au)
Last modified: 2017-03-21
Compilation date: `r Sys.Date()`
# Introduction
Mass cytometry is a technique that allows simultaneous profiling of many (> 30) protein markers on each of millions of cells.
This is frequently used to characterize cell subpopulations based on unique combinations of markers.
One way to analyze this data is to identify subpopulations that change in abundance between conditions, e.g., with or without drug treatment, before and after stimulation.
This vignette will describe the steps necessary to perform this "differential abundance" (DA) analysis.
# Setting up the data
## Mocking up a data set
The analysis starts from a set of Flow Cytometry Standard (FCS) files containing intensities for each cell.
For the purposes of this vignette, we will simulate some data to demonstrate the methods below.
This experiment will assay 30 markers, and contain 3 replicate samples in each of 2 biological conditions.
We add two small differentially abundant subpopulations to ensure that we get something to look at later.
```{r}
ncells <- 20000
nda <- 200
nmarkers <- 31
down.pos <- 1.8
up.pos <- 1.2
conditions <- rep(c("A", "B"), each=3)
combined <- rbind(matrix(rnorm(ncells*nmarkers, 1.5, 0.6), ncol=nmarkers),
matrix(rnorm(nda*nmarkers, down.pos, 0.3), ncol=nmarkers),
matrix(rnorm(nda*nmarkers, up.pos, 0.3), ncol=nmarkers))
combined[,31] <- rnorm(nrow(combined), 1, 0.5) # last marker is a QC marker.
combined <- 10^combined # raw intensity values
sample.id <- c(sample(length(conditions), ncells, replace=TRUE),
sample(which(conditions=="A"), nda, replace=TRUE),
sample(which(conditions=="B"), nda, replace=TRUE))
colnames(combined) <- paste0("Marker", seq_len(nmarkers))
```
We use this to construct a `ncdfFlowSet` for our downstream analysis.
```{r}
library(ncdfFlow)
collected.exprs <- list()
for (i in seq_along(conditions)) {
stuff <- list(combined[sample.id==i,,drop=FALSE])
names(stuff) <- paste0("Sample", i)
collected.exprs[[i]] <- poolCells(stuff)
}
names(collected.exprs) <- paste0("Sample", seq_along(conditions))
collected.exprs <- ncdfFlowSet(as(collected.exprs, "flowSet"))
```
In practice, we can use the `read.ncdfFlowSet` function to load intensities from FCS files into the R session.
The `ncdfFlowSet` object can replace all instances of `collected.exprs` in the downstream steps.
## Pre-processing of intensities
### Pooling cells together
The intensities need to be transformed and gated prior to further analysis.
We first pool all cells together into a single `flowFrame`, which will be used for construction of the transformation and gating functions for all samples.
This avoids spurious differences from using sample-specific functions.
```{r}
pool.ff <- poolCells(collected.exprs)
```
### Estimating transformation parameters
We use the `estimateLogicle` method from the `r Biocpkg("flowCore")` package to obtain a transformation function, and apply it to `pool.ff`.
This performs a biexponential transformation with parameters estimated for optimal display.
```{r}
library(flowCore)
trans <- estimateLogicle(pool.ff, colnames(pool.ff))
proc.ff <- transform(pool.ff, trans)
```
### Gating out uninteresting cells
The next step is to construct gates to remove uninteresting cells.
There are several common gates that are used in mass cytometry data analysis, typically used in the following order:
- Gating out calibration beads (high in Ce140) used to correct for intensity shifts in the mass spectrometer.
- Gating on moderate intensities for the DNA markers, to remove debris and doublets.
This can be done using the `dnaGate` function.
- Gating on a dead/alive marker, to remove dead cells.
Whether high or low values should be removed depends on the marker (e.g., high values to be removed for cisplatin).
- Gating out low values for selected markers, e.g., CD45 when studying leukocytes, CD3 when studying T cells.
To demonstrate, we will construct a gate to remove low values for the last marker, using the `outlierGate` function.
The constructed gate is then applied to the `flowFrame`, only retaining cells falling within the gated region.
```{r}
gate.31 <- outlierGate(proc.ff, "Marker31", type="upper")
gate.31
filter.31 <- filter(proc.ff, gate.31)
summary(filter.31@subSet)
```
We apply the gate before proceeding to the next marker to be gated.
```{r]
proc.ff <- Subset(proc.ff, gate.31)
```
### Applying functions to the original data
Applying the transformation functions to the original data is simple.
```{r}
processed.exprs <- transform(collected.exprs, trans)
```
Applying the gates is similarly easy.
Use methods the `r Biocpkg("flowViz")` package to see how gating diagnostics can be visualized.
```{r}
processed.exprs <- Subset(processed.exprs, gate.31)
```
Markers used for gating are generally ignored in the rest of the analysis.
For example, as long as all cells contain DNA, we are generally not interested in differences in the amount of DNA.
This is achieved by discarding those markers (in this case, marker 31).
```{r}
processed.exprs <- processed.exprs[,1:30]
```
## Normalizing intensities across batches
By default, we do not perform any normalization of intensities between samples.
This is because we assume that barcoding was used with multiplexed staining and mass cytometry.
Thus, technical biases that might affect intensity should be the same in all samples, which means that they cancel out when comparing between samples.
In data sets containing multiple batches of separately barcoded samples, we provide the `normalizeBatch` function to adjust the intensities.
This uses range-based normalization to equalize the dynamic range between batches for each marker.
Alternatively, it can use warping functions to eliminate non-linear distortions due to batch effects.
The problem of normalization is much harder to solve in data sets with no barcoding at all.
In such cases, the best solution is to expand the sizes of the hyperspheres to "smooth over" any batch effects.
See the `expandRadius` function for more details.
# Counting cells into hyperspheres
We quantify abundance by assigning cells to hyperspheres in the high-dimensional marker space, and counting the number of cells from each sample in each hypersphere.
To do this, we first convert the intensity data into a format that is more amenable for counting.
The `prepareCellData` function works with either a list of matrices or directly with a `ncdfFlowSet` object, and generates a `CyData` object containing the reformatted intensities.
```{r}
cd <- prepareCellData(processed.exprs)
```
We then assign cells to hyperspheres using the `countCells` function.
Each hypersphere is centred at a cell to restrict ourselves to non-empty hyperspheres, and has radius equal to 0.5 times the square root of the number of markers.
The square root function adjusts for increased sparsity of the data at higher dimensions, while the 0.5 scaling factor allows cells with 10-fold differences in marker intensity (due to biological variability or technical noise) to be counted into the same hypersphere.
Also see the `neighborDistances` function for guidance on choosing a value of `tol`.
```{r}
cd <- countCells(cd, tol=0.5)
```
The output is another `CyData` object with extra information added to various fields.
In particular, the reported count matrix contains the set of counts for each hypersphere (row) from each sample (column).
```{r}
head(assay(cd))
```
Also reported are the "positions" of the hyperspheres, defined for each marker as the median intensity for all cells assigned to each hypersphere.
This will be required later for interpretation, as the marker intensities are required for defining the function of each subpopulation.
Shown below is the position of the first hypersphere, represented by its set of median intensities across all markers.
```{r}
head(intensities(cd))
```
There is some light filtering in `countCells` to improve memory efficiency, which can be adjusted with the `filter` argument.
# Testing for significant differences in abundance
We can use a number of methods to test the count data for differential abundance.
Here, we will use the quasi-likelihood (QL) method from the `r Biocpkg("edgeR")` package.
This allows us to model discrete count data with overdispersion due to biological variability.
```{r}
library(edgeR)
y <- DGEList(assay(cd), lib.size=cd$totals)
```
First, we do some filtering to remove low-abundance hyperspheres with average counts below 5.
These are mostly uninteresting as they do not provide enough evidence to reject the null hypothesis.
Removing them also reduces computational work and the severity of the multiple testing correction.
Lower values can also be used, but we do not recommend going below 1.
```{r}
keep <- aveLogCPM(y) >= aveLogCPM(5, mean(cd$totals))
cd <- cd[keep,]
y <- y[keep,]
```
We then apply the QL framework to estimate the dispersions, fit a generalized linear model and test for significant differences between conditions.
We refer interested readers to the `r Biocpkg("edgeR")` user's guide for more details.
```{r}
design <- model.matrix(~factor(conditions))
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design, robust=TRUE)
res <- glmQLFTest(fit, coef=2)
```
Note that normalization by total cell count per sample is implicitly performed by setting `lib.size=out$totals`.
We do not recommend using `calcNormFactors` in this context, as its assumptions may not be applicable to mass cytometry data.
# Controlling the spatial FDR
To correct for multiple testing, we aim to control the spatial false discovery rate (FDR).
This refers to the FDR across areas of the high-dimensional space.
We do this using the `spatialFDR` function, given the p-values and positions of all tested hyperspheres.
```{r}
qvals <- spatialFDR(intensities(cd), res$table$PValue)
```
Hyperspheres with significant differences in abundance are defined as those detected at a spatial FDR of, say, 5%.
```{r}
is.sig <- qvals <= 0.05
summary(is.sig)
```
This approach is a bit more sophisticated than simply applying the BH method to the hypersphere p-values.
Such a simple approach would fail to account for the different densities of hyperspheres in different parts of the high-dimensional space.
# Visualizing and interpreting the results
## With static plots
To interpret the DA hyperspheres, we use dimensionality reduction to visualize them in a convenient two-dimensional representation.
This is done here with PCA, though for more complex data sets, we suggest using something like `r CRANpkg("Rtsne")`.
```{r}
sig.coords <- intensities(cd)[is.sig,]
sig.res <- res$table[is.sig,]
coords <- prcomp(sig.coords)
```
Each DA hypersphere is represented as a point on the plot below, coloured according to its log-fold change between conditions.
We can see that we've recovered the two DA subpopulations that we put in at the start.
One subpopulation increases in abundance (red) while the other decreases (blue) in the second condition relative to the first.
```{r}
plotSphereLogFC(coords$x[,1], coords$x[,2], sig.res$logFC)
```
This plot should be interpreted by examining the marker intensities, in order to determine what each area of the plot represents.
We suggest using the `plotSphereIntensity` function to make a series of plots for all markers, as shown below.
Colours represent to the median marker intensities of each hypersphere, mapped onto the `r CRANpkg("viridis")` colour scale.
```{r,fig.width=10,fig.height=12}
par(mfrow=c(6,5), mar=c(2.1, 1.1, 3.1, 1.1))
limits <- intensityRanges(cd, p=0.05)
all.markers <- markernames(cd)
for (i in order(all.markers)) {
plotSphereIntensity(coords$x[,1], coords$x[,2], sig.coords[,i],
irange=limits[,i], main=all.markers[i])
}
```
We use the `intensityRanges` function to define the bounds of the colour scale.
This caps the minimum and maximum intensities at the 5^th^ and 95^th^ percentiles, respectively, to avoid colours being skewed by outliers.
Note that both of these functions return a vector of colours, named with the corresponding numeric value of the log-fold change or intensity.
This can be used to construct a colour bar -- see `?plotSphereLogFC` for more details.
## Using a Shiny app
An alternative approach to interpretation is to examine each hypersphere separately, and to determine the cell type corresponding to the hypersphere's intensities.
First, we prune done the number of hyperspheres to be examined in this manner.
This is done by identifying "non-redundant" hyperspheres, i.e., hyperspheres that do not overlap hyperspheres with lower p-values.
```{r}
nonred <- findFirstSphere(intensities(cd), res$table$PValue)
summary(nonred)
```
We pass these hyperspheres to the `interpretSpheres`, which creates a Shiny app where the intensities are displayed.
The idea is to allow users to inspect each hypersphere, annotate it and then save the labels to R once annotation is complete.
See the documentation for more details.
```{r}
all.coords <- prcomp(intensities(cd))
app <- interpretSpheres(cd, select=nonred, metrics=res$table, run=FALSE,
red.coords=all.coords$x[,1:2], red.highlight=is.sig)
# Set run=TRUE if you want the app to run automatically.
```
# Additional notes
Users wanting to identify specific subpopulations may consider using the `selectorPlot` function from `r Biocpkg("scran")`.
This provides an interactive framework by which hyperspheres can be selected and saved to a R session for further examination.
The best markers that distinguish cells in one subpopulation from all others can also be identified using `pickBestMarkers`.
# Session information
```{r}
sessionInfo()
```