---
title: "Quality control with `scater`"
author:
- name: "Davis McCarthy"
  affiliation: 
  - EMBL European Bioinformatics Institute
- name: "Aaron Lun"
  affiliation:
  - Cancer Research UK Cambridge Institute, University of Cambridge
package: scater
output:
    BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Quality control with scater}
  %\VignetteEngine{knitr::rmarkdown}
  %VignetteEncoding{UTF-8}
---

```{r knitr-options, echo=FALSE, message=FALSE, warning=FALSE}
## To render an HTML version that works nicely with github and web pages, do:
## rmarkdown::render("vignettes/vignette.Rmd", "all")
library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, 
    error=FALSE, warning=FALSE, message=FALSE, dev = 'png')
library(ggplot2)
theme_set(theme_bw(12))
```

# Overview 

The `scater` package puts a focus on aiding with quality control (QC) and pre-processing of single-cell RNA-seq data before further downstream analysis.
We see QC as consisting of three distinct steps:

1. QC and filtering of cells
2. QC and filtering of features (genes)
3. QC of experimental variables

Following QC, we can proceed with data normalisation before downstream analysis and modelling.
We will demonstrate on some example data generated below:

```{r}
library(scater)
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
    assays = list(counts = sc_example_counts), 
    colData = sc_example_cell_info
)
example_sce
```

# Calculating QC metrics

## Introducing the `calculateQCMetrics` function

The `calculateQCMetrics` function computes a number of quality control metrics for each cell and feature, stored in the `colData` and `rowData` respectively.
By default, the QC metrics are computed from the count data, but this can be changed through the `exprs_values` argument.

```{r calc-qc-metrics}
example_sce <- calculateQCMetrics(example_sce)
colnames(colData(example_sce))
colnames(rowData(example_sce))
```

Control sets can be defined for features (e.g., spike-in transcripts, mitochondrial genes) or cells (e.g., empty wells, visually damaged cells).
The function will subsequently compute metrics for each control set, e.g., the proportion of counts assigned to spike-in transcripts.

```{r calc-qc-metrics-multi-feature-controls}
example_sce <- calculateQCMetrics(example_sce, 
    feature_controls = list(ERCC = 1:20, mito = 500:1000),
    cell_controls = list(empty = 1:5, damaged = 31:40))

all_col_qc <- colnames(colData(example_sce))
all_col_qc <- all_col_qc[grep("ERCC", all_col_qc)]
```

Metrics are also computed for the union of all control sets, and for the set of all features/cells that are not marked as controls.
These sets are labelled as `"feature_control"` and `"endogenous"` for features, and `"cell_control"` and `"non_control"` for cells.
Users should avoid using these names for their own sets.

## Cell-level QC metrics

We refer users to the `?calculateQCMetrics` for a full list of the computed cell-level metrics.
However, we will describe some of the more popular ones here:

* `total_counts`: total number of counts for the cell (i.e., the library size).
* `total_features_by_counts`: the number of features for the cell that have counts above the detection limit (default of zero).
* `pct_counts_X`: percentage of all counts that come from the feature control set named X.

If `exprs_values` is set to something other than `"counts"`, the names of the metrics will be changed by swapping `"counts"` for whatever named assay was used.

## Feature-level QC metrics

Feature-level metrics include:

* `mean_counts`: the mean count of the gene/feature.
* `pct_dropout_by_counts`: the percentage of cells with counts of zero for each gene.
* `pct_counts_Y`: percentage of all counts that come from the cell control set named Y.

Again, if a different `exprs_values` was used, the names of the metrics will change correspondingly.

# Producing diagnostic plots for QC

## Examining the most expressed features

We look at a plot that shows the top 50 (by default) most-expressed features. 
Each row in the plot below corresponds to a gene, and each bar corresponds to the expression of a gene in a single cell.
The circle indicates the median expression of each gene, with which genes are sorted.
By default, "expression" is defined using the feature counts (if available), but other expression values can be used instead by changing `exprs_values`.

```{r plot-highest, fig.asp=2, fig.wide=TRUE}
plotQC(example_sce, type = "highest-expression", exprs_values = "counts")
```

We expect to see the "usual suspects", i.e., mitochondrial genes, actin, ribosomal protein, MALAT1. 
A few spike-in transcripts may also be present here, though if all of the spike-ins are in the top 50, it suggests that too much spike-in RNA was added.
A large number of pseudo-genes or predicted genes may indicate problems with alignment.

## Frequency of expression as a function of the mean

Another useful plot is that of the frequency of expression (i.e., number of cells with non-zero expression) against the mean.
These two metrics should be positively correlated with each other for most genes.
 
```{r plot-qc-exprs-freq-vs-mean-default}
plotQC(example_sce, type = "exprs-freq-vs-mean")
```

Outliers from the trend may warrant further investigation.
For example, alignment errors for pseudo-genes of highly-expressed genes will result in features with low means that are expressed in all cells.
Conversely, PCR amplification biases (or the presence of rare populations) may result in genes with very high means that are expressed in very few cells.

## Percentage of counts assigned to feature controls

A particularly useful plot for cell-level QC involves percentage of expression in feature controls against the total number of expressed features.
These two metadata variables can be plotted against each other as shown below.
We take advantage of the `r CRANpkg("ggplot2")` semantics to fine-tune the plot aesthetics and to add a smoothing curve:

```{r plot-pdata-pct-exprs-controls}
plotColData(example_sce, x = "total_features_by_counts",
    y = "pct_counts_feature_control", colour = "Mutation_Status") +
    theme(legend.position = "top") +
    stat_smooth(method = "lm", se = FALSE, size = 2, fullrange = TRUE)
```

Well-behaved cells should have a large number of expressed features and and a low percentage of expression from feature controls. 
High percentage expression from feature controls and few expressed features are indicative of blank and failed cells.

## Cumulative expression plot

The `plotScater` method plots the cumulative proportion of each cell's library assigned to the top highest-expressed features (default 500).
This type of plot visualizes differences in expression distributions for different cells, in the same manner as per-sample boxplots for microarray or bulk RNA-seq data.
Cumulative expression plots are more effective for single-cell data where it is not easy to examine hundreds or thousands of boxplots at once.

With this function, we can split up the cells based on `colData` variables to examine differences in the expression distributions between cells. 
By default, the plot method will try to use count values for the plot, but other data available in the `assays` slot of the object can be used by specifying `exprs_values`.

```{r plot-sceset-blocking}
plotScater(example_sce, block1 = "Mutation_Status", block2 = "Treatment",
     colour_by = "Cell_Cycle", nfeatures = 300, exprs_values = "counts")
```

This approach can allow users to identify large differences in expression distributions across different experimental blocks (e.g., processing batches).

## Plate position plot

For plate-based experiments, it is useful to see how expression or factors vary with the position of cell on the plate.
This can be visualized using the `plotPlatePosition` function:

```{r}
example_sce2 <- example_sce
example_sce2$plate_position <- paste0(
     rep(LETTERS[1:5], each = 8), 
     rep(formatC(1:8, width = 2, flag = "0"), 5)
)
plotPlatePosition(example_sce2, colour_by = "Gene_0001",
    by_exprs_values = "counts") 
```

Systematic trends in expression with the plate position may indicate that there were issues with processing.
The same approach can be used with experimental factors to determine whether cells are appropriately randomized across the plate.

## Other quality control plots

Two feature metadata variables can be easily plotted against each other using the `plotFeatureData` function:

```{r plot-fdata}
plotRowData(example_sce, x = "n_cells_by_counts", y = "mean_counts")
```

The `multiplot` function also allows multiple plots to be generated on the same page, as demonstrated below.

```{r plot-pdata-2, fig.wide=TRUE, fig.asp=0.3}
p1 <- plotColData(example_sce, x = "total_counts", 
    y = "total_features_by_counts")
p2 <- plotColData(example_sce, x = "pct_counts_feature_control",
    y = "total_features_by_counts")
p3 <- plotColData(example_sce, x = "pct_counts_feature_control",
    y = "pct_counts_top_50_features")
multiplot(p1, p2, p3, cols = 3)
```

This is especially useful for side-by-side comparisons between control sets, as demonstrated below for the plot of highest-expressing features.
A plot for non-control cells is shown on the left while the plot for the controls is shown on the right.

```{r highest-2, fig.wide=TRUE}
p1 <- plotQC(example_sce[, !example_sce$is_cell_control],
    type = "highest-expression")
p2 <- plotQC(example_sce[, example_sce$is_cell_control],
    type = "highest-expression")
multiplot(p1, p2, cols = 2)
```

# Filtering the `SingleCellExperiment`

## By cells

### Column subsetting 

Column subsetting of the `SingeCellExperiment` object will only retain the selected cells, thus removing low-quality or otherwise unwanted cells.
We demonstrate below by retaining the first 40 cells.
(This happens to be all the cells in this particular dataset, which are already known to be high-quality.)

```{r}
example_sce <- example_sce[,1:40]
```

`r Biocpkg("scater")` also provides a `filter` function, inspired by the function of the same name in the `dplyr` package and operating in exactly the same manner.
This can be used to very conviently subset (i.e. filter) the cells of an `SingleCellExperiment` object based on its `colData` variables.

```{r}
filter(example_sce, Treatment == "treat1")
```

### Identifying filtering thresholds

We can identify high-quality cells to retain by setting a fixed threshold on particular metrics.
For example, we could retain only cells that have at least 100,000 total counts _and_ at least 500 expressed features:

```{r}
keep.total <- example_sce$total_counts > 1e5
keep.n <- example_sce$total_features_by_counts > 500
filtered <- example_sce[,keep.total & keep.n]
dim(filtered)
```

A more flexible way of choosing thresholds is through the `isOutlier` function.
This defines the threshold at a certain number of median absolute deviations (MADs) away from the median.
Values beyond this threshold are considered outliers and can be filtered out, assuming that they correspond to low-quality cells.
Here, we define small outliers (using `type="lower"`) for the _log_-total counts at 3 MADs from the median.

```{r}
keep.total <- isOutlier(example_sce$total_counts, nmads=3, 
    type="lower", log=TRUE)
filtered <- example_sce[,keep.total]
```

The `isOutlier` approach adjusts to experiment-specific aspects of the data, e.g., sequencing depth, amount of spike-in RNA added, cell type.
In contrast, a fixed threshold would require manual adjustment to account for changes to the experimental protocol or system.
We refer readers to the [_simpleSingleCell_](https://www.bioconductor.org/help/workflows/simpleSingleCell/intro) workflow for more details.

### Identifying outliers on all QC metrics

Outlier cells can also be identified by using the `r CRANpkg("mvoutlier")` package on the QC metrics for all cells.
This will identify cells that have substantially different QC metrics from the others, possibly corresponding to low-quality cells.
We can visualize any outliers using a principal components plot as shown below:

```{r plot-pca-outlier}
example_sce <- runPCA(example_sce, use_coldata = TRUE,
    detect_outliers = TRUE)
plotReducedDim(example_sce, use_dimred="PCA_coldata")
```

Column subsetting can then be performed based on the `$outlier` slot, which indicates whether or not each cell has been designated as an outlier. 
Automatic outlier detection can be informative, but a close inspection of QC metrics and tailored filtering for the specifics of the dataset at hand is strongly recommended.

```{r}
summary(example_sce$outlier)
```

## By features 

It is common to filter out low-abundance features prior to further analyses.
This is easily achieved by row subsetting of the `SingleCellExperiment` object.
In the example below, genes are only retained if they are expressed in four or more cells:

```{r}
keep_feature <- nexprs(example_sce, byrow=TRUE) >= 4
example_sce <- example_sce[keep_feature,]
dim(example_sce)
```

Other filtering can be done using existing annotation.
For example, ribosomal protein genes and predicted genes can be identified (and removed) using regular expressions or biotype information.
Such genes are often (but not always) uninteresting for characterizing population heterogeneity.

# Relationships between experimental factors and expression

We can investigate the relative importance of different explanatory factors with some of the `plotQC` function options. 
We compute the $R^2$ for each factor in `colData(example_sce)` when fitting a linear model regressing expression values for each gene against that factor.
This is best done on the log-expression values to reduce the effect of the mean on the variance - hence, we run `normalize` first.

```{r plot-qc-expl-factors-all}
example_sce <- normalize(example_sce)
plotQC(example_sce, type = "expl")
```

Each line corresponds to one factor and represents the distribution of R^2^ values across all genes.
Alternatively, we can choose a subset of factors to plot in this manner.

```{r plot-qc-expl-variables-select-variables}
plotQC(example_sce, type = "expl",
    variables = c("total_features_by_counts", "total_counts",
        "Mutation_Status", "Treatment", "Cell_Cycle"))
```

We can also produce pairwise scatter plots of potential explanatory variables.
Variables are ranked by their median percentage of expression variance explained in a linear model.

```{r plot-qc-pairs-pc}
plotQC(example_sce, type = "expl", method = "pairs", theme_size = 6)
```

In this small dataset, `total_counts` and `total_features_by_counts` explain a very large proportion of the variance in feature expression. 
The proportion of variance that they explain for a real dataset should be much smaller (say 1-5%).

# Removing technical biases

## Scaling normalization

Scaling normalization accounts for cell-specific biases that scale expression up or down for all genes in a particular cell, e.g., coverage or capture efficiency.
The simplest approach to scaling normalization defines the size factors from the scaled library sizes of all cells.
is done so that the mean size factor is equal to unity, ensuring that the normalized values are on the same scale as the original counts.

```{r}
sizeFactors(example_sce) <- librarySizeFactors(example_sce)
summary(sizeFactors(example_sce))
```

Log-transformed normalized expression values can then be computed with `normalize`, which stores the output in the `"logcounts"` slot.

```{r}
example_sce <- normalize(example_sce)
```

While simple, library size normalization does not account for composition biases that are often present in high-throughput sequencing data.
It also fails to account for differences in the biases affecting spike-in transcripts.
We strongly suggest using the `computeSumFactors` and `computeSpikeFactors` functions from the `r Biocpkg("scran")` package.

## Batch correction

Batch correction accounts for systematic differences in expression between cells in different batches.
Unlike scaling biases, these are usually constant across all cells in a given batch but different for each gene.

Batch effects can be regressed out by using the `removeBatchEffect` function from the `r Biocpkg("limma")` package.
This applies a linear model, usually on the log-expression values to avoid issues with the mean-variance relationship.
To illustrate:

```{r}
library(limma)
batch <- rep(1:2, each=20)
corrected <- removeBatchEffect(logcounts(example_sce), block=batch)
assay(example_sce, "corrected_logcounts") <- corrected
```

Factors of interest can be included in `design` to avoid regressing them out.
This is necessary when they are not orthogonal to the `block`.
However, this assumes that your model is fully specified, which may not be possible when the factors of interest are unknown.
In such cases, an alternative method is to use the `mnnCorrect` approach from `r Biocpkg("scran")`.

# Session information {.unnumbered}

```{r}
sessionInfo()
```