--- title: "An introduction to miQC" author: "Ariel Hippen and Stephanie Hicks" date: "Compiled: `r format(Sys.time(), '%B %d, %Y')`" bibliography: biblio.bib output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{miQC} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- # Installation To install the package, please use the following. ```{r, eval=FALSE} if (!requireNamespace("BiocManager")) install.packages("BiocManager") BiocManager::install("miQC") ``` ```{r options, include=FALSE, message=FALSE, warning=FALSE} knitr::opts_chunk$set(cache=FALSE, error=FALSE, message=FALSE, warning=FALSE) ``` # Introduction This vignette provides a basic example of how to run miQC, which allows users to perform cell-wise filtering of single-cell RNA-seq data for quality control. Single-cell RNA-seq data is very sensitive to tissue quality and choice of experimental workflow; it’s critical to ensure compromised cells and failed cell libraries are removed. A high proportion of reads mapping to mitochondrial DNA is one sign of a damaged cell, so most analyses will remove cells with mtRNA over a certain threshold, but those thresholds can be arbitrary and/or detrimentally stringent, especially for archived tumor tissues. miQC jointly models both the proportion of reads mapping to mtDNA genes and the number of detected genes with mixture models in a probabilistic framework to identify the low-quality cells in a given dataset. You'll need the following packages installed to work this tutorial: ```{r} suppressPackageStartupMessages({ library(SingleCellExperiment) library(scRNAseq) library(scater) library(flexmix) library(splines) library(BiocParallel) library(biomaRt) library(miQC) }) ``` ## Example data To demonstrate how to run miQC on a single-cell RNA-seq dataset, we'll use data from mouse brain cells, originating from an experiment by Zeisel et al [@zeisel_brain_2015], and available through the Bioconductor package _scRNAseq_. ```{r} sce <- ZeiselBrainData() sce ``` ## Scater preprocessing In order to calculate the percent of reads in a cell that map to mitochondrial genes, we first need to establish which genes are mitochondrial. For genes listed as HGNC symbols, this is as simple as searching for genes starting with _mt-_. For other IDs, we recommend using a _biomaRt_ query to map to chromosomal location and identify all mitochondrial genes. ```{r} mt_genes <- grepl("^mt-", rownames(sce)) feature_ctrls <- list(mito = rownames(sce)[mt_genes]) feature_ctrls ``` _miQC_ is designed to be run with the Bioconductor package _scater_, which has a built-in function _addPerCellQC_ to calculate basic QC metrics like number of unique genes detected per cell and total number of reads. When we pass in our list of mitochondrial genes, it will also calculate percent mitochondrial reads. ``` {r} sce <- addPerCellQC(sce, subsets = feature_ctrls, BPPARAM = MulticoreParam()) head(colData(sce)) ``` # miQC Using the QC metrics generated via _scater_, we can use the _plotMetrics_ function to visually inspect the quality of our dataset. ``` {r} plotMetrics(sce) ``` We can see that most cells have a fairly low proportion of mitochondrial reads, given that the graph is much denser at the bottom. We likely have many cells that are intact and biologically meaningful. There are also a few cells that have almost half of their reads mapping to mitochondrial genes, which are likely broken or otherwise compromised and we will want to exclude from our downstream analysis. However, it's not clear what boundaries to draw to separate the two groups of cells. With that in mind, we'll generate a linear mixture model using the _mixtureModel_ function. ```{r} model <- mixtureModel(sce) ``` This function is a wrapper for _flexmix_, which fits a mixture model on our data and returns the parameters of the two lines that best fit the data, as well as the posterior probability of each cell being derived from each distribution. We can look at the parameters and posterior values directly with the functions ``` {r} parameters(model) head(posterior(model)) ``` Or we can visualize the model results using the _plotModel_ function: ```{r} plotModel(sce, model) ``` As expected, the cells at the very top of the graph are almost certainly compromised, most likely to have been derived from the distribution with fewer unique genes and higher baseline mitochondrial expression. We can use these posterior probabilities to choose which cells to keep, and visualize the consequences of this filtering with the _plotFiltering_ function. ```{r} plotFiltering(sce, model) ``` To actually perform the filtering and remove the indicated cells from our SingleCellExperiment object, we can run the _filterCells_ parameter. ```{r} sce <- filterCells(sce, model) sce ``` ## Extras In most cases, a linear mixture model will be satisfactory as well as simplest, but _miQC_ also supports some non-linear mixture models: currently polynomials and b-splines. A user should only need to change the _model_type_ parameter when making the model, and all visualization and filtering functions will work the same as with a linear model. ```{r} sce <- ZeiselBrainData() sce <- addPerCellQC(sce, subsets = feature_ctrls, BPPARAM = MulticoreParam()) model2 <- mixtureModel(sce, model_type = "spline") plotModel(sce, model2) plotFiltering(sce, model2) ``` Also, _miQC_ defaults to removing any cell with 75% or greater posterior probability of being compromised, but if we want to be more or less stringent, we can alter the _posterior_cutoff_ parameter, like so: ```{r} plotFiltering(sce, model2, posterior_cutoff = 0.9) ``` Note that when performing miQC multiple times on different samples for the same experiment, it's recommended to select the same _posterior_cutoff_ for all, to give consistency in addition to the flexibility of sample-specific models. ## When not to use miQC The miQC model is based on the assumption that there are a non-trivial number of compromised cells in the dataset, which is not true in all datasets. We recommend running _plotMetrics_ on a dataset before running miQC to see if the two-distribution model is appropriate. Look for the distinctive triangular shape where cells have a wide variety of mitochondrial percentages at lower gene counts and taper off to lower mitochondrial percentage at higher gene counts, as can be seen in the Zeisel data. For example of a dataset where there's not a significant number of compromised cells, so the two-distribution assumption is not met, look at another dataset from the _scRNAseq_ package, mouse data from Lun et al [@lun_assessing_2017]. Note that this dataset uses ensembl IDs, so we will use biomaRt to determine which genes are mitochondrial and calculate mitochondrial percentage. ```{r} sce <- LunSpikeInData() mouse_mart <- useEnsembl("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl") id_map <- getBM(values = rownames(sce), filters = "ensembl_gene_id", attributes = c("ensembl_gene_id", "chromosome_name"), mart = mouse_mart) mt_genes <- subset(id_map,id_map$chromosome_name=="MT")$ensembl_gene_id feature_ctrls <- list(mito = mt_genes) sce <- addPerCellQC(sce, subsets = feature_ctrls, BPPARAM = BiocParallel::MulticoreParam()) plotMetrics(sce) ``` The _mixtureModel_ function will throw a warning if only one distribution is found, in which case no cells would be filtered. In these cases, we recommend using other filtering methods, such as a cutoff on mitochondrial percentage or identifying outliers using median absolute deviation (MAD). # Session Information ```{r, echo=FALSE} ## Session info options(width = 120) sessionInfo() ``` # References