--- title: "Introduction to `scater`: Single-cell analysis toolkit for expression in R" author: - name: "Davis McCarthy" affiliation: - EMBL European Bioinformatics Institute - name: "Aaron Lun" affiliation: - Cancer Research UK Cambridge Institute, University of Cambridge package: scater output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{1. An introduction to the scater package} %\VignetteEngine{knitr::rmarkdown} %VignetteEncoding{UTF-8} --- ```{r knitr-options, echo=FALSE, message=FALSE, warning=FALSE} ## To render an HTML version that works nicely with github and web pages, do: ## rmarkdown::render("vignettes/vignette.Rmd", "all") library(knitr) opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png', warning=FALSE, error=FALSE, message=FALSE) library(ggplot2) theme_set(theme_bw(12)) ``` # Introduction This document gives an introduction to and overview of the quality control functionality of the `r Biocpkg("scater")` package. `r Biocpkg("scater")` contains tools to help with the analysis of single-cell transcriptomic data, focusing on RNA-seq data. The package features: * Use of the `SingleCellExperiment` class as a data container for interoperability with a wide range of other Bioconductor packages; * Functions to import [_kallisto_](http://pachterlab.github.io/kallisto/) and [_Salmon_](https://combine-lab.github.io/salmon/) results; * Simple calculation of many quality control metrics from the expression data; * Many tools for visualising scRNA-seq data, especially diagnostic plots for quality control; * Subsetting and many other methods for filtering out problematic cells and features; * Methods for identifying important experimental variables and normalising data ahead of downstream statistical analysis and modeling. # Creating a `SingleCellExperiment` object We assume that you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.). First, we create a `SingleCellExperiment` object containing the data, as demonstrated below with some example data (`"sc_example_counts"`) and metadata (`"sc_example_cell_info"`): Rows of the object correspond to features, while columns correspond to samples, i.e., cells in the context of single-cell 'omics data. ```{r quickstart-load-data, message=FALSE, warning=FALSE} library(scater) data("sc_example_counts") data("sc_example_cell_info") example_sce <- SingleCellExperiment( assays = list(counts = sc_example_counts), colData = sc_example_cell_info ) example_sce ``` We usually expect (raw) count data to be labelled as `"counts"` in the assays, which can be easily retrieved with the `counts` accessor. Getters and setters are also provided for `exprs`, `tpm`, `cpm`, `fpkm` and versions of these with the prefix `norm_`. ```{r quickstart-add-exprs, results='hide'} str(counts(example_sce)) ``` Row and column-level metadata are easily accessed (or modified) as shown below. There are also dedicated getters and setters for spike-in specifiers (`isSpike`); size factor values (`sizeFactors`); and reduced dimensionality results (`reducedDim`). ```{r} example_sce$whee <- sample(LETTERS, ncol(example_sce), replace=TRUE) colData(example_sce) rowData(example_sce)$stuff <- runif(nrow(example_sce)) rowData(example_sce) ``` Subsetting is very convenient with this class, as both data and metadata are processed in a synchronized manner. For example, we can filter out features (genes) that are not expressed in any cells: ```{r filter-no-exprs} keep_feature <- rowSums(counts(example_sce) > 0) > 0 example_sce <- example_sce[keep_feature,] ``` More details about the `SingleCellExperiment` class can be found in the documentation for `r Biocpkg("SingleCellExperiment")` package. # Calculating a variety of expression values We calculate counts-per-million using the aptly-named `calculateCPM` function. The output is most appropriately stored as an assay named `"cpm"` in the assays of the `SingleCellExperiment` object. ```{r} cpm(example_sce) <- calculateCPM(example_sce) ``` Another option is to use the `normalize` function, which calculates log~2~-transformed normalized expression values. This is done by dividing each count by its size factor (or scaled library size, if no size factors are defined), adding a pseudo-count and log-transforming. The resulting values can be interpreted on the same scale as log-transformed counts, and are stored in `"logcounts"`. ```{r} example_sce <- normalize(example_sce) assayNames(example_sce) ``` Note that `exprs` is a synonym for `logcounts` when accessing or setting data. This is done for backwards compatibility with older verions of _scater_. ```{r} identical(exprs(example_sce), logcounts(example_sce)) ``` Of course, users can construct any arbitrary matrix of the same dimensions as the count matrix and store it as an assay. ```{r} assay(example_sce, "is_expr") <- counts(example_sce)>0 ``` The `calcAverage` function will compute the average count for each gene after scaling each cell's counts by its size factor. If size factors are not available, it will compute a size factor from the library size. ```{r} head(calcAverage(example_sce)) ``` # Other methods of data import Count matrices stored as CSV files or equivalent can be easily read into R session using `read.table` from _utils_ or `fread` from the `r CRANpkg("data.table")` package. It is advisable to coerce the resulting object into a matrix before storing it in a `SingleCellExperiment` object. For large data sets, the matrix can be read in chunk-by-chunk with progressive coercion into a sparse matrix from the `r CRANpkg("Matrix")` package. This is performed using `readSparseCounts` and reduces memory usage by not explicitly storing zeroes in memory. Data from 10X Genomics experiments can be read in using the `read10xCounts` function from the `r Biocpkg("DropletUtils")` package. This will automatically generate a `SingleCellExperiment` with a sparse matrix, see the documentation for more details. `r Biocpkg("scater")` also provides wrapper functions `readSalmonResults` or `readKallistoResults` to import transcript abundances from the `kallisto` and `Salmon` pseudo-aligners. This is done using methods from the `r Biocpkg("tximport")` package. # Transitioning from the `SCESet` class As of July 2017, `scater` has switched from the `SCESet` class previously defined within the package to the more widely applicable `SingleCellExperiment` class. From Bioconductor 3.6 (October 2017), the release version of `scater` will use `SingleCellExperiment`. `SingleCellExperiment` is a more modern and robust class that provides a common data structure used by many single-cell Bioconductor packages. Advantages include support for sparse data matrices and the capability for on-disk storage of data to minimise memory usage for large single-cell datasets. It should be straight-forward to convert existing scripts based on `SCESet` objects to `SingleCellExperiment` objects, with key changes outlined immediately below. * The functions `toSingleCellExperiment` and `updateSCESet` (for backwards compatibility) can be used to convert an old `SCESet` object to a `SingleCellExperiment` object; * Create a new `SingleCellExperiment` object with the function `SingleCellExperiment` (actually less fiddly than creating a new `SCESet`); * `scater` functions have been refactored to take `SingleCellExperiment` objects, so once data is in a `SingleCellExperiment` object, the user experience is almost identical to that with the `SCESet` class. Users may need to be aware of the following when updating their own scripts: * Cell names can now be accessed/assigned with the `colnames` function (instead of `sampleNames` or `cellNames` for an `SCESet` object); * Feature (gene/transcript) names should now be accessed/assigned with the `rownames` function (instead of `featureNames`); * Cell metadata, stored as `phenoData` in an `SCESet`, corresponds to `colData` in a `SingleCellExperiment` object and is accessed/assigned with the `colData` function (this replaces the `pData` function); * Individual cell-level variables can still be accessed with the `$` operator (e.g. `sce$total_counts`); * Feature metadata, stored as `featureData` in an `SCESet`, corresponds to `rowData` in a `SingleCellExperiment` object and is accessed/assigned with the `rowData` function (this replaces the `fData` function); * `plotScater`, which produces a cumulative expression, overview plot, replaces the generic `plot` function for `SCESet` objects. # Session information {.unnumbered} ```{r} sessionInfo() ```