---
title: "Introduction to `scater`: Single-cell analysis toolkit for expression in R"
author:
- name: "Davis McCarthy"
  affiliation: 
  - EMBL European Bioinformatics Institute
- name: "Aaron Lun"
  affiliation:
  - Cancer Research UK Cambridge Institute, University of Cambridge
package: scater
output:
    BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{1. An introduction to the scater package}
  %\VignetteEngine{knitr::rmarkdown}
  %VignetteEncoding{UTF-8}
---


```{r knitr-options, echo=FALSE, message=FALSE, warning=FALSE}
## To render an HTML version that works nicely with github and web pages, do:
## rmarkdown::render("vignettes/vignette.Rmd", "all")
library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png',
    warning=FALSE, error=FALSE, message=FALSE)
library(ggplot2)
theme_set(theme_bw(12))
```

# Introduction

This document gives an introduction to and overview of the quality control functionality of the `r Biocpkg("scater")` package.
`r Biocpkg("scater")` contains tools to help with the analysis of single-cell transcriptomic data, focusing on RNA-seq data. 
The package features:

* Use of the `SingleCellExperiment` class as a data container for interoperability with a wide range of other Bioconductor packages;
* Functions to import [_kallisto_](http://pachterlab.github.io/kallisto/) and [_Salmon_](https://combine-lab.github.io/salmon/) results;
* Simple calculation of many quality control metrics from the expression data;
* Many tools for visualising scRNA-seq data, especially diagnostic plots for quality control;
* Subsetting and many other methods for filtering out problematic cells and features;
* Methods for identifying important experimental variables and normalising data ahead of downstream statistical analysis and modeling.

# Creating a `SingleCellExperiment` object

We assume that you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.).
First, we create a `SingleCellExperiment` object containing the data, as demonstrated below with some example data (`"sc_example_counts"`) and metadata (`"sc_example_cell_info"`): 
Rows of the object correspond to features, while columns correspond to samples, i.e., cells in the context of single-cell 'omics data.

```{r quickstart-load-data, message=FALSE, warning=FALSE}
library(scater)
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
    assays = list(counts = sc_example_counts), 
    colData = sc_example_cell_info
)
example_sce
```

We usually expect (raw) count data to be labelled as `"counts"` in the assays, which can be easily retrieved with the `counts` accessor.
Getters and setters are also provided for `exprs`, `tpm`, `cpm`, `fpkm` and versions of these with the prefix `norm_`.

```{r quickstart-add-exprs, results='hide'}
str(counts(example_sce))
```

Row and column-level metadata are easily accessed (or modified) as shown below.
There are also dedicated getters and setters for spike-in specifiers (`isSpike`); size factor values (`sizeFactors`); and reduced dimensionality results (`reducedDim`).

```{r}
example_sce$whee <- sample(LETTERS, ncol(example_sce), replace=TRUE)
colData(example_sce)
rowData(example_sce)$stuff <- runif(nrow(example_sce))
rowData(example_sce)
```

Subsetting is very convenient with this class, as both data and metadata are processed in a synchronized manner.
For example, we can filter out features (genes) that are not expressed in any cells:

```{r filter-no-exprs}
keep_feature <- rowSums(counts(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]
```

More details about the `SingleCellExperiment` class can be found in the documentation for `r Biocpkg("SingleCellExperiment")` package.

# Calculating a variety of expression values

We calculate counts-per-million using the aptly-named `calculateCPM` function.
The output is most appropriately stored as an assay named `"cpm"` in the assays of the `SingleCellExperiment` object.

```{r}
cpm(example_sce) <- calculateCPM(example_sce)
```

Another option is to use the `normalize` function, which calculates log~2~-transformed normalized expression values.
This is done by dividing each count by its size factor (or scaled library size, if no size factors are defined), adding a pseudo-count and log-transforming.
The resulting values can be interpreted on the same scale as log-transformed counts, and are stored in `"logcounts"`.

```{r}
example_sce <- normalize(example_sce)
assayNames(example_sce)
```

Note that `exprs` is a synonym for `logcounts` when accessing or setting data.
This is done for backwards compatibility with older verions of _scater_.

```{r}
identical(exprs(example_sce), logcounts(example_sce))
```

Of course, users can construct any arbitrary matrix of the same dimensions as the count matrix and store it as an assay.

```{r}
assay(example_sce, "is_expr") <- counts(example_sce)>0
```

The `calcAverage` function will compute the average count for each gene after scaling each cell's counts by its size factor.
If size factors are not available, it will compute a size factor from the library size.

```{r}
head(calcAverage(example_sce))
```

# Other methods of data import 

Count matrices stored as CSV files or equivalent can be easily read into R session using `read.table` from _utils_ or `fread` from the `r CRANpkg("data.table")` package.
It is advisable to coerce the resulting object into a matrix before storing it in a `SingleCellExperiment` object.

For large data sets, the matrix can be read in chunk-by-chunk with progressive coercion into a sparse matrix from the `r CRANpkg("Matrix")` package.
This is performed using `readSparseCounts` and reduces memory usage by not explicitly storing zeroes in memory.

Data from 10X Genomics experiments can be read in using the `read10xCounts` function from the `r Biocpkg("DropletUtils")` package.
This will automatically generate a `SingleCellExperiment` with a sparse matrix, see the documentation for more details.

`r Biocpkg("scater")` also provides wrapper functions `readSalmonResults` or `readKallistoResults` to import transcript abundances from the `kallisto` and `Salmon` pseudo-aligners.
This is done using methods from the `r Biocpkg("tximport")` package.

# Transitioning from the `SCESet` class

As of July 2017, `scater` has switched from the `SCESet` class previously defined within the package to the more widely applicable `SingleCellExperiment` class. 
From Bioconductor 3.6 (October 2017), the release version of `scater` will use `SingleCellExperiment`. 
`SingleCellExperiment` is a more modern and robust class that provides a common data structure used by many single-cell Bioconductor packages. 
Advantages include support for sparse data matrices and the capability for on-disk storage of data to minimise memory usage for large single-cell datasets.

It should be straight-forward to convert existing scripts based on `SCESet` objects to `SingleCellExperiment` objects, with key changes outlined immediately below.

* The functions `toSingleCellExperiment` and `updateSCESet` (for backwards compatibility) can be used to convert an old `SCESet` object to a `SingleCellExperiment` object;
* Create a new `SingleCellExperiment` object with the function `SingleCellExperiment` (actually less fiddly than creating a new `SCESet`);
* `scater` functions have been refactored to take `SingleCellExperiment`  objects, so once data is in a `SingleCellExperiment` object, the user experience is almost identical to that with the `SCESet` class.

Users may need to be aware of the following when updating their own scripts:

* Cell names can now be accessed/assigned with the `colnames` function (instead of `sampleNames` or `cellNames` for an `SCESet` object);
* Feature (gene/transcript) names should now be accessed/assigned with the `rownames` function (instead of `featureNames`);
* Cell metadata, stored as `phenoData` in an `SCESet`, corresponds to `colData` in a `SingleCellExperiment` object and is accessed/assigned with the `colData` function (this replaces the `pData` function);
* Individual cell-level variables can still be accessed with the `$` operator (e.g. `sce$total_counts`);
* Feature metadata, stored as `featureData` in an `SCESet`, corresponds to `rowData` in a `SingleCellExperiment` object and is accessed/assigned with the `rowData` function (this replaces the `fData` function);
* `plotScater`, which produces a cumulative expression, overview plot, replaces 
the generic `plot` function for `SCESet` objects.

# Session information {.unnumbered}

```{r}
sessionInfo()
```