---
title: "An introduction to the SingleCellExperiment class"
author: 
- name: Davide Risso
  affiliation: Division of Biostatistics and Epidemiology, Weill Cornell Medicine 
- name: Aaron Lun
  email: infinite.monkeys.with.keyboards@gmail.com
package: SingleCellExperiment
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{1. An introduction to the SingleCellExperiment class}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r options, include=FALSE, echo=FALSE}
library(BiocStyle)
knitr::opts_chunk$set(warning=FALSE, error=FALSE, message=FALSE)
```

# Motivation

The `SingleCellExperiment` class is a light-weight container for single-cell genomics data.
It extends the `RangedSummarizedExperiment` class and follows similar conventions, 
i.e., rows should represent features (genes, transcripts, genomic regions) and columns should represent cells.
It provides methods for storing dimensionality reduction results and data for alternative feature sets (e.g., synthetic spike-in transcripts, antibody tags).
It is the central data structure for Bioconductor single-cell packages like `r Biocpkg("scater")` and `r Biocpkg("scran")`.

# Creating SingleCellExperiment instances

`SingleCellExperiment` objects can be created via the constructor of the same name:

```{r construct}
library(SingleCellExperiment)
counts <- matrix(rpois(100, lambda = 10), ncol=10, nrow=10)
sce <- SingleCellExperiment(assays = list(counts = counts))
sce
```

An alternative approach is via coercion from `SummarizedExperiment` objects.

```{r coerce}
se <- SummarizedExperiment(list(counts=counts))
as(se, "SingleCellExperiment")
```

To demonstrate the use of the class, we will the Allen data set from the `r Biocpkg("scRNAseq")` package.

```{r fluidigm}
library(scRNAseq)
sce <- ReprocessedAllenData("tophat_counts")
sce
```

The set of operations that can be applied to a `RangedSummarizedExperiment` are also applicable to any instance of a `SingleCellExperiment`.
This includes access to assay data via `assay()`, column metadata with `colData()`, and so on.

# Adding low-dimensional representations

We compute log-transformed normalized expression values from the count matrix. 
(We note that many of these steps can be performed as one-liners from the `r Biocpkg("scater")` package,
but we will show them here in full to demonstrate the capabilities of the `SingleCellExperiment` class.)

```{r subset}
counts <- assay(sce, "tophat_counts")
libsizes <- colSums(counts)
size.factors <- libsizes/mean(libsizes)
logcounts(sce) <- log2(t(t(counts)/size.factors) + 1)
assayNames(sce)
```

We obtain the PCA and t-SNE representations of the data and add them to the object with the `reducedDims()<-` method.

```{r pca}
pca_data <- prcomp(t(logcounts(sce)), rank=50)

library(Rtsne)
set.seed(5252)
tsne_data <- Rtsne(pca_data$x[,1:50], pca = FALSE)

reducedDims(sce) <- list(PCA=pca_data$x, TSNE=tsne_data$Y)
sce
```

The stored coordinates can be retrieved by name or by numerical index.
Each row of the coordinate matrix is assumed to correspond to a cell, while each column represents a dimension.

```{r}
reducedDims(sce)
reducedDimNames(sce)
head(reducedDim(sce, "PCA")[,1:2])
head(reducedDim(sce, "TSNE")[,1:2])
```

Any subsetting by column of `sce_sub` will also lead to subsetting of the dimensionality reduction results by cell.

```{r}
dim(reducedDim(sce, "PCA"))
dim(reducedDim(sce[,1:10], "PCA"))
```

# Convenient access to named assays

In the `SingleCellExperiment`, users can assign arbitrary names to entries of `assays`.
To assist interoperability between packages, we provide some suggestions for what the names should be for particular types of data:

- `counts`: Raw count data, e.g., number of reads or transcripts for a particular gene.
- `normcounts`: Normalized values on the same scale as the original counts.
For example, counts divided by cell-specific size factors that are centred at unity.
- `logcounts`: Log-transformed counts or count-like values.
In most cases, this will be defined as log-transformed `normcounts`, e.g., using log base 2 and a pseudo-count of 1.
- `cpm`: Counts-per-million.
This is the read count for each gene in each cell, divided by the library size of each cell in millions.
- `tpm`: Transcripts-per-million.
This is the number of transcripts for each gene in each cell, divided by the total number of transcripts in that cell (in millions).

Each of these suggested names has an appropriate getter/setter method for convenient manipulation of the `SingleCellExperiment`.
For example, we can take the (very specifically named) `tophat_counts` name and assign it to `counts` instead:

```{r}
counts(sce) <- assay(sce, "tophat_counts")
sce
dim(counts(sce))
```

This means that functions expecting count data can simply call `counts()` without worrying about package-specific naming conventions.

# Adding alternative feature sets

Many scRNA-seq experiments contain sequencing data for multiple feature types beyond the endogenous genes:

- Externally added spike-in transcripts for plate-based experiments.
- Antibody tags for CITE-seq experiments.
- CRISPR tags for CRISPR-seq experiments.
- Allele information for experiments involving multiple genotypes.

Such features can be stored inside the `SingleCellExperiment` via the concept of "alternative Experiments".
These are nested `SummarizedExperiment` instances that are guaranteed to have the same number and ordering of columns as the `SingleCellExperiment` itself.
Data for endogenous genes and other features can thus be kept separate, which is often desirable as they need to be processed differently.

To illustrate, consider the case of the spike-in transcripts in the Allen data. 
We move the corresponding rows out of the assays of the main `SingleCellExperiment` and into a nested alternative Experiment.

```{r}
is.spike <- grepl("^ERCC-", rownames(sce))
sce <- splitAltExps(sce, ifelse(is.spike, "ERCC", "gene"))
altExpNames(sce)
```

The `altExp()` method returns a self-contained `SingleCellExperiment` instance containing only the spike-in transcripts.

```{r}
altExp(sce)
```

Each alternative Experiment can have a different set of assays from the main `SingleCellExperiment`.
This is useful in cases where the other feature types must be normalized or transformed differently.
Similarly, the alternative Experiments can have different `rowData()` from the main object.

```{r}
rowData(altExp(sce))$concentration <- runif(nrow(altExp(sce)))
rowData(altExp(sce))
rowData(sce)
```

# Session Info

```{r}
sessionInfo()
```