---
title: "An introduction to the SingleCellExperiment class"
author: "Davide Risso and Aaron Lun"
package: SingleCellExperiment
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{1. An introduction to the SingleCellExperiment class}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r options, include=FALSE, echo=FALSE}
library(BiocStyle)
knitr::opts_chunk$set(warning=FALSE, error=FALSE, message=FALSE)
```

# The `SingleCellExperiment` class definition

The `SingleCellExperiment` class is a light-weight container for single-cell genomics data.
It extends the `RangedSummarizedExperiment` class and follows similar conventions, 
i.e., rows should represent features (genes, transcripts, genomic regions) and columns should represent cells.
In addition to the slots already present in the `RangedSummarizedExperiment`, the `SingleCellExperiment` class contains:

* `int_elementMetadata`, to hold internal metadata for each row.
* `int_colData`, to hold internal metadata for each column.
* `int_metadata`, to hold internal metadata for the entire object.
* `reducedDims`, to hold dimensionality reduction results.

The `int_` prefix describes _internal_ slots that are **not** meant for direct manipulation by the user.
Instead, they are set by other user-visible methods, which will be discussed in more detail below.
Package developers should see the other vignette for how to define methods to access these internal fields.

# Creating SingleCellExperiment instances

`SingleCellExperiment` objects can be created via the constructor of the same name:

```{r construct}
library(SingleCellExperiment)
counts <- matrix(rpois(100, lambda = 10), ncol=10, nrow=10)
sce <- SingleCellExperiment(assays = list(counts = counts))
sce
```

An alternative approach is via coercion from `SummarizedExperiment` objects.

```{r coerce}
se <- SummarizedExperiment(list(counts=counts))
as(se, "SingleCellExperiment")
```

To demonstrate the use of the class, we will use a subset of the `allen` data set from the `r Biocpkg("scRNAseq")` package.
These data are stored as a `SummarizedExperiment`, so we will first coerce it into a `SingleCellExperiment`.

```{r fluidigm}
library(scRNAseq)
data(allen)
allen
sce <- as(allen, "SingleCellExperiment")
sce
```

# Adding spike-in information

One of the main additions to `SummarizedExperiment` is the ability for the user to specify the rows corresponding to spike-in transcripts.
This is done with the method `isSpike`, using an appropriate name for the spike-in set.

```{r spikes}
isSpike(sce, "ERCC") <- grepl("^ERCC-", rownames(sce))
sce
```

The identities of the spike-in rows can be easily retrieved using the name of the spike-in set, as shown below.
The names of currently available spike-in sets can also be returned with the `spikeNames` method.

```{r}
table(isSpike(sce, "ERCC"))
spikeNames(sce)
```

While most experimental designs use a single set of spike-ins, the class has the flexibility of including more than one set of spikes.
Let us pretend that the members of the _Adam_ gene family have been spiked-in as external genes in these data.

```{r spikes2}
isSpike(sce, "Adam") <- grepl("^Adam[0-9]", rownames(sce))
sce
table(isSpike(sce, "Adam"))
spikeNames(sce)
```

If `isSpike` is used without specifying any name, it will return the union of all spike-in sets.

```{r}
table(isSpike(sce))
```

Running `isSpike<-` with `NULL` will clear the specified set from the `SingleCellExperiment`, while running `clearSpikes` will delete all spike-in information.

```{r}
temp <- sce
isSpike(temp, "Adam") <- NULL
spikeNames(temp)
temp <- clearSpikes(temp)
spikeNames(temp)
```

Note that the `isSpike` and `isSpike<-` methods get and set columns in `int_elementMetadata` and `int_metadata`.
This information is only relevant to package developers and not necessary for routine use of this class.

# Adding size factors

One can also store size factors in the `SingleCellExperiment` object. 
For illustration, we simply compute the total number of reads as size factors here.
Note that more sophisticated methods for computing size factors are available (see, e.g., `r Biocpkg("scran")`).

```{r sizeFactors}
sizeFactors(sce) <- colSums(assay(sce))
head(sizeFactors(sce))
```

We can compute multiple size factors and store them in the object, by providing a name to `sizeFactors`.
This does _not_ affect the values of the unnamed size factors.

```{r sizeFactors2}
sizeFactors(sce, "ERCC") <- colSums(assay(sce)[isSpike(sce, "ERCC"),])
head(sizeFactors(sce, "ERCC"))
head(sizeFactors(sce)) # same as before
```

# Extracting `colData` and `rowData` 

The `colData` and `rowData` methods can be used to retrieve the stored sample- and gene-level metadata. 
By default, this will only return the user-visible metadata fields, i.e., not including the fields stored in the `int_*` slots.

```{r metadata}
colData(sce)
rowData(sce)
```

However, it is sometimes useful to retrieve a `DataFrame` with the internal fields, i.e., spike-in and size factor information.
This can be achieved by specifying `internal=TRUE`.

```{r metadata2}
colData(sce, internal=TRUE)
rowData(sce, internal=TRUE)
```

See below for some discussion of why an internal storage mechanism is used here.

# Adding low-dimensional representations

For simplicity and speed, we work on a subset of 100 genes. 
To avoid ending up with only uninteresting genes, we extract the 100 genes with maximal variance in the log-transformed counts.

```{r subset}
library(magrittr)
assay(sce) %>% log1p %>% rowVars -> vars
names(vars) <- rownames(sce)
vars <- sort(vars, decreasing = TRUE)

sce_sub <- sce[names(vars[1:100]),]
sce_sub
```

We obtain the PCA and t-SNE representations of the data and add them to the object with the `reducedDims` method.

```{r pca}
library(Rtsne)
set.seed(5252)

pca_data <- prcomp(t(log1p(assay(sce_sub))))
tsne_data <- Rtsne(pca_data$x[,1:50], pca = FALSE)

reducedDims(sce_sub) <- SimpleList(PCA=pca_data$x, TSNE=tsne_data$Y)
sce_sub
```

The stored coordinates can be retrieved by name or by numerical index.
Each row of the coordinate matrix is assumed to correspond to a cell, while each column represents a dimension.

```{r}
reducedDims(sce_sub)
reducedDimNames(sce_sub)
head(reducedDim(sce_sub, "PCA")[,1:2])
head(reducedDim(sce_sub, "TSNE")[,1:2])
```

Any subsetting by column of `sce_sub` will also lead to subsetting of the dimensionality reduction results by cell.

```{r}
dim(reducedDim(sce_sub, "PCA"))
dim(reducedDim(sce_sub[,1:10], "PCA"))
```

# Convenient access to named assays

In the `SingleCellExperiment`, users can assign arbitrary names to entries of `assays`.
To assist interoperability between packages, we provide some suggestions for what the names should be for particular types of data:

- `counts`: Raw count data, e.g., number of reads or transcripts for a particular gene.
- `normcounts`: Normalized values on the same scale as the original counts.
For example, counts divided by cell-specific size factors that are centred at unity.
- `logcounts`: Log-transformed counts or count-like values.
In most cases, this will be defined as log-transformed `normcounts`, e.g., using log base 2 and a pseudo-count of 1.
- `cpm`: Counts-per-million.
This is the read count for each gene in each cell, divided by the library size of each cell in millions.
- `tpm`: Transcripts-per-million.
This is the number of transcripts for each gene in each cell, divided by the total number of transcripts in that cell (in millions).

Each of these suggested names has an appropriate getter/setter method for convenient manipulation of the `SingleCellExperiment`.
For example, we can take the (very specifically named) `tophat_counts` name and assign it to `counts` instead:

```{r}
counts(sce) <- assay(sce, "tophat_counts")
sce
dim(counts(sce))
```

This means that functions expecting count data can simply call `counts()` without worrying about package-specific naming conventions.

# Session Info

```{r}
sessionInfo()
```