---
title: 'Interoptability between MAST and SingleCellExperiment-derived packages.'
author: Andrew McDavid
date: "`r Sys.Date()`"
bibliography: mast-interopt.bib
package: MAST
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Interoptability between MAST and SingleCellExperiment-derived packages}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

# Introduction 
As a `SingleCellExperiment`-derived package, `MAST` can easily be
inserted into workflows with packages such as
`scran`, `scater`, `zinbwave`, `SCnorm` and others.  Moreover, subclassing SingleCellExperiment/SummarizedExperiment provides a flexible abstraction for the `assay` that contains the actual expression data.  It can use sparse `Matrix` and HDF5 as backends to save memory.

To use MAST with such packages, you just need to upcast the `SingleCellExperiment` to MAST's subclass `SingleCellAssay` with the function `SceToSingleCellAssay` that handles the coercion and checks the object for validity.  Going the other direction, generally `SingleCellAssay`s should work in packages that use `SingleCellExperiment`, but if in doubt you could down-cast with `as(sca, 'SingleCellExperiment')`.

## Log-transformation is expected in MAST

The main gotcha in all this is that some SingleCellExperiment-derived packages
assume integer counts have been provided, while MAST assumes that log-transformed approximately scale-normalized data is provided.  We find that MAST performs best
with log-transformed, scale-normalized data that has been thresholded,
such as $\log_2(\text{transcripts per million} + 1)$.

We address this by:

-  testing for log-like data for objects up-cast to `SingleCellAssay`
-  explicitly naming the slot of the `assay` containing such
putatively log-like data
-  by default operating on the slot with such log-like data

## Examples 

In what follows, we show an example of using `scater` to plot some QC
metrics, `SCnorm` to normalize data, and, and conversion
to a `Seurat` object.


#  From MAST to Scater

Scater @scater is a package that provides functions for QC, normalization and visualization of single cell RNAseq data.

```{r, results = 'hide'}
library(MAST)

knitr::opts_chunk$set(message = FALSE,error = FALSE,warning = FALSE)
data(maits, package='MAST')
unlog <- function(x) ceiling(2^x - 1)
sca_raw =  FromMatrix(t(maits$expressionmat), maits$cdat, maits$fdat)
assays(sca_raw)$counts = unlog(assay(sca_raw))
assayNames(sca_raw)
```
Here we make an object with assays `counts` and `et`.  By default,
`MAST` will operate on the `et` assay, but scran wants count-like data
for some of its QC.  The `et` data are log2 + 1 transcripts per
million (TPM), as output by RSEM.

We could specify the assay name at creation with `sca_raw =
FromMatrix(list(logTPM = t(maits$expressionmat)), maits$cdat,
maits$fdat)` or rename an object that contains appropriately transformed data with
`assayNames(sca_raw) = c('logTPM', 'counts')`.

Before calling `scater` functionality, you might pause to
consider if some features should belong in special `control` sets,
such as mitochrondial genes, or spike-ins.

```{r scaterQC,results='hide'}
library(scater)
sca_raw = calculateQCMetrics(sca_raw)
plotRowData(sca_raw, x = 'log10_mean_counts', 'pct_dropout_by_counts')
plotColData(sca_raw, y="total_features_by_counts", x="total_counts")
```

Evidently some features were filtered, so not all cells contain 1
million counts.  We can tell these were rare features based on the
inverse relationship between `total_counts` and
`total_features_by_counts`: the most complex libraries (with the
greatest number of features) are missing the most counts.

```{r}
sca_raw = runPCA(sca_raw, ncomponents=5, exprs_values = 'et')
plotReducedDim(sca_raw, dimred = 'PCA', colour_by = 'condition')
```
We can also run a PCA.

## From scater to MAST

Since scater uses `SingleCellExperiment` objects, the only here consideration is making sure `MAST` can find log-like data, and possibly thresholding the data.

```{r}
if(packageVersion('scater') >= '1.13'){
  example_sce = mockSCE() 
} else{
  data("sc_example_counts")
  data("sc_example_cell_info")
  example_sce = SingleCellExperiment(
    assays = list(counts = sc_example_counts),
    colData = sc_example_cell_info)
}

example_sce = normalize(example_sce)
sca = SceToSingleCellAssay(example_sce)
```

Here we coerce `example_sce` to be a SingleCellAssay object.

```{r}
zlm( ~ Treatment, sca = sca, exprs_value = 'logcounts')
```

We test for differential expression with regards to `Treatment` and explicitly indicate the `logcounts` slot will be used.  Methods in MAST will operate on the default slice returned by `assay`, which has been over-ridden to  return log-like data: the default slice is the first assay whose name, as given by `assayNames(x)`, matches the first element in the sequence `c('thresh', 'et', 'Et', 'lCount', 'logTPM', 'logCounts', 'logcounts')`.  So in the case of `sca`, even if `exprs_value` was not specified, the `logcounts` slot would have been used, even though it comes second in `assayNames(sca)`:

```{r}
assayNames(sca)
```

## Sparse matrix and HDF5 support

```{r}
library(Matrix)
sca_sparse = FromMatrix(
    exprsArray = list(et = Matrix(t(maits$expressionmat), sparse = TRUE)),
    cData = maits$cdat, fData = maits$fdat)
class(assay(sca_sparse))
regular_time = system.time(zlm( ~ condition, sca = sca_raw[1:100,]))
sparse_time = system.time(zlm( ~ condition, sca = sca_sparse[1:100,]))
```

There is no complication to providing a sparse matrix.

```{r}
library(DelayedArray)
library(HDF5Array)
hd5_dat = as(t(maits$expressionmat), 'HDF5Array')
seed(hd5_dat)
```

Write `sc_example_counts` to disk as an `HDF5Array`

```{r}
sca_delay = FromMatrix(
    exprsArray = list(et = hd5_dat),
     cData = maits$cdat, fData = maits$fdat)
class(assay(sca_sparse))

hd5_time = system.time(zlm( ~ condition, sca = sca_delay[1:100,]))
```

Nor is there a complication to using HDF5-backed stores.  
```{r}
knitr::kable(data.frame(method = c('Dense', 'Sparse', 'HDF5'), 'user time(s)' =c( regular_time[1], sparse_time[1], hd5_time[1]), check.names = FALSE))

```

As expected, the dense storage is fastest, followed by the sparse storage.  HDF5 is slowest.  The linear models underlying MAST don't (yet) have any special provision for big $n$ data, and will tend to linearly (or worse) in the number of cells.  So performance may be an issue even if they data do fit in memory.


# MAST and ZINB-wave
```{r zinbwave}
library(zinbwave)
feature_var = apply(assay(sca_raw), 1, var)
sca_top500 = sca_raw[rank(-feature_var)<=500,]
zw = zinbwave(Y = sca_top500, X = '~1', which_assay = 'counts', K = 2, normalizedValues = TRUE)

```

Run zinbwave.  To speed things, we take the
top 500 most variable genes.


```{r, message=FALSE}
rd = data.frame(reducedDim(zw, 'PCA'), reducedDim(zw, 'zinbwave'), colData(zw))
GGally::ggpairs(rd, columns = c('PC1', 'PC2', 'W1', 'W2'), mapping = aes(color = condition))
```

## Using MAST to characterizing genes that drive the factors

```{r, results = 'hide'}
colData(zw) = cbind(colData(zw), reducedDim(zw, 'zinbwave'))
zw = SceToSingleCellAssay(zw)
zz = zlm(~W1 + W2, sca = zw, exprs_values = 'et')
```

```{r, results = 'asis'}
ss = summary(zz)
knitr::kable(print(ss))

```

These are log-fold changes in the top few changes associated with factors 1 and 2.

```{r}
library(dplyr)
library(data.table)
top5 = ss$datatable %>% filter(component=='logFC', contrast %like% 'W') %>% arrange(-abs(z)) %>% head(n=5) %>% left_join(rowData(zw) %>% as.data.table())
dat = zw[top5$primerid,] %>% as('data.table')
dat = dat[,!duplicated(colnames(dat)),with = FALSE]
plt = ggplot(dat, aes(x=W1, color = condition)) + geom_point() + facet_wrap(~symbolid)

```


```{r}
plt + aes(y = et)
```

Expression on "Et" scale ($\log_2( TPM + 1)$)

```{r}
plt + aes(y = normalizedValues)
```

Normalized expression from factor model

# References