---
title: 'Interoptability between MAST and SingleCellExperiment-derived packages.'
author: Andrew McDavid
date: "`r Sys.Date()`"
package: MAST
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Interoptability between MAST and SingleCellExperiment-derived packages}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

# Introduction 
As a SingleCellExperiment-derived package, `MAST` can easily be
inserted into workflows with packages such as
`scran`, `scater`, `zinbwave`, `SCnorm` and others.  The main gotcha is packages
that assume integer counts vs log-transformed, or log-transformed,
approximately scale-normalized data.  We find that MAST performs best
with log-transformed, scale-normalized data that has been thresholded,
such as $\log_2(\text{transcripts per million} + 1)$.

We address this by:

-  testing for log-like data for objects constructed in MAST
-  explicitly naming the slot of the `assay` containing such
putatively log-like data
-  by default operating on the slot with such log-like data

In objects that were constructed in other packages, we ...

In what follows, we show an example of using `scater` to plot some QC
metrics, `SCnorm` to normalize data, and, and conversion
to a `Seurat` object.


#  From MAST to Scater

Scater (citation) is a package that ...

```{r, results = 'hide'}
library(MAST)

knitr::opts_chunk$set(message = FALSE,error = FALSE,warning = FALSE)
data(maits, package='MAST')
unlog <- function(x) ceiling(2^x - 1)
sca_raw =  FromMatrix(t(maits$expressionmat), maits$cdat, maits$fdat)
assays(sca_raw)$counts = unlog(assay(sca_raw))
assayNames(sca_raw)
```
Here we make an object with assays `counts` and `et`.  By default,
`MAST` will operate on the `et` assay, but scran wants count-like data
for some of its QC.  The `et` data are log2 + 1 transcripts per
million (TPM), as output by RSEM.

We could specify the assay name at creation with `sca_raw =
FromMatrix(list(logTPM = t(maits$expressionmat)), maits$cdat,
maits$fdat)` or rename an object that contains appropriately transformed data with
`assayNames(sca_raw) = c('logTPM', 'counts')`.

Before calling `scater` functionality, you might pause to
consider if some features should belong in special `control` sets,
such as mitochrondial genes, or spike-ins.

```{r scaterQC,results='hide'}
library(scater)
sca_raw = calculateQCMetrics(sca_raw)
plotRowData(sca_raw, x = 'log10_mean_counts', 'pct_dropout_by_counts')
plotColData(sca_raw, y="total_features_by_counts", x="total_counts")
```

Evidently some features were filtered, so not all cells contain 1
million counts.  We can tell these were rare features based on the
inverse relationship between `total_counts` and
`total_features_by_counts`: the most complex libraries (with the
greatest numer of features) are missing the most counts.

```{r}
sca_raw <- runPCA(sca_raw, ncomponents=5, exprs_values = 'et')
plotReducedDim(sca_raw, use_dimred = 'PCA', colour_by = 'condition')
```
We can also run a PCA.

## From scater to MAST

```{r}
data(sc_example_counts)

```

# WIP: From MAST to ZINB-wave
<!-- ```{r zinbwave} -->
<!-- library(zinbwave) -->
<!-- feature_var = apply(assay(sca_raw), 1, var) -->
<!-- sca_top500 = sca_raw[rank(-feature_var)<=500,] -->
<!-- zw <- zinbwave(Y = sca_top500, X = '~condition', which_assay = 'counts') -->

<!-- ``` -->

<!-- Run zinbwave.  To speed things, we take the -->
<!-- top 500 most variable genes. -->


<!-- ```{r} -->
<!-- ggpairs(data.frame(zw = reducedDim(zw), colData(zw)), columns = c('zw.PC1', 'zw.PC2', 'ngeneson'), mapping = aes(color = condition)) -->

<!-- ``` -->

<!-- ## Using MAST to characterizing genes that drive the factors -->

<!-- ```{r, results = 'hide'} -->
<!-- assays(zw)[[1]] <- log2(assay(zw)+1) -->
<!-- colData(zw) <- cbind(colData(zw), zw=as.data.frame(reducedDim(zw))) -->
<!-- zw <- zw %>% as('SingleCellAssay') -->
<!-- zz <- zlm(~W1+W2, sca=zw) -->

<!-- ``` -->

<!-- ```{r, results = 'asis'} -->
<!-- ss <- summary(zz) -->
<!-- knitr::kable(print(ss)) -->

<!-- ``` -->

<!-- These are log-fold changes in the top few changes associated with factors 1 and 2. -->

<!-- ```{r} -->
<!-- top5 <- ss$datatable %>% filter(component=='logFC', contrast %like% 'W') %>% arrange(-abs(z)) %>% head(n=5) %>% left_join(rowData(zw) %>% as.data.table()) -->
<!-- plt <- ggplot(zw[top5$primerid,] %>% as('data.table'), aes(x=W1, color = condition)) + geom_point() + facet_wrap(~symbolid) -->

<!-- ``` -->


<!-- ```{r} -->
<!-- plt + aes(y = et) -->
<!-- ``` -->

<!-- Expression on "Et" scale ($\log_2( TPM + 1)$) -->

<!-- ```{r} -->
<!-- plt + aes(y = normalizedValues) -->
<!-- ``` -->

<!-- Normalized expression from factor model -->