---
title: "Introduction to StatescopeR"
author: 
  - name: Mischa Steketee
    affiliation:
    - Cancer Centre Amsterdam, Amsterdam UMC
    email: m.f.b.steketee@amsterdamumc.nl
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
vignette: >
  %\VignetteIndexEntry{Introduction to StatescopeR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    ## see https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
    crop = NULL
)
```

```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE}
## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    StatescopeR = citation("StatescopeR")[1]
)
```

# Introduction

StatescopeR is the R version of Statescope, a computational framework designed
to discover cell states from cell type-specific gene expression profiles
inferred from bulk RNA profiles. StatescopeR works in three steps:
1) Multi-omic single cell RNAseq reference-based deconvolution 
2) Refinement of inferred cell type-specific gene expression profiles 
3) Cell state discovery 
In addition to bulk RNA and a single cell RNAseq (scRNAseq) reference,
StatescopeR can also integrate prior expectations of cell fractions in the
bulk RNA. For example tumor fractions estimated by DNA copy number analysis or
immune cell fractions by immunohistochemistry. Below is a manual for running
StatescopeR with some pancreas scRNAseq data used for both the reference and
pseudobulk which is deconvolved.

# Running StatescopeR

## Installation

StatescopeR is available through Bioconductor, install it using the following
commands in R:

```{r "install", eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("StatescopeR")

## Check that you have a valid Bioconductor installation
BiocManager::valid()
```

## Prepare scRNAseq data

After the package is installed, we will be loading it and the scRNAseq package,
from which we will use the SegerstolpePancreas data set for our example. We will
use it for two purposes: 1) Creating a scRNAseq reference, as you would do for
your own application. 2) Creating pseudobulk to be deconvolved, which we can
conveniently use for evaluation as well. In your own application this would
likely be bulk mRNA sequencing.

```{r 'Prepare scRNAseq data'}
## Load StatescopeR & scRNAseq (for example data)
suppressMessages(library(StatescopeR))
suppressMessages(library(scRNAseq))
suppressMessages(library(scuttle))

## Load SegerstolpePancreas data set
scRNAseq <- SegerstolpePancreasData()

## remove duplicate genes
scRNAseq <- scRNAseq[!duplicated(rownames(scRNAseq)), ]

## Subset to 1 healthy and  type 2 diabetes samples
scRNAseq <- scRNAseq[, scRNAseq$individual %in% c("H2",
                                                    "T2D2", "T2D3")]
## remove cells with no cell type label
scRNAseq <- scRNAseq[, !is.na(scRNAseq$`cell type`)]

## remove very rare cell types (<100 cells in total data set)
celltypes_to_remove <-
    names(table(scRNAseq$`cell type`)[(table(scRNAseq$`cell type`) < 100)])
scRNAseq <- scRNAseq[, !scRNAseq$`cell type` %in% celltypes_to_remove]

```

## Prepare StatescopeR input

StatescopeR takes the following inputs: 1) scRNAseq reference/signature
2) bulk mRNA to be deconvolved
3) selected genes for deconvolution, by default this uses
`r BiocStyle::Githubpkg("theislab/AutoGeneS")` 
(optional) Prior expectation of fractions
(optional) Hyperparameters (i.e. cores for parallelization, nrepfinal for
maximum number of optimization iteration etc.)

```{r 'Prepare StatescopeR input'}
## Normalize (cp10k) and logtransform scRNAseq
cpm(scRNAseq) <- calculateCPM(scRNAseq)
logcounts(scRNAseq) <- log1p(cpm(scRNAseq) / 100)

## Create scRNAseq reference/signature with 5 hvg for quick example
signature <- create_signature(scRNAseq,
    hvg_genes = TRUE, n_hvg_genes = 5L,
    labels = scRNAseq$`cell type`
)

## select subset of genes for deconvolution (3/5 hvg to make it quick)
selected_genes <- select_genes(scRNAseq, 3L, 5L,
    labels = scRNAseq$`cell type`
)

## Create pseudobulk and normalize to cp10k (logging is done within Statescope)
pseudobulk <- aggregateAcrossCells(scRNAseq, ids = scRNAseq$individual)
normcounts(pseudobulk) <- calculateCPM(pseudobulk) / 100
pseudobulk <- as(pseudobulk, "SummarizedExperiment")
rownames(pseudobulk) <- rownames(scRNAseq)

## (optional) Create prior expectation
prior <- gather_true_fractions(scRNAseq, ids = scRNAseq$individual,
            label_col = "cell type") # Use True sc fractions for this
prior[rownames(prior) != "ductal cell", ] <- NA # Keep only ductal cells
prior <- t(prior) # Transpose it to nSample x nCelltype
```

## Run StatescopeR

Now that all inputs have been prepared you can deconvolve the (pseudo)bulk mRNA
and discover transcriptional states. In short the modules work as follows: 
BLADE_deconvolution employs Bayesian variational inference, modelling the bulk 
gene expression level of each gene j in sample i by:
\begin{equation}
y_{ij} = log(\sum_{t} f_{i}^t exp(x_{ij}^t)) + \epsilon_{ij}
\end{equation}
with hidden variables $f_{i}^t$ and $x_{ij}^t$ denoting the fraction of cell
type t for sample i and the expression of gene j for sample i in cell type j
respectively. Refinement further optimizes cell type-specific gene expression
profiles, $x_{ij}^t$, by fixing estimated cell fractions and putting more
emphasis on capturing inter-sample heterogeneity over staying close to the
single cell prior knowledge. StateDiscovery uses convex-NMF to cluster inferred
cell type-specific gene expression profiles in an unsupervised manner. These
modules are wrappers around the original Python version
`r BiocStyle::Githubpkg("tgac-vumc/Statescope")`.

```{r 'Run StatescopeR'}
## Run Deconvolution module
Statescope <- BLADE_deconvolution(signature, pseudobulk, selected_genes, prior,
                                    cores = 1L, Nrep = 1L)

## Run Refinement module
Statescope <- Refinement(Statescope, signature, pseudobulk, cores = 2L)


## Run State Discovery module (pick k=2 for quick example)
Statescope <- StateDiscovery(Statescope, k = 2L, Ncores = 2L)
```

## Evaluate Statescope

As we used pseudobulk from scRNAseq in this manual we are able to check if the
output of StatescopeR is correct.

```{r 'Evaluate Statescope'}
## Count fractions of cells per sample in the pseudobulk
true_fractions <- gather_true_fractions(scRNAseq, ids = scRNAseq$individual,
                    label_col = "cell type")

## Plot correlation and RMSE with true fractions per celltype
fraction_eval(Statescope, true_fractions)
```

## Downstream analysis/visualizations

After you have run Statescope you can do all kinds of analyses, here we show
some visualizations to give you an idea, but feel free to try whatever comes to
your mind.

```{r 'Downstream analysis/visualizations'}
## Show predicted fractions
metadata(Statescope)$fractions

## Show predicted ductal cell type specific gene expression profiles
assay(metadata(Statescope)$ct_specific_gep$`ductal cell`, "weighted_gep")

## Show ductal cell state scores per sample
metadata(Statescope)$statescores$`ductal cell`

## Show ductal cell state loadings
metadata(Statescope)$stateloadings$`ductal cell`

## Plot heatmap of fractions
fraction_heatmap(Statescope)

## Plot barplot of top loadings
barplot_stateloadings(Statescope)
```

## Group expectations

If you have an expected fraction for a group of cell types instead of for one
specific cell (i.e. Lymphoid cells instead of T/B cells), StatescopeR also
allows you to give a group prior. Here we show an example where we know the
expected fraction of the pancreatic non-duct cells (acinar & alpha cells)
together, but not their individual contributions.

```{r 'Group expectations'}
## define cell types in one group
grouped_cts <- c("alpha cell", "acinar cell")

## initialize group assignment matrix (nCelltype x nGroup) with 0's
celltype_names <- colnames(signature$mu)
group_names <- c("Group", setdiff(celltype_names, grouped_cts))
nCelltype <- length(celltype_names)
nGroup <- length(group_names)

group <- matrix(0,
    nrow = nGroup, ncol = nCelltype,
    dimnames = list(group_names, celltype_names)
)


## initialize prior fraction matrix (nSamples x nCelltypes) with NA's
nSamples <- ncol(pseudobulk)
sample_names <- colnames(pseudobulk)

prior <- matrix(
    nrow = nSamples, ncol = nGroup,
    dimnames = list(sample_names, group_names)
)


## Assign cell types to groups
for (ct in celltype_names) {
    if (ct %in% grouped_cts) {
        ## assign cell type to group
        group["Group", ct] <- 1
    } else {
        ## Assign cell type to its own group
        group[ct, ct] <- 1
    }
}

## Add grouped prior fractions to prior
prior[names(true_fractions), "Group"] <- colSums(as.matrix(
                                            true_fractions[grouped_cts, ]))


## init group prior
group_prior <- list("Group" = group, "Expectation" = prior)

## Now you can just run Statescope as with a normal prior. Note we do not give
## a ductal cell prior this time
Statescope <- BLADE_deconvolution(signature, pseudobulk, selected_genes,
    group_prior,
    cores = 1L,
    Nrep = 1L
    )
```

# Citing `StatescopeR`

We hope that `r Biocpkg("StatescopeR")` will be useful for your research.
Please use the following information to cite the package and the overall
approach. Thank you!

```{r "citation"}
## Citation info
citation("StatescopeR")
```

# Reproducibility

The `r Biocpkg("StatescopeR")` package `r Citep(bib[["StatescopeR"]])` was made
possible thanks to:

-   R `r Citep(bib[["R"]])`
-   `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])`
-   `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])`
-   `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`
-   `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])`
-   `r CRANpkg("sessioninfo")` `r Citep(bib[["sessioninfo"]])`
-   `r CRANpkg("testthat")` `r Citep(bib[["testthat"]])`

This package was developed using `r BiocStyle::Biocpkg("biocthis")`.

`R` session information.

```{r reproduce3, echo=FALSE}
## Session info
library("sessioninfo")
options(width = 80)
session_info()
```