---
title: "scRNA-Seq Population-level Analysis using GloScope"
author:
- name: Hao Wang
  affiliation:
  - &UCB University of California, Berkeley, California, USA
  email: hao_wang@berkeley.edu
- name: William Torous
  affiliation:
  - *UCB
- name: Elizabeth Purdom
  affiliation:
  - *UCB
date: "`r Sys.Date()`"
output: 
  BiocStyle::html_document:
    toc: true
bibliography: gloscope.bib
vignette: >
  %\VignetteIndexEntry{GloScope}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```


# Introduction 

This vignette will review the steps needed to implement the `GloScope` methodology. `GloScope` is a framework for creating profiles of scRNA-Seq samples in order to globally compare and analyze them across patients or tissue samples. First this methodology estimates a global gene expression distribution for each sample, and then it calculates how divergent pairs of samples are from each other. The output from the package's main function, `gloscope()`, is a $n\times n$ divergence matrix containing the pairwise statistical divergences between all samples. This divergence matrix can be the input to other downstream statistical and machine learning tools. This package has been submitted to Bioconductor because we believe it may be of particular interest to the bioinformatic community and because it depends on other packages from Bioconductor.

`GloScope` estimates the gene expression density from a low dimensional embedding of the UMI count measurements, such as PCA or scVI embeddings. Users must provide `GloScope` with a `data.frame` containing each cell's embedding coordinates, along with the metadata which identifies to which sample each cell belongs to.

The package provides a helper function `plotMDS` which allows the user to visualize the divergence matrix with multidimensional scaling (MDS), but the divergence matrix can also be visualized with other algorithms as well.  

A standard workflow for `GloScope` consists of:

1. Obtain the dimension reduction embedding of the cells and specify how many dimensions to keep. This is computed outside of the `GloScope` package.

2. Choose a density estimation method (Gaussian mixture or k-nearest neighbours) to estimate each sample's latent distribution.

3. Calculate the symmetric KL divergence or Jensen-Shannon divergence between all pairs of samples.

4. Visualize the distance matrix with the first two dimensions of MDS using the `plotMDS` function.

## Installation

You can install the latest stable release of `GloScope` from Bioconductor. Make sure that you have the `BiocManager` package installed to proceed.

```{r install_bioc, eval=FALSE}
if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("GloScope")
```

# Example

In this section, we use a toy example to illustrate the input, output and visualization steps of the `GloScope` pipeline. 

The data is a subset of that presented in @stephenson2021single, and was obtained from [this hyperlinked URL](https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-10026). The dataset has a total of 647,366 peripheral blood mononuclear cells (PBMCs) from 130 patients, whose phenotypes are COVID-infected, healthy control donor, patients with other non-COVID respiratory disease, and volunteers administered with intravenous lipopolysaccharide (IV-LPS). To enable faster computation in this tutorial, we subset this dataset to 20 random COVID and healthy control donors and further subsample each patient's count matrix to  500 random cells. This subsampled data is provided in the SingleCellExperiment object `example_SCE`. We emphasize that this subsampling procedure is only for demonstration purposes and is not a recommended step in normal analyses.

## Data Input

We first load the `GloScope` package and the aforementioned example dataset.

```{r load-libs, message = FALSE,  warning = FALSE}
library(GloScope)
data("example_SCE")
```

The example SingleCellExperiment object contains the first $50$ principal components of the subsampled cells, as well as the sample and phenotype labels associated with each cell. 

```{r}
head(SingleCellExperiment::reducedDim(example_SCE,"PCA")[,1:10])
head(SingleCellExperiment::colData(example_SCE))
```

The following table confirms that each donor provides $500$ cells, and that both phenotypes are represented.

```{r}
table(SingleCellExperiment::colData(example_SCE)$sample_id, SingleCellExperiment::colData(example_SCE)$phenotype)
```

`GloScope` expects that the user provides `GloScope` with a `data.frame` containing each cell's low-dimensional embedding (with cells in rows), along with a vector which contains the sample from which each cell in the embedding matrix is drawn. The PCA embeddings in the example dataset are provided by the authors of @stephenson2021single. In general, users of `GloScope` can input any dimensionality reduction to the method. UMI counts are often provided as `Seurat` or `SingleCellExperiment` objects, and many dimensionality reduction strategies, including PCA and scVI (@lopez2018deep) can be computed and saved within those data structures. This is recommended, as it will allow the user to keep the counts, embeddings, and the meta data (like the sample from which each cell was isolated) in the same structure and will minimize the change of inadvertently mislabelling the sample of a cell.

The following code, which is only pseudo-code and not evaluated (since we do not provide the `seurat_object`), demonstrates how PCA embeddings and sample labels can be extracted from a `Seurat` object for input into the `gloscope` function.

```{r eval = F}
embedding_df <- seurat_object@reductions$pca@cell.embeddings
sample_ids <- seurat_object@meta.data$sample_id
```

## Divergence Matrix Calculation

With the cell embeddings and the sample labels in the proper format, the `gloscope` function is simple to setup and run. For simplicity, we will save these as separate objects, though for real datasets, this would not be recommended since it would unnecessarily make copies of the data and increase memory usage:

```{r}
embedding_matrix <- SingleCellExperiment::reducedDim(example_SCE,"PCA")[,1:10]
sample_ids <- SingleCellExperiment::colData(example_SCE)$sample_id
```

Although the example data contains the first 50 principal components, we chose to use only the first 10 for calculations. Large number of latent variables will make the density estimation unstable, so we do not recommend large increases to the number of latent variables. 

The base function call is `gloscope(embedding_df, sample_ids)`.  

```{r, eval=FALSE}
# Can take a couple of minutes to run:
gmm_divergence <- gloscope(embedding_matrix, sample_ids)
```

The default implementation, run above, implements the `GMM` option for density estimation; this is the method primarily considered in the manuscript, but can take longer to run so we haven't evaluated it here. (You can evaluate it by changing `eval=FALSE` to `eval=TRUE`).  

An alternative methods uses a non-parametric alternative for density estimation based on a $k$-nearest neighbours algorithm and can be chosen with the argument `dens`. We will use this on our examples simply to make the tutorial run quickly:

```{r}
knn_divergence <- gloscope(embedding_matrix, sample_ids, dens="KNN")
knn_divergence[1:5,1:5]
```


Note, that unlike PCA, not every dimensionality reduction method retains its statistical properties when only a subset of the coordinates is retained, with scVI being an example. For methods like scVI, you should choose the number of latent variables you will want to use *when calculating the latent variables* and not subset them after the fact. 

## More on the density method:

If the user chooses the default GMM method (`dens="GMM"`), `gloscope` fits sample-level densities with Gaussian mixture models (GMMs) implemented by `mclust`. The `mclust` package uses the Bayesian information criterion (BIC) to select a GMM from a family of models, and the user of `GloScope` can specify how many centroids should be considered in that family. By default GMMs with $1$ through $9$ centroids are compared. With the `num_components` optional vector to `gloscope` the user can specify the possible number of centroids for `mclust` to compute, with the one with the best BIC value being the final choice. 

When GMMs are used for density fitting, a Monte-Carlo approximation is then used to compute the pairwise divergences from the estimated densities. This means that the resulting estimate is stochastic, and details about controlling the random seed appear later in this vignette. The number of Monte-Carlo draws from the density of each sample is $10,000$ by default, and this is controlled by the optional parameter $r$ in the `gloscope` function.

```{r, eval=FALSE}
# Can take a couple of minutes to run:
gmm_divergence_alt<-gloscope(embedding_matrix, sample_ids, dens = "GMM",  num_components = c(2,4,6),r=20000)
```

A non-parametric alternative for density and divergence estimation is the $k$-nearest neighbours algorithm. To use this technique, the optional argument `dens="KNN"` should be set. The number of neighbors is a hyperparameter, equal to $50$ by default, and governed by the optional argument `k`. It is important to note that negative divergences between similar samples are possible with this density estimation choice. The `gloscope` function does not censor or round any negative values in the output matrix, leaving that decision to the user. 

```{r}
knn_divergence_alt <- gloscope(embedding_matrix, sample_ids, 
    dens = "KNN", k = 25)
```

## More on the divergence method

The default divergence for `GloScope` is the symmetric KL divergence, but the Jensen-Shannon divergence is also implemented. This can be controlled by setting the argument `dist_mat="KL"` or `dist_mat="JS"`, respectively. One beneficial property of the Jensen-Shannon divergence is that its square root is a proper distance metric. Note that `gloscope` returns a matrix of untransformed divergences, and the user must take the square root of matrix entries themselves if this is desired.


# Visualization

The `plotMDS` function provided by this package visualizes the output divergence matrix with multidimensional scaling. The `plotMDS` function utilizes the `isoMDS` function from the package `MASS`, and then creates a scatter plot with samples color-coded by a user-specified covariate such as phenotype 

This function requires a `data.frame` with the relevant metadata at the sample level, rather than at the cell level. This can easily be obtained by applying the `unique` function from base R to the cell-level metadata `data.frame`. 

```{r}
pat_info <- as.data.frame(unique(SingleCellExperiment::colData(example_SCE)[,-c(3)]))
head(pat_info)
```

Here we plot the MDS representation with each sample color-coded by the `phenotype` variable. Note the function call returns both a matrix of MDS embeddings and a `ggplot` visualization of the first two dimensions. 

```{r}
mds_result <- plotMDS(dist_mat = knn_divergence, metadata = pat_info, "sample_id","phenotype", k = 2)
mds_result$plot
```

Another classical way to visualize a divergence matrix is with a heatmap. The following code demonstrates that the output of `gloscope` is easily used in plotting functions beyond the package.

```{r}
heatmap(knn_divergence)
```

# Parallelization and Random Seeds

To speed-up calculations of the pair-wise divergences, `GloScope` allows for parallelizing the calculation. The argument `BPPARAM` controls the parameters of the parallelization (see `bplapply`). 

The default is no parallelization, but the iteration across sample-pairs will still via the function `bplapply`. In this case (i.e. no parallelization), the argument is simply `BPPARAM = BiocParallel::SerialParam()`. 

IMPORTANT: Due to the construction of the NAMESPACE file, it is essential that any setting of the BPPARAM optional argument uses the `BiocParallel::` namespace prefix. For example, `gloscope(...,BPPARAM = MulicoreParam()` will raise an error, and `gloscope(...,BPPARAM = BiocParallel::MulicoreParam()` should be used instead.

## Random seed 

The calculation of the KL divergence from the GMM density estimate uses Monte-Carlo approximation, and hence has to randomly sample from the estimated density. To set the seed for the pseudo-random number generator used in the simulation, the seed needs to be set within the argument to `BPPARAM` and **not** by a call to `set.seed` (see https://bioconductor.org/packages/release/bioc/vignettes/BiocParallel/inst/doc/Random_Numbers.html for more information). 

This is how the seed must be set, *even if there is no parallelization* chosen (the default), because the iteration over sample pairs is sent through `bplapply` function regardless, as noted above. Setting the seed outside the function via `set.seed` will not have an effect on the function. 

The following is an example of how to set the random seed when running the `GMM` option, using the default of no parallelization:

```{r eval = FALSE}
gmm_divergence <- gloscope(embedding_matrix, sample_ids, dens = "GMM", dist_mat = "KL",
    BPPARAM = BiocParallel::SerialParam(RNGseed = 2))
```

The same argument (`RNGseed`) can be added to other `BPPARAM` arguments to set the seed.

Note that the `KNN` estimation procedure does not have any Monte-Carlo approximation steps, and thus does not need to have a random seed.

# References


# SessionInfo {-}

```{r sessionInfo}
sessionInfo()
```