---
title: "An introduction to scMerge2"
author:
- name: Yingxin Lin
  affiliation: School of Mathematics and Statistics, The University of Sydney, Australia
output:
  BiocStyle::html_document:
    toc_float: true
package: BiocStyle
vignette: >
  %\VignetteIndexEntry{scMerge2}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```


# Introduction

The scMerge algorithm allows batch effect removal and normalisation for single cell RNA-Seq data. It comprises of three key components including: 

1.  The identification of stably expressed genes (SEGs) as "negative controls" for estimating unwanted factors; 
2.  The construction of pseudo-replicates to estimate the effects of unwanted factors; and 
3.  The adjustment of the datasets with unwanted variation using a fastRUVIII model. 


The purpose of this vignette is to illustrate some uses of `scMerge` and explain its key components. 

# Loading Packages and Data

We will load the `scMerge` package. We designed our package to be consistent with the popular BioConductor's single cell analysis framework, namely the `SingleCellExperiment` and `scater` package. 

```{r loading packages, warning = FALSE, message = FALSE}
suppressPackageStartupMessages({
    library(SingleCellExperiment)
    library(scMerge)
    library(scater)
})
```


We provided an illustrative mouse embryonic stem cell (mESC) data in our package, as well as a set of pre-computed stably expressed gene (SEG) list to be used as negative control genes. 


The full curated, unnormalised mESC data can be found [here](http://www.maths.usyd.edu.au/u/yingxinl/wwwnb/scMergeData/sce_mESC.rda). The `scMerge` package comes with a sub-sampled, two-batches version of this data (named "batch2" and "batch3" to be consistent with the full data) . 


```{r loading data}
## Subsetted mouse ESC data
data("example_sce", package = "scMerge")
data("segList_ensemblGeneID", package = "scMerge")
```


In this mESC data, we pooled data from 2 different batches from three different cell types. Using a PCA plot, we can see that despite strong separation of cell types, there is also a strong separation due to batch effects. This information is stored in the `colData` of `example_sce`. 

```{r checking raw data}
example_sce = runPCA(example_sce, exprs_values = "logcounts")

scater::plotPCA(example_sce, 
                colour_by = "cellTypes", 
                shape_by = "batch")
```


# scMerge2

## Unsupervised `scMerge2`

In unsupervised `scMerge2`, we will perform graph clustering on shared nearest neighbour graphs within each batch to obtain pseudo-replicates. This requires the users to supply a `k_celltype` vector with the number of neighbour when constructed the nearest neighbour graph in each of the batches. By default, this number is 10.


```{r}
scMerge2_res <- scMerge2(exprsMat = logcounts(example_sce),
                         batch = example_sce$batch,
                         ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                         verbose = FALSE)

assay(example_sce, "scMerge2") <- scMerge2_res$newY

set.seed(2022)
example_sce <- scater::runPCA(example_sce, exprs_values = 'scMerge2')                                       
scater::plotPCA(example_sce, colour_by = 'cellTypes', shape = 'batch')
```


## Semi-supervised `scMerge2`


When cell type information are known (e.g. results from cell type classification using reference), scMerge2 can use this information to construct pseudo-replicates and identify mutual nearest groups with `cellTypes` input. 


```{r}
scMerge2_res <- scMerge2(exprsMat = logcounts(example_sce),
                         batch = example_sce$batch,
                         cellTypes = example_sce$cellTypes,
                         ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                         verbose = FALSE)


assay(example_sce, "scMerge2") <- scMerge2_res$newY

example_sce = scater::runPCA(example_sce, exprs_values = 'scMerge2')                                       
scater::plotPCA(example_sce, colour_by = 'cellTypes', shape = 'batch')
```

# More details of scMerge2

## Number of pseudobulk 

The number of pseudobulk constructed within each cell grouping is set via `k_pseudoBulk`. By default, this number is set as 30. A larger number will create more pseudo-bulk data in model estimation, with longer time in estimation. 

```{r}
scMerge2_res <- scMerge2(exprsMat = logcounts(example_sce),
                         batch = example_sce$batch,
                         ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                         k_pseudoBulk = 50,
                         verbose = FALSE)


assay(example_sce, "scMerge2") <- scMerge2_res$newY

set.seed(2022)
example_sce <- scater::runPCA(example_sce, exprs_values = 'scMerge2')                                       
scater::plotPCA(example_sce, colour_by = 'cellTypes', shape = 'batch')
```

## Return matrix by batch 

When working with large data, we can get the adjusted matrix for a smaller subset of cells each time. This can be achieved by setting `return_matrix` to `FALSE` in `scMerge2()` function, which the function then will not return the adjusted whole matrix but will output the estimated `fullalpha`.
Then to get the adjusted matrix using the estimated `fullalpha`, we first need to performed cosine normalisation on the logcounts matrix and then calculate the row-wise (gene-wise) mean of the cosine normalised matrix (This is because by default, `scMerge2()` perform cosine normalisation on the log-normalised matrix before `RUVIII` step). Then we can use `getAdjustedMat()` to adjust the matrix of a subset of cells each time.


```{r}
scMerge2_res <- scMerge2(exprsMat = logcounts(example_sce),
                         batch = example_sce$batch,
                         ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                         verbose = FALSE,
                         return_matrix = FALSE)

cosineNorm_mat <- batchelor::cosineNorm(logcounts(example_sce))
adjusted_means <- DelayedMatrixStats::rowMeans2(cosineNorm_mat)

newY <- list()
for (i in levels(example_sce$batch)) {
    newY[[i]] <- getAdjustedMat(cosineNorm_mat[, example_sce$batch == i], 
                                scMerge2_res$fullalpha,
                                ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                                ruvK = 20,
                                adjusted_means = adjusted_means)
}
newY <- do.call(cbind, newY)

assay(example_sce, "scMerge2") <- newY[, colnames(example_sce)]

set.seed(2022)
example_sce <- scater::runPCA(example_sce, exprs_values = 'scMerge2')                                       
scater::plotPCA(example_sce, colour_by = 'cellTypes', shape = 'batch')

```


Note that we can also adjust only a subset of genes by input a gene list in `return_subset_genes` in 
both `getAdjustedMat()` and `scMerge2()`.

## Hierarchical scMerge2

scMerge2 provides a hierarchical merging strategy for data integration that requires multiple level adjustment through function `scMerge2h()`. For example, for the dataset with multiple samples, we may want to remove the sample effect first within the dataset before integrate it with other datasets. Below, we will illustrate how we can build a hierarchical merging order as input for `scMerge2h()`.

For illustration purpose, here I first create a fake sample information for the sample data. Now, each batch has two samples. 

```{r}
# Create a fake sample information
example_sce$sample <- rep(c(1:4), each = 50)
table(example_sce$sample, example_sce$batch)
```

To perform `scMerge2h`, we need to create 

1. a hierarchical index list that indicates the indices of the cells that are going into merging;
2. a batch information list that indicates the batch information of each merging.

We will first illustrate that a two-level merging case, where the first level refers to the sample effect removal within each batch, and the second level refers to the merging of two batches. 

First, we will construct the hierarchical index list (`h_idx_list`). The hierarchical index list is a list that indicates for each level, which indices of the cells are going into merging. The number of the element of the list should be the same with the number of level of merging. For each element, it should contain a list of vectors of indices of each merging.


```{r}
# Construct a hierarchical index list
h_idx_list <- list(level1 = split(seq_len(ncol(example_sce)), example_sce$batch),
                   level2 = list(seq_len(ncol(example_sce))))
```

On level 1, we will perform two merging, one for each batch. Therefore, we have a list of two vectors of indices. Each indicates the indices of the cells of the batches.

```{r}
h_idx_list$level1
```

On level 2, we will perform one merging to merge two batches. Therefore, we have a list of one vector of indices, indicates all the indices of the cells of the full data.

```{r}
h_idx_list$level2
```

Next, we need to create the batch information list (`batch_list`), which has the same structure with the `h_idx_list`, but indicates the batch label of the `h_idx_list`. 


```{r}
# Construct a batch information list
batch_list <- list(level1 = split(example_sce$sample, example_sce$batch),
                   level2 = list(example_sce$batch))
```

We can see `batch_list` indicates the batch label for level of the merging.

```{r}
batch_list$level1
```


Now, we can input the `batch_list` and `h_idx_list` in `scMerge2h` to merge the data hierarchically. 
We also need to input a `ruvK_list`, a vector of number of unwanted variation (k in RUV model) for each level of the merging. We suggest a lower `ruvK` in the first level. Here we set `ruvK_list = c(2, 5)` which indicates we will set RUV's k equal to 2 in the first level and 5 in the second level.

```{r warning=FALSE}
scMerge2_res <- scMerge2h(exprsMat = logcounts(example_sce),
                          batch_list = batch_list,
                          h_idx_list = h_idx_list,
                          ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                          ruvK_list = c(2, 5),
                          verbose = FALSE)
```

The output of `scMerge2h` is a list of matrices indicates the adjusted matrix from each level.

```{r}
length(scMerge2_res)
lapply(scMerge2_res, dim)
```


Here we will use the adjusted matrix from the last level as the final adjusted matrix.

```{r}
assay(example_sce, "scMerge2") <- scMerge2_res[[length(h_idx_list)]]
set.seed(2022)
example_sce <- scater::runPCA(example_sce, exprs_values = 'scMerge2')                           
scater::plotPCA(example_sce, colour_by = 'cellTypes', shape = 'batch')
```


# Session Info

```{r session info}
sessionInfo()
```