---
title: "Large-scale single-cell RNA-seq data analysis using GDS files and Seurat"
author: "Dr. Xiuwen Zheng (Genomics Research Center, AbbVie)"
date: "Feb 2023"
output:
    BiocStyle::html_document:
        toc_float: true
    BiocStyle::pdf_document: default
vignette: >
    %\VignetteIndexEntry{scRNA-seq data analysis with GDS files and Seurat}
    %\VignetteKeywords{scRNAseq, GDS, Seurat}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r style, echo=FALSE, results='asis'}
BiocStyle::markdown()
knitr::opts_chunk$set(warning=FALSE, error=FALSE, message=FALSE)
```


# Introduction

Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of gene expression heterogeneity within complex biological systems. As scRNA-seq technology becomes increasingly accessible and cost-effective, experiments are generating data from larger and larger numbers of cells. However, the analysis of large scRNA-seq data remains a challenge, particularly in terms of scalability. While numerous analysis tools have been developed to tackle the complexities of scRNA-seq data, their scalability is often limited, posing a major bottleneck in the analysis of large-scale experiments. In particular, the R package [Seurat](https://cran.r-project.org/package=Seurat) is one of the most widely used tools for exploring and analyzing scRNA-seq data, but its scalability is often limited by available memory.

To address this issue, we introduce a new R package called “SCArray.sat” that extends the Seurat classes and functions to support Genomic Data Structure ([GDS](http://www.bioconductor.org/packages/gdsfmt)) files as a [DelayedArray](http://www.bioconductor.org/packages/DelayedArray) backend for data representation. GDS files store multiple dense and sparse array-based data sets in a hierarchical structure. This package defines a new class, called “SCArrayAssay” (derived from the Seurat class “Assay”), which wraps raw counts, normalized expressions, and scaled data matrices based on GDS-specific DelayedMatrix. It is designed to integrate seamlessly with the Seurat package to provide common data analysis in a workflow, with optimized algorithms for GDS data files.

~

```{r workflow, echo=FALSE, fig.cap="Overview of the SCArray framework.", fig.wide=TRUE}
knitr::include_graphics("scarray_sat.svg")
```

~


The SeuratObject package defines the `Assay` class with three members/slots `counts`, `data` and `scale.data` storing raw counts, normalized expressions and scaled data matrix respectively. However, `counts` and `data` should be either a dense matrix or a sparse matrix defined in the [Matrix](https://cran.r-project.org/package=Matrix) package. The scalability of the sparse matrix is limited by the number of non-zero values (should be < 2^31), since the Matrix package uses 32-bit indices internally. `scale.data` in the `Assay` class is defined as a dense matrix, so it is also limited by the available memory. The new class `SCArrayAssay` is derived from `Assay`, with three additional slots `counts2`, `data2` and `scale.data2` replacing the old ones. These new slots can be DelayedMatrix wrapping an on-disk data matrix, without loading the data in memory.

The SCArray.sat package takes advantage of the S3 object-oriented methods defined in the SeuratObject and Seurat packages to reduce code redundancy, by implementing the functions specific to the classes `SCArrayAssay` and `SC_GDSMatrix` (GDS-specific DelayedMatrix). Table 1 shows a list of key S3 methods for data analysis. For example, the function `NormalizeData.SC_GDSMatrix()` will be called when a `SC_GDSMatrix` object is passed to the S3 generic `NormalizeData()`, while `NormalizeData.Assay()` and `NormalizeData.Seurat()` are unchanged. In addition, the SCArray and SCArray.sat packages implement the optimized algorithms for the calculations, by reducing the on-disk data access and taking the GDS data structure into account.

~

**Table 1: Key S3 methods implemented in the SCArray.sat package.**

| **Methods**               | **Description**           | **Note**                   |
|:--------------------------|:--------------------------|:---------------------------|
|GetAssayData.SCArrayAssay()         |Accessor function for ‘SCArrayAssay’ objects  |    |
|SetAssayData.SCArrayAssay           |Setter functions for ‘Assay’ objects          |    |
|NormalizeData.SC_GDSMatrix()        |Normalize raw count data | Store a DelayedMatrix  |
|ScaleData.SC_GDSMatrix()            |Scale and center the normalized data |    |
|FindVariableFeatures.SC_GDSMatrix() |Identifies features |     |
|RunPCA.SC_GDSMatrix()               |Run a PCA dimensionality reduction |       |

*SC_GDSMatrix: GDS- and single-cell- specific DelayedMatrix.*

~

~

# Installation

* Requires [SCArray](http://www.bioconductor.org/packages/SCArray/) (≥ v1.7.13), [SeuratObject](https://cran.r-project.org/package=SeuratObject) (≥ v4.0), [Seurat](https://cran.r-project.org/package=Seurat) (≥ v4.0)

* Bioconductor repository

To install this package, start R and enter:
```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("SCArray.sat")
```


# Examples

## Small Datasets

```{r}
# Load the packages
suppressPackageStartupMessages({
    library(Seurat)
    library(SCArray)
    library(SCArray.sat)
})

# Input GDS file with raw counts
fn <- system.file("extdata", "example.gds", package="SCArray")

# show the file contents
(f <- scOpen(fn))
scClose(f)    # close the file


# Create a Seurat object from the GDS file
d <- scNewSeuratGDS(fn)
class(GetAssay(d))    # SCArrayAssay, derived from Assay

d <- NormalizeData(d)
d <- FindVariableFeatures(d, nfeatures=500)
d <- ScaleData(d)
```

**Let's check the internal data matrices,**
```{r}
# get the file name for the on-disk object
scGetFiles(d)

# raw counts
m <- GetAssayData(d, slot="counts")
scGetFiles(m)    # the file name storing raw count data
m

# normalized expression
# the normalized data does not save in neither the file nor the memory
GetAssayData(d, slot="data")

# scaled and centered data matrix
# in this example, the scaled data does not save in neither the file nor the memory
GetAssayData(d, slot="scale.data")
```

**Perform PCA and UMAP:**
```{r fig.wide=TRUE}
d <- RunPCA(d, ndims.print=1:2)
DimPlot(d, reduction="pca")

d <- RunUMAP(d, dims=1:50)    # use all PCs calculated by RunPCA()
DimPlot(d, reduction="umap")
```

~

## Large Datasets

Let's download a large single-cell RNA-seq dataset from Bioconductor, and convert it to a GDS file. This step will take a while.

If the `TENxBrainData` package is not installed, run:
```R
# install a Bioconductor package
BiocManager::install("TENxBrainData")
```

Then,
```R
library(TENxBrainData)
library(SCArray)

# scRNA-seq data for 1.3 million brain cells from E18 mice (10X Genomics)
# the data will be downloaded automatically at the first time.
# raw count data is stored in HDF5 format
tenx <- TENxBrainData()
rownames(tenx) <- rowData(tenx)$Ensembl  # since rownames(tenx)=NULL

# save it to a GDS file
SCArray::scConvGDS(tenx, "1M_sc_neurons.gds")
```
After the file conversion, users can use this GDS file with Seurat to analyze the data.

~

~

# Benchmarks

## Test Datasets

The datasets used in the benchmark (Table 2) were generated from the 1.3 million brain cells, with the following R codes:
```R
library(SCArray)
library(SCArray.sat)

sce <- scExperiment("1M_sc_neurons.gds")  # load the full 1.3M cells

# D100 dataset
scConvGDS(sce[, 1:1e5], "1M_sc_neurons_d100.gds")  # save to a GDS
# in-memory Seurat object
obj <- scMemory(scNewSeuratGDS("1M_sc_neurons_d100.gds"))
saveRDS(obj, "1M_sc_neurons_d100_seuratobj.rds")  # save to a RDS

# D250 dataset
scConvGDS(sce[, 1:2.5e5], "1M_sc_neurons_d250.gds")
obj <- scMemory(scNewSeuratGDS("1M_sc_neurons_d250.gds"))
saveRDS(obj, "1M_sc_neurons_d250_seuratobj.rds")

# D500 dataset
scConvGDS(sce[, 1:5e5], "1M_sc_neurons_d500.gds")
obj <- scMemory(scNewSeuratGDS("1M_sc_neurons_d500.gds"))
saveRDS(obj, "1M_sc_neurons_d500_seuratobj.rds")

# Dfull dataset
scConvGDS(sce, "1M_sc_neurons_dfull.gds")
```

**Table 2: Datasets in the benchmarks.**

| **Dataset**  | **# of features**  | **# of cells**  | **GDS file**  | **RDS (Seurat Object)**  |
|:-------------|:-------------------|:----------------|:--------------|:-------------------------|
|D100   |27,998  | 100K        | 209MB  | 419MB  |
|D250   |27,998  | 250K        | 529MB  | 1.1GB  |
|D500   |27,998  | 500K        | 1.1GB  | 2.2GB  |
|Dfull  |27,998  | 1.3 million | 2.8GB  | Out of the limit of sparse matrix |

*the number of non-zeros should be < 2^31 in a sparse matrix.*

~

## R Codes in the Benchmark

The following R script is used in the benchmark for testing GDS files, and the R codes for testing the Seurat Object are similar except the input file.
```R
suppressPackageStartupMessages({
    library(Seurat)
    library(SCArray.sat)
})

# the input GDS file can be for d250, d500, dfull
fn <- "1M_sc_neurons_d100.gds"
d <- scNewSeuratGDS(fn)

d <- NormalizeData(d)
d <- FindVariableFeatures(d, nfeatures=2000)  # using the default
d <- ScaleData(d)

d <- RunPCA(d)
d <- RunUMAP(d, dims=1:50)

saveRDS(d, "d100.rds")    # or d250.rds, d500.rds, dfull.rds

q('no')
```

~

## Memory Usage and Elapsed Time

With large test datasets, the SCArray.sat package significantly reduces the memory usages compared with the Seurat package, while the in-memory implementation in Seurat is only 2 times faster than SCArray.sat. When the full dataset "Dfull" was tested, Seurat failed to load the data since the number of non-zeros is out of the limit of sparse matrix.

```{r benchmark, echo=FALSE, fig.cap="The benchmark on PCA & UMAP with large datasets (CPU: Intel Xeon Gold 6248 @2.50GHz, RAM: 176GB).", fig.wide=TRUE}
knitr::include_graphics("benchmark.svg")
```

~

~

# Miscellaneous

## Save SCArrayAssay

The Seurat object with `SCArrayAssay` can be directly saved to a RDS (R object) file, in which the raw counts in the GDS file is not stored in the RDS file. This can avoid data duplication, and is helpful for faster meta data loading. Please keep the GDS and RDS files in the same directory or the same relative paths. The R object can be reloaded later in another R session, and GDS files are reopened internally when accessing the count data.

```{r}
d    # the example for the small dataset

save_fn <- tempfile(fileext=".rds")  # or specify an appropriate location
save_fn
saveRDS(d, save_fn)  # save to a RDS file

remove(d)  # delete the variable d
gc()       # trigger a garbage collection

d <- readRDS(save_fn)  # load from a RDS file
d
GetAssayData(d, slot="counts")  # reopens the GDS file automatically
```

~

## Multicore or Multi-process Implementation

The multicore and multi-process features are supported by SCArray and SCArray.sat via the Bioconductor package "BiocParallel". To enable the parallel feature, users can use the function `setAutoBPPARAM()` in the DelayedArray package to setup multi-process workers. For examples,
```R
library(BiocParallel)

DelayedArray::setAutoBPPARAM(MulticoreParam(4))  # 4 child processes
```


~

## Downgrade SCArrayAssay

The `SCArrayAssay` object can be downgraded to the regular `Assay`. It is useful when the downstream functions or packages do not support DelayedArray. 

```{r}
is(GetAssay(d))

new_d <- scMemory(d)  # downgrade the active assay
is(GetAssay(new_d))
```
If users only want to 'downgrade' the scaled data matrix, then
```{r}
is(GetAssayData(d, slot="scale.data"))  # it is a DelayedMatrix

new_d <- scMemory(d, slot="scale.data")  # downgrade "scale.data" in the active assay
is(GetAssay(new_d))  # it is still SCArrayAssay
is(GetAssayData(new_d, slot="scale.data"))  # in-memory matrix
```

~

## Conversion from Seurat to SingleCellExperiment

A Seurat object with `SCArrayAssay` can be converted to a Bioconductor `SingleCellExperiment` object using `as.SingleCellExperiment()` in the Seurat package. The DelayedMatrix in `SCArrayAssay will be saved in the new SingleCellExperiment object. For example,

```{r}
is(d)

sce <- as.SingleCellExperiment(d)
is(sce)
sce

counts(sce)  # raw counts
```

~

## List of supported functions

Not all of the functions in the Seurat package can be applied to the `SCArrayAssay` object. Here is the list of currently supported and unsupported functions we have tested. The unsupported methods maybe available on request in the future release of SCArray.sat. Note that the supported states may depend on the package versions of Seurat and SeuratObject, and SeuratObject_4.1.3 and Seurat_4.3.0 were tested here.

**Table 3: The states of functions and methods with the support of SCArrayAssay.**

|State|Functions              |Description                             |Notes                      |
|:---:|:----------------------|:---------------------------------------|:--------------------------|
| ✓   |CreateSeuratObject()   |  |  |
| ✓   |FindVariableFeatures() |Identifies the top features |  |
| ✓   |NormalizeData()        |Normalize the count data |  |
| ✓   |RunPCA()               |Principal component analysis |  |
| ✓   |ScaleData()            |Scales and centers features |  |
| ☑   |FindMarkers()          |Differentially expressed genes | data read via blocking |
| ☑   |FoldChange()           |  |  |
| ☑   |RunICA()               |Independent component analysis |  |
| ☑   |RunSPCA()              |Supervised principal component analysis |  |
| ☑   |RunLDA()               |Linear discriminant analysis |  |
| ☑   |RunSLSI()              |Supervised latent semantic indexing |  |
| ☑   |RunUMAP()              |Uniform manifold approx. and projection |  |
| ⦿   |FindNeighbors          |Nearest-neighbor graph construction |  |
| ⦿   |HVFInfo()              |Info for highly variable features |  |
| ⦿   |RunTSNE()              |t-SNE dimensionality reduction |  |
| ⦿   |ProjectDim()           |  |  |
| ⦿   |ProjectUMAP()          |  |  |
| ⦿   |SVFInfo()              |Info for spatially variable features |  |
| ⦿   |VariableFeatures()     |Get/set variable feature information |  |
| ✗   |CreateAssayObject()    |  | use CreateAssayObject2() instead |
| ✗   |as.Seurat()            |  | planned |
| ✗   |RunCCA()               |Canonical correlation analysis | planned |
| ✗   |SCTransform()          |Normalization via regularized NB regression | planned |

* ✓ Supported (implemented in SCArray.sat, optimized for memory)
* ☑ Supported (mainly relying on the implementation in Seurat)
* ⦿ Supported (implemented in Seurat, not in SCArray.sat)
* ✗ Unsupported (raising an error)


~

## Debugging

`options(SCArray.verbose=TRUE)` is used to enable displaying debug information when calling the functions in the SCArray and SCArray.sat packages. For example,

```{r}
options(SCArray.verbose=TRUE)

d <- ScaleData(d)
```

~

~

# Session Information

```{r}
# print version information about R, the OS and attached or loaded packages
sessionInfo()
```

```{r echo=FALSE}
unlink("test.rds", force=TRUE)
```