---
title: "Introduction to Spline-DV"
author: 
- name: Shreyan Gupta
  affiliation: Veterinary Integrative Biosciences, Texas A&M University
- name: Victoria Gatlin
  affiliation: Veterinary Integrative Biosciences, Texas A&M University
- name: Selim Romero
  affiliation: Department of Nutrition, Texas A&M University
- name: James J Cai
  affiliation: Veterinary Integrative Biosciences, Texas A&M University
abstract: "This tutorial shows how to use **Spline-DV** for Differential Variability
  Analysis across two single-cell RNA seq samples and **Spline-HVG** for highly variable
  feature selection"
output:
  BiocStyle::html_document:
    toc: true
    number_sections: true
  html_notebook:
    toc: true
  pdf_document:
    toc: true
    
vignette: >
  %\VignetteIndexEntry{Introduction to Spline-DV}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo=TRUE)
```

# Introduction

## Beyond averages gene expression changes

One of the most intuitive ways to evaluate a gene expression change is using Differential Expression (DE) analysis. Traditionally, DE analysis focuses on identifying genes that are up- or down-regulated (increased or decreased expression) between conditions, typically employing a basic mean-difference approach. We propose a paradigm shift that acknowledges the central role of gene expression variability in cellular function and challenges the current dominance of mean-based DE analysis in single-cell studies. We suggest that scRNA-seq data analysis should embrace the role of inherent gene expression variability in defining cellular function and move beyond mean-based approaches.

# Installation

```{r Installation, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

BiocManager::install("Xenon8778/SplineDV")
```

# Import libraries

For this vignette, we'll use SplineDV in conjunction with the scRNAseq package. While standard data preprocessing can be applied beforehand, the package itself handles quality control internally, requiring only a raw count matrix as input.

```{r Importlibraries, message=FALSE}
library(scRNAseq)
library(SplineDV)
```

# Overview of the SplineDV workflow

## Input - scRNA-seq expression matrix from two condtions

Spline-DV is designed to compare expression variability between two experimental conditions. To illustrate this, we'll utilize a publicly available mouse lung single-cell RNA-seq dataset from the scRNAseq Bioconductor package by Zilionis et al. [1]. This dataset includes tumor-infiltrating myeloid cells from both tumor and healthy conditions.

```{r LoadingExampleData}
sce <- fetchDataset('zilionis-lung-2019','2023-12-20', path="mouse")
sce
```
Let us see the number of cells per condition and and the available cell types in the scRNA-seq dataset.
```{r MajorCellTypes}
table(sce$`Major cell type`,sce$`Tissue`)
```

The expression matrix should be formatted with the genes/features as rows and cells as columns. The two input matrices for splineDV must be stored as two separate count expression matrices or SingleCellExperiment objects. We can isolate the two experimental conditions. To focus solely on changes within a specific cell type and eliminate potential variability arising from shifts in cell type distribution, we select only Neutrophils for further analysis.

```{r SplittingExperimentalConditions}
# Extract Healthy data
healthyCount <- sce[, sce$Tissue == "healthy"]

# Extract Tumor data
tumorCount <- sce[, sce$Tissue == "tumor"]

# Select a specific cell type - Neutrophils
healthyCount <- healthyCount[, which(healthyCount$`Major cell type` == "Neutrophils")]
print(healthyCount)

tumorCount <- tumorCount[, which(tumorCount$`Major cell type` == "Neutrophils")]
print(tumorCount)
```

Given our isolation of a single cell type, any observed changes in gene expression variability should be attributed to experimental factors. To mitigate the potential impact of background distribution shifts caused by varying cell type and state proportions, we recommend conducting Spline-DV analysis in a cell state/type-specific manner.

## Running Spline-DV

The main function of SplineDV ios the splineDV function. For the analysis, the test data (X) is always use in contrast with the control data (Y). We use smaller QC parameters for the small example data sets to preserve enough cells and genes. **We recommend using the default QC parameters for large data sets.**

```{r runSplineDV}
dvRes <- splineDV(X = tumorCount, Y = healthyCount)
head(dvRes)
```

The output is a DataFrame containing statistical measures for each gene in the differential variability (DV) analysis. Genes with large vectorDist values exhibit substantial changes in expression variability between the two experimental conditions. Genes with FDR values below 0.05 are considered significantly differentially variable. The Direction column indicates whether the variability increased or decreased. By leveraging this information, we can identify biologically relevant genes that not only demonstrate shifts in mean expression but also alterations in expression variability across conditions.

## Visualize Gene Expression statistics

To visualize the differential variability of genes, we can employ a 3D scatter plot. The `DVPlot` function allows us to visualize the computed spline for each condition, represented by a distinct color, along with the distance vectors of genes from their respective spline. This plot provides a clear representation of the shift in expression variability between the two conditions.

```{r plotDV, out.width="75%", out.height="75%", fig.format='png'}
fig <- DVPlot(dvRes, targetgene='Spp1')
fig 
```

# Highly Variable Genes (HVGs) using Spline-HVG

Next, we'll introduce the Spline HVG algorithm, a feature selection technique designed to identify Highly Variable Genes (HVGs) in scRNA-seq data. This algorithm computes the distance from the spline to identify genes with significant variability. To demonstrate this, we'll utilize the same sample dataset of healthy neutrophils from mice lungs used in the previous section.

## Input - scRNA-seq Expression matrix

```{r loadHVGExampleData}
# Healthy neutrophils data
print(healthyCount)
```

## Running Spline-HVG

The expression matrix should be formatted with the genes/features as rows and cells as columns. The input matrix for `splineHVG` must be stored as a count expression matrices or SingleCellExperiment objects. To focus solely on changes within a specific cell type and eliminate potential variability arising from shifts in cell type distribution, we select only Neutrophils for further analysis. Smaller QC parameters can be used for the small data sets (e.g., cells < 500 and genes < 5000) to preserve enough cells and genes for splineHVG analysis. However, **we recommend using the default QC parameters for large data sets**, if not more stringent.

```{r runSplineHVG}
HVGRes <- splineHVG(healthyCount, nHVGs = 200)
head(HVGRes)
```

The output is a DataFrame containing statistical measures for each gene in the Spline HVG analysis. Genes with large Distance values exhibit significant deviation from expected behavior, as represented by the 3D spline. The top n HVGs can be identified by sorting the DataFrame by Distance in descending order. These highly variable genes, which are biologically relevant, can be utilized in various downstream analyses, such as for dimensional reduction and PCA.

```{r showHVGList}
# Extracting HVG Gene list
HVGList <- rownames(HVGRes)[HVGRes$HVG == TRUE]
head(HVGList)
```

## Visualize Highly Variable Genes

To visualize the highly variable genes, we can employ a 3D scatter plot. The `HVGPlot` function allows us to visualize the computed spline, which represents the expected gene behavior. Each dot on the plot represents a single gene. This plot provides a clear representation of HVGs (highlighted in red), which deviate significantly from the 3D spline, indicating a substantial shift in their gene expression.

```{r plotHVG, out.width="75%", out.height="75%"}
fig <- HVGPlot(HVGRes)
fig
```

# Conclusion

SplineDV is a valuable tool for single-cell RNA sequencing (scRNA-seq) analysis, specifically designed to identify differential gene expression variability between conditions. By examining changes in expression dispersion, SplineDV complements traditional differential expression analysis methods. We recommend using SplineDV in conjunction with other tools from Seurat and Bioconductor to gain a more comprehensive understanding of biological processes.

# Appendix

## Citing SplineDV

```{r "citation"}
citation("SplineDV")
```

## References

1.  Zilionis, R., Engblom, C., Pfirschke, C., Savova, V., Zemmour, D., Saatcioglu, H. D., Krishnan, I., Maroni, G., Meyerovitz, C. V., Kerwin, C. M., Choi, S., Richards, W. G., De Rienzo, A., Tenen, D. G., Bueno, R., Levantini, E., Pittet, M. J., & Klein, A. M. (2019). Single-Cell Transcriptomics of Human and Mouse Lung Cancers Reveals Conserved Myeloid Populations across Individuals and Species. Immunity, 50(5), 1317–1334.e10. <https://doi.org/10.1016/j.immuni.2019.03.009>

# sessionInfo

This is the output of `sessionInfo()` on the system on which this document was compiled:

```{r}
date()
sessionInfo()
```