---
title: Using escape to perform ssGSEA analyses on single-cell RNA-seq data
author: 
- name: Nick Borcherding
  email: ncborch@gmail.com
  affiliation: Washington University in St. Louis, School of Medicine, St. Louis, MO, USA
- name: Jared Andrews
  email: jared.andrews07@gmail.com
  affiliation: Washington University in St. Louis, School of Medicine, St. Louis, MO, USA

date: "January 29, 2021"

output:
  BiocStyle::html_document:
    toc_float: true
package: escape
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Escape-ingToWork}
  %\VignetteEncoding{UTF-8} 
---

```{r, echo=FALSE, results="hide", message=FALSE}
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
library(BiocStyle)
```

# Loading Processed Single-Cell Data


For the demonstration of *escape*, we will use the example "pbmc_small" data from *Seurat* and also generate a `SingleCellExperiment` object from it.


```{r}
suppressPackageStartupMessages(library(escape))
suppressPackageStartupMessages(library(dittoSeq))
suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(SeuratObject))
pbmc_small <- get("pbmc_small")
pbmc_small <- DietSeurat(suppressMessages(UpdateSeuratObject(pbmc_small)))
sce <- as.SingleCellExperiment(pbmc_small, assay = "RNA")
```

# Getting Gene Sets

# Getting Gene Sets

## Option 1: Molecular Signture Database

The first step in the process of performing gene set enrichment analysis is identifying the gene sets we would like to use. The function `getGeneSets()` allows users to isolate a whole or multiple libraries from a list of GSEABase GeneSetCollection objects. We can do this for gene set collections from the built-in [Molecular Signature Database](https://www.gsea-msigdb.org/gsea/msigdb/search.jsp) by setting the parameter **library** equal to library/libraries of interest. For multiple libraries, just set **library = c("Library1", "Library2", etc)**. 

In Addition:    
- Individual pathways/gene sets can be isolated from the libraries selected, by setting **gene.sets** = the name(s) of the gene sets of interest.   
- Subcategories of the invidual libaries can be selected using the **subcategory** parameter.   

If the sequencing of the single-cell data is performed on a species other than "Homo sapiens", make sure to use the **species** parameter in `getGeneSets()` in order to get the correct gene nomenclature.

```{r}
GS.hallmark <- getGeneSets(library = "H")
```

## Option 2: Built-In gene sets

```{r, eval = FALSE}
data("escape.gene.sets", package="escape")
gene.sets <- escape.gene.sets
```


## Option 3: Define personal gene sets

```{r, eval=FALSE}
gene.sets <- list(Tcell_signature = c("CD2","CD3E","CD3D"),
			Myeloid_signature = c("SPI1","FCER1G","CSF1R"))
```

# Enrichment

The next step is performing the enrichment on the RNA count data. The function `enrichIt()` can handle either a matrix of raw count data or will pull that data directly from a SingleCellExperiment or Seurat object. The **gene.sets** parameter in the function is the GeneSets, either generated from `getGeneSets()` or from the user. The enrichment scores will be calculated across all individual cells and **groups** is the n size to break the enrichment by while the **cores** is the number of cores to perform in parallel during the enrichment calculation too. 

`enrichIt()` can utilize two distinct methods for quantification using the **method** parameter - either the *"ssGSEA"* method described by [Barbie et al 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2783335/) or *"UCell"* described by [Andreatta and Carmona 2021](https://pubmed.ncbi.nlm.nih.gov/34285779/).

To prevent issues with enrichment calculations for gene sets with a low number of genes represented in the count data of a single-cell object - **min.size** was instituted to remove gene sets with less than the indicated genes. 

**Important Note:** This is computationally intensive and is highly dependent on the number of cells and the number of gene sets included.

```{r}
ES.seurat <- enrichIt(obj = pbmc_small, 
                      gene.sets = GS.hallmark, 
                      groups = 1000, cores = 2, 
                      min.size = 5)

ES.sce <- enrichIt(obj = sce, 
                   gene.sets = GS.hallmark, 
                   method = "UCell",
                   groups = 1000, cores = 2, 
                   min.size = 5)
```

We can then easily add these results back to our Seurat or SCE object.

```{r}
## if working with a Seurat object
pbmc_small <- Seurat::AddMetaData(pbmc_small, ES.seurat)

## if working with an SCE object
met.data <- merge(colData(sce), ES.sce, by = "row.names", all=TRUE)
row.names(met.data) <- met.data$Row.names
met.data$Row.names <- NULL
colData(sce) <- met.data
```

****

# Visualizations

The easiest way to generate almost any visualization for single cell data is via `r Biocpkg("dittoSeq")`, which is an extremely flexible visualization package for both bulk and single-cell RNA-seq data that works very well for both expression data and metadata. Better yet, it can handle both SingleCellExperiment and Seurat objects.

To keep things consistent, we'll define a pleasing color scheme.
```{r}
colors <- colorRampPalette(c("#0D0887FF","#7E03A8FF","#CC4678FF","#F89441FF","#F0F921FF"))
```


## Heatmaps

A simple way to approach visualizations for enrichment results is the heatmap, especially if you are using a number of gene sets or libraries. 

```{r}
dittoHeatmap(pbmc_small, genes = NULL, metas = names(ES.seurat), 
             annot.by = "groups", 
             fontsize = 7, 
             cluster_cols = TRUE,
             heatmap.colors = colors(50))
```

A user can also produce a heatmap with select gene sets by providing specific names to the **metas** parameter. For example, we can isolated gene sets involved immune response.

```{r}
dittoHeatmap(sce, genes = NULL, 
             metas = c("HALLMARK_IL2_STAT5_SIGNALING", 
                       "HALLMARK_IL6_JAK_STAT3_SIGNALING", 
                       "HALLMARK_INFLAMMATORY_RESPONSE"), 
             annot.by = "groups", 
             fontsize = 7,
             heatmap.colors = colors(50))
```

## Violin Plots

Another way to visualize a subset of gene set enrichment would be to graph the distribution of enrichment using violin, jitter, boxplot, or ridgeplots. We can also compare between categorical variables using the **group.by** parameter.

```{r}
multi_dittoPlot(sce, vars = c("HALLMARK_IL2_STAT5_SIGNALING", 
                       "HALLMARK_IL6_JAK_STAT3_SIGNALING", 
                       "HALLMARK_INFLAMMATORY_RESPONSE"), 
                group.by = "groups", plots = c("jitter", "vlnplot", "boxplot"), 
                ylab = "Enrichment Scores", 
                theme = theme_classic() + theme(plot.title = element_text(size = 10)))
```


## Hex Density Enrichment Plots

We can also compare the distribution of enrichment scores of 2 distinct gene sets across all single cells using the `dittoScatterHex()` function. Here, we use our SingleCellExperiment object with results of `enrichIt()` and specify gene sets to the **x.var** and **y.var** parameters to produce a density plot. We can also add contours to the plot, by passing **do.contour = TRUE**. 

```{r}
dittoScatterHex(sce, x.var = "HALLMARK_IL2_STAT5_SIGNALING", 
                    y.var = "HALLMARK_IL6_JAK_STAT3_SIGNALING", 
                    do.contour = TRUE) + 
        scale_fill_gradientn(colors = colors(11)) 
```

We can also separate the graph using the **split.by** parameter, allowing for the direct comparison of categorical variables.

```{r}
dittoScatterHex(sce, x.var = "HALLMARK_IL2_STAT5_SIGNALING", 
                    y.var = "HALLMARK_IL6_JAK_STAT3_SIGNALING", 
                do.contour = TRUE,
                split.by = "groups")  + 
        scale_fill_gradientn(colors = colors(11)) 
```

## Ridge Plots

Another distribution visualization is using a [Ridge Plot](https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html) from the ggridges R package. This allows the user to incorporate categorical variables to seperate the enrichment scores along the y-axis in addition to the faceting by categorical variables. 

Like above, we can explore the distribution of the **"HALLMARK_DNA_REPAIR"** gene set between groups by calling `ridgeEnrichment()` with the ES2 object. We specify **group = letters.idents**, which will separate groups on the y-axis. We can also add a rug plot (**add.rug = TRUE**) to look at the discrete sample placement along the enrichment ridge plot. 

```{r}
## Seurat object example
ES2 <- data.frame(pbmc_small[[]], Idents(pbmc_small))
colnames(ES2)[ncol(ES2)] <- "cluster"

## plot
ridgeEnrichment(ES2, gene.set = "HALLMARK_IL2_STAT5_SIGNALING", group = "cluster", add.rug = TRUE)
```

In addition to the separation of **letter.idents**, we can also use `ridgeEnrichment()` for better granularity of multiple variables. For example, instead of looking at the difference just between "Type", we can set **group = "cluster"** and then **facet = "letter.idents"**. This gives a visualization of the enrichment of DNA Repair by cluster and Type.

```{r}
ridgeEnrichment(ES2, gene.set = "HALLMARK_IL2_STAT5_SIGNALING", group = "cluster", 
                facet = "letter.idents", add.rug = TRUE)
```

## Split Violin Plots

Another distribution visualization is a violin plot, which we seperate and directly compare using a binary classification. Like `ridgeEnrichment()`, this allows for greater use of categorical variables. For `splitEnrichment()`, the output will be two halves of a violin plot bassed on the `split` parameter with a central boxplot with the relative distribution accross all samples. 

Like above, we can explore the distribution of the **"HALLMARK_DNA_REPAIR"** gene set between groups by calling `splitEnrichment()` with the ES2 object and **split = "groups"**. We can also explore the cluster distribution by assigning the **x-axis = "cluster"**.

```{r}
splitEnrichment(ES2, split = "groups", gene.set = "HALLMARK_IL2_STAT5_SIGNALING")
splitEnrichment(ES2, x.axis = "cluster", split = "groups", gene.set = "HALLMARK_IL2_STAT5_SIGNALING")
```


## Enrichment Plots

New to the dev version of the escape is `enrichmentPlot()` that takes the single-cell expression object (**pbmc_small**) and calculates mean rank order for a gene set across groups. The function requires the name of the specific gene set (**gene.set**) and the library of gene sets (**gene.sets**) to extract the rank positions from the count data stored in the Seurat or single-cell expression object. 

```{r}
enrichmentPlot(pbmc_small, 
               gene.set = "HALLMARK_IL2_STAT5_SIGNALING",
               gene.sets = GS.hallmark,
               group = "groups")
```

***

# Expanded Analysis

One frustration of Gene Set Enrichment is trying to make sense of the values. In order to move away from just selecting pathways that may be of interest, escape offers the ability to `performPCA()` on the enrichment scores. Like the other functions, we will need to provide the output of `enrichIt()` to the **enriched** parameter and the **groups** to include for later graphing.

```{r}
PCA <- performPCA(enriched = ES2, gene.sets = names(GS.hallmark), groups = c("groups", "cluster"))
```

```{r}
pcaEnrichment(PCA, PCx = "PC1", PCy = "PC2", contours = TRUE)
```

We can also look at the `pcaEnrichment()` output separated by categorical factors using the **facet** parameter, for example using the cluster assignment.

```{r}
pcaEnrichment(PCA, PCx = "PC1", PCy = "PC2", contours = FALSE, facet = "cluster") 
```

We can more closely examine the construction of the PCA by looking at the contribution of each gene set to the respective principal component using `masterPCAPlot()` with the same input as above with the `pcaEnrichment()`. We can also control the number of gene sets plotted with **top.contribution**. 

```{r}
masterPCAPlot(ES2, gene.sets = names(GS.hallmark), PCx = "PC1", PCy = "PC2", top.contribution = 10)
```


## Signficance

We can also look for significant differences between groups of variables using `getSignificance()`. For this, we need to assign a **group** parameter and the type of **fit** including: T.test, logistic regression (LR), Wilcoxon Rank Sum Test (Wilcoxon), ANOVA, and Kruskal-Wallis (KW) test. `getSignificance()` will pull any numeric values, to ensure only gene sets, use the **gene.sets** parameter, similar to the PCAplot functions above. 

Returned is a test statistic, raw p value, FDR value , and the median values for each group. In addition, ANOVA and the Kruskal-Wallis test will automatically return the corrected p-values for each comparison in the group. 


```{r}
output <- getSignificance(ES2, 
                          group = "groups", 
                          gene.sets = names(ES.seurat),
                          fit = "T.test")
```

***

# Support

If you have any questions, comments or suggestions, feel free to visit the [github repository](https://github.com/ncborcherding/escape) or [email me](mailto:ncborch@gmail.com).

```{r}
sessionInfo()
```