---
title: "LRcell: Differential **cell** type change analysis using **L**ogistic/linear **R**egression."
shorttitle: "LRcell Vignette"
author: 
- name: Wenjing Ma
  affiliation: Computer Science and Informatics, Emory University 
  email: wenjing.ma@emory.edu
- name: Dr. Zhaohui S. Qin
  affiliation: Department of Biostatistics and Bioinformatics, Emory University 
date: "`r Sys.Date()`"
abstract: >
  The goal of LRcell is to identify specific sub-cell types that drives the changes 
  observed in a bulk RNA-seq differential gene expression experiment. To achieve this, 
  LRcell utilizes sets of cell marker genes acquired from single-cell RNA-sequencing (scRNA-seq) 
  as indicators for various cell types in the tissue of interest. Next, for each cell type, 
  using its marker genes as indicators, we apply Logistic/Linear Regression on the complete 
  set of genes with differential expression p-values to calculate a cell-type significance p-value. 
  Finally, these p-values are compared to predict which one(s) are likely to be responsible 
  for the differential gene expression pattern observed in the bulk RNA-seq experiments. 
  LRcell is inspired by the LRpath[@sartor2009lrpath] algorithm developed by Sartor et al., originally designed for pathway/gene set enrichment analysis. LRcell contains three major components: LRcell analysis, plot generation and marker gene selection. All modules in this package are written in R. This package also provides marker genes in the Prefrontal Cortex (pFC) human brain region and  nine mouse brain regions (Frontal Cortex, Cerebellum, Globus Pallidus, Hippocampus, Entopeduncular, Posterior Cortex, Striatum, Substantia Nigra and Thalamus).
output:
  BiocStyle::html_document
bibliography: library.bib
vignette: >
  %\VignetteIndexEntry{LRcell Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction
Single-cell RNA-sequencing (scRNA-seq) technologies have revealed cell heterogeneity. 
Marker gene expressions are considered as the most intuitive and informative measurements 
distinguishing different cell types. Although several computational methods have 
been proposed to do marker gene selection from clusters or cell types, none of them applied 
marker gene information on bulk RNA-seq experiments to find cell type or cluster enrichment. 
Here, we present LRcell package, which uses Logistic/Linear Regression to
identify the most transcriptionally enriched cell types or clusters when applying cell marker 
information to a bulk experiment with p-values measurements of differentially expressed genes.

**Pre-Loaded Marker genes information**: This package offers marker genes information in 1 human PBMC data, 1 human brain region 
and 9 mouse brain regions.

# Standard Workflow

## Installation

This is a R Bioconductor package and it can be installed by using `BiocManager`.
```{r install}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager") ## this will install the BiocManager package
BiocManager::install("LRcell")
```

To check whether LRcell package is successfully installed:
```{r setup}
library(LRcell)
```

## LRcell usage
Once we have LRcell package loaded, we can start using it to analyze the transcriptional
engagement of cell types or clusters. LRcell **takes both single-cell marker genes list 
and p-values of bulk DE genes as input** to calculate the enrichment of cell-type specific
marker genes in the ranked DE genes. 

As mentioned above, LRcell provides single-cell marker genes list in 1 human PBMC data, 1 human brain region
(Prefrontal Cortex) and 9 mouse brain regions (Frontal Cortex, Cerebellum, Globus Pallidus, 
Hippocampus, Entopeduncular, Posterior Cortex, Striatum, Substantia Nigra and Thalamus).

- The human PBMC data comes from volunteers enrolled in an HIV vaccine trial [@hao2020integrated] at time point of day 0.
- The human brain data comes from control samples in Major Depressive Disorder studies. [@nagy2020single]. 
- The mouse data comes from the whole brain single-cell RNA-seq experiments[@saunders2018molecular]. Another resource for this dataset is [DropViz](http://dropviz.org/).

The data is stored in another Bioconductor ExperimentHub package named [LRcellTypeMarkers](https://github.com/marvinquiet/LRcellTypeMarkers). Users can access the data through ExperimentHub by:

```{r}
## for installing ExperimentHub
# BiocManager::install("ExperimentHub")

## query data
library(ExperimentHub)
eh <- ExperimentHub::ExperimentHub()
eh <- AnnotationHub::query(eh, "LRcellTypeMarkers")  ## query for LRcellTypeMarkers package
eh  ## this will list out EH number to access the calculated gene enrichment scores

## get mouse brain Frontal Cortex enriched genes
enriched.g <- eh[["EH4548"]]
marker.g <- get_markergenes(enriched.g, method="LR", topn=100)
```

Users are also encouraged to process a read count matrix with cell annotation information into a gene enrichment scores matrix.
```{r eval=FALSE}
enriched.g <- LRcell_gene_enriched_scores(expr, annot, power=1, parallel=TRUE, n.cores=4)
```
Here, `expr` is a read count matrix with rows as genes and columns as cells. `annot` is a named-vector with names as cell names (which is in accordance with the column names of `expr`) and values as annotated cell types. `power` is a hyper-parameter controlling how much penalty for the proportion of cells expressing certain gene. `parallel` and `n.cores` are two parameters for executing function in parallel to accelerate the calculation.

### Directly indicate species and brain region in LRcell
Compared to processing data yourself, a much easier way is to indicate species and brain region or tissue. 
In this way, marker genes are extracted from ExperimentHub accordingly. For example, we can use mouse Frontal Cortex
marker genes to do LRcell analysis on the example bulk experiment[@swarup2019identification]. (The example contains 23, 420 genes along with p-values calculated from 
[DESeq2](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html). 
Data is processed from a mouse Alzheimer's disease model (GEO: [GSE90693](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90693)), 
which is 6 months after treatment in Frontal Cortex brain region.)

```{r example}
# load example bulk data
data("example_gene_pvals")
head(example_gene_pvals, n=5)
```

Here, we use Linear Regression:
```{r LRcell}
res <- LRcell(gene.p = example_gene_pvals,
              marker.g = NULL,
              species = "mouse",
              region = "FC",
              method = "LiR")
FC_res <- res$FC
# exclude leading genes for a better view
sub_FC_res <- subset(FC_res, select=-lead_genes)
head(sub_FC_res)
```

Plot out the result:
```{r plot_LRcell, fig.width=8, fig.height=6, dpi=120} 
plot_manhattan_enrich(FC_res, sig.cutoff = .05, label.topn = 5)
```

According to the result, when using enrichment scores as a predictor variable in Linear Regression,
one cluster of Astrocytes (FC_8-2.Astrocytes) is the mostly enriched along with Microglia (FC_11-1.Microglia). 
(Note that although cluster FC_11-4 was annotated as __unknown__ in the publication, according to our research,
FC_11-* belong to Microglia/Macrophage cell types.) Recent publications have shown that Astrocytes are involved 
in Alzheimer's Disease.

### Marker gene download and do LRcell analysis
Marker gene list downloading example (mouse, Frontal Cortex, Logistic Regression):
```{r download_marker}
library(ExperimentHub)
eh <- ExperimentHub::ExperimentHub()  ## use ExperimentHub to download data
eh <- query(eh, "LRcellTypeMarkers")
enriched_genes <- eh[["EH4548"]]  # use title ID which indicates FC region
# get marker genes for LRcell in logistic regression
FC_marker_genes <- get_markergenes(enriched_genes, method="LR", topn=100)

# to have a glance of the marker gene list
head(lapply(FC_marker_genes, head))
```

Then, we can run LRcell analysis by using `LRcellCore()` function using Logistic Regression.
```{r LRcellCore}
res <- LRcellCore(gene.p = example_gene_pvals,
           marker.g = FC_marker_genes,
           method = "LR", min.size = 5, 
           sig.cutoff = 0.05)
## curate cell types
res$cell_type <- unlist(lapply(strsplit(res$ID, '\\.'), '[', 2))
head(subset(res, select=-lead_genes))
```

Plot out the result:
```{r plot_LRcellCore, fig.width=10, fig.height=6, dpi=120} 
plot_manhattan_enrich(res, sig.cutoff = .05, label.topn = 5)
```

We can clearly find that Microglia (FC_11-1.Microglia) is the most transcriptionally enriched 
when using Logistic Regression analysis. The result is clear and sound because 
proliferation and activation of Microglia in the brain is a known feature of
Alzheimer's Disease (AD).

LRcell is used to give a hint on which cell types are potentially involved in diseases when a paired
scRNA-seq and bulk RNA-seq experiments are unavailable. Thus, LRcell can be applied to the considerable
amount of bulk DEG experiments to shed light on biological discoveries.

## Calculate gene enrichment scores from expression dataframe
The gene enrichment score calculation is based on algorithm proposed in [@marques2016oligodendrocyte], 
which takes both enrichment of genes in certain cell types and fraction of cells expressing the gene into 
consideration. If interested, please take a look at the **Supplementary Materials** in the original paper.

`LRcell_gene_enriched_scores()` function takes the gene-by-cell expression matrix and a cell-type annotation as input, 
which means cell type assignments should be done first. The columns of the expression matrix
should be in accordance with cell names in the annotation vector. 

```{r echo=FALSE}
# generate a simulated gene*cell read counts matrix
n.row <- 3; n.col <- 10
sim.expr <- matrix(0, nrow=n.row, ncol=n.col)
rownames(sim.expr) <- paste0("gene", 1:n.row)
colnames(sim.expr) <- paste0("cell", 1:n.col)

# generate a simulated annotation for cells
sim.annot <- c(rep("celltype1", 3), rep("celltype2", 3), rep("celltype3", 4))
names(sim.annot) <- colnames(sim.expr)

sim.expr['gene1', ] <- c(3, 0, 2, 8, 10, 6, 1, 0, 0, 2) # marker gene for celltype2
sim.expr['gene2', ] <- c(7, 5, 8, 1, 0, 5, 2, 3, 2, 1) # marker gene for celltype1
sim.expr['gene3', ] <- c(8, 10, 6, 7, 8, 9, 5, 8, 6, 8) # house keeping
```


Take a randomly-generated data as example.
```{r example_expr}
# print out the generated expression matrix
print(sim.expr)

# print out the cell-type annotation
print(sim.annot)
```

This toy example contains 3 genes and 10 cells. As you can tell from the matrix, __gene1__ is 
a marker gene of __celltype2__; __gene2__ is a marker gene of __celltype1__; __gene3__ is a house keeping gene.


```{r marker_gene_selection}
# generating the enrichment score 
enriched_res <- LRcell_gene_enriched_scores(expr = sim.expr,
                            annot = sim.annot, parallel = FALSE)
enriched_res
```
According to the result, it is a gene-by-celltype dataframe indicating the enrichment score of genes in different cell type.
Since __gene2__ is a marker gene of __celltype1__, thus it has the highest enrichment score.

When choosing marker genes using top 1 as threshold:
```{r}
marker_res <- get_markergenes(enriched.g = enriched_res,
                method = "LR", topn=1)
marker_res
```


# SessionInfo
```{r session_info}
sessionInfo()
```

# References