---
title: "EasyCellType: an example workflow"
output: 
  BiocStyle::html_document:
    toc: true
    number_sections: false
    
bibliography: references.bib 
vignette: >
  %\VignetteIndexEntry{my-vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## 1. Introduction
The `EasyCellType` package was designed to examine an input marker list using 
the databases and provide annotation recommendations in graphical outcomes. 
The package refers to 3 public available marker gene data bases, 
and provides two approaches to conduct the annotation anaysis: 
gene set enrichment analysis(GSEA) and a modified Fisher's exact test.
The package has been submitted to `bioconductor` to achieve an easy access for researchers.  

This vignette shows a simple workflow illustrating how EasyCellType package works. 
The data set that will be used throughout the example is freely available from 
10X Genomics.

### Installation
The package can be installed using `BiocManager` by the following commands
```{r setup, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("EasyCellType")
```

Alternatively, the package can also be installed using `devtools` and launched by
```{r, results=FALSE, warning=FALSE, message=FALSE}
library(devtools)
install_github("rx-li/EasyCellType")
```

After the installation, the package can be loaded with
```{r, results=FALSE, warning=FALSE, message=FALSE}
library(EasyCellType)
```

## 2. Example workflow
We use the Peripheral Blood Mononuclear Cells (PBMC) data freely available from 10X Genomics. 
The data can be downladed from
https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz.
After downloading the data, it can be read using function `Read10X`. 

We have included the data in our package, which can be loaded with
```{r, results=FALSE, warning=FALSE, message=FALSE}
data(pbmc_data)
```

We followed the standard workflow provided by `Seurat` package[@seurat] to process the PBMC data set. 
The detailed technical explanations can be found in
https://satijalab.org/seurat/articles/pbmc3k_tutorial.html.
```{r, results=FALSE, warning=FALSE, message=FALSE}
library(Seurat)
# Initialize the Seurat object
pbmc <- CreateSeuratObject(counts = pbmc_data, project = "pbmc3k", min.cells = 3, min.features = 200)
# QC and select samples
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
# Normalize the data
pbmc <- NormalizeData(pbmc)
# Identify highly variable features
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Scale the data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
# Perfom linear dimensional reduction
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# Cluster the cells
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
# Find differentially expressed features
markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
```

Now we get the expressed markers for each cluster. 
We then convert the gene symbols to Entrez IDs.

```{r, results=FALSE, warning=FALSE, message=FALSE}
library(org.Hs.eg.db)
library(AnnotationDbi)
markers$entrezid <- mapIds(org.Hs.eg.db,
                           keys=markers$gene, #Column containing Ensembl gene ids
                           column="ENTREZID",
                           keytype="SYMBOL",
                           multiVals="first")
markers <- na.omit(markers)
```

In case the data is measured in mouse, we would replace the package `org.Hs.eg.db` 
with `org.Mm.eg.db` and do the above analysis.

The input for `EasyCellType` package should be a data frame containing Entrez IDs, 
clusters and expression scores. The order of columns should follow this rule.
In each cluster, the gene should be sorted by the expression score.
```{r, results=FALSE, warning=FALSE, message=FALSE}
library(dplyr)
markers_sort <- data.frame(gene=markers$entrezid, cluster=markers$cluster, 
                      score=markers$avg_log2FC) %>% 
  group_by(cluster) %>% 
  mutate(rank = rank(score),  ties.method = "random") %>% 
  arrange(desc(rank)) 
input.d <- as.data.frame(markers_sort[, 1:3])
```

We have include the processed data in the package. It can be loaded with
```{r, results=FALSE, warning=FALSE, message=FALSE}
data("gene_pbmc")
input.d <- gene_pbmc
```


Now we can call the `annot` function to run annotation analysis.
```{r, results=FALSE, warning=FALSE, message=FALSE}
annot.GSEA <- easyct(input.d, db="cellmarker", species="Human", 
                    tissue=c("Blood", "Peripheral blood", "Blood vessel",
                      "Umbilical cord blood", "Venous blood"), p_cut=0.3,
                    test="GSEA")
```

We used the GSEA approach to do the annotation. In our package, we use `GSEA`
function in `clusterProfiler` package[@clusterprofiler] to conduct the enrichment analysis.
You can replace 'GSEA' with 'fisher' if you would like to use Fisher exact test 
to do the annotation.
The candidate tissues can be seen using `data(cellmarker_tissue)`, 
`data(clustermole_tissue)` and `data(panglao_tissue)`.

The dot plot showing the overall annotation results can be created by
```{r, results=FALSE, warning=FALSE, message=FALSE}
plot_dot(test="GSEA", annot.GSEA)
```

Bar plot can be created by
```{r, results=FALSE, warning=FALSE, message=FALSE, fig.show='hide'}
plot_bar(test="GSEA", annot.GSEA)
```

```{r}
sessionInfo()
```

## 3. Reference