--- title: "EasyCellType: an example workflow" output: BiocStyle::html_document: toc: true number_sections: false bibliography: references.bib vignette: > %\VignetteIndexEntry{my-vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## 1. Introduction The `EasyCellType` package was designed to examine an input marker list using the databases and provide annotation recommendations in graphical outcomes. The package refers to 3 public available marker gene data bases, and provides two approaches to conduct the annotation anaysis: gene set enrichment analysis(GSEA) and a modified Fisher's exact test. The package has been submitted to `bioconductor` to achieve an easy access for researchers. This vignette shows a simple workflow illustrating how EasyCellType package works. The data set that will be used throughout the example is freely available from 10X Genomics. ### Installation The package can be installed using `BiocManager` by the following commands ```{r setup, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("EasyCellType") ``` Alternatively, the package can also be installed using `devtools` and launched by ```{r, results=FALSE, warning=FALSE, message=FALSE} library(devtools) install_github("rx-li/EasyCellType") ``` After the installation, the package can be loaded with ```{r, results=FALSE, warning=FALSE, message=FALSE} library(EasyCellType) ``` ## 2. Example workflow We use the Peripheral Blood Mononuclear Cells (PBMC) data freely available from 10X Genomics. The data can be downladed from https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz. After downloading the data, it can be read using function `Read10X`. We have included the data in our package, which can be loaded with ```{r, results=FALSE, warning=FALSE, message=FALSE} data(pbmc_data) ``` We followed the standard workflow provided by `Seurat` package[@seurat] to process the PBMC data set. The detailed technical explanations can be found in https://satijalab.org/seurat/articles/pbmc3k_tutorial.html. ```{r, results=FALSE, warning=FALSE, message=FALSE} library(Seurat) # Initialize the Seurat object pbmc <- CreateSeuratObject(counts = pbmc_data, project = "pbmc3k", min.cells = 3, min.features = 200) # QC and select samples pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) # Normalize the data pbmc <- NormalizeData(pbmc) # Identify highly variable features pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) # Scale the data all.genes <- rownames(pbmc) pbmc <- ScaleData(pbmc, features = all.genes) # Perfom linear dimensional reduction pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc)) # Cluster the cells pbmc <- FindNeighbors(pbmc, dims = 1:10) pbmc <- FindClusters(pbmc, resolution = 0.5) # Find differentially expressed features markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) ``` Now we get the expressed markers for each cluster. We then convert the gene symbols to Entrez IDs. ```{r, results=FALSE, warning=FALSE, message=FALSE} library(org.Hs.eg.db) library(AnnotationDbi) markers$entrezid <- mapIds(org.Hs.eg.db, keys=markers$gene, #Column containing Ensembl gene ids column="ENTREZID", keytype="SYMBOL", multiVals="first") markers <- na.omit(markers) ``` In case the data is measured in mouse, we would replace the package `org.Hs.eg.db` with `org.Mm.eg.db` and do the above analysis. The input for `EasyCellType` package should be a data frame containing Entrez IDs, clusters and expression scores. The order of columns should follow this rule. In each cluster, the gene should be sorted by the expression score. ```{r, results=FALSE, warning=FALSE, message=FALSE} library(dplyr) markers_sort <- data.frame(gene=markers$entrezid, cluster=markers$cluster, score=markers$avg_log2FC) %>% group_by(cluster) %>% mutate(rank = rank(score), ties.method = "random") %>% arrange(desc(rank)) input.d <- as.data.frame(markers_sort[, 1:3]) ``` We have include the processed data in the package. It can be loaded with ```{r, results=FALSE, warning=FALSE, message=FALSE} data("gene_pbmc") input.d <- gene_pbmc ``` Now we can call the `annot` function to run annotation analysis. ```{r, results=FALSE, warning=FALSE, message=FALSE} annot.GSEA <- easyct(input.d, db="cellmarker", species="Human", tissue=c("Blood", "Peripheral blood", "Blood vessel", "Umbilical cord blood", "Venous blood"), p_cut=0.3, test="GSEA") ``` We used the GSEA approach to do the annotation. In our package, we use `GSEA` function in `clusterProfiler` package[@clusterprofiler] to conduct the enrichment analysis. You can replace 'GSEA' with 'fisher' if you would like to use Fisher exact test to do the annotation. The candidate tissues can be seen using `data(cellmarker_tissue)`, `data(clustermole_tissue)` and `data(panglao_tissue)`. The dot plot showing the overall annotation results can be created by ```{r, results=FALSE, warning=FALSE, message=FALSE} plot_dot(test="GSEA", annot.GSEA) ``` Bar plot can be created by ```{r, results=FALSE, warning=FALSE, message=FALSE, fig.show='hide'} plot_bar(test="GSEA", annot.GSEA) ``` ```{r} sessionInfo() ``` ## 3. Reference