--- title: "NeuCA Package User's Guide" author: - name: Ziyi Li affiliation: Department of Biostatistics, The University of Texas MD Anderson Cancer Center - name: Hao Feng affiliation: Department of Population and Quantitative Health Sciences, Case Western Reserve University email: hxf155@case.edu package: NeuCA output: BiocStyle::html_document abstract: | NEUral-network based Cell Annotation, `NeuCA`, is a tool for cell type annotation using single-cell RNA-seq data. It is a supervised cell label assignment method that uses existing scRNA-seq data with known labels to train a neural network-based classifier, and then predict cell labels in single-cell RNA-seq data of interest. vignette: | %\VignetteIndexEntry{NeuCA Package User's Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction The fast advancing single cell RNA sequencing (scRNA-seq) technology enables transcriptome study in heterogeneous tissues at a single cell level. The initial important step of analyzing scRNA-seq data is to accurately annotate cell labels. We present a neural-network based cell annotation method `NeuCA`. When closely correlated cell types exist, `NeuCA` uses the cell type tree information through a hierarchical structure of neural networks to improve annotation accuracy. Feature selection is performed in hierarchical structure to further improve classification accuracy. When cell type correlations are not high, a feed-forward neural network is adopted. `NeuCA` depends on the following packages: * `r CRANpkg("keras")`, for neural-network interface in _R_, * `r Biocpkg("limma")`, for linear model framework and testing markers, * `r Biocpkg("SingleCellExperiment")`, for data organization formatting, * `r CRANpkg("e1071")`, for probability and predictive functions. # Preparing NeuCA input files: `SingleCellExperiment` class The scRNA-seq data input for `NeuCA` must be objects of the _Bioconductor_ `r Biocpkg("SingleCellExperiment")`. You may need to read corresponding vignettes on how to create a SingleCellExperiment from your own data. An example is provided here to show how to do that, but please note this is not a comprehensive guidance for `r Biocpkg("SingleCellExperiment")`. **Step 1**: Load in example scRNA-seq data. We are using two example datasets here: `Baron_scRNA` and `Seg_scRNA`. `Baron_scRNA` is a droplet(inDrop)-based, single-cell RNA-seq data generated from pancrease ([Baron et al.](https://doi.org/10.1016/j.cels.2016.08.011)). Around 10,000 human and 2,000 mouse pancreatic cells from four cadaveric donors and two strains of mice were sequenced. `Seg_scRNA` is a Smart-Seq2 based, single-cell RNA-seq dataset ([Segerstolpe et al.](https://doi.org/10.1016/j.cmet.2016.08.020)). It has thousands of human islet cells from healthy and type-2 diabetic donors. A total of 3,386 cells were collected, with around 350 cells from each donor. Here, subsets of these two datasets (with cell type labels for each cell) were included as examples. ```{r, eval = TRUE, message = FALSE} library(NeuCA) data("Baron_scRNA") data("Seg_scRNA") ``` **Step 2a**: Prepare training data as a SingleCellExperiment object. ```{r, eval = TRUE, message = FALSE} Baron_anno = data.frame(Baron_true_cell_label, row.names = colnames(Baron_counts)) Baron_sce = SingleCellExperiment( assays = list(normcounts = as.matrix(Baron_counts)), colData = Baron_anno ) # use gene names as feature symbols rowData(Baron_sce)$feature_symbol <- rownames(Baron_sce) # remove features with duplicated names Baron_sce <- Baron_sce[!duplicated(rownames(Baron_sce)), ] ``` **Step 2b**: Similarly, prepare testing data as a SingleCellExperiment object. Note the true cell type labels are not necessary (and of course often not available). ```{r, eval = TRUE, message = FALSE} Seg_anno = data.frame(Seg_true_cell_label, row.names = colnames(Seg_counts)) Seg_sce <- SingleCellExperiment( assays = list(normcounts = as.matrix(Seg_counts)), colData = Seg_anno ) # use gene names as feature symbols rowData(Seg_sce)$feature_symbol <- rownames(Seg_sce) # remove features with duplicated names Seg_sce <- Seg_sce[!duplicated(rownames(Seg_sce)), ] ``` # NeuCA training and prediction **Step 3**: with both training and testing data as objects in `SingleCellExperiment` class, now we can train the classifier in `NeuCA` and predict testing dataset's cell types. This process can be achieved with one line of code: ```{r, eval = TRUE, message = FALSE} predicted.label = NeuCA(train = Baron_sce, test = Seg_sce, model.size = "big", verbose = FALSE) #Baron_scRNA is used as the training scRNA-seq dataset #Seg_scRNA is used as the testing scRNA-seq dataset ``` `NeuCA` can detect whether highly-correlated cell types exist in the training dataset, and automatically determine if a general neural-network model will be adopted or a marker-guided hierarchical neural-network will be adopted for classification. [**Tuning parameter**] In neural-network, the numbers of layers and nodes are tunable parameters. Users have the option to determine the complexity of the neural-network used in `NeuCA` by specifying the desired `model.size` argument. Here, "big", "medium" and "small" are 3 possible choices, reflecting large, medium and small number of nodes and layers in neural-network, respectively. The model size details are shown in the following Table 1. From our experience, "big" or "medium" can often produce high accuracy predictions. ```{r, eval = TRUE, message = FALSE, echo=FALSE} library(knitr) library(kableExtra) df <- data.frame(Cat = c("Small", "Medium", "Big"), Layers = c("3", "4", "5"), Nodes = c("64", "64,128", "64,128,256")) kable(df, col.names = c("", "Number of layers", "Number of nodes in hidden layers"), escape = FALSE, caption = "Tuning model sizes in the neural-network classifier training.") %>% kable_styling(latex_options = "striped") ``` # Predicted cell types `predicted.label` is a vector of the same length with the number of cells in the testing dataset, containing all cell's predicted cell type. It can be viewed directly: ```{r, eval = TRUE} head(predicted.label) table(predicted.label) ``` [**Optional**] If you have the true cell type labels for the testing dataset, you may evaluate the predictive performance by a confusion matrix: ```{r, eval = TRUE} table(predicted.label, Seg_true_cell_label) ``` You may also draw a Sankey diagram to visualize the prediction accuracy: ```{r, eval = TRUE, message = FALSE, echo=F} library(networkD3) source = rep(NA, choose(length(unique(Seg_true_cell_label)),2)+length(unique(Seg_true_cell_label))) target = rep(NA, choose(length(unique(Seg_true_cell_label)),2)+length(unique(Seg_true_cell_label))) value = rep(NA, choose(length(unique(Seg_true_cell_label)),2)+length(unique(Seg_true_cell_label))) links = data.frame(source, target, value) cfm = table(predicted.label, Seg_true_cell_label) id = 1 for(i in 1:ncol(cfm)){ for(j in i:nrow(cfm)){ links[id,1] = paste0(colnames(cfm)[i], "_true") links[id,2] = paste0(rownames(cfm)[j], "_pred") links[id,3] = cfm[j,i] id = id + 1 } } nodes <- data.frame( name=c(paste0(colnames(cfm), "_true"), paste0(rownames(cfm), "_pred")) ) links$IDsource <- match(links$source, nodes$name)-1 links$IDtarget <- match(links$target, nodes$name)-1 p <- sankeyNetwork(Links = links, Nodes = nodes, Source = "IDsource", Target = "IDtarget", Value = "value", NodeID = "name", sinksRight=FALSE) p ``` # Session info {.unnumbered} ```{r sessionInfo, echo=FALSE} sessionInfo() ```