--- title: "Performing scClassify using pretrained model" author: - name: Yingxin Lin affiliation: School of Mathematics and Statistics, The University of Sydney, Australia date: "`r BiocStyle::doc_date()`" output: BiocStyle::html_document: toc: true toc_float: true vignette: > %\VignetteIndexEntry{pretrainedModel} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, message = FALSE, comment = "#>" ) ``` # Introduction A common application of single-cell RNA sequencing (RNA-seq) data is to identify discrete cell types. To take advantage of the large collection of well-annotated scRNA-seq datasets, `scClassify` package implements a set of methods to perform accurate cell type classification based on *ensemble learning* and *sample size calculation*. This vignette will provide an example showing how users can use a pretrained model of scClassify to predict cell types. A pretrained model is a `scClassifyTrainModel` object returned by `train_scClassify()`. A list of pretrained model can be found in https://sydneybiox.github.io/scClassify/index.html. First, install `scClassify`, install `BiocManager` and use `BiocManager::install` to install `scClassify` package. ```{r eval = FALSE} # installation of scClassify if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("scClassify") ``` # Setting up the data We assume that you have *log-transformed* (size-factor normalized) matrices as query datasets, where each row refers to a gene and each column a cell. For demonstration purposes, we will take a subset of single-cell pancreas datasets from one independent study (Wang et al.). ```{r setup} library(scClassify) data("scClassify_example") wang_cellTypes <- scClassify_example$wang_cellTypes exprsMat_wang_subset <- scClassify_example$exprsMat_wang_subset exprsMat_wang_subset <- as(exprsMat_wang_subset, "dgCMatrix") ``` Here, we load our pretrained model using a subset of the Xin et al. human pancreas dataset as our reference data. First, let us check basic information relating to our pretrained model. ```{r} data("trainClassExample_xin") trainClassExample_xin ``` In this pretrained model, we have selected the genes based on Differential Expression using limma. To check the genes that are available in the pretrained model: ```{r} features(trainClassExample_xin) ``` We can also visualise the cell type tree of the reference data. ```{r} plotCellTypeTree(cellTypeTree(trainClassExample_xin)) ``` # Running scClassify Next, we perform `predict_scClassify` with our pretrained model `trainRes = trainClassExample` to predict the cell types of our query data matrix `exprsMat_wang_subset_sparse`. Here, we used `pearson` and `spearman` as similarity metrics. ```{r} pred_res <- predict_scClassify(exprsMat_test = exprsMat_wang_subset, trainRes = trainClassExample_xin, cellTypes_test = wang_cellTypes, algorithm = "WKNN", features = c("limma"), similarity = c("pearson", "spearman"), prob_threshold = 0.7, verbose = TRUE) ``` Noted that the `cellType_test` is not a required input. For datasets with unknown labels, users can simply leave it as `cellType_test = NULL`. Prediction results for pearson as the similarity metric: ```{r} table(pred_res$pearson_WKNN_limma$predRes, wang_cellTypes) ``` Prediction results for spearman as the similarity metric: ```{r} table(pred_res$spearman_WKNN_limma$predRes, wang_cellTypes) ``` # Session Info ```{r} sessionInfo() ```