--- output: html_document bibliography: ref.bib --- # (PART) Basic usage {-} # Introduction ## Motivation The Bioconductor package *[SingleR](https://bioconductor.org/packages/3.20/SingleR)* implements an automatic annotation method for single-cell RNA sequencing (scRNA-seq) data [@aran2019reference]. Given a reference dataset of samples (single-cell or bulk) with known labels, it assigns those labels to new cells from a test dataset based on similarities in their expression profiles. This provides a convenient way of transferring biological knowledge across datasets, allowing users to leverage the domain expertise implicit in the creation of each reference. The most common application of *[SingleR](https://bioconductor.org/packages/3.20/SingleR)* involves predicting cell type (or "state", or "kind") in a new dataset, a process that is facilitated by the availability of curated references and compatibility with user-supplied datasets. In this manner, the burden of manually interpreting clusters and defining marker genes only has to be done once, for the reference dataset, and this knowledge can be propagated to new datasets in an automated manner. ## Method description *[SingleR](https://bioconductor.org/packages/3.20/SingleR)* can be considered a robust variant of nearest-neighbors classification, with some tweaks to improve resolution for closely related labels. For each test cell: 1. We compute the Spearman correlation between its expression profile and that of each reference sample. The use of Spearman's correlation provides a measure of robustness to batch effects across datasets. The calculation only uses the union of marker genes identified by pairwise comparisons between labels in the reference data, so as to improve resolution of separation between labels. 2. We define the per-label score as a fixed quantile (by default, 0.8) of the correlations across all samples with that label. This accounts for differences in the number of reference samples for each label, which interferes with simpler flavors of nearest neighbor classification; it also avoids penalizing classifications to heterogeneous labels by only requiring a good match to a minority of samples. 3. We repeat the score calculation for all labels in the reference dataset. The label with the highest score is used as *[SingleR](https://bioconductor.org/packages/3.20/SingleR)*'s prediction for this cell. 4. We optionally perform a fine-tuning step to improve resolution between closely related labels. The reference dataset is subsetted to only include labels with scores close to the maximum; scores are recomputed using only marker genes for the subset of labels, thus focusing on the most relevant features; and this process is iterated until only one label remains. ## Quick start We will demonstrate the use of `SingleR()` on a well-known 10X Genomics dataset [@zheng2017massively] with the Human Primary Cell Atlas dataset [@hpcaRef] as the reference. ``` r # Loading test data. library(TENxPBMCData) new.data <- TENxPBMCData("pbmc4k") # Loading reference data with Ensembl annotations. library(celldex) ref.data <- HumanPrimaryCellAtlasData(ensembl=TRUE) # Performing predictions. library(SingleR) predictions <- SingleR(test=new.data, assay.type.test=1, ref=ref.data, labels=ref.data$label.main) table(predictions$labels) ``` ``` ## ## B_cell CMP DC GMP ## 606 8 1 2 ## Monocyte NK_cell Platelets Pre-B_cell_CD34- ## 1164 217 3 46 ## T_cells ## 2293 ``` And that's it, really. ## Where to get help Questions on the general use of *[SingleR](https://bioconductor.org/packages/3.20/SingleR)* should be posted to the [Bioconductor support site](https://support.bioconductor.org). Please send requests for general assistance and advice to the support site rather than to the individual authors. Bug reports or feature requests should be made to the [GitHub repository](https://github.com/LTLA/SingleR); well-considered suggestions for improvements are always welcome. ## Session information {-}