--- title: "`scfind` package vignette" author: "Vladimir Kiselev" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{`scfind` package vignette} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r knitr-options, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE} library(knitr) opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png') op <- options(gvis.plot.tag='chart') ``` # Introduction # `SingleCellExperiment` class `scfind` is built on top of the Bioconductor’s [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment) class. `scfind` operates on objects of class `SingleCellExperiment` and writes all of its results back to the the object. # `scfind` Input If you already have an `SCESet` object, then proceed to the next chapter. If you have a matrix or a data frame containing expression data then you first need to create an `SingleCellExperiment` object containing your data. For illustrative purposes we will use an example expression matrix provided with `scfind`. The dataset (`yan`) represents __FPKM__ gene expression of 90 cells derived from human embryo. The authors ([Yan et al.](http://dx.doi.org/10.1038/nsmb.2660)) have defined developmental stages of all cells in the original publication (`ann` data frame). We will use these stages in projection later. ```{r , warning=FALSE, message=FALSE} library(SingleCellExperiment) library(scfind) head(ann) yan[1:3, 1:3] ``` Note that the cell type information has to be stored in the `cell_type1` column of the `rowData` slot of the `SingleCellExperiment` object. Now let's create a `SingleCellExperiment` object of the `yan` dataset: ```{r} sce <- SingleCellExperiment(assays = list(normcounts = as.matrix(yan)), colData = ann) # this is needed to calculate dropout rate for feature selection # important: normcounts have the same zeros as raw counts (fpkm) counts(sce) <- normcounts(sce) logcounts(sce) <- log2(normcounts(sce) + 1) # use gene names as feature symbols rowData(sce)$feature_symbol <- rownames(sce) isSpike(sce, "ERCC") <- grepl("^ERCC-", rownames(sce)) # remove features with duplicated names sce <- sce[!duplicated(rownames(sce)), ] sce ``` # Cell Type Search If one has a list of genes that you would like to check against you dataset, i.e. find the cell types that most likely represent your genes (highest expression), then `scfind` allows one to do that by first creating a gene index and then very quickly searching the index: ```{r} geneIndex <- buildCellTypeIndex(sce) p_values <- -log10(findCellType(geneIndex, c("SOX6", "SNAI3"))) barplot(p_values, ylab = "-log10(pval)", las = 2) ``` The calculation above shows that a list of genes containing `SOX6` and `SNAI3` is specific for the `zygote` cell type. # Cell Search If one is more interested in finding out in which cells all the genes from your gene list are expressed than you can build a cell index instead of a cell type index. `buildCellIndex` function should be used for building the index and `findCell` for searching the index: ```{r} geneIndex <- buildCellIndex(sce) res <- findCell(geneIndex, c("SOX6", "SNAI3")) res$common_exprs_cells ``` Cell search reports the p-values corresponding to cell types as well: ```{r} barplot(-log10(res$p_values), ylab = "-log10(pval)", las = 2) ``` # sessionInfo() ```{r echo=FALSE} sessionInfo() ```