---
title: "`scfind` package vignette"
author: "Vladimir Kiselev"
date: "`r Sys.Date()`"
output:
    BiocStyle::html_document:
        toc: true
vignette: >
  %\VignetteIndexEntry{`scfind` package vignette}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r knitr-options, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE}
library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
op <- options(gvis.plot.tag='chart')
```

# Introduction

# `SingleCellExperiment` class

`scfind` is built on top of the Bioconductor’s [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment) class. `scfind` operates on objects of class `SingleCellExperiment` and writes all of its results back to the the object.

# `scfind` Input

If you already have an `SCESet` object, then proceed to the next chapter.

If you have a matrix or a data frame containing expression data then you first need to create an `SingleCellExperiment` object containing your data. For illustrative purposes we will use an example expression matrix provided with `scfind`. The dataset (`yan`) represents __FPKM__ gene expression of 90 cells derived from human embryo. The authors ([Yan et al.](http://dx.doi.org/10.1038/nsmb.2660)) have defined developmental stages of all cells in the original publication (`ann` data frame). We will use these stages in projection later.

```{r , warning=FALSE, message=FALSE}
library(SingleCellExperiment)
library(scfind)

head(ann)
yan[1:3, 1:3]
```

Note that the cell type information has to be stored in the `cell_type1` column of the `rowData` slot of the `SingleCellExperiment` object.

Now let's create a `SingleCellExperiment` object of the `yan` dataset:
```{r}
sce <- SingleCellExperiment(assays = list(normcounts = as.matrix(yan)), colData = ann)
# this is needed to calculate dropout rate for feature selection
# important: normcounts have the same zeros as raw counts (fpkm)
counts(sce) <- normcounts(sce)
logcounts(sce) <- log2(normcounts(sce) + 1)
# use gene names as feature symbols
rowData(sce)$feature_symbol <- rownames(sce)
isSpike(sce, "ERCC") <- grepl("^ERCC-", rownames(sce))
# remove features with duplicated names
sce <- sce[!duplicated(rownames(sce)), ]
sce
```

# Cell Type Search

If one has a list of genes that you would like to check against you dataset, i.e.
find the cell types that most likely represent your genes (highest expression), then `scfind` allows one to do that by first creating a gene index and then very quickly searching the index:
```{r}
geneIndex <- buildCellTypeIndex(sce)
p_values <- -log10(findCellType(geneIndex, c("SOX6", "SNAI3")))
barplot(p_values, ylab = "-log10(pval)", las = 2)
```

The calculation above shows that a list of genes containing `SOX6` and `SNAI3` is specific for the `zygote` cell type.

# Cell Search

If one is more interested in finding out in which cells all the genes from your
gene list are expressed than you can build a cell index instead of a
cell type index. `buildCellIndex` function should be used for building the index
and `findCell` for searching the index:
```{r}
geneIndex <- buildCellIndex(sce)
res <- findCell(geneIndex, c("SOX6", "SNAI3"))
res$common_exprs_cells
```

Cell search reports the p-values corresponding to cell types as well:
```{r}
barplot(-log10(res$p_values), ylab = "-log10(pval)", las = 2)
```

# sessionInfo()

```{r echo=FALSE}
sessionInfo()
```