---
title: "scds:**s**ingle **c**ell **d**oublet **s**coring: In-silico doublet annotation for single cell RNA sequencing data"
# shorttitle: "Introduction to scds"
# author:
# - name: Abha Bais
#   affiliation: &id Department of Developmental Biology, University of Pittsburgh School of Medicine
# - name: Dennis Kostka
#   affiliation: *id Department of Computational and Systems Biology, Center for Evolutionary Biology and Medicine, University of Pittsburgh School of Medicine
# date: "`r Sys.Date()`"
# package: scds
output:
  BiocStyle::html_document:
   toc_float: false
  BiocStyle::pdf_document: default
vignette: >
  %\VignetteIndexEntry{Introduction to the scds package}
  %\VignetteEngine{knitr::rmarkdown}
  %VignetteEncoding{UTF-8}
---

# Introduction
In this vignette, we provide an overview of the basic functionality and usage of the `scds` package, which interfaces with ``SingleCellExperiment`` objects.


# Installation

Install the `scds` package using Bioconductor:

```{r, eval = FALSE, echo = TRUE, message = FALSE, warning = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("scds", version = "3.9")
```

Or from github:

```{r, eval = FALSE, echo = TRUE, message = FALSE, warning = FALSE}
library(devtools)
devtools::install_github('kostkalab/scds')

```

# Quick start


`scds` takes as input a `SingleCellExperiment` object (see here `r BiocStyle::Biocpkg("SingleCellExperiment")`), where raw counts are stored in a ```counts``` assay, i.e. ```assay(sce,"counts")```. An example dataset created by sub-sampling the cell-hashing cell-lines data set (see  https://satijalab.org/seurat/hashing_vignette.html) is included with the package and accessible via ```data("sce")```.Note that `scds` is designed to workd with larger datasets, but for the purposes of this vignette, we work with a smaller example dataset. We apply ```scds``` to this data and compare/visualize reasults:

## Example data set

Get example data set provided with the package.

```{r prelims, eval = TRUE, echo = TRUE, message = FALSE, warning = FALSE}
library(scds)
library(scater)
library(rsvd)
library(Rtsne)
library(cowplot)
set.seed(30519)
data("sce_chcl")
sce = sce_chcl #- less typing
dim(sce)

```

We see it contains 2,000 genes and 2,000 cells, 216 of which are identified as doublets:

```{r doublets, eval = TRUE, echo = TRUE, message = FALSE, warning = FALSE}
table(sce$hto_classification_global)
```

We can visualize cells/doublets after projecting into two dimensions:

```{r proj, eval = TRUE, echo = TRUE, message = FALSE, warning = FALSE}
logcounts(sce) = log1p(counts(sce))
vrs            = apply(logcounts(sce),1,var)
pc             = rpca(t(logcounts(sce)[order(vrs,decreasing=TRUE)[1:100],]))
ts             = Rtsne(pc$x[,1:10],verb=FALSE)

reducedDim(sce,"tsne") = ts$Y; rm(ts,vrs,pc)
plotReducedDim(sce,"tsne",color_by="hto_classification_global")
```

## Computational doublet annotation

We now run the ```scds``` doublet annotation approaches. Briefly, we identify doublets in two complementary ways: `cxds` is based on co-expression of gene pairs and works with absence/presence calls only, while `bcds` uses the full count information and a binary classification approach using artificially generated doublets. `cxds_bcds_hybrid` combines both approaches, for more details please consult [(this manuscript)](https://doi.org/10.1101/564021). Each of the three methods returns a doublet score, with higher scores indicating more "doublet-like" barcodes.

```{r scds, eval = TRUE, echo = TRUE, message = FALSE, warning = FALSE}
#- Annotate doublet using co-expression based doublet scoring:
sce = cxds(sce,retRes = TRUE)
sce = bcds(sce,retRes = TRUE,verb=TRUE)
sce = cxds_bcds_hybrid(sce)
par(mfcol=c(1,3))
boxplot(sce$cxds_score   ~ sce$doublet_true_labels, main="cxds")
boxplot(sce$bcds_score   ~ sce$doublet_true_labels, main="bcds")
boxplot(sce$hybrid_score ~ sce$doublet_true_labels, main="hybrid")

```

## Visualizing gene pairs


For ```cxds``` we can identify and visualize gene pairs driving doublet annoataions, with the expectation that the two genes in a pair might mark different types of cells ([see manuscript](https://doi.org/10.1101/564021)). In the following we look at the top three pairs, each gene pair is a row in the plot below:

```{r pairplot, eval = FALSE, echo = TRUE, message = FALSE, warning = FALSE}
scds =
top3 = metadata(sce)$cxds$topPairs[1:3,]
rs   = rownames(sce)
hb   = rowData(sce)$cxds_hvg_bool
ho   = rowData(sce)$cxds_hvg_ordr[hb]
hgs  = rs[ho]

l1 =  ggdraw() + draw_text("Pair 1", x = 0.5, y = 0.5)
p1 = plotReducedDim(sce,"tsne",color_by=hgs[top3[1,1]])
p2 = plotReducedDim(sce,"tsne",color_by=hgs[top3[1,2]])

l2 =  ggdraw() + draw_text("Pair 2", x = 0.5, y = 0.5)
p3 = plotReducedDim(sce,"tsne",color_by=hgs[top3[2,1]])
p4 = plotReducedDim(sce,"tsne",color_by=hgs[top3[2,2]])

l3 = ggdraw() + draw_text("Pair 3", x = 0.5, y = 0.5)
p5 = plotReducedDim(sce,"tsne",color_by=hgs[top3[3,1]])
p6 = plotReducedDim(sce,"tsne",color_by=hgs[top3[3,2]])

plot_grid(l1,p1,p2,l2,p3,p4,l3,p5,p6,ncol=3, rel_widths = c(1,2,2))
```

# Session Info

```{r sessionInfo}
sessionInfo()

```