---
title: "Decontamination of single cell protein expression data with DecontPro"
author:
- name: Yuan Yin
  affiliation: &id Boston University School of Medicine
- name: Joshua Campbell
  affiliation: *id
  email: camp@bu.edu
date: "`r Sys.Date()`"
output: 
  BiocStyle::html_document:
    toc: true
vignette: >
  %\VignetteIndexEntry{decontPro}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


# Introduction

DecontPro assess and decontaminate single-cell protein expression data, such as 
those generated from CITE-seq or Total-seq. The count matrix is decomposed into 
three matrices, the native, the ambient and the background that represent the 
contribution from the true protein expression on cells, the ambient material and
other non-specific background contamination.

# Installation

DecontX Package can be installed from Bioconductor:

```{r install, eval= FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("decontX")
```

Then the package can be loaded in R using the following command:

```{r load, message=FALSE}
library(decontX)
```

To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/decontX.

# Importing data

Here we use an example dataset from `SingleCellMultiModal` package.

```{r message=FALSE}
library(SingleCellMultiModal)
dat <- CITEseq("cord_blood", dry.run = FALSE)
counts <- experiments(dat)$scADT
```

For this tutorial, we sample only 1000 droplets from the dataset to demonstrate the use of functions. When analyzing your dataset, sub-sampling should be done with caution, as `decontPro` approximates contamination profile using the dataset. A biased sampling may introduce bias to the contamination profile approximation.

```{r}
set.seed(42)
sample_id <- sample(dim(counts)[2], 1000, replace = FALSE)
counts_sample <- counts[, sample_id]
```

# Generate cell clusters

`decontPro` requires a vector indicating the cell types of each droplet. Here we use `Seurat` for clustering.

```{r message=FALSE}
library(Seurat)
library(dplyr)
adt_seurat <- CreateSeuratObject(counts_sample, assay = "ADT")
adt_seurat <- NormalizeData(adt_seurat, normalization.method = "CLR", margin = 2) %>%
  ScaleData(assay = "ADT") %>%
  RunPCA(assay = "ADT", features = rownames(adt_seurat), npcs = 10,
  reduction.name = "pca_adt") %>%
  FindNeighbors(dims = 1:10, assay = "ADT", reduction = "pca_adt") %>%
  FindClusters(resolution = 0.5)
```

```{r message=FALSE}
adt_seurat <- RunUMAP(adt_seurat,
                      dims = 1:10,
                      assay = "ADT",
                      reduction = "pca_adt",
                      reduction.name = "adtUMAP",
                      verbose = FALSE)
DimPlot(adt_seurat, reduction = "adtUMAP", label = TRUE)
FeaturePlot(adt_seurat, 
            features = c("CD3", "CD4", "CD8", "CD19", "CD14", "CD16", "CD56"))

clusters <- as.integer(Idents(adt_seurat))
```

# Run DecontPro

You can run `DecontPro` by simply:

``` {r eval=FALSE}
counts <- as.matrix(counts_sample)
out <- deconPro(counts,
                clusters)
```

Priors (`delta_sd` and `background_sd`) may need tuning with the help of plotting the decontaminated results. The two priors encode belief if a more spread-out count should be considered contamination vs. native. We tested the default values on datasets ranging 5k to 10k droplets and 10 to 30 ADTs and the results are reasonable. A more contaminated or a smaller dataset may need larger priors. In this tutorial, since we sampled only 1k droplets from the original 10k droplets, we use slightly larger priors:

```{r message=FALSE}
counts <- as.matrix(counts_sample)
out <- decontPro(counts,
                 clusters,
                 delta_sd = 2e-4,
                 background_sd = 2e-5)
```

The output contains three matrices, and model parameters after inference. 
`decontaminated_counts` represent the true protein expression on cells.

```{r}
decontaminated_counts <- out$decontaminated_counts
```


# Plot Results

Plot ADT density before and after decontamination. For bimodal ADTs, the background peak should be removed. Note CD4 is tri-modal with the intermediate mode corresponding to monocytes. This can be used as a QC metric for decontamination as only the lowest mode should be removed. 

```{r}
plotDensity(counts,
            decontaminated_counts,
            c("CD3", "CD4", "CD8", "CD14", "CD16", "CD19"))
```

We can also visualize the decontamination by each cell cluster.
```{r}
plotBoxByCluster(counts,
                 decontaminated_counts,
                 clusters,
                 c("CD3", "CD4", "CD8", "CD16"))
```

# Session Information

```{r}
sessionInfo()
```