---
title: "Comparing totalVI and OSCA book CITE-seq analyses"
author: "OSCA book authors and Vincent J. Carey, stvjc at channing.harvard.edu"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Comparing totalVI and OSCA book CITE-seq analyses}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    highlight: pygments
    number_sections: yes
    theme: united
    toc: yes
---

# Overview

This vignette endeavors to put Bioconductor and scvi-tools
together to help understand how different data structures and methods
relevant to CITE-seq analysis contribute to interpretation
of CITE-seq exeperiments.

The scvi-tools tutorial (for version 0.20.0)
analyzes a pair of 10x PBMC CITE-seq experiments (5k and 10k cells).
Chapter 12 of the OSCA book analyzes only the 10k dataset.

# Technical steps to facilitate comparison

The following subsections are essentially "code-only".  We
exhibit steps necessary to assemble a SingleCellExperiment
instance with the subset of the totalVI quantifications
produced for the cells from the "10k" dataset.

## Acquire software
```{r getsoft, message=FALSE}
library(SingleCellExperiment)
library(scater)
library(scviR)
```

## Obtain key data representations

```{r getdat, message=FALSE}
ch12sce = getCh12Sce(clear_cache=TRUE)
ch12sce
```

```{r getdat2, message=FALSE}
fullvi = getTotalVI5k10kAdata()
fullvi
```

## Assemble a SingleCellExperiment with totalVI outputs

### Acquire cell identities and batch labels

```{r basicvecs}
totvi_cellids = rownames(fullvi$obs)
totvi_batch = fullvi$obs$batch
```

### Acquire quantifications and latent space positions

```{r latent}
totvi_latent = fullvi$obsm$get("X_totalVI")
totvi_umap = fullvi$obsm$get("X_umap")
totvi_denoised_rna = fullvi$layers$get("denoised_rna")
totvi_denoised_protein = fullvi$obsm$get("denoised_protein")
totvi_leiden = fullvi$obs$leiden_totalVI
```


### Drop 5k data from all

```{r remove5k}
is5k = which(totvi_batch == "PBMC5k")
totvi_cellids = totvi_cellids[-is5k]
totvi_latent = totvi_latent[-is5k,]
totvi_umap = totvi_umap[-is5k,]
totvi_denoised_rna = totvi_denoised_rna[-is5k,]
totvi_denoised_protein = totvi_denoised_protein[-is5k,]
totvi_leiden = totvi_leiden[-is5k]
```

### Label the rows of components

```{r rown}
rownames(totvi_latent) = totvi_cellids
rownames(totvi_umap) = totvi_cellids
rownames(totvi_denoised_rna) = totvi_cellids
rownames(totvi_denoised_protein) = totvi_cellids
names(totvi_leiden) = totvi_cellids
```

### Find common cell ids

In this section we reduce the cell collections
to cells common to the chapter 12 and totalVI
datasets.

```{r findcomm}
comm = intersect(totvi_cellids, ch12sce$Barcode)
```

### Build the totalVI SingleCellExperiment

```{r dosce}
# select and order
totvi_latent = totvi_latent[comm,]
totvi_umap = totvi_umap[comm,]
totvi_denoised_rna = totvi_denoised_rna[comm,]
totvi_denoised_protein = totvi_denoised_protein[comm,]
totvi_leiden = totvi_leiden[comm]
 
# organize the totalVI into SCE with altExp

totsce = SingleCellExperiment(SimpleList(logcounts=t(totvi_denoised_rna))) # FALSE name
rowData(totsce) = S4Vectors::DataFrame(fullvi$var)
rownames(totsce) = rownames(fullvi$var)
rowData(totsce)$Symbol = rownames(totsce)
nn = SingleCellExperiment(SimpleList(logcounts=t(totvi_denoised_protein))) # FALSE name
reducedDims(nn) = list(UMAP=totvi_umap)
altExp(totsce) = nn
altExpNames(totsce) = "denoised_protein"
totsce$leiden = totvi_leiden
altExp(totsce)$leiden = totvi_leiden
altExp(totsce)$ch12.clusters = altExp(ch12sce[,comm])$label

# add average ADT abundance to metadata, for adt_profiles

tot.se.averaged <- sumCountsAcrossCells(altExp(totsce), altExp(totsce)$leiden,
    exprs_values="logcounts", average=TRUE)
rownames(tot.se.averaged) = gsub("_TotalSeqB", "", rownames(tot.se.averaged))
metadata(totsce)$se.averaged = tot.se.averaged
```

### Reduce the chapter 12 dataset to the cells held in common

```{r trim12}
colnames(ch12sce) = ch12sce$Barcode
ch12sce_matched = ch12sce[, comm]
```

# Key outputs of the chapter 12 analysis

## Clustering and projection based on the ADT quantifications

The TSNE projection of the normalized ADT quantifications and
the [walktrap](https://arxiv.org/abs/physics/0512106) cluster assignments
are produced for the cells common to the two approaches.

```{r lkkey}
plotTSNE(altExp(ch12sce_matched), color_by="label", text_by="label")
```
 
## Cluster profiles based on averaging ADT quantities within clusters

This heatmap uses precomputed cluster averages that are
lodged in the metadata element of the SingleCellExperiment.
Colors represent the log2-fold change from the grand average across all clusters.
```{r lkadt}
adtProfiles(ch12sce_matched)
```

## Marker expression patterns in mRNA-based sub-clusters of ADT-based clusters

We enhance the annotation of the list of subclusters retrieved
using `getCh12AllSce` and then drill into mRNA-based
subclusters of ADT-based cluster 3 to compare expression levels
of three genes.
```{r dosubcl}
ch12_allsce = getCh12AllSce() 
ch12_allsce = lapply(ch12_allsce, function(x) {
   colnames(x)= x$Barcode; 
   cn = colnames(x); 
   x = x[,intersect(cn,comm)]; x})
of.interest <- "3"
markers <- c("GZMH", "IL7R", "KLRB1")
plotExpression(ch12_allsce[[of.interest]], x="subcluster",
    features=markers, swap_rownames="Symbol", ncol=3)
```

There is a suggestion of a boolean basis for subcluster
identity, depending on low or high expression of the selected genes.

## Graduated relationships between mRNA and surface protein expression

Following the exploration in OSCA chapter 12, cluster 3 is analyzed
for a regression association between expression
measures of three genes and the ADT-based abundance of CD127.
```{r lkgra}
plotExpression(ch12_allsce[["3"]], x="CD127", show_smooth=TRUE, show_se=FALSE, 
    features=c("IL7R", "TPT1", "KLRB1", "GZMH"), swap_rownames="Symbol")
```

# Analogs to the chapter 12 findings, based on totalVI quantifications

## The Leiden clustering in UMAP projection

```{r lktotumap}
plotUMAP(altExp(totsce), color_by="leiden", text_by="leiden")
```

## Cluster profiles based on average ADT abundance, using denoised protein quantifications

The approach to profiling the ADT abundances used in the totalVI
tutorial employs scaling to (0,1).

```{r domattot}
tav = S4Vectors::metadata(totsce)$se.averaged
ata = assay(tav)
uscale = function(x) (x-min(x))/max(x)
scmat = t(apply(ata,1,uscale))
pheatmap::pheatmap(scmat, cluster_rows=FALSE)
```

## Concordance in ADT-based clustering between OSCA and totalVI

A quick view of the concordance of the two clustering outcomes
is

```{r lkconc}
atot = altExp(totsce)
ach12 = altExp(ch12sce_matched)
tt = table(ch12=ach12$label, VI=atot$leiden)
pheatmap::pheatmap(log(tt+1))
```

With this we can pick out some
clusters with many cells in common:

```{r lkcomm}
lit = tt[c("9", "12", "5", "3"), c("0", "1", "2", "8", "6", "5")]
rownames(lit) = sQuote(rownames(lit))
colnames(lit) = sQuote(colnames(lit))
lit
```

## Subcluster assessment for OSCA cluster "3"

Let's examine the distributions of marker mRNAs in
the Leiden totalVI clusters  corresponding to OSCA cluster "3":

```{r lksubsss}
tsub = totsce[,which(altExp(totsce)$leiden %in% c("5", "6", "8"))]
markers <- c("GZMH", "IL7R", "KLRB1")
altExp(tsub)$leiden = factor(altExp(tsub)$leiden) # squelch unused levels
tsub$leiden = factor(tsub$leiden) # squelch unused levels
plotExpression(tsub, x="leiden",
    features=markers, swap_rownames="Symbol", ncol=3)
```

Note that the y axis label is incorrect -- we are plotting the denoised
expression values from totalVI.

The display seems roughly consistent with the "boolean basis" observed above with
the mRNA-based subclustering.  

## Graduated relationships between ADT and mRNA abundance as measured by totalVI

The same approach is taken as above.  We don't have TPT1 in the 4000 genes
retained in the totalVI exercise.

```{r lkgradvi}
rn = rownames(altExp(tsub))
rn = gsub("_TotalSeqB", "", rn)
rownames(altExp(tsub)) = rn
rowData(altExp(tsub)) = DataFrame(Symbol=rn)
plotExpression(tsub, x="CD127", show_smooth=TRUE, show_se=FALSE,
   features=c("IL7R", "KLRB1", "GZMH"), swap_rownames="Symbol")
```

# Conclusions

We have shown how rudimentary programming and data
organization can be used to make
outputs of OSCA and totalVI methods amenable to comparison
in the Bioconductor framework.

The scviR package includes a shiny app in the function
`explore_subcl` that should be expanded to facilitate
exploration of totalVI subclusters.  Much more work
remains to be done in the area of exploring

- additional approaches to integrative interpretation of
ADT and mRNA abundance patterns, such as intersection and
concatenation methods in the feature selection materials
in OSCA ch. 12

- effects of tuning and architecture details for the totalVI VAE

# Session information

```{r lksess}
sessionInfo()
```