--- title: "scvi-tools CITE-seq tutorial in R, using serialized tutorial components" author: "Follows scvi-tools doc, Gayoso, Steier et al. DOI 10.1038/s41592-020-01050-x" date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{scvi-tools CITE-seq tutorial in R, using serialized tutorial components} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: highlight: pygments number_sections: yes theme: united toc: yes --- # A CITE-seq example The purpose of this vignette is to illustrate the feasibility of reflecting the material in the [online tutorial](https://colab.research.google.com/github/scverse/scvi-tutorials/blob/0.20.0/totalVI.ipynb) for scvi-tools 0.20.0 in Bioconductor. The authors of the tutorial describe it as producing > a joint latent representation of cells, denoised data for both protein and RNA Additional tasks include > integrat[ing] datasets, and comput[ing] differential expression of RNA and protein The integration concerns the simultaneous analysis of two datasets from 10x genomcs. In this vignette we carry out the bulk of the tutorial activities using R and Bioconductor, reaching to scvi-tools python code via basilisk. ## Retrieval of PBMC data The following chunk will acquire (and cache, using BiocFileCache) a preprocessed version of the 10k and 5k combined CITE-seq experiments from the scvi-tools data repository. ```{r doinit,message=FALSE} library(scviR) library(ggplot2) library(reshape2) adref = getCiteseq5k10kPbmcs() adref ``` ## Retrieval of fitted VAE The totalVI variational autoencoder was fit with these data. A fitted version is retrieved and cached using ```{r dogetv} vae = getCiteseqTutvae() ``` This is an instance of an S3 class, `python.builtin.object`, defined in the reticulate package. ```{r lkclv} class(vae) ``` Some fields of interest that are directly available from the instance include an indicator of the trained state, the general parameters used to train, and the "anndata" (annotated data) object that includes the input counts and various results of preprocessing: ```{r some} vae$is_trained cat(vae$`_model_summary_string`) vae$adata ``` The structure of the VAE is reported using ```{r lkmod, eval=FALSE} vae$module ``` This is quite voluminous and is provided in an appendix. ## Trace of (negative) ELBO values The negative "evidence lower bound" (ELBO) is a criterion that is minimized in order to produce a fitted autoencoder. The scvi-tools totalVAE elgorithm creates a nonlinear projection of the inputs to a 20-dimensional latent space, and a decoder that transforms object positions in the latent space to positions in the space of observations that are close to the original input positions. The negative ELBO values are computed for samples from the training data and for "left out" validation samples. Details on the validation assessment would seem to be part of pytorch lightning. More investigation of scvi-tools code and doc are in order. ```{r lkelb} h = vae$history npts = nrow(h$elbo_train) plot(seq_len(npts), as.numeric(h$elbo_train[[1]]), ylim=c(1200,1400), type="l", col="blue", main="Negative ELBO over training epochs", ylab="neg. ELBO", xlab="epoch") graphics::legend(300, 1360, lty=1, col=c("blue", "orange"), legend=c("training", "validation")) graphics::lines(seq_len(npts), as.numeric(h$elbo_validation[[1]]), type="l", col="orange") ``` ## Normalized quantities On a CPU, the following can take a long time. ```{r getn, eval=FALSE} NE = vae$get_normalized_expression(n_samples=25L, return_mean=TRUE, transform_batch=c("PBMC10k", "PBMC5k") ) ``` We provide the totalVI-based denoised quantities in ```{r getdenoise} denoised = getTotalVINormalized5k10k() vapply(denoised, dim, integer(2)) ``` Note that these have features as columns, samples (cells) as rows. ```{r lkn} utils::head(colnames(denoised$rna_nmlzd)) utils::head(colnames(denoised$prot_nmlzd)) ``` ## UMAP projection of Leiden clustering in the totalVI latent space We have stored a fully loaded anndata instance for retrieval to inspect the latent space and clustering produced by the tutorial notebook procedure. The images produced here do not agree exactly with what I see in the colab pages for 0.20.0. The process was run in Jetstream2, not in colab. ```{r getproj, fig.height=6} full = getTotalVI5k10kAdata() # class distribution cllabs = full$obs$leiden_totalVI blabs = full$obs$batch table(cllabs) um = full$obsm$get("X_umap") dd = data.frame(umap1=um[,1], umap2=um[,2], clust=factor(cllabs), batch=blabs) ggplot(dd, aes(x=umap1, y=umap2, colour=clust)) + geom_point(size=.05) + guides(color = guide_legend(override.aes = list(size = 4))) ``` Effectiveness at accommodating the two-batch design is suggested by the mixed representation of the batches in all the Leiden clusters. ```{r getba} ggplot(dd, aes(x=umap1, y=umap2, colour=factor(batch))) + geom_point(size=.05) ``` ## Protein abundances in projected clusters We focus on four of the ADT. Points (cells) in the UMAP projection given above are colored by the estimated abundance of the proteins quantified via (normalized) ADT abundance. Complementary expression of CD4 and CD8a is suggested by the configurations in the middle two panels. ```{r lknn,fig.width=8} pro4 = denoised$prot_nmlzd[,1:4] names(pro4) = gsub("_.*", "", names(pro4)) wprot = cbind(dd, pro4) mm = melt(wprot, id.vars=c("clust", "batch", "umap1", "umap2")) utils::head(mm,3) ggplot(mm, aes(x=umap1, y=umap2, colour=log1p(value))) + geom_point(size=.1) + facet_grid(.~variable) ``` # Conclusions We have shown that all the results of the totalVI application in the tutorial are readily accessible with utilities in scviR. Additional work on details of differential expression are present in the tutorial and can be explored by the interested reader/user. # Appendix: The VAE module The structure of the VAE is reported using ```{r lkmod2, eval=TRUE} vae$module ``` # Session information ```{r lksess} sessionInfo() ```