--- title: "Introduction to StatescopeR" author: - name: Mischa Steketee affiliation: - Cancer Centre Amsterdam, Amsterdam UMC email: m.f.b.steketee@amsterdamumc.nl output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" vignette: > %\VignetteIndexEntry{Introduction to StatescopeR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", ## see https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html crop = NULL ) ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Bib setup library("RefManageR") ## Write bibliography information bib <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitr = citation("knitr")[1], RefManageR = citation("RefManageR")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], StatescopeR = citation("StatescopeR")[1] ) ``` # Introduction StatescopeR is the R version of Statescope, a computational framework designed to discover cell states from cell type-specific gene expression profiles inferred from bulk RNA profiles. StatescopeR works in three steps: 1) Multi-omic single cell RNAseq reference-based deconvolution 2) Refinement of inferred cell type-specific gene expression profiles 3) Cell state discovery In addition to bulk RNA and a single cell RNAseq (scRNAseq) reference, StatescopeR can also integrate prior expectations of cell fractions in the bulk RNA. For example tumor fractions estimated by DNA copy number analysis or immune cell fractions by immunohistochemistry. Below is a manual for running StatescopeR with some pancreas scRNAseq data used for both the reference and pseudobulk which is deconvolved. # Running StatescopeR ## Installation StatescopeR is available through Bioconductor, install it using the following commands in R: ```{r "install", eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("StatescopeR") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` ## Prepare scRNAseq data After the package is installed, we will be loading it and the scRNAseq package, from which we will use the SegerstolpePancreas data set for our example. We will use it for two purposes: 1) Creating a scRNAseq reference, as you would do for your own application. 2) Creating pseudobulk to be deconvolved, which we can conveniently use for evaluation as well. In your own application this would likely be bulk mRNA sequencing. ```{r 'Prepare scRNAseq data'} ## Load StatescopeR & scRNAseq (for example data) suppressMessages(library(StatescopeR)) suppressMessages(library(scRNAseq)) suppressMessages(library(scuttle)) ## Load SegerstolpePancreas data set scRNAseq <- SegerstolpePancreasData() ## remove duplicate genes scRNAseq <- scRNAseq[!duplicated(rownames(scRNAseq)), ] ## Subset to 1 healthy and type 2 diabetes samples scRNAseq <- scRNAseq[, scRNAseq$individual %in% c("H2", "T2D2", "T2D3")] ## remove cells with no cell type label scRNAseq <- scRNAseq[, !is.na(scRNAseq$`cell type`)] ## remove very rare cell types (<100 cells in total data set) celltypes_to_remove <- names(table(scRNAseq$`cell type`)[(table(scRNAseq$`cell type`) < 100)]) scRNAseq <- scRNAseq[, !scRNAseq$`cell type` %in% celltypes_to_remove] ``` ## Prepare StatescopeR input StatescopeR takes the following inputs: 1) scRNAseq reference/signature 2) bulk mRNA to be deconvolved 3) selected genes for deconvolution, by default this uses `r BiocStyle::Githubpkg("theislab/AutoGeneS")` (optional) Prior expectation of fractions (optional) Hyperparameters (i.e. cores for parallelization, nrepfinal for maximum number of optimization iteration etc.) ```{r 'Prepare StatescopeR input'} ## Normalize (cp10k) and logtransform scRNAseq cpm(scRNAseq) <- calculateCPM(scRNAseq) logcounts(scRNAseq) <- log1p(cpm(scRNAseq) / 100) ## Create scRNAseq reference/signature with 5 hvg for quick example signature <- create_signature(scRNAseq, hvg_genes = TRUE, n_hvg_genes = 5L, labels = scRNAseq$`cell type` ) ## select subset of genes for deconvolution (3/5 hvg to make it quick) selected_genes <- select_genes(scRNAseq, 3L, 5L, labels = scRNAseq$`cell type` ) ## Create pseudobulk and normalize to cp10k (logging is done within Statescope) pseudobulk <- aggregateAcrossCells(scRNAseq, ids = scRNAseq$individual) normcounts(pseudobulk) <- calculateCPM(pseudobulk) / 100 pseudobulk <- as(pseudobulk, "SummarizedExperiment") rownames(pseudobulk) <- rownames(scRNAseq) ## (optional) Create prior expectation prior <- gather_true_fractions(scRNAseq, ids = scRNAseq$individual, label_col = "cell type") # Use True sc fractions for this prior[rownames(prior) != "ductal cell", ] <- NA # Keep only ductal cells prior <- t(prior) # Transpose it to nSample x nCelltype ``` ## Run StatescopeR Now that all inputs have been prepared you can deconvolve the (pseudo)bulk mRNA and discover transcriptional states. In short the modules work as follows: BLADE_deconvolution employs Bayesian variational inference, modelling the bulk gene expression level of each gene j in sample i by: \begin{equation} y_{ij} = log(\sum_{t} f_{i}^t exp(x_{ij}^t)) + \epsilon_{ij} \end{equation} with hidden variables $f_{i}^t$ and $x_{ij}^t$ denoting the fraction of cell type t for sample i and the expression of gene j for sample i in cell type j respectively. Refinement further optimizes cell type-specific gene expression profiles, $x_{ij}^t$, by fixing estimated cell fractions and putting more emphasis on capturing inter-sample heterogeneity over staying close to the single cell prior knowledge. StateDiscovery uses convex-NMF to cluster inferred cell type-specific gene expression profiles in an unsupervised manner. These modules are wrappers around the original Python version `r BiocStyle::Githubpkg("tgac-vumc/Statescope")`. ```{r 'Run StatescopeR'} ## Run Deconvolution module Statescope <- BLADE_deconvolution(signature, pseudobulk, selected_genes, prior, cores = 1L, Nrep = 1L) ## Run Refinement module Statescope <- Refinement(Statescope, signature, pseudobulk, cores = 2L) ## Run State Discovery module (pick k=2 for quick example) Statescope <- StateDiscovery(Statescope, k = 2L, Ncores = 2L) ``` ## Evaluate Statescope As we used pseudobulk from scRNAseq in this manual we are able to check if the output of StatescopeR is correct. ```{r 'Evaluate Statescope'} ## Count fractions of cells per sample in the pseudobulk true_fractions <- gather_true_fractions(scRNAseq, ids = scRNAseq$individual, label_col = "cell type") ## Plot correlation and RMSE with true fractions per celltype fraction_eval(Statescope, true_fractions) ``` ## Downstream analysis/visualizations After you have run Statescope you can do all kinds of analyses, here we show some visualizations to give you an idea, but feel free to try whatever comes to your mind. ```{r 'Downstream analysis/visualizations'} ## Show predicted fractions metadata(Statescope)$fractions ## Show predicted ductal cell type specific gene expression profiles assay(metadata(Statescope)$ct_specific_gep$`ductal cell`, "weighted_gep") ## Show ductal cell state scores per sample metadata(Statescope)$statescores$`ductal cell` ## Show ductal cell state loadings metadata(Statescope)$stateloadings$`ductal cell` ## Plot heatmap of fractions fraction_heatmap(Statescope) ## Plot barplot of top loadings barplot_stateloadings(Statescope) ``` ## Group expectations If you have an expected fraction for a group of cell types instead of for one specific cell (i.e. Lymphoid cells instead of T/B cells), StatescopeR also allows you to give a group prior. Here we show an example where we know the expected fraction of the pancreatic non-duct cells (acinar & alpha cells) together, but not their individual contributions. ```{r 'Group expectations'} ## define cell types in one group grouped_cts <- c("alpha cell", "acinar cell") ## initialize group assignment matrix (nCelltype x nGroup) with 0's celltype_names <- colnames(signature$mu) group_names <- c("Group", setdiff(celltype_names, grouped_cts)) nCelltype <- length(celltype_names) nGroup <- length(group_names) group <- matrix(0, nrow = nGroup, ncol = nCelltype, dimnames = list(group_names, celltype_names) ) ## initialize prior fraction matrix (nSamples x nCelltypes) with NA's nSamples <- ncol(pseudobulk) sample_names <- colnames(pseudobulk) prior <- matrix( nrow = nSamples, ncol = nGroup, dimnames = list(sample_names, group_names) ) ## Assign cell types to groups for (ct in celltype_names) { if (ct %in% grouped_cts) { ## assign cell type to group group["Group", ct] <- 1 } else { ## Assign cell type to its own group group[ct, ct] <- 1 } } ## Add grouped prior fractions to prior prior[names(true_fractions), "Group"] <- colSums(as.matrix( true_fractions[grouped_cts, ])) ## init group prior group_prior <- list("Group" = group, "Expectation" = prior) ## Now you can just run Statescope as with a normal prior. Note we do not give ## a ductal cell prior this time Statescope <- BLADE_deconvolution(signature, pseudobulk, selected_genes, group_prior, cores = 1L, Nrep = 1L ) ``` # Citing `StatescopeR` We hope that `r Biocpkg("StatescopeR")` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you! ```{r "citation"} ## Citation info citation("StatescopeR") ``` # Reproducibility The `r Biocpkg("StatescopeR")` package `r Citep(bib[["StatescopeR"]])` was made possible thanks to: - R `r Citep(bib[["R"]])` - `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` - `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` - `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])` - `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` - `r CRANpkg("sessioninfo")` `r Citep(bib[["sessioninfo"]])` - `r CRANpkg("testthat")` `r Citep(bib[["testthat"]])` This package was developed using `r BiocStyle::Biocpkg("biocthis")`. `R` session information. ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 80) session_info() ```