--- title: "Dimensionality reduction of single cell data with corral" author: - name: Lauren Hsu affiliation: Harvard TH Chan School of Public Health; Dana-Farber Cancer Institute - name: Aedin Culhane affiliation: Department of Data Sciences, Dana-Farber Cancer Institute, Department of Biostatistics, Harvard TH Chan School of Public Health date: "4/28/2020" output: BiocStyle::html_document: toc_float: true BiocStyle::pdf_document: default package: BiocStyle bibliography: ref.bib vignette: | %\VignetteIndexEntry{dim reduction with corral} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(gridExtra) ``` # Introduction Single-cell 'omics analysis enables high-resolution characterization of heterogeneous populations of cells by quantifying measurements in individual cells and thus provides a fuller, more nuanced picture into the complexity and heterogeneity between cells. However, the data also present new and significant challenges as compared to previous approaches, especially as single-cell data are much larger and sparser than data generated from bulk sequencing methods. Dimensionality reduction is a key step in the single-cell analysis to address the high dimensionality and sparsity of these data, and to enable the application of more complex, computationally expensive downstream pipelines. Correspondence analysis (CA) is a matrix factorization method, and is similar to principal components analysis (PCA). Whereas PCA is designed for application to continuous, approximately normally distributed data, CA is appropriate for non-negative, count-based data that are in the same additive scale. `corral` implements CA for dimensionality reduction of a single matrix of single-cell data. See the vignette for `corralm` for the multi-table adaptation of CA for single-cell batch alignment/integration. corral can be used with various types of input. When called on a matrix (or other matrix-like object), it returns a list with the SVD output, principal coordinates, and standard coordinates. When called on a `r Biocpkg('SingleCellExperiment')`, it returns the `r Biocpkg('SingleCellExperiment')` with the corral embeddings in the `reducedDim` slot named `corral`. To retrieve the full list output from a `SingleCellExperiment` input, the `fullout` argument can be set to `TRUE`. ![](corral_IO.png) # Loading packages and data We will use the `Zhengmix4eq` dataset from the `r Biocpkg('DuoClustering2018')` package. ```{r, message = FALSE} library(corral) library(SingleCellExperiment) library(ggplot2) library(DuoClustering2018) zm4eq.sce <- sce_full_Zhengmix4eq() ``` This dataset includes approximately 4,000 pre-sorted and annotated cells of 4 types mixed by Duo et al. in approximately equal proportions [@zmdata]. The cells were sampled from a "Massively parallel digital transcriptional profiling of single cells" [@zheng]. ```{r} zm4eq.sce table(colData(zm4eq.sce)$phenoid) ``` # `corral` on `r Biocpkg('SingleCellExperiment')` We will run `corral` directly on the raw count data: ```{r} zm4eq.sce <- corral(inp = zm4eq.sce, whichmat = 'counts') zm4eq.sce ``` We can use `plot_embedding` to visualize the output: ```{r} plot_embedding_sce(sce = zm4eq.sce, which_embedding = 'corral', plot_title = 'corral on Zhengmix4eq', color_attr = 'phenoid', color_title = 'cell type', saveplot = FALSE) ``` Using the `scater` package, we can also add and visualize `umap` and `tsne` embeddings based on the `corral` output: ```{r} library(scater) library(gridExtra) # so we can arrange the plots side by side zm4eq.sce <- runUMAP(zm4eq.sce, dimred = 'corral', name = 'corral_UMAP') zm4eq.sce <- runTSNE(zm4eq.sce, dimred = 'corral', name = 'corral_TSNE') ggplot_umap <- plot_embedding_sce(sce = zm4eq.sce, which_embedding = 'corral_UMAP', plot_title = 'Zhengmix4eq corral with UMAP', color_attr = 'phenoid', color_title = 'cell type', returngg = TRUE, showplot = FALSE, saveplot = FALSE) ggplot_tsne <- plot_embedding_sce(sce = zm4eq.sce, which_embedding = 'corral_TSNE', plot_title = 'Zhengmix4eq corral with tSNE', color_attr = 'phenoid', color_title = 'cell type', returngg = TRUE, showplot = FALSE, saveplot = FALSE) multiplot(ggplot_umap, ggplot_tsne, cols = 2) ``` The `corral` embeddings stored in the `reducedDim` slot can be used in downstream analysis, such as for clustering or trajectory analysis. `corral` can also be run on a `SummarizedExperiment` object. # `corral` on matrix `corral` can also be performed on a matrix (or matrix-like) input. ```{r} zm4eq.countmat <- assay(zm4eq.sce,'counts') zm4eq.countcorral <- corral(zm4eq.countmat) ``` The output is in a `list` format, including the SVD output (`u`,`d`,`v`), the standard coordinates (`SCu`,`SCv`), and the principal coordinates (`PCu`,`PCv`). ```{r} zm4eq.countcorral ``` We can use `plot_embedding` to visualize the output: (the embeddings are in the `v` matrix because these data are by genes in the rows and have cells in the columns; if this were reversed, with cells in the rows and genes/features in the column, then the cell embeddings would instead be in the `u` matrix.) ```{r} celltype_vec <- colData(zm4eq.sce)$phenoid plot_embedding(embedding = zm4eq.countcorral$v, plot_title = 'corral on Zhengmix4eq', color_vec = celltype_vec, color_title = 'cell type', saveplot = FALSE) ``` The output is the same as above with the `SingleCellExperiment`, and can be passed as the low-dimension embedding for downstream analysis. Similarly, UMAP and tSNE can be computed for visualization. (Note that in performing SVD, the direction of the axes doesn't matter so they may be flipped between runs, as `corral` and `corralm` use `irlba` to perform fast approximation.) # Session information ```{r} sessionInfo() ``` # References