%\VignetteEngine{knitr::knitr} \documentclass[a4paper, 9pt]{article} <>= BiocStyle::latex() @ %\VignetteIndexEntry{SIMLR and CIMLR Multi-kernel LeaRning methods} \usepackage[utf8]{inputenc} \usepackage{graphicx} \usepackage{placeins} \usepackage{url} \usepackage{tcolorbox} \usepackage{authblk} \begin{document} \title{\Biocpkg{SIMLR} and \Biocpkg{CIMLR} Multi-kernel LeaRning methods} \author[1]{Daniele Ramazzotti} \author[1]{Bo Wang} \author[2]{Luca De Sano} \author[1]{Serafim Batzoglou} \affil[1]{Department of Computer Science, Stanford University, Stanford, CA , USA.} \affil[2]{Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi Milano Bicocca Milano, Italy.} \date{\today} \maketitle \begin{tcolorbox}{\bf Overview.} In this package we provide implementations of both SIMLR and CIMLR. These methods were originally applied to single-cell and cancer genomic data, but they are in principle capable of effectively and efficiently learning similarities in all the contexts where diverse and heterogeneous statistical characteristics of the data make the problem harder for standard approaches. \\ \vspace{1.0cm} {\em In this vignette, we give an overview of the package by presenting some of its main functions.} \vspace{1.0cm} \renewcommand{\arraystretch}{1.5} \end{tcolorbox} <>= library(knitr) opts_chunk$set( concordance = TRUE, background = "#f3f3ff" ) @ \newpage \tableofcontents \section{Changelog} \begin{itemize} \item[1.0.0] implements SIMLR and SIMLR feature ranking algorithms. \item[1.0.2] implements SIMLR large scale algorithms. \item[1.5.1] implements CIMLR clustering method. \end{itemize} \section{Algorithms and useful links} \label{sec:stuff} \renewcommand{\arraystretch}{2} \begin{center} \begin{tabular}{| l | p{6.0cm} | l |} {\bf Acronym} & {\bf Extended name} & {\bf Reference}\\ \hline SIMLR & Single-cell Interpretation via Multi-kernel LeaRning & \href{https://www.nature.com/articles/nmeth.4207}{Paper}\\ \hline CIMLR & Cancer Integration via Multikernel LeaRning & \href{https://www.biorxiv.org/content/early/2018/02/16/267245}{Paper}\\ \hline \end{tabular} \end{center} \section{Using the SIMLR R package} We first load the data provided as examples in the package. The dataset BuettnerFlorian is used for an example of the standard SIMLR, while the dataset ZeiselAmit is used for an example of SIMLR large scale. Moreover, the dataset GliomasReduced is a reduced dataset of lower grade gliomas to be used as an example for CIMLR. <>= library(SIMLR) data(BuettnerFlorian) data(ZeiselAmit) data(GliomasReduced) @ The external R package igraph is required for the computation of the normalized mutual information to assess the results of the clustering. <>= library(igraph) @ We now run SIMLR as an example on an input dataset from Buettner, Florian, et al. "Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells." Nature biotechnology 33.2 (2015): 155-160. For this dataset we have a ground true of 3 cell populations, i.e., clusters. <>= set.seed(11111) example = SIMLR(X = BuettnerFlorian$in_X, c = BuettnerFlorian$n_clust, cores.ratio = 0) @ We now compute the normalized mutual information between the inferred clusters by SIMLR and the true ones. This measure with values in [0,1], allows us to assess the performance of the clustering with higher values reflecting better performance. <>= nmi_1 = compare(BuettnerFlorian$true_labs[,1], example$y$cluster, method="nmi") nmi_1 @ As a further understanding of the results, we now visualize the cell populations in a plot. <>= plot(example$ydata, col = c(topo.colors(BuettnerFlorian$n_clust))[BuettnerFlorian$true_labs[,1]], xlab = "SIMLR component 1", ylab = "SIMLR component 2", pch = 20, main="SIMILR 2D visualization for BuettnerFlorian") @ \begin{figure*}[ht] \begin{center} \includegraphics[width=0.5\textwidth]{figure/image-1} \end{center} \caption{Visualization of the 3 cell populations retrieved by SIMLR on the dataset by Florian, et al.} \end{figure*} We also run SIMLR feature ranking on the same inputs to get a rank of the key genes with the related pvalues. <>= set.seed(11111) ranks = SIMLR_Feature_Ranking(A=BuettnerFlorian$results$S,X=BuettnerFlorian$in_X) @ <>= head(ranks$pval) head(ranks$aggR) @ Similarly, We now show an example for CIMLR on a reduced version of the cohort of lower grade glioma cancer patients originally published in Cancer Genome Atlas Research Network. "Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas." New England Journal of Medicine 372.26 (2015): 2481-2498. Also in this case, we ask for 3 clusters. <>= set.seed(11111) example_llg = CIMLR(X = GliomasReduced$in_X, c = 3, cores.ratio = 0) @ We finally show an example for SIMLR large scale on an input dataset being a reduced version of the dataset provided in Buettner, Zeisel, Amit, et al. "Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq." Science 347.6226 (2015): 1138-1142. For this dataset we have a ground true of 9 cell populations, i.e., clusters. <>= set.seed(11111) example_large_scale = SIMLR_Large_Scale(X = ZeiselAmit$in_X, c = ZeiselAmit$n_clust, kk = 10) @ We compute the normalized mutual information between the inferred clusters by SIMLR large scale and the true ones. <>= nmi_2 = compare(ZeiselAmit$true_labs[,1], example_large_scale$y$cluster, method="nmi") nmi_2 @ As a further understanding of the results, also in this case we visualize the cell populations in a plot. <>= plot(example_large_scale$ydata, col = c(topo.colors(ZeiselAmit$n_clust))[ZeiselAmit$true_labs[,1]], xlab = "SIMLR component 1", ylab = "SIMLR component 2", pch = 20, main="SIMILR 2D visualization for ZeiselAmit") @ \begin{figure*}[ht] \begin{center} \includegraphics[width=0.5\textwidth]{figure/image_large_scale-1} \end{center} \caption{Visualization of the 9 cell populations retrieved by SIMLR large scale on the dataset by Zeisel, Amit, et al.} \end{figure*} Now, as a final example, we also provide the results of two heuristics (see the original SIMLR paper) to estimate the number of clusters from data. <>= set.seed(53900) NUMC = 2:5 res_example = SIMLR_Estimate_Number_of_Clusters(BuettnerFlorian$in_X, NUMC = NUMC, cores.ratio = 0) @ Best number of clusters, K1 heuristic: <>= NUMC[which.min(res_example$K1)] @ K2 heuristic: <>= NUMC[which.min(res_example$K2)] @ Results of the two heuristics: <>= res_example @ \section{\Rcode{sessionInfo()}} <>= toLatex(sessionInfo()) @ \end{document}