%\VignetteEngine{knitr::knitr} \documentclass[a4paper, 9pt]{article} <>= BiocStyle::latex() @ %% \VignetteIndexEntry{An R Package for todo} \usepackage[utf8]{inputenc} \usepackage{graphicx} \usepackage{placeins} \usepackage{url} \usepackage{tcolorbox} \begin{document} \title{Using the \Biocpkg{SIMLR} package} \author{ Bo Wang\footnote{Department of Computer Science, Stanford University, Stanford, CA , USA.} \and Daniele Ramazzotti\footnote{Department of Pathology, Stanford University, Stanford, CA , USA.} \and Luca De Sano\footnote{Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi Milano Bicocca Milano, Italy.} \and Junjie Zhu\footnote{Department of Electrical Engineering, Stanford University, Stanford, CA , USA.} \and Emma Pierson\footnote{Department of Computer Science, Stanford University, Stanford, CA , USA.} \and Serafim Batzoglou\footnote{Department of Computer Science, Stanford University, Stanford, CA , USA.} } \date{\today} \maketitle \begin{tcolorbox}{\bf Overview.} Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical to identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. We develop a novel similarity-learning framework, \Biocpkg{SIMLR} (Single-cell Interpretation via Multi-kernel LeaRning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization. \Biocpkg{SIMLR} is capable of separating known subpopulations more accurately in single-cell data sets than do existing dimension reduction methods. Additionally, \Biocpkg{SIMLR} demonstrates high sensitivity and accuracy on high-throughput peripheral blood mononuclear cells (PBMC) data sets generated by the GemCode single-cell technology from 10x Genomics. \\ \vspace{1.0cm} {\em In this vignette, we give an overview of the package by presenting some of its main functions.} \vspace{1.0cm} \renewcommand{\arraystretch}{1.5} \end{tcolorbox} <>= library(knitr) opts_chunk$set( concordance = TRUE, background = "#f3f3ff" ) @ \newpage \tableofcontents \section{Changelog} \begin{itemize} \item[1.0] implements SIMLR and SIMLR feature ranking algorithms. \end{itemize} \section{Algorithms and useful links} \label{sec:stuff} \renewcommand{\arraystretch}{2} \begin{center} \begin{tabular}{| l | p{6.0cm} | l |} {\bf Acronym} & {\bf Extended name} & {\bf Reference}\\ \hline SIMLR & Single-cell Interpretation via Multi-kernel LeaRning & \href{http://biorxiv.org/content/early/2016/06/09/052225}{Paper}\\ \hline \end{tabular} \end{center} \section{Using SIMLR} We first load the data provided as an example in the package. <>= library(SIMLR) data(BuettnerFlorian) @ The external R package igraph is required for the computation of the normalized mutual information to assess the results of the clustering. <>= library(igraph) @ We now run SIMLR as an example on an input dataset from Buettner, Florian, et al. "Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells." Nature biotechnology 33.2 (2015): 155-160. For this dataset we have a ground true of 3 cell populations, i.e., clusters. <>= set.seed(11111) example = SIMLR(X = BuettnerFlorian$in_X, c = BuettnerFlorian$n_clust, cores.ratio = 0) @ We now compute the normalized mutual information between the inferred clusters by SIMLR and the true one. This measure with values in [0,1], allows us to assess the performance of the clustering with higher values reflecting better performance. <>= nmi_1 = compare(BuettnerFlorian$true_labs[,1], example$y$cluster, method="nmi") print(nmi_1) @ As a further understanding of the results, we now visualize the cell populations in a plot. <>= plot(example$ydata, col = c(topo.colors(BuettnerFlorian$n_clust))[BuettnerFlorian$true_labs[,1]], xlab = "SIMLR component 1", ylab = "SIMLR component 2", pch = 20, main="SIMILR 2D visualization for Test_1_mECS") @ \begin{figure*}[ht] \begin{center} \includegraphics[width=0.5\textwidth]{figure/image-1} \end{center} \caption{Visualization of the 3 cell populations retrieved by SIMLR.} \end{figure*} SIMRL supports SCESet objects. We now create an example object and then run SIMLR on it. <>= library(scran) ncells = 100 ngenes = 50 mu <- 2^runif(ngenes, 3, 10) gene.counts <- matrix(rnbinom(ngenes*ncells, mu=mu, size=2), nrow=ngenes) rownames(gene.counts) = paste0("X", seq_len(ngenes)) sce = newSCESet(countData=data.frame(gene.counts)) output = SIMLR(X = sce, c = 8, cores.ratio = 0) @ We finally run SIMLR feature ranking on the same inputs to get a rank of the key genes with the related pvalues. <>= ranks = SIMLR_Feature_Ranking(A=BuettnerFlorian$results$S,X=BuettnerFlorian$in_X) @ <>= head(ranks$pval) head(ranks$aggR) @ \section{\Rcode{sessionInfo()}} <>= toLatex(sessionInfo()) @ \end{document}