%\VignetteIndexEntry{scDD Quickstart} %\VignettePackage{scDD} %\VignetteEngine{knitr::knitr} \documentclass{article} \usepackage{underscore} <>= BiocStyle::latex() @ \bioctitle[scDD]{Identifying differential distributions in single-cell RNA-seq experiments with scDD} \author{Keegan Korthauer\thanks{\email{keegan@jimmy.harvard.edu}}} \begin{document} \maketitle \begin{abstract} The \Rpackage{scDD} package models single-cell gene expression data (from single-cell RNA-seq) using flexible nonparamentric Bayesian mixture models in order to explicitly handle heterogeneity within cell populations. In bulk RNA-seq data, where each measurement is an average over thousands of cells, distributions of expression over samples are most often unimodal. In single-cell RNA-seq data, however, even when cells represent genetically homogeneous populations, multimodal distributions of gene expression values over samples are common \cite{korthauer2016}. This type of heterogeneity is often treated as a nuisance factor in studies of differential expression in single-cell RNA-seq experiments. Here, we explicitly accommodate it in order to improve power to detect differences in expression distributions that are more complicated than a mean shift. \end{abstract} \packageVersion{\Sexpr{BiocStyle::pkg_ver("scDD")}} Report issues on \url{https://github.com/kdkorthauer/scDD/issues} \newpage \tableofcontents \newpage \section{Background} Our aim is two-fold: (1) to detect which genes have different expression distributions across two biological conditions and (2) to classify those differences into informative patterns. Note that in (1) we explicitly say differences in 'distributions' rather than differences in 'average', which would correspond to traditional DE (differential expression) analysis in bulk RNA-seq. By examining the entire distribution, we are able to detect more subtle differences as well as describe complex patterns, such as the existence of subgroups of cells within and across condition that express a given gene at a different level. We start by assuming that the log-transformed nonzero expression values arise out of a Dirichlet Process Mixture of normals model. This allows us to characterize expression distributions in terms of the number of modes (or clusters). To detect differences in these distributions across conditions, an approximate Bayes Factor score is used which compares the conditional likelihood under the hypothesis of Equivalent distributions (ED) where one clustering process governs both conditions jointly, with the hypothesis of Differential distributions (DD) where each condition is generated from its own clustering process. In the full framework, significance of the scores for each gene are evaluated via empirical p-values after permutation. Optionally, a fast implementation obtains the p-values from the non-parametric Kolmogorov-Smirnov test. Zero values are considered by also implementing a $\chi^2$ test of whether the proportion of zero values differs by condition (adjusted for overall sample detection rate). More details are provided in \cite{korthauer2016}. After the detection step is carried out, the significantly DD genes are classified into four informative patterns based on the number of clusters detected and whether they overlap. These patterns, depicted in Figure~\ref{figure/patternPlot-1}, include (a) DE (differential expression of unimodal genes), (b) DP (differential proportion for multimodal genes), (c) DM (differential modality), and (d) DB (both differential modality and different component means). Genes where a differential proportion of zeroes were identified are classified as DZ (differential zero). Genes that are identified as significantly differentially distributed but do not fall into one of the above categories are abbreviated NC (for no call). This includes genes with the same number of components with similar component means, but differential variance. For reasons detailed in \cite{korthauer2016}, we do not aim to interpret this type of pattern. <>= ### Note that the following code can be ignored for the purposes of # analysis with scDD; it is simply used to generate the cartoon of # interesting DD patterns (illustration purposes only) par(mfrow=c(2,2), tcl=-0.5, mai=c(0.4,0.4,0.5,0.3)) x <- seq(0, 6, by=0.05) ## traditional de # mu1 is 2 # mu2 is 4 cord.x <- c(0,x,6) cord.y <- c(0,dnorm(x, 2, 0.75),0) curve(dnorm(x, 2 , 0.75),xlim=c(0,6),main="Traditional DE", xaxt="n", xlab="", ylab="", yaxt="n") polygon(cord.x,cord.y,col=rgb(0,0,1,1/4)) cord.x <- c(0,x,6) cord.y <- c(0,dnorm(x, 4, 0.75),0) lines(x, dnorm(x, 4 , 0.75)) polygon(cord.x,cord.y,col=rgb(1,0,0,1/4)) axis(side=1, at=c(2,4), labels=c(expression(mu[1]), expression(mu[2])), pos=0, cex.axis=1.5) mtext("(A)", side = 3, line=0.5, adj=-0.1, cex=1.2, font=2) x <- seq(0, 10, by=0.05) ## differential proportion cord.x <- c(0,x,10) cord.y <- c(0,0.3*dnorm(x, 7, 1) + 0.7*dnorm(x, 3, 1),0) curve(0.3*dnorm(x, 7, 1) + 0.7*dnorm(x, 3, 1),xlim=c(0,10),main="DP", xaxt="n", xlab="", ylab="", yaxt="n") polygon(cord.x,cord.y,col=rgb(0,0,1,1/4)) cord.x <- c(0,x,10) cord.y <- c(0,0.3*dnorm(x, 3, 1) + 0.7*dnorm(x, 7, 1),0) lines(x, 0.3*dnorm(x, 3, 1) + 0.7*dnorm(x, 7, 1)) polygon(cord.x,cord.y,col=rgb(1,0,0,1/4)) axis(side=1, at=c(3,7), labels=c(expression(mu[1]), expression(mu[2])), pos=0, cex.axis=1.5) mtext("(B)", side = 3, line=0.5, adj=-0.1, cex=1.2, font=2) ## differential modes (DM) cord.x <- c(0,x,6) cord.y <- c(0,dnorm(x, 2, 0.75),0) curve(dnorm(x, 2 , 0.75),xlim=c(0,6),main="DM", xaxt="n", xlab="", ylab="", yaxt="n") polygon(cord.x,cord.y,col=rgb(0,0,1,1/4)) cord.x <- c(0,x,6) cord.y <- c(0,0.3*dnorm(x, 2, 0.6) + 0.7*dnorm(x, 4, 0.6),0) lines(x, 0.3*dnorm(x, 2, 0.6) + 0.7*dnorm(x, 4, 0.6)) polygon(cord.x,cord.y,col=rgb(1,0,0,1/4)) axis(side=1, at=c(2,4), labels=c(expression(mu[1]), expression(mu[2])), pos=0, cex.axis=1.5) mtext("(C)", side = 3, line=0.5, adj=-0.1, cex=1.2, font=2) ## Both DM and DP cord.x <- c(0,x,10) cord.y <- c(0,0.5*dnorm(x, 2.5, 1) + 0.5*dnorm(x, 7.5, 1),0) curve(0.5*dnorm(x, 2.5, 1) + 0.5*dnorm(x, 7.5, 1), xlim=c(0,10),main="DB", xaxt="n", xlab="", ylab="", yaxt="n", ylim=c(0,max(0.5*dnorm(x, 2.5, 1) + 0.5*dnorm(x, 7.5, 1)))) polygon(cord.x,cord.y,col=rgb(0,0,1,1/4)) cord.x <- c(0,x,10) cord.y <- c(0,0.8*dnorm(x, 5, 2),0) lines(x, 0.8*dnorm(x, 5, 2)) polygon(cord.x,cord.y,col=rgb(1,0,0,1/4)) axis(side=1, at=c(2.5, 5, 7.5), labels=c(expression(mu[1]), expression(mu[3]), expression(mu[2])), pos=0, cex.axis=1.5) mtext("(D)", side = 3, line=0.5, adj=-0.1, cex=1.2, font=2) @ \incfig{figure/patternPlot-1}{0.65\textwidth} {Illustration of informative DD patterns}{} The rest of this vignette outlines the main functionality of the \Rpackage{scDD} package. This includes: \begin{itemize} \item Identifying genes that are expressed differently between two biological conditions and classifying them into informative patterns. \item Simulating single-cell RNA-seq data with differential expression that exhibits multimodal patterns. \item Preprocessing and formatting of single-cell RNA-seq data to facilitate analysis \item Visualizing the expression patterns using a violin plotting scheme \end{itemize} \section{Identify and Classify DD genes} In this section, we demonstrate how to use the main function \Rfunction{scDD} to find genes with differential distributions and classify them into the patterns of interest described in the previous section. First, we need to load the \Rpackage{scDD} package. For each of the following sections in this vignette, we assume this step has been carried out. <>= library(scDD) @ Next, we load the toy simulated example \Rclass{SingleCellExperiment} object that we will use for identifying and classifying DD genes. <>= data(scDatExSim) @ Verify that this object is a member of the \Rclass{SingleCellExperiment} class and that it contains 200 samples and 30 genes. The \Rcode{colData} slot (which contains a dataframe of metadata for the cells) should have a column that contains the biological condition or grouping of interest. In this example data, that variable is the 'condition' variable. Note that the input gene set needs to be in \Rclass{SingleCellExperiment} format, and should contain normalized counts. In practice, it is also advisable to filter the input gene set to remove genes that have an extremely high proportion of zeroes (see Section 6). More specifically, the test for differential distributions of the expressed measurements will not be carried out on genes where only one or fewer cells had a nonzero measurement (these genes will still be tested for differential proportion of zeroes (DZ) if the \Rcode{testZeroes} parameter is set to \Rcode{TRUE}, however). <>= class(scDatExSim) dim(scDatExSim) @ Next, specify the hyperparameter arguments that we'll pass to the \Rfunction{scDD} function. These values reflect heavy-tailed distributions over the paramaters and are robust to many different settings in simulation (see \cite{korthauer2016} for more details). << prior >>= prior_param=list(alpha=0.01, mu0=0, s0=0.01, a0=0.01, b0=0.01) @ Finally, call the \Rfunction{scDD} function to test for differential distributions, classify DD genes, and return the results. If the biological condition or grouping variable in the \Rcode{colData} slot is named something other than 'condition', you'll need to specify the name of the variable as an argument to the \Rfunction{scDD} function (set the \Rcode{condition} argument equal to the name of the relevant column). We won't perform the test for a difference in the proportion of zeroes since none exists in this simulated toy example data, but this option can be invoked by changing the \Rcode{testZeroes} option to \Rcode{TRUE}. Note that the default option is to use a fast test of differential distributions that involves the Kolmogorov-Smirnov test instead of the full permutation testing framework. This provides a fast implementation of the method at the cost of potentially slightly decreased power compared to the full scDD framework described in the manuscript (see Section 4 for more details). Note that if you are only interested in obtaining the results of the test for significance, and not in the classification of genes to the patterns mentioned above, you can achieve further computational speedup by setting \Rcode{categorize} to \Rcode{FALSE}. <
>= scDatExSim <- scDD(scDatExSim, prior_param=prior_param, testZeroes=FALSE) @ Four results objects are added to the \Robject{scDatExSim} SingleCellExperiment object in the \Rclass{metadata} slot. For convenience, the results objects can be extracted with the \Rfunction{results} function. The main results object is the \Rcode{"Genes"} object which is a \Rclass{data.frame} containing the following columns: \begin{itemize} \item \Rcode{gene}: gene name (matches rownames of SCdat) \item \Rcode{DDcategory}: name of the DD pattern (DE, DP, DM, DB, DZ), or NC (no call), or NS (not significant). \item \Rcode{Clusters.combined}: the number of clusters identified when pooling condition 1 and 2 together \item \Rcode{Clusters.c1}: the number of clusters identified in condition 1 alone \item \Rcode{Clusters.c2}: the number of clusters identified in condition 2 alone \item \Rcode{nonzero.pvalue}: p-value for KS test of differential distributions of expressed cells \item \Rcode{nonzero.pvalue.adj}: Benjamini-Hochberg adjusted p-value for KS test of differential distributions \item \Rcode{zero.pvalue}: p-value for test of difference in dropout rate (only if \Rcode{testZeroes==TRUE}) \item \Rcode{zero.pvalue.adj}: Benjamini-Hochberg adjusted p-value for test of difference in dropout rate (only if \Rcode{testZeroes==TRUE}) \item \Rcode{combined.pvalue}: Fisher's combined p-value for a difference in nonzero or zero values (only if \Rcode{testZeroes==TRUE}) \item \Rcode{combined.pvalue.adj}: Benjamini-Hochberg adjusted Fisher's combined p-value for a difference in nonzero or zero values (only if \Rcode{testZeroes==TRUE}) \end{itemize} This can be extracted using the following call to \Rfunction{results}: <
>= RES <- results(scDatExSim) head(RES) @ The remaining three results objects are matrices (first for condition 1 and 2 combined, then condition 1 alone, then condition 2 alone) that contain the cluster memberships (partition estimates) for each sample (for clusters 1,2,3,...) in columns and genes in rows. Zeroes, which are not involved in the clustering, are labeled as zero. These can be extracted by specifying an alternative \Robject{type} when calling the \Rfunction{results} function. For example, we can extract the partition estimates for condition 1 with the following: <>= PARTITION.C1 <- results(scDatExSim, type="Zhat.c1") PARTITION.C1[1:5,1:5] @ \section{Alternate test for Differential Distributions} The first step in the scDD framework that identifies Differential Distributions was designed to have optimal power to detect differences in expression distributions, but the utilization of a permutation test on the Bayes Factor can be computationally demanding. While this is not an issue when machines with multiple cores are available since the code takes advantage of parallel processing, we also provide the option to use an alternate test to detect distributional differences that avoides the use of a permutation test. This option (default) uses the Kolmogorov-Smirnov test, which examines the null hypothesis that two samples are generated from the same continuous distribution. While the use of this test yielded slighlty lower power in simulations than the full permutation testing framework at lower sample sizes (50-75 cells in each condition) and primarily affected the DB pattern genes, it does not require permutations and thus is orders of magnitude faster. The overall power to detect DD genes in simulation was still comparable or favorable to exisiting methods for differential expression analysis of scRNA-seq experiments. The remaining steps of the scDD framework remain unchanged if the alternate test is used. That is, the Dirichlet process mixture model is still fit to the observed expression measurements so that the significant DD genes can be categorized into patterns that represent the major distributional changes, and results can still be visualized with violin plots using the \Rfunction{sideViolin} function described in the Plotting section. The option to use the full permutation testing procedure instead of the Kolmogorov-Smirnov test is invoked by setting the number of permutations to something other than zero (the \Rcode{permutations} argument in \Rfunction{scDD}) when calling the main \Rfunction{scDD} function as follows: <
>= scDatExSim <- scDD(scDatExSim, prior_param=prior_param, testZeroes=FALSE, permutations=100) @ The line above will run 100 permutations of every gene. In practice, it is recommended that at least 1000 permutations are carried out if using the full permutation testing option. Note that this option will take significantly longer than the default option to use the alternate KS test, and computation time will increase with more genes and/or more permutations, but multiple cores will automatically be utilized (if available) via the \Rpackage{BiocParallel} package. By default, an OS appropriate back-end using the number of cores on the machine minus 2 is chosen automatically. Alternatively, you can specificy the number of cores to use by passing in a \Rcode{param} argument in the \Rcode{scDD} function call (where the \Rcode{param} argument is an object of class \Rcode{MulticoreParm} for Linux-like OS or \Rcode{SnowParam} for Windows). For example, to use 12 cores on a Linux-like OS, specify \Rcode{param=MulticoreParam(workers=12)}. The results returned by \Rfunction{scDD} remain exactly as described in the previous section, with the exception that the \Rcode{nonzero.pvalue} and \Rcode{nonzero.pvalue.adj} columns of the \Rfunction{Genes} data frame now contain the p-values and Benjamini-Hochberg adjusted p-values of the perumtation test of the Bayes Factor for independence of condition membership with clustering. \section{Simulation} Here we show how to generate a simulated single-cell RNA-seq dataset which contains multi-modal genes. The \Rfunction{simulateSet} function simulates data from a two-condition experiment with a specified number of genes that fall into each of the patterns of interest. For DD genes, these include DE (differential expression of unimodal genes), DP (differential proportion for multimodal genes), DM (differential modality), and DB (both differential modality and mean expression levels), and for ED genes these include EE (equivalent expression for unimodal genes) and EP (equivalent proportion for multimodal genes). The simulation parameters are based on observed data from two conditions, so the function requires an \Rclass{SingleCellExperiment} formatted dataset as input. The output of the function is also a \Rclass{SingleCellExperiment} object with information about the true category of each gene and its simulated fold change stored in the \Rcode{rowData} slot. First, we load the toy example \Rclass{SingleCellExperiment} to simulate from <>= data(scDatEx) @ We'll verify that this object is a member of the \Rclass{SingleCellExperiment} class and that it contains 142 samples and 500 genes << check class2>>= class(scDatEx) dim(scDatEx) @ Next we need to set the arguments that will be passed to the \Rfunction{simulateSet} function. In this example we will simulate 30 genes total, with 5 genes of each type and 100 samples in each of two conditions. We also set a random seed for reproducibility. << set num >>= nDE <- 5 nDP <- 5 nDM <- 5 nDB <- 5 nEE <- 5 nEP <- 5 numSamples <- 100 seed <- 816 @ Finally, we'll create the simulated set with specified numbers of DE, DP, DM, DM, EE, and EP genes and specified number of samples, where DE gene fold changes represent 2 standard deviations of the observed fold change distribution, and multimodal genes have cluster mean distance of 4 standard deviations. <>= SD <- simulateSet(scDatEx, numSamples=numSamples, nDE=nDE, nDP=nDP, nDM=nDM, nDB=nDB, nEE=nEE, nEP=nEP, sd.range=c(2,2), modeFC=4, plots=FALSE, random.seed=seed) # load the SingleCellExperiment package to use rowData method library(SingleCellExperiment) head(rowData(SD)) @ The \Rcode{normcounts} assay \Robject{SD} object (of class \Rclass{SingleCellExperiment}) contains simulated expression values. The \Rcode{rowData} slot stores the fold change/modal distance values and gene expression categories, which are useful in assessing performance of a differential expression method. \section{Formatting and Preprocessing} Before beginning an analysis using \Rpackage{scDD}, you will need to carry out a few preprocessing steps. This includes normalization, filtering of genes that are mostly zero, and getting the data into format that is expected by the \Rfunction{scDD} function. The following subsections will detail these steps. \subsection{Constructing a SingleCellExperiment object} In this subsection, we provide a quick example of how to construct an object of the \Rclass{SingleCellExperiment} class. For more detailed instructions, refer to the \Rpackage{SingleCellExperiment} package documentation. Here we will convert the toy example data object \Robject{scDatExList} into a \Rclass{SingleCellExperiment} object. This object is a list of two matrices (one for each condition) of normalized counts. Each matrix has genes in rows and cells in columns, and is named "C1" or "C2" (for condition 1 or 2). First, load the \Rpackage{SingleCellExperiment} package: <>= library(SingleCellExperiment) @ Next, load the toy example data list: <>= data(scDatExList) @ Next, create a vector of condition membership labels (these should be 1 or 2). In our example data list, we have 78 cells in condition 1, and 64 cells in condition 2. <>= condition <- c(rep(1, ncol(scDatExList$C1)), rep(2, ncol(scDatExList$C2))) @ The rows and columns of the expression matrix should have unique names. This is already the case for our toy example dataset. The names of the columns should also correspond to the names of the condition membership labels in \Robject{condition}. <>= # Example of row and column names head(rownames(scDatExList$C1)) head(colnames(scDatExList$C2)) names(condition) <- c(colnames(scDatExList$C1), colnames(scDatExList$C2)) @ Once our labeling is intact, we can call the \Rfunction{SingleCellExperiment} function and specify the two relevant pieces of information. The \Rcode{normcounts} assays slot should contain one matrix, so we use \Rcode{rbind} here to combine both conditions. Optionally, additional experiment information can be stored in additional slots; see the \Rpackage{SingleCellExperiment} package for more details. <>= sce <- SingleCellExperiment(assays=list(normcounts=cbind(scDatExList$C1, scDatExList$C2)), colData=data.frame(condition)) show(sce) @ \subsection{Filtering and Normalization} In this subsection, we demonstrate the utility of the \Rfunction{preprocess} function, which can be helpful if working with raw counts, or data which contains genes that are predominantly zero (common in single-cell RNA-seq experiments). This function takes as input a list of data matrices, one for each condition. First, load the toy example data and verify it is a \Rclass{SingleCellExperiment} object: <>= data(scDatEx) show(scDatEx) @ Now, apply the \Rfunction{preprocess} function with the \Robject{zero.thresh} argument set to 0.9 so that genes are filtered out if they are 90 (or more) percent zero. <>= scDatEx <- preprocess(scDatEx, zero.thresh=0.9) show(scDatEx) @ We can see that no genes were removed, since all have fewer than 90 percent of zeroes to begin with. Now, apply the preprocess function again, but this time use a more stringent threshold on the proportion of zeroes and apply normalization using size factors calculated using the \Rpackage{scran}. In this example, we set the \Robject{zero.thresh} argument to 0.50 so that genes with more than 50 percent zeroes are filtered out and we set the \Robject{scran_norm} argument to \Rcode{TRUE} to return \Rpackage{scran} normalized counts. <>= scDatEx.scran <- preprocess(scDatEx, zero.thresh=0.5, scran_norm=TRUE) show(scDatEx.scran) @ We can see that 38 genes were removed due to having more than 50 percent zeroes. The warning message here is letting us know that our SingleCellExperiment object already contained a \Rcode{normcounts} assay, and that by specifying \Rcode{scran_norm=TRUE}, we are replacing that with the scran normalized counts and moving the original values to a new slot called \Rcode{normcounts-orig}. Also included is the option to use median normalization, invoked by setting \Robject{median_norm} to \Rcode{TRUE}. \section{Plotting} Next we demonstrate the plotting routine that is implemented in the \Rfunction{sideViolin} function. This function produces side-by-side violin plots (where the curves represent a smoothed kernel density estimate) of the log-transformed data. A count of 1 is added before log-transformation so that zeroes can be displayed, but they are not included in the density estimation. Each condition is represented by one violin plot. Individual data points are plotted (with jitter) on top. We illustrate this function by displaying the six types of simulated genes using the toy example simulated dataset. First, load the toy simulated dataset: <>= data(scDatExSim) @ Next, load the \Rpackage{SingleCellExperiment} package to facilitate subset operations on \Rcode{SingleCellExperiment} class objects: <>= library(SingleCellExperiment) @ The following lines will produce the figures in Figure~\ref{figure/plotGrid-1}. Plot side by side violin plots for Gene 1 (DE): <>= de <- sideViolin(normcounts(scDatExSim)[1,], scDatExSim$condition, title.gene=rownames(scDatExSim)[1]) @ Plot side by side violin plots for Gene 6 (DP): <>= dp <- sideViolin(normcounts(scDatExSim)[6,], scDatExSim$condition, title.gene=rownames(scDatExSim)[6]) @ Plot side by side violin plots for Gene 11 (DM): <>= dm <- sideViolin(normcounts(scDatExSim)[11,], scDatExSim$condition, title.gene=rownames(scDatExSim)[11]) @ Plot side by side violin plots for Gene 16 (DB): <>= db <- sideViolin(normcounts(scDatExSim)[16,], scDatExSim$condition, title.gene=rownames(scDatExSim)[16]) @ Plot side by side violin plots for Gene 21 (EP): <>= ep <- sideViolin(normcounts(scDatExSim)[21,], scDatExSim$condition, title.gene=rownames(scDatExSim)[21]) @ Plot side by side violin plots for Gene 26 (EE): <>= ee <- sideViolin(normcounts(scDatExSim)[26,], scDatExSim$condition, title.gene=rownames(scDatExSim)[26]) @ The plot objects returned by \Rfunction{sideViolin} are standard \Rpackage{ggplot2} objects, and thus can be manipulated into multipanel figures with the help of the \Rpackage{gridExtra} or \Rpackage{cowplot} packages. Here we use \Rfunction{grid.arrange} from the \Rpackage{gridExtra} package to visualize all the plots generated above. The end result is shown in Figure~\ref{figure/plotGrid-1}. <>= library(gridExtra) grid.arrange(de, dp, dm, db, ep, ee, ncol=2) @ \incfig{figure/plotGrid-1}{0.92\textwidth}{Example Simulated DD genes}{} \section{Session Info} Here is the output of \Rfunction{sessionInfo} on the system where this document was compiled: <>= sessionInfo() @ \bibliography{vignette} \end{document}