--- title: "BEARscc: Using spike-ins to assess single cell cluster robustness" author: "David T. Severson" output: BiocStyle::pdf_document vignette: > %\VignetteIndexEntry{Vignette Title} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` \newpage ```{r prepare_libaries, eval=TRUE, include=FALSE} library("data.table") library("SingleCellExperiment") ``` # Introduction ## Scope Single-cell transcriptome sequencing data are subject to substantial technical variation and batch effects that can confound the classification of cellular sub-types. Unfortunately, current clustering algorithms do not account for this uncertainty. To address this shortcoming, we have developed a noise perturbation algorithm called **BEARscc** that is designed to determine the extent to which classifications by existing clustering algorithms are robust to observed technical variation. **BEARscc** makes use of spike-in measurements to model technical variance as a function of gene expression and technical dropout effects on lowly expressed genes. In our benchmarks, we found that BEARscc accurately models read count fluctuations and drop-out effects across transcripts with diverse expression levels. Applying our approach to publicly available single-cell transcriptome data of mouse brain and intestine, we have demonstrated that BEARscc identified cells that cluster consistently, irrespective of technical variation. For more details, see the [manuscript on bioRxiv](http://biorxiv.org/content/early/2017/03/21/118919). Importantly, **BEARscc** should not be considered another clustering algorithm. Specifically, this package is designed to supply users with an organic tool to evaluate and explore the impact of noise on their single cell cluster interpretations. The package provides users with a way to clarify the precision of a single cell experiment with respect to grouping cells into clusters that are biologically meaningful. In this way, **BEARscc** allows users to achieve confidence in clusters and relevant cells that consistently cluster together invariant to the noise inherent to a single cell experiment. Conversely, the algorithm provides a mechanism to identify cells and clusters which cannot be resolved given the precision of the experiment in conjuction with the clustering algorithm of choice. It is our hope that **BEARscc** will enable users to proceed with clusters, or biological groups, in which they are confident are robust to noise and in which they have an intimate understanding of those cells and clusters that may be less precisely assigned to the putative biological role. ## Installation BEARscc is now available on Bioconductor and can be installed using the syntax below. ```{r install_BEARscc, include=TRUE, eval=FALSE} source("http://bioconductor.org/biocLite.R") biocLite("BEARscc") ``` ## Citation BEARscc and its associated manuscript are currently under review for publication at a peer-reviewed journal. For now, please cite the bioRxiv pre-print: Severson, DT. Owen, RP. White, MJ. Lu, X. Schuster-Boeckler, B. BEARscc determines robustness of single-cell clusters using simulated technical replicates. doi: https://doi.org/10.1101/118919 \newpage # Tutorial ## Overview BEARscc relies upon spike-in count measurements in single-cell transcriptome experiments to estimate experimental noise and produce simulated technical replicates to provide a quantitative understanding of the robustness of proposed single cell cluster labels to experimental noise. In principal, the algorithm is compatible with any clustering algorithm. The following should provide users with a comprehensive tutorial of the use and utlity of BEARscc as a tool for vetting single cell clusters with respect to experimental noise. Before getting started, we need to load some example single cell data.BEARscc is equipped with a set of sample data for the purpose of testing functions, examples in help files, and this nifty tutorial. The data may be loaded as follows: ```{r load_data, eval=TRUE} library("BEARscc") data("BEARscc_examples") ``` The loaded file `BEARscc_examples` is equipped with separate `data.frame` objects including ERCC spike-in observations (`ERCC.counts.df`), endogenous count observations (`data.counts.df`), and the expected or actual spike-in concentrations (`ERCC.meta.df`) as well as a `SingleCellExperiment` object that contains all of the above seperate components (`BEAR_examples.sce`) as shown below: ```{r display_data, eval=TRUE} head(ERCC.counts.df[,1:2]) head(data.counts.df[,1:2]) head(ERCC.meta.df) BEAR_examples.sce ``` In the event we were working with a new set of data, the spike-in concentrations `data.frame` can be computed from industry reported concentrations and the relevant dilution protocol utilized in the experiment. Count tables would need to be mapped and counted with preferred software, and the spike-in control counts (ERCC or otherwise) would need to be identified from the endogenous counts. Below is how one would create a `SingleCellExperiment` object from spike-in count, endogenous count, and spike-in concentration `data.frame` objects. Note how we create the `observed_expression` assay object in the following code. This object is essential in that all estimation and simulation of replicates occurs assuming these are the observed counts (normalized or otherwise). Without them BEARscc will throw an error indicating that `"observed_expression"` not in `names(assays())`. Also, `data.counts.df` in this example includes both spike-in genes and endogenous genes. In general, we recommend spike-in genes be simulated with endogenous genes as a control, and this will be done by default when the `SingleCellExperiment` object is used. ```{r create_SCEList, eval=TRUE} BEAR.se <- SummarizedExperiment(list(counts=as.matrix(data.counts.df))) BEAR_examples.sce<-as(BEAR.se, "SingleCellExperiment") metadata(BEAR_examples.sce)<-list(spikeConcentrations=ERCC.meta.df) assay(BEAR_examples.sce, "observed_expression")<-counts(BEAR_examples.sce) isSpike(BEAR_examples.sce, "ERCC_spikes")<-grepl("^ERCC-", rownames(BEAR_examples.sce)) ``` Running the above code should yield a `SingleCellExperiment` object identical to the one that loads with `data("BEARscc_examples")`. ## Building the noise model We will now estimate the single-cell noise present in the experiment using spike-in controls. In this tutorial, we rely upon a subsample of artificial control data found in `BEARscc_examples`; however, users are encouraged to work through the tutorial with their own single cell data provided some form of spike-ins were included in the experiment. Building the noise models with BEARscc is relatively straightforward with `estimate_noiseparameters()`. We simply provide the function with the now adequately annotated `SingleCellExperiment` object, `BEAR_examples.sce`. Here, the parameter `alpha_resolution is set to 0.1 to speed things along, but we suggest values between 0.001 and 0.01 be used in real applications of BEARscc. ```{r estimate_noise, eval=TRUE} BEAR_examples.sce <- estimate_noiseparameters(BEAR_examples.sce, max_cumprob=0.9999, alpha_resolution = 0.1, bins=10, write.noise.model=FALSE, file="BEAR_examples") ``` Several options exist for `estimate_noiseparameters()`. These are fully documented in the help page `?estimate_noiseparameters`. ## Simulating technical replicates Following estimation of noise, the parameters computed are then used to generate a simulate a technical replicate. Here we will simulate replicates on our local computer, but frequently users will want to utilize the methods described in Section 2.4. Notably the necessary parameters are conveniently stored in the `metadata` of our `SingleCellExpression` object, `BEAR_examples.sce` following noise estimation, and so we simply run: ```{r simulate_replicates, eval=TRUE} BEAR_examples.sce <- simulate_replicates(BEAR_examples.sce, max_cumprob=0.9999, n=10) ``` Recall that `BEAR_examples.sce` is our `SingleCellExperiment` object that we recently annoated with model parameters describing experimental noise using the function `estimate_noiseparameters()` and note that the variable `n` is the desired number of simulated technical replicates (e.g. 10). Finally, the `maxcum_prob` is identical to its use in the noise estimation. If the user deviated from the default parameter, it is highly recommended that this value be identical to the value utilized in `estimate_noiseparmaters()` The resulting object is a list, where each element is a simulated technical replicate, and one element is the original counts matrix. ## Simulation of replicates for larger datasets For larger datasets, we set `write.noise.model=TRUE` when running `estimate_noiseparameters()`and copy the written bayesian drop-out and noise estimate files with the observed counts table to a high performance computing environment. The following code provides an example: ```{r estimate_noise_HPC, eval=FALSE, include=TRUE} BEAR_examples.sce <- estimate_noiseparameters(BEAR_examples.sce, write.noise.model=TRUE, file="tutorial_example", model_view=c("Observed","Optimized")) ``` After running the above code, then within the current working directory (if unsure use `getwd()`), we should find the two tab-delimited files that together completely describe the BEARscc noise model. These are the parameters describing the mixed model of technical variation (`tutorial_example_parameters4randomize.xls`, see Section 3.1.1) and the parameters describing the drop-out model (`tutorial_example_bayesianestimates.xls`, see Section 3.1.2). In addition, we should find our `"observed_expression"`matrix in the form of `tutorial_example_counts4clusterperturbation.xls`. Note that the `xls` subscript just allows users to quickly open these tab-delimited files in Microsoft Excel if desired, but these can be readily viewed on the terminal or in simple text editors as well. With the original counts file and noise model prepared, we then copy these files to our high performance compute cluster. The following code provides a sense of how to proceed once these files have been copied to a high performance cluster; however, the job submission structure of each user's environment will dictate the precise syntax for the following procedure. The script `HPC_generate_noise_matrices` contains an analogous `simulate_replicates()` for a high performance computational node. To utilize these functions for simulating technical replicates on a cluster, please install BEARscc on the relevant cluster. The user should write an R script to load the BEARscc library and run the clustering. The following code provides a suggested format for both calling the R script with a bash job script and the relevant R invocation of BEARscc and may also be found as a stand alone script in `inst/example/HPC_run_example.R`. Our cluster utilizes a job submission format that interacts seamlessly with bash code; therefore, the `$SGE_TASKID` represents an array id for jobs to conveniently generate 100 simulated technical replicates in a single job array. In any case, this variable should be treated as the index for the simulated technical replicate as we recommend from experience that users generate 50 to 100 such simulated technical replicates to reach a stable noise consensus matrix solution. The following bash code could be included in one such job script: ```{bash run_sim_rep_HPC, eval=FALSE, include=TRUE} Rscript --vanilla HPC_run_example.R $SGE_TASK_ID ``` Noting that the file `HPC_run_example.R` contains the following suggested code to run BEARscc: ```{r HPC_example, eval=FALSE, include=TRUE} library("BEARscc") #### Load data #### ITERATION<-commandArgs(trailingOnly=TRUE)[1] counts.df<-read.delim("tutorial_example_counts4clusterperturbation.xls") #filter out zero counts to speed up algorithm counts.df<-counts.df[rowSums(counts.df)>0,] probs4detection<-fread("tutorial_example_bayesianestimates.xls") parameters<-fread("tutorial_example_parameters4randomize.xls") #### Simulate replicates #### counts.error<-HPC_simulate_replicates(counts_matrix=counts.df, dropout_parameters=dropout_parameters, spikein_parameters=spikein_parameters) write.table(counts.error, file=paste("simulated_replicates/", paste(ITERATION,"sim_replicate_counts.txt",sep="_"), sep=""), quote =FALSE, row.names=TRUE) ######### ``` The script generates seperate simulated technical replicate files, which can be loaded into R as a list for clustering or, in the case of more computationally intense clustering algorithms, re-clustered individually in a high performance compute environment. ## Forming a noise consensus After generating simulated technical replicates, these should be re-clustered using the clustering method applied to the original dataset. For simplicity, here we use hierarchical clustering on a euclidean distance metric to identify two clusters. In our experience, some published clustering algorithms are sensitive to cell order, so we suggest scrambling the order of cells for each noise iteration as we do below in the function, `recluster()`. The function below will serve our purposes for this tutorial, but BEARscc may be used with any clustering algorithm (e.g. SC3, RaceID2, or BackSPIN). To quickly recluster a list of simulated technical replicates, we define a reclustering function: ```{r recluster, include=TRUE, eval=TRUE} recluster <- function(x) { x <- data.frame(x) scramble <- sample(colnames(x), size=length(colnames(x)), replace=FALSE) x <- x[,scramble] clust <- hclust(dist(t(x), method="euclidean"),method="complete") clust <- cutree(clust,2) data.frame(clust) } ``` First, we need to determine how the observed data clusters under our algorithm of choice (e.g. `recluster()`, which is a simple hiearchical clustering for illustrative purposes). These are the clusters that can be evaluated by `BEARscc` and compared to `BEARscc` meta-clustering: ```{r original_clusters, include=TRUE, eval=TRUE} OG_clusters<-recluster(data.frame(assay(BEAR_examples.sce, "observed_expression"))) colnames(OG_clusters)<-"Original" ``` We then apply the function `recluster()` to all simulated technical replicates and the original counts matrix and manipulate the list into a `data.frame`. ```{r recluster_go, include=TRUE, evalue=TRUE} cluster.list<-lapply(metadata(BEAR_examples.sce)$simulated_replicates, `recluster`) clusters.df<-do.call("cbind", cluster.list) colnames(clusters.df)<-names(cluster.list) ``` Note: If running clustering algorithms on a seperate high performance cluster, the user should retrieve labels and format as a `data.frame` of cluster labels, where the last column must be the original cluster labels derived from the observed count data. As an example, examine the file, [inst/example/example_clusters.tsv](inst/example/example_clusters.tsv). Using the cluster labels file generated by clustering simulated technical replicates on a high performance compute environment or with `recluster()` as described above , we can generate a noise consensus matrix as shown below. An illustrative three rows and columns of an example noise consensus matrix are displayed: ```{r compute_consensus, eval=TRUE, include=TRUE} noise_consensus <- compute_consensus(clusters.df) head(noise_consensus[,1:3], n=3) ``` Using the `aheatmap()` function in the `NMF` library, the consensus matrix result of 10 simulated replicates by BEARscc: To reproduce the plot run: ```{r plot_noise_consensus, eval=TRUE, include=TRUE} library("NMF") aheatmap(noise_consensus, breaks=0.5) ``` Although, 10 simulated replicates is sparse (we recommend 50 to 100), we can already see that these samples likely belong to a single cluster. Indeed, these samples were prepared from a single ground truth of dilute whole tissue brain RNA-seq data from two batches of experimental data. If desired, we could annotate the above heatmap with relevant metadata concerning sample batch, origin, etc. ## Evaluating the noise consensus In order to further interpret the noise consensus, we have defined three cluster (and analagous cell) metrics. Stability indicates the propensity for a putative cluster to contain the same cells across noise-injected counts matrices. Promiscuity indicates a tendency for cells in a putative cluster to associate with other clusters across noise-injected counts matrices. Score represents the promiscuity substracted from the stability. We have found it useful to inform the optimal number of clusters in terms of resiliance to noise by examining these metrics by cutting hierarchical clustering dendograms of the noise consensus and comparing the results to the original clustering labels. To do this create a vector containing each number of clusters one wishes to examine (the function automatically determines the results for the dataset as a single cluster) and then cluster the consensus with various cluster numbers using `cluster_consensus()`: ```{r cluster_consensus, include=TRUE, eval=TRUE} vector <- seq(from=2, to=5, by=1) BEARscc_clusts.df <- cluster_consensus(noise_consensus,vector) ``` We add the original clustering to the `data.frame` to evaluate its robustness as well as the suggested BEARscc clusters: ```{r add_original, include=TRUE, eval=TRUE} BEARscc_clusts.df <- cbind(BEARscc_clusts.df, Original=OG_clusters) ``` ### Understanding robustness at the cluster level Now we can compute cluster metrics for each of the BEARscc cluster number scenarious and the original clustering; indeed, any cluster labels of the users choosing could be supplied to vet with the information provided by the noise consensus. We accomplish this by running the command and displaying the first 5 rows of the resulting melted `data.frame`: ```{r compute_cluster_scores, include=TRUE, eval=TRUE} cluster_scores.df <- report_cluster_metrics(BEARscc_clusts.df, noise_consensus, plot=FALSE) head(cluster_scores.df, n=5) ``` Above is a melted `data.frame` that displays the name of each cluster, the size of each cluster, the metric (Score, Promiscuity, Stability), the value of each metric for the respective cluster and clustering, the clustering in question (1,2,...,Original), whether the cluster consists of only one cell, and finally the mean of each metric across all clusters in a clustering. In the previous example, we display all metrics for generating 1 clusters from the data given the previously computed noise consensus from 10 simulated technical replicates and the same for the score metric generating 2 clusters from the data. Importantly, by definition, 1 cluster scenarios have a promiscuity value of 0. Observe that the score for the cluster in the 1 cluster scenario is much larger than either cluster of the 2 cluster scenario, and that this is reflected in the average clustering column. We can examine the BEARcc metrics for the original cluster using `ggplot2` and by subsetting the `data.frame` to the original cluster and the score metric: ```{r plot_clusterscores_original, include=TRUE, eval=TRUE} library("ggplot2") library("cowplot") original_scores.df<-cluster_scores.df[ cluster_scores.df$Clustering=="Original",] ggplot(original_scores.df[original_scores.df$Metric=="Score",], aes(x=`Cluster.identity`, y=Value) )+ geom_bar(aes(fill=`Cluster.identity`), stat="identity")+ xlab("Cluster identity")+ylab("Cluster score")+ ggtitle("Original cluster scores")+guides(fill=FALSE) ``` We can see that this initial clustering is terrible, which makes sense given the ground truth consists of a single biological entity. The stability and promiscuity metrics bear this out: ```{r plot_clusterstability_original, include=TRUE, eval=TRUE} ggplot(original_scores.df[original_scores.df$Metric=="Stability",], aes(x=`Cluster.identity`, y=Value) )+ geom_bar(aes(fill=`Cluster.identity`), stat="identity")+ xlab("Cluster identity")+ylab("Cluster stability")+ ggtitle("Original cluster stability")+guides(fill=FALSE) ``` The high stability exhibited in the example above is not suprising as samples in this example should have strong association with one another. However, the promuscuity below reveals the reason for the low score: ```{r plot_clusterpromiscuity_original, include=TRUE, eval=TRUE} ggplot(original_scores.df[original_scores.df$Metric=="Promiscuity",], aes(x=`Cluster.identity`, y=Value) )+ geom_bar(aes(fill=`Cluster.identity`), stat="identity")+ xlab("Cluster identity")+ylab("Cluster promiscuity")+ ggtitle("Original cluster promiscuity")+guides(fill=FALSE) ``` Despite the high stability, the samples within each cluster have high association with cells in the other cluster, which results in a high promiscuity reported from the noise consensus. As a net result, the scores in the original 2 cluster case for each cluster are subpar. Again, this is consistent with the ground truth of the example data. ### Understanding robustness at the sample level Completely analagous to cluster metrics, the extent to which cells belong within a given cluster may be evaluated with respect to the noise consensus. Below we demonstrate how to compute the cell metrics and display 4 cells for illustrative purposes. ```{r compute_cell_scores, include=TRUE, eval=TRUE} cell_scores.df <- report_cell_metrics(BEARscc_clusts.df, noise_consensus) head(cell_scores.df, n=4) ``` The output is a melted `data.frame` that displays the name of each cluster to which the cell belongs, the cell label, the size of each cluster, the metric (Score, Promiscuity, Stability), the value for each metric, and finally the clustering in question (1,2,...,Original). As with cluster metrics, these results can be plotted to visualize cells in the context of the original clusters using `ggplot2`. The score metric is plotted below to illustrate this: ```{r plot_cellscore_original, include=TRUE, eval=TRUE} original_cell_scores.df<-cell_scores.df[cell_scores.df$Clustering=="Original",] ggplot(original_cell_scores.df[original_cell_scores.df$Metric=="Score",], aes(x=factor(`Cluster.identity`), y=Value) )+ geom_jitter(aes(color=factor(`Cluster.identity`)), stat="identity")+ xlab("Cluster identity")+ylab("Cell scores")+ ggtitle("Original clustering cell scores")+guides(color=FALSE) ``` ### Using BEARscc to inform cluster number, $k$, choice While BEARscc certainly does not claim to provide a definitive solution to the question concerning the number of clusters by any means, we provide what we believe to be a useful perspective on the matter. Specifically, we have found that by examining the cluster metrics across various hiearchical clusterings of the BEARscc noise consensus, the cluster number $k$ with the highest score tends to provide a number of clusters that resembles ground truth more closely than simple gene sampling or relevant algorithms along as evidenced by experiments in control samples, c. elegans , and murine brain data (see our manuscript on [bioRxiv](http://biorxiv.org/content/early/2017/03/21/118919). Importantly, this only provides a heuristic for determining cluster number $k$ that takes into count the inherent noise of the single cell experiment, and the heuristic is dependent upon the clustering algorithm of choosing (the better the utilized clustering algorithm, the better the BEARscc $k$ heuristic) and represents a form of "meta-clustering" rather than a new way to determine the number of clusters, $k$, per se. As an illustration of utilizing this heuristic, we plot the average cluster score values for various cluster number scenarios below: ```{r choosing_k, include=TRUE, eval=TRUE} ggplot(cluster_scores.df[cluster_scores.df$Metric=="Score",], aes(x=`Clustering`, y=Value) )+ geom_boxplot(aes(fill=Clustering, color=`Clustering`))+ xlab("Clustering")+ ylab("Cluster score distribution")+ ggtitle("Original cluster scores for each clustering")+ guides(fill=FALSE, color=FALSE) ``` As we can see, the 1 cluster scenario provides the best cluster score as determined by the noise consensus. As mentioned previously, the single cluster solution resembles the biological ground truth. \newpage # Algorithm and theory In order to simulate technical replicates, BEARscc first builds a statistical model of expression variance and drop-out frequency, which is dependent only on observed gene expression. The parameters of this model are estimated from spike-in counts. Expression-dependent variance is approximated by fitting read counts of each spike-in transcript across cells to a mixture model comprised of a Poisson and negative binomial distribution (Section 3.1.1). The drop-out model (Section 3.1.2) in BEARscc has two distinct parts: the \emph{drop-out injection distribution} models the likelihood that a given transcript concentration will result in a drop-out, and the \emph{drop-out recovery distribution} models the likelihood that an observed drop-out resulted from a given transcript concentration. The drop-out injection distribution is taken to be the observed drop-out rate in spike-in controls as a function of actual spike-in transcript concentration. This distribution is then used to estimate the drop-out recovery distribution density via Bayes' theorem and a an empirically informed set of priors and assumptions. Briefly, BEARscc utilizes the drop-out injection distribution and the number of observed zeroes for each endogenous gene to infer a gene-specific probability distribution describing the likelihood that an observed drop-out should in fact have been some non-zero value, given the drop-out rate of the endogenous gene. This entire process is facilitated by the function, `estimate_noiseparameters()`. In the second step, BEARscc generates simulated technical replicates by applying the models described in the first step (Section 3.2). For every observed count in the range of values where drop-outs occurred amongst the spike-in transcripts, BEARscc uses the drop-out injection distribution from Step 1 to determine whether to convert the count to zero. For observations where the count is zero, the drop-out recovery distribution is used to estimate a new value, given the overall drop-out frequency for the gene (Section 3.2). Next, BEARscc substitutes all values larger than zero with a value generated from the derived model of expression variability, parameterized to the observed count for that gene. This procedure can then be repeated any number of times to generate a collection of simulated technical replicates. This step is carried out by `create_noiseinjected_counts()`. In the third step, the simulated technical replicates are then re-clustered, using exactly the same method as for the observed data; this re-clustering for each simulated technical replicate is described as an \emph{association matrix} where each element indicates whether two cells share a cluster identity (1) or cluster apart from each other (0). The association matrices for each simulated technical replicate are averaged to form a noise consensus matrix that can be easily interpreted (Section 3.3). This is accomplished with the function `compute_consensus()`. Each element of the noise consensus matrix represents the fraction of simulated technical replicates that, upon applying the clustering method of choice, resulted in two cells clustering together (the \emph{association frequency}). Then, the functions `report_cell_metrics()` and `report_cluster_metrics()` may be used to explore and quantitate the noise consensus matrix at the cell sample and cluster levels, respectively. ## Noise estimation As mentioned previously, BEARscc uses spike-ins to estimate the noise of the experiment for the purpose of producing simulated technical replicates. BEARscc models overall technical variation with a mixture-model (Section 3.1.1) and inferred drop-out effects (Section 3.1.2) independently using the spike-in observations. However, a single function in BEARscc `estimate_noiseparameters()` accomplishes this task. ### Estimating transcript variation Technical variance is modeled in BEARsccby fitting a single parameter mixture model, $Z(c)$, to the spike-ins' observed count distributions. The noise model is fit independently for each spike-in transcript and subsequently regressed onto spike-in mean expression to define a generalized noise model. This is accomplished in three steps: \begin{enumerate} \item Define a mixture model composed of \emph{poisson} and \emph{negative binomial} random variables: \begin{equation} Z \sim (1-\alpha)*Pois(\mu)+\alpha*NBin(\mu,\sigma) \label{eq:mixture_model}\end{equation} \item Empirically fit the parameter, $\alpha_{i}$, in a spike-in specific mixture-model, $Z_{i}$, to the observed distribution of counts for each ERCC spike-in transcript, $i$, where $\mu_i$ and $\sigma_i$ are the observed mean and variance of the given spike-in. The parameter, $\alpha_{i}$, is chosen such that the error between the observed and mixture-model is minimized. \item Generalize the mixture-model by regressing $\alpha_{i}$ parameters and the observed variance, $\sigma_i$, onto the observed spike-in mean expression, $\mu_i$. Thus the mixture model describing the noise observed in ERCC transcripts is defined solely by $\mu$, which is treated as the count transformation parameter, $c$, in the generation of simulated technical replicates. \end{enumerate} In step 2, a mixture model distribution is defined for each spike-in, $i$: \begin{equation} Z_{i}(\alpha_{i},\mu_{i},\sigma_{i})\sim (1-\alpha_{i})*Pois(\mu_i )+ \alpha_{i}*NBin(\mu_i,\sigma_i ). \end{equation} The distribution, $Z_i$, is fit to the observed counts of the respective spike-in, where $\alpha_i$ is an empirically fitted parameter, such that the $\alpha_i$ minimizes the difference between the observed count distribution of the spike-in and the respective fitted model, $Z_i$. Specifically, for each spike-in transcript, $\mu_i$ and $\sigma_i$ are taken to be the mean and standard deviation, respectively, of the observed counts for spike-in transcript, $i$. Then, $\alpha_i$ is computed by empirical parameter optimization; $\alpha_i$ is taken to be the $\alpha_{i,j}$ in the mixture-model, \begin{equation} Z_{i,j}(\alpha_{i,j},\mu_i,\sigma_i)\sim (1-\alpha_{i,j})* Pois(\mu_i )+\alpha_{i,j}*NBin(\mu_i,\sigma_i ), \end{equation} found to have the least absolute total difference between the observed count density and the density of the fitted model, $Z_i$. In the case of ties, the minimum $\alpha_{i,j}$ is chosen. In step 3, $\alpha(c)$ is then defined with a linear fit, $\alpha_i=\alpha*log2(\mu_i)+b$. $\sigma(c)$ was similarly defined, $log2(\sigma_i) = \alpha*log2(\mu_i)+b$. In this way, the observed distribution of counts in spike-in transcripts defines the single parameter mixture-model, $Z(c)$, used to transform counts during generation of simulated technical replicates: \begin{equation} Z(c) \sim (1-\alpha(c))*Pois(c)+\alpha(c)*NBin(c,\sigma(c)) \label{eq:noisefunction}\end{equation} During technical replicate simulation, the parameter $c$ is set to the observed count value, $a$, and the transformed count in the simulated replicate was determined by sampling a single value from $Z(c=a)$. ### Defining the drop-out models A model of the drop-outs is developed by BEARscc in order to inform the permutation of zeros during noise injection. The observed zeros in spike-in transcripts as a function of actual transcript concentration and Bayes' theorem are used to define two models: the \emph{drop-out injection distribution} and the \emph{drop-out recovery distribution}. The drop-out injection distribution is described by $Prob(X=0 | Y=y)$, where $X$ is the distribution of observed counts and $Y$ is the distribution of actual transcript counts; the density is computed by regressing the fraction of zeros observed in each sample, $D_i$, for a given spike-in, $i$, onto the expected number spike-in molecules in the sample, $y_i$, e.g. $D=a*y+b$. Then, $D$ describes the density of zero-observations conditioned on actual transcript number, $y$, or $Prob(X=0 | Y=y)$. Notably, each gene was treated with an identical density distribution for drop-out injection. In contrast, the density of the drop-out recovery distribution, $Prob(Y_j=y | X_j=0)$, is specific to each gene, $j$, where $X_j$ is the distribution of the observed counts and $Y_j$ is the distribution of actual transcript counts for a given gene. The gene-specific drop-out recovery distribution is inferred from drop-out injection distribution using Bayes' theorem and a prior. This is accomplished in 3 steps: \begin{enumerate} \item For the purpose of applying Bayes' theorem, the gene-specific distribution, $Prob(X_j=0 | Y_j=y)$, is taken to be the the drop-out injection density for all genes, $j$. \item The probability that a specific transcript count is present in the sample, $Prob(Y_j=y)$, is a necessary, but empirically unknowable prior. Therefore, the prior was defined using the law of total probability, an assumption of uniformity, and the probability that a zero was observed in a given gene, $Prob(X_j=0)$. The probability, $Prob(X_j=0)$, is taken to be the fraction of observations that are zero for a given gene. BEARscc does this in order to better inform the density estimation of the gene-specific drop-out recovery distribution. \item The drop-out recovery distribution density is then computed by applying Bayes' theorem: \begin{equation} Prob(Y_j=y | X_j=0)=\frac{Prob(X_j=0 | Y_j=y)*Prob(Y_j=y)}{Prob(X_j=0)}, \label{eq:bayesianeq}\end{equation} \end{enumerate} In the second step, the law of total probability, an assumption of uniformity, and the fraction of zero observations in a given gene are leveraged to define the prior, $Prob(Y_j=y)$. First, a threshold of expected number of transcripts, $k$ in $Y$, is chosen such that $k$ was the maximum value for which the drop-out injection density was non-zero. Next, uniformity is assumed for all expected number of transcript values, $y$ greater than zero and less than or equal to $k$; that is $Prob(Y_j=y)$ is defined to be some constant probability, $n$. Furthermore, $Prob(Y_j=y)$ is defined to be 0 for all $y>k$. In order to inform $Prob(Y_j=y)$ empirically, $Prob(Y_j=0)$ and $n$ are derived by imposing the law of total probability (Equation \ref{eq:totalprob}) and unity (Equation \ref{eq:unity}) yielding a system of equations: \begin{equation} Prob(X_j=0)=\Sigma_{y=0}^{k-1}{ (Prob(X_j=0 | Y_j=y)*Prob(Y_j=y))} \label{eq:totalprob}\end{equation} \begin{equation} \Sigma_{y=0}^{k-1}{ Prob(Y_j=y)}=Prob(Y_j=0)+(k-1)*{n}=1 \label{eq:unity}\end{equation} The probability that a zero is observed given there are no transcripts in the sample, $Prob(X_j=0 | Y_j=0)$, is assumed to be 1. With the preceding assumption, solving for $Prob(Y_j=0)$ and $n$ give: \begin{equation} n=\frac{1-Prob(Y_j=0)}{k-1} \label{eq:nsolved}\end{equation} \begin{equation} Prob(Y_j=0)=\frac{Prob(X_j=0)-\frac{1}{k-1}* \Sigma_{y=1}^{k-1}( Prob(X_j=0 | Y_j=y)) }{(1-\frac{1}{k-1}* \Sigma_{y=1}^{k-1} (Prob(X_j=0 | Y_j=y)) } \label{eq:py0solved}\end{equation} \\ In this way, $Prob(Y_j=0)$ is defined by (Equation \ref{eq:nsolved}) for $y$ in $Y_j$ less than or equal to $k$ and greater than zero, and defined by (Equation \ref{eq:py0solved}) for $y$ in $Y_j$ equal to zero. For $y$ in $Y_j$ greater than $k$, the prior $Prob(Y_j=y)$ is defined to be equal to zero. In the third step, the previously computed prior, $Prob(Y_j=y)$, the fraction of zero observations in a given gene, $Prob(X_j=0)$, and the drop-out injection distribution, $Prob(X_j=0 | Y_j=y)$, are utilized to estimate, with Bayes's theorem, the density of the drop-out recovery distribution, $Prob(Y_j=y | X_j=0)$. During the generation of simulated technical replicates for zero observations and count observations less than or equal to $k$, values are sampled from the drop-out recovery and injection distributions as described in the pseudocode of the BEARscc algorithm for simulating technical replicates. ## Simulating technical replicates Simulated technical replicates were generated from the noise mixture-model and two drop-out models. For each gene, the count value of each sample is systematically transformed using the mixture-model, $Z(c)$, and the drop-out injection, $Prob(X=0 | Y=y)$, and recovery, $Prob(Y_j=y | X_j=0)$, distributions in order to generate simulated technical replicates as indicated by the following pseudocode: ``` FOR EACH gene, $j$ FOR EACH count, $c$ IF $c=0$ $n \leftarrow SAMPLE$ one count, y, from $Prob(Y_j=y | X_j=0)$ IF $n=0$ $c \leftarrow 0$ ELSE $c \leftarrow SAMPLE$ one count from $Z(n)$ ENDIF ELSE IF $c\leq k$ $dropout \leftarrow TRUE$ with probability, $Prob(X=0 | Y=k)$ IF $dropout=TRUE$ $c \leftarrow 0$ ELSE $c \leftarrow SAMPLE$ one count from $Z(c)$ ENDIF ELSE $c \leftarrow SAMPLE$ one count from $Z(c)$ ENDIF ENDIF RETURN $c$ DONE DONE ``` \newpage # List of functions ## `estimate_noiseparameters()` ### Description Estimates the drop-out model and technical variance from spike-ins present in the sample. For greater detail, please see help file `?estimate_noiseparameters()`. ### Usage ```R data(BEARscc_examples) BEAR_examples.sce <- estimate_noiseparameters(BEAR_examples.sce, alpha_resolution=0.25, write.noise.model=FALSE) BEAR_examples.sce ``` ### Output The resulting output of `estimate_noiseparameters()` is a long list, which is enumerated in the function's package help page. ### Note The above usage is for execution of `simulate_replicates` on a local machine. To save results as files for us of `prepare_probabilities` on high performance computing environment, then use: ```R estimate_noiseparameters(BEAR_examples.sce, write.noise.model=TRUE, alpha_resolution=0.25, file="noise_estimation", model_view=c("Observed","Optimized")) ``` ## `simulate_replicates()` ### Description Computes BEARscc simulated technical replicates from the previously estimated noise parameters computed with the function `estimate_noise_parameters()`. For greater detail, please see help file `?simulate_replicates()`. ### Usage ```R data(analysis_examples) BEAR_examples.sce<-simulate_replicates(BEAR_examples.sce, n=3) BEAR_examples.sce ``` ### Output The resulting object is a list of counts data, where each element of the list is a `data.frame` of the counts representing a BEARscc simulated technical replicate. For further details refer to the function help page. ### Note This function is the in-package analog of the high-performance computing function,`prepare_probabilities`. ## `HPC_simulate_replicates()` ### Description The high-performance computing function analog to `simulate_replicates()`. ### Usage Please refere to section 2.4. ### Output The resulting objects would normally be output to a tab-delimited file, where each file results from a `data.frame` of the counts representing a BEARscc simulated technical replicate. ### Note This function has no help file, but is referred to in the section 2.4 of this document on simulating for larger datasets. ## `compute_consensus()` ### Description Computes the consensus matrix using a `data.frame` of cluster labels across different BEARscc simulated technical replicates.The consensus matrix is is a visual and quantitaive representation of the clustering variation on a cell-by-cell level created by using cluster labels to compute the number of times any given pair of cells associates in the same cluster; this forms the 'noise consensus matrix'. Each element of this matrix represents the fraction of simulated technical replicates in which two cells cluster together (the 'association frequency'), after using a clustering method of the user's choice to generate a data.frame of clustering labels. This consensus matrix may be used to compute BEARscc metrics at both the cluster and cell level. For greater detail, please see help file `?compute_consensus()`. ### Usage ```R data("analysis_examples") noise_consensus <- compute_consensus(clusters.df) noise_consensus ``` ### Output When the number of samples are \emph{n}, then the noise consensus resulting from this function is an \emph{n} x \emph{n} matrix describing the fraction of simulated technical replicates in which each cell of the experiment associates with another cell. ## `cluster_consensus()` ### Description This function will perform hierarchical clustering on the noise consensus matrix allowing the user to investigate the appropriate number of clusters, \emph{k}, considering the noise within the experiment. Frequently one will want to assess multiple possible cluster number situations at once. In this case it is recommended that one use a `lapply` in conjunction with a vector of all biologically reasonable cluster numbers to fulfill the task of attempting to identify the optimal cluster number. For greater detail, please see help file `?cluster_consensus()`. ### Usage ```R data(analysis_examples) vector <- seq(from=2, to=5, by=1) BEARscc_clusts.df <- cluster_consensus(consensus_matrix=noise_consensus, vector) BEARscc_clusts.df ``` ### Output The output is a vector of cluster labels based on hierarchical clustering of the noise consensus. In the event that a vector is supplied for number of clusters in conjunction with `lapply`, then the output is a data.frame of the cluster labels for each of the various number of clusters deemed biologically reasonable by the user. ## `report_cell_metrics()` ### Description To quantitatively evaluate the results, three metrics are calculated from the noise consensus matrix: 'stability' is the average frequency with which a cell within a cluster associates with other cells within the same cluster across simulated replicates; 'promiscuity' measures the average association frequency of a cell within a cluster with the \emph{n} cells outside of the cluster with the strongest association with the cell in question; and 'score' is the difference between 'stability' and 'promiscuity'. Importantly, 'score' reflects the overall "robustness" of a given cell's assignment to a user-provided cluster label with respect to technical variance. Together these metrics provide a quantitative measure of the extent to which cluster labels provided by the user are invariant across simulated technical replicates. For greater detail, please see help file `?report_cell_metrics()`. ### Usage ```R data(analysis_examples) cell_scores.dt <- report_cell_metrics(BEARscc_clusts.df, noise_consensus) cell_scores.dt ``` ### Output A melted `data.frame` describing the BEARscc metrics for each cell, where the columns are enumerated in the help file. ## `report_cluster_metrics()` ### Description To quantitatively evaluate the results, three metrics are calculated from the noise consensus matrix: 'stability' is the average frequency with which cells within a cluster associate with each other across simulated replicates; 'promiscuity' measures the association frequency between cells within a cluster and those outside of it; and 'score' is the difference between 'stability' and 'promiscuity'. Importantly, 'score' reflects the overall "robustness" of a cluster to technical variance. Together these metrics provide a quantitative measure of the extent to which cluster labels provided by the user are invariant across simulated technical replicates. For greater detail, please see help file `?report_cluster_metrics()`. ### Usage ```R data(analysis_examples) cluster_scores.dt <- report_cluster_metrics(BEARscc_clusts.df,noise_consensus, plot=TRUE, file="example") cluster_scores.dt ``` ### Output A melted `data.frame` describing the BEARscc metrics for each cluster, where the columns are enumerated in the help file. \newpage #Example data Within the package there are data subsampled from single cell sequencing protocol applied to water samples containing ERCC spike-ins (blanks) and dilute RNA from brain whole tissue (brain) discussed at length in in a manuscript on [bioRxiv](http://biorxiv.org/content/early/2017/03/21/118919) ## `BEARscc_examples` ### Description A toy dataset for applying BEARscc functions as described in the README on https://bitbucket.org/bsblabludwig/bearscc.git.These data are a subset of observations made by Drs. Michael White and Richard Owen in the Xin Lu Lab. Samples were sequenced by the Wellcome Trust Center for Genomics, Oxford, UK. These data are available in full with GEO accession number, GSE95155. For greater detail, please see help file `?BEARscc_examples`. ### Usage ```R data("BEARscc_examples") ``` ## `analysis_examples` ### Description BEARscc downstream example objects: The `analysis_examples` Rdata object contains downstream data objects for use in various help pages for dynamic execution resulting from running tutorial in README and vignette on `BEARscc_examples`. The objects are a result of applying BEARscc functions as described in the README found at https://bitbucket.org/bsblabludwig/bearscc.git. For greater detail, please see help file `?analysis_examples`. ### Usage ```R data("analysis_examples") ``` \newpage # License This software is made available under the terms of the [GNU General Public License v3](http://www.gnu.org/licenses/gpl-3.0.html) THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.