--- title: "clusterExperiment Vignette" author: "Elizabeth Purdom and Davide Risso" date: "`r Sys.Date()`" bibliography: bibFile.bib output: BiocStyle::html_document: toc: true vignette: > %\VignetteEncoding{UTF-8} --- ```{r GlobalOptions, results="hide", include=FALSE, cache=FALSE} knitr::opts_chunk$set(fig.align="center", cache=FALSE, cache.path = "clusterExperimentTutorial_cache/", fig.path="clusterExperimentTutorial_figure/",error=FALSE, #make it stop on error fig.width=6,fig.height=6,autodep=TRUE,out.width="600px",out.height="600px", message=FALSE) #knitr::opts_knit$set(stop_on_error = 2L) #really make it stop #knitr::dep_auto() ``` # Introduction {#Intro} The goal of this package is to encourage the user to try many different clustering algorithms in one package structure, and we provide strategies for creating a unified clustering from these many clustering resutls. We give tools for running many different clusterings and choices of parameters. We also provide visualization to compare many different clusterings and algorithm tools to find common shared clustering patterns. We implement common post-processing steps unrelated to the specific clustering algorithm (e.g. subsampling the data for stability, finding cluster-specific markers via differential expression, etc). The other main goal of this package is to implement strategies that we have developed in the RSEC algorithm (Resampling-based Sequential Ensemble Clustering) for finding a single robust clustering based on the many clusterings that the user might create by perturbing various parameters of a clustering algorithm. There are several steps to these strategies that we call our standard clustering workflow. The `RSEC` function is our preferred realization of this workflow that depends on subsampling and other ensemble methods to provide robust clusterings, particularly for single-cell sequencing experiments and other large mRNA-Seq experiments. We also provide a class `ClusterExperiment` that inherits from `SummarizedExperiment` to store the many clusterings and related information, and a class `ClusterFunction` that encodes a clustering routine so that users can create customized clustering routines in a standardized way that can interact with our clustering workflow algorithms. All of our methods also have a barebones version that allows input of matrices and greater control. This comes at the expense of the user having to manage and keep track of the clusters, input data, transformation of the data, etc. We do not discuss these barebone versions in this tutorial. Instead, we focus on using the `SummarizedExperiment` object as the input and working with the resulting `ClusterExperiment` object. See the help pages of each method for more on how to allow for matrix input. Although this package was developed with (single-cell) RNA-seq data in mind, its use is not limited to RNA-seq or even to gene expression data. Any dataset characterized by high dimensionality could benefit from the methods implemented here. ## The RSEC clustering workflow The package encodes many common practices that are shared across clustering algorithms, like subsampling the data, computing silhouette width, sequential clustering procedures, and so forth. It also provides novel strategies that we developed as part of the RSEC algorithm (Resampling-based Sequential Ensemble Clustering) . As mentioned above, RSEC is a specific algorithm for creating a clustering that follows these basic steps: * Implement many different clusterings using different choices of parameters using the function `clusterMany`. This results in a large collection of clusterings, where each clustering is based on different parameters. * Find a unifying clustering across these many clusterings using the `combineMany` function. * Determine whether some clusters should be merged together into larger clusters. This involves two steps: - Find a hierarchical clustering of the clusters found by `combineMany` using `makeDendrogram` - Merge together clusters of this hierarchy based on the percentage of differential expression, using `mergeClusters`. The basic premise of RSEC is to find small, robust clusters of samples, and then merge them into larger clusters as relevant. We find that many algorithmic methods for choosing the appropriate number of clusters for methods err on the side of too few clusters. However, we find in practice that we tend to prefer to err on finding many clusters and then merging them based on examining the data. The `RSEC` function is a wrapper around these steps that makes many specific choices in this basic workflow. However, many steps of this workflow are useful for users separately and for this reason, the `clusterExperiment` package generalizes this workflow so that the user can follow this workflow with their own choices. We call this generalization the clustering workflow, as oppose to the specific choices set in `RSEC`. Users can also run or add their own clusters to this workflow at different stages. Additional functionality for creating a single clustering is also available in the `clusterSingle` function, and the user should see the documentation in the help page of that function. # Quickstart {#quickstart} In this section we give a quick introduction to the package and the `RSEC` wrapper which creates the clustering. We will also demonstrate how to find features (biomarkers) that go along with the clusters. ## The Data {#data} We will make use of a single cell RNA sequencing experiment made available in the `scRNAseq` package. ```{r,cache=FALSE} set.seed(14456) ## for reproducibility, just in case library(scRNAseq) data("fluidigm") ``` We will use the `fluidigm` dataset (see `help("fluidigm")`). This dataset is stored as a SummarizedExperiment object. We can access the data with `assay` and metadata on the samples with `colData`. ```{r} assay(fluidigm)[1:5,1:10] colData(fluidigm)[,1:5] NCOL(fluidigm) #number of samples ``` ## Filtering and normalization {#step0} While there are `r NCOL(fluidigm)` samples, there are only 65 cells, because each cell is sequenced twice at different sequencing depth. We will limit the analysis to the samples corresponding to high sequencing depth. ```{r filter_high} se <- fluidigm[,colData(fluidigm)[,"Coverage_Type"]=="High"] ``` We also filter out lowly expressed genes: we retain only those genes with at least 10 reads in at least 10 cells. ```{r filter} wh_zero <- which(rowSums(assay(se))==0) pass_filter <- apply(assay(se), 1, function(x) length(x[x >= 10]) >= 10) se <- se[pass_filter,] dim(se) ``` This removed `r sum(!pass_filter)` genes out of `r NROW(fluidigm)`. We now have `r NROW(se)` genes (or features) remaining. Notice that it is important to remove genes with zero counts in all samples (we had `r length(wh_zero)` genes which were zero in all samples here). Otherwise, PCA dimensionality reductions and other implementations may have a problem. Normalization is an important step in any RNA-seq data analysis and many different normalization methods have been proposed in the literature. Comparing normalization methods or finding the best performing normalization in this dataset is outside of the scope of this vignette. Instead, we will use a simple quantile normalization that will at least make our clustering reflect the biology rather than the difference in sequencing depth among the different samples. ```{r normalization} fq <- round(limma::normalizeQuantiles(assay(se))) assays(se) <- list(normalized_counts=fq) ``` As one last step, we are going to change the name of the columns "Cluster1" and "Cluster2" that some in the dataset and refer to published clustering results from the paper; we will use the terms "Published1" and "Published2" to better distinguish them in later plots from other clustering we will do. ```{r colnames} wh<-which(colnames(colData(se)) %in% c("Cluster1","Cluster2")) colnames(colData(se))[wh]<-c("Published1","Published2") ``` ## Clustering with RSEC We will now run `RSEC` to find clusters of the cells using the default settings. We set `isCount=TRUE` to indicate that the data in `se` is count data, so that the log-transform and other count methods should be applied. We also choose the number of cores on which we want to run the operation in parallel via the parameter `ncores`. This is a relatively small number of samples, compared to most single-cell sequencing experiments, so we choose cluster on the top 10 PCAs of the data by setting `reduceMethod="PCA"` and `nReducedDims=10` (the default is 50). Finally, we set the minimum cluster size in our ensemble clustering to be 3 cells (`combineMinSize=3`). We do this not for biological reasons, but for instructional purposes to allow us to get a larger number of clusters. Because this procedure is slightly computationally intensive, depending on the user's machine, we have set this code to not run so that the vignette will compile quickly upon installation. However, it doesn't take very long (roughly 1-2 minutes) so we recommend users try it themselves by running the following code (or setting `eval=TRUE` in the .Rmd file) with the `ncores` option appropriately chosen for their machine. ```{r RSEC, eval=FALSE} library(clusterExperiment) system.time(rsecFluidigm<-RSEC(se, isCount = TRUE, reduceMethod="PCA", nReducedDims=10,combineMinSize=3, ncores=1, random.seed=176201)) ``` Instead we have saved the results from this call as a data object in the package and will use the following code to load it into the vignette: ```{r RSECLoad} #don't call this routine if you ran the above code. #it will overwrite the rsecFluidigm you made library(clusterExperiment) data("rsecFluidigm") ``` ### The output We can look at the object that was created. ```{r RSECprint} rsecFluidigm ``` The print out tells us about the clustering(s) that were created (namely `r nClusterings(rsecFluidigm)` clusterings) and which steps of the workflow have been called (all of them have because we used the wrapper `RSEC` that does the whole thing). Recall from our brief description above that RSEC clusters the data many times using different parameters before finding an consensus clustering. All of these intermediate clusterings are saved. Each of these intermediate clusterings used subsampling of the data and sequential clustering of the data to find the clustering, while the different clusterings represent the different parameters that were adjusted. We can see that `rsecFluidigm` has a built in (S4) class called a `ClusterExperiment` object. This is a class built for this package and explained in the section on [ClusterExperiment Objects](#ceobjects). In the object `rsecFluidigm` the clusterings are stored along with corresponding information for each clustering. Furthermore, all of the information in the original `SummarizedExperiment` is retained. The print out also tells us information about the "primaryCluster" of `rsecFluidigm`. Each `ClusterExperiment` object has a "primaryCluster", which is the default cluster that the many functions will use unless specified by the user. We are told that the "primaryCluster" for `rsecFluidigm` is has the label "mergeClusters" -- which is the defaul label given to the last cluster of the `RSEC` function because the last call of the `RSEC` function is to `mergeClusters`. There are many accessor functions that help you get at the information in a `ClusterExperiment` object and some of the most relevant are described in the section on [ClusterExperiment Objects](#ceobjects). (`ClusterExperiment` objects are S4 objects, and are not lists). For right now we will only mention the most basic such function that retrieves the actual cluster assignments. The final clustering created by `RSEC` is saved as the `primary` clustering of the object. ```{r primaryCluster} head(primaryCluster(rsecFluidigm),20) ``` The clusters are encoded by consecutive integers. Notice that some of the samples are assigned the value of `-1`. `-1` is the value assigned in this package for samples that are not assigned to any cluster. Why certain samples are not clustered depends on the underlying choices of the clustering routine and we won't get into here until we learn a bit more about RSEC. Another special value is `-2` discussed in the section on [ClusterExperiment objects](#ceobjects) This final result of RSEC is the result of running many clusterings and finding the ensembl consensus between them. All of the these intermediate clusterings are saved in `rsecFluidigm` object. They can be accessed by the `clusterMatrix` function, that returns a matrix where the columns are the different clusterings and the rows are samples. We show a subset of this matrix here: ```{r clusterMatrix} head(clusterMatrix(rsecFluidigm)[,1:4]) ``` The "mergeClusters" clustering is the final clustering from RSEC and matches the primary clustering that we saw above. The "combineMany" clustering is the result of the initial ensembl concensus among all of the many clusterings that were run, while "mergeClusters" is the result of merging smaller clusters together that did not show enough signs of differences between clusters. The remaining clusters are the result of changing the parameters, and a couple of such clusterings a shown in the above printout of the cluster matrix. The column names are the `clusterLabels` for each clustering and can be accessed (and assigned new values!) via the `clusterLabels` function. ```{r clusterLabels} head(clusterLabels(rsecFluidigm)) ``` We can see the names of more clusterings, and see that the different parameter values tried in each clustering are saved in the names of the clustering. We can also see the different parameter combinations that went into the consensus clustering by using `getClusterManyParams` (here only `r ncol(getClusterManyParams(rsecFluidigm))-1` different parameters). ```{r clusterParams} head(getClusterManyParams(rsecFluidigm)) ``` ### Visualizing the output `clusterExperiment` also provides many ways to visualize the output of RSEC (or any set of clusterings run in `clusterExperiment`, as we'll show below). **Visualizing many clusterings** The first such useful visualization is a plot of all of the clusterings together using the `plotClusters` command. For this visualization, it is useful to change the amount of space on the left of the plot to allow for the labels of the clusterings, so we will reset the `mar` option in `par`. We also decrease the `axisLine` argument that decides the amount of space between the axis and the labels to give more space to the labels (`axisLine` is passed internally to the `line` option in `axis`). ```{r plotClusterRSEC} defaultMar<-par("mar") plotCMar<-c(1.1,8.1,4.1,1.1) par(mar=plotCMar) plotClusters(rsecFluidigm,main="Clusters from RSEC", whichClusters="workflow", sampleData=c("Biological_Condition","Published2"), axisLine=-1) ``` This plot shows the samples in the columns, and different clusterings on the rows. Each sample is color coded based on its clustering for that row, where the colors have been chosen to try to match up clusters across different clusterings that show large overlap. Moreover, the samples have been ordered so that each subsequent clustering (starting at the top and going down) will try to order the samples to keep the clusters together, without rearranging the clustering blocks of the previous clustering/row. We also added a `sampleData` argument in our call, indicating that we also want to visualize some information about the samples saved in the `colData` slot (inherited from our original `fluidigm` object). We chose the columns "Biological_Condition" and "Published2" from `colData`, which correspond to the original biological condition of the experiment, and the clusters reported in the original paper, respectively. The data from `sampleData` (when requested) are always shown at the bottom of the plot. Notice that some samples are white. This indicates that they have the value -1, meaning they were not clustered. In fact, for many clusterings, there is a large amount of white here. This is likely do to the fact that there are only `r ncol(rsecFluidigm)` cells here, and the default parameters of RSEC are better suited for a large number of cells, such as seen in more modern single-cell sequencing experiments. The sequential clustering may be problematic for small numbers of cells. We can use an alternative version of `plotClusters` called `plotClustersWorkflow` that will better emphasize the more final clusterings from the ensemble/concensus step and merging steps (it currently does not allow for showing the `sampleData` as well, however -- only clustering results). ```{r plotClustersWorkflow} par(mar=plotCMar) plotClustersWorkflow(rsecFluidigm) ``` **Barplots** We can examine size distribution of a single clustering with the function `plotBarplot`. By default, the cluster picked will be the primary cluster, which for the result of `clusterMany` is rather arbitrarily just the first cluster. ```{r plotBarplotRSEC} plotBarplot(rsecFluidigm,main="Final Clustering") ``` We can also pick a particular intermediate clustering, say our intial ensembl clustering before merging. ```{r plotBarplotRSEC.2} plotBarplot(rsecFluidigm,whichClusters=c("combineMany" )) ``` We can also compare two specific clusters with a simple barplot using `plotBarplot`. Here we compare the "combineMany" and the "mergeClusters" clusterings. ```{r plotBarplot2} plotBarplot(rsecFluidigm,whichClusters=c("mergeClusters" ,"combineMany")) ``` Since "combineMany" is a partition of "mergeClusters", there is perfect subsetting within the clusters of "mergeClusters". **Co-Clustering** We can also visualize the proportion of times samples were together in the individual clusterings (i.e. before the consensus clustering): ```{r plotCoClustering_rsec} plotCoClustering(rsecFluidigm,whichClusters=c("mergeClusters","combineMany")) ``` Note that this is not the result from any particular subsampling (which was done repeatedly for each clustering), but the proportion of times across the clusterings we ran. The initial consensus clustering in `combineMany` was made based on these proportions and a particular cutoff of the required proportion of times the samples needed to be together. **Plot of Dendrogram** We can visualize how the initial ensembl cluster in `combineMany` was clustered into a hierarchy and merged to give us the final clustering in `mergeClusters`: ```{r plotDendroRSEC} plotDendrogram(rsecFluidigm,whichClusters=c("combineMany","mergeClusters"),plotType="colorblock",leafType="sample") ``` As shown in this plot, the individual clusters of the `combineMany` ensembl clustering were hierarchically clustered (hence the note that the dendrogram was made from the `combineMany` clustering), and similar sister clusters were merged if there were not enough gene differences between them. **2D plot of clusters** Finally, we can plot a 2-dimensional representation of the data with PCA and color code the samples to get a sense of how the data cluster. ```{r plotReducedDims} plotReducedDims(rsecFluidigm) ``` We can also look at a higher number of dimensions (or different dimensions) by changing the parameter 'whichDims'. ```{r plotReducedDimsMany} plotReducedDims(rsecFluidigm,whichDims=c(1:4)) ``` In this case we can see that higher dimensions show us a greater amount of separation between the clusters than in just 2 dimensions. ## Rerunning RSEC with different parameters In the next section, we will describe more about the options we could adjust in `RSEC`. As an example of a few options, we might, for example, want to change the proportion of co-clustering we required in making our `combineMany` clustering (which used the default of 0.7), or change the proportion of genes that must show differences in order to not merge clusters or the method of deciding. We can call `RSEC` again on our object `rsecFluidigm` and it will not redo the many individual clustering steps which are time intensive (unless we request it to rerun it by including the argument `rerunClusterMany=TRUE`). We demonstrate this in our next command where we change these choices in the following ways: * set the proportion of co-clustering required by the argument `combineProportion=0.6`, * make the merge cutoff `mergeCutoff=0.01` * decide to use a different method of estimating the proportion differential for merge by setting `mergeMethod="Storey"` instead of the default ("adjP"). * no longer adjust the minimum cluster size and use the default (`combineMinSize=5`). These are the main parameters we might frequently want to tweak in `RSEC`. ```{r recallRSEC} rsecFluidigm<-RSEC(rsecFluidigm,isCount=TRUE,combineProportion=0.6,mergeMethod="JC",mergeCutoff=0.05) ``` Notice that we save the output over our original object. This is the standard way to work with the `ClusterExperiment` objects, since the package's commands just continues to add the clusterings, without deleting anything from before. In this way, we do not duplicate the actual data in our workspace (which is often large). We can compare the results of our changes with the `whichClusters` command to explicitly choose the clusterings we want to plot: ```{r plotClusterRSECRecall} defaultMar<-par("mar") plotCMar<-c(1.1,8.1,4.1,1.1) par(mar=plotCMar) plotClusters(rsecFluidigm,main="Clusters from RSEC", whichClusters=c("mergeClusters.1","combineMany.1","mergeClusters","combineMany"), sampleData=c("Biological_Condition","Published2"), axisLine=-1) ``` The clusterings with the `.1` appended to the labels are the previous `combineMany` and `mergeClusters` clusterings from the default setting (see [Rerunning](#rerun) to see how different versions are labeled and stored internally). We can see that we lost several clusters with these options. In what follows, we'll go back to the original (default) RSEC settings by rerunning RSEC (the original clusters are saved in the `rsecFluidigm` object, but there is useful information about the merging that is overwritten by our latest call so we will just rerun it to recreate the clustering): ```{r plotClusterRSECReset} rsecFluidigm<-RSEC(rsecFluidigm,combineMinSize=3) ``` In practice, it can be useful to interactively make choices about these parameters by rerunning each the individual steps of the workflow separately and visualizing the changes before moving to the next step, as we do below during our [overview](#workover) of the steps. ## Finding Features related to the clusters {#step4} A common practice after determining a set of clusters is to perform differential gene expression analysis in order to find genes that show the greatest differences amongst the clusters. We would stress that this is purely an exploratory technique, and any p-values that result from this analysis are not valid, in the sense that they are likely to be inflated. This is because the same data was used to define the clusters and to perform differential expression analysis. Since this is a common task, we provide the function `getBestFeatures` to perform various kinds of differential expression analysis between the clusters. A common F-statistic between groups can be chosen. However, we find that it is far more informative to do pairwise comparisons between clusters, or one cluster against all, in order to find genes that are specific to a particular cluster. An option for all of these choices is provided in the `getBestFeatures` function. The `getBestFeatures` function uses the DE analysis provided by the `limma` package [@Smyth:2004gh, @Ritchie:2015fa]. In addition, the `getBestFeatures` function provides an option to do use the "voom" correction in the `limma` package [@Law:2014ff] to account for the mean-variance relationship that is common in count data. The tests performed by `getBestFeatures` are specific contrasts between clustering groups; these contrasts can be retrieved without performing the tests using `clusterContrasts`, including in a format appropriate for the `MAST` algorithm. As mentioned above, there are several types of tests that can be performed to identify features that are different between the clusters which we describe in the section entitled [Finding Features related to a Clustering](getBestFeatures). Here we simply perform all pairwise tests between the clusters. ```{r getBestFeatures} pairsAllRSEC<-getBestFeatures(rsecFluidigm,contrastType="Pairs",p.value=0.05, number=nrow(rsecFluidigm)) head(pairsAllRSEC) ``` We can visualize only these significantly different pair-wise features with `plotHeatmap` by using the column "IndexInOriginal" in the result of `getBestFeatures` to quickly identify the genes to be used in the heatmap. Notice that the same genes can be replicated across different contrasts, so we will not always have unique genes: ```{r getBestFeatures_size} length(pairsAllRSEC$Feature)==length(unique(pairsAllRSEC$Feature)) ``` In this case they are not unique because the same gene can be significant for different pairs tests. Hence, we will make sure we take only unique gene values so that they are not plotted multiple times in our heatmap. (This is a good practice even if in a particular case the genes are unique). ```{r getBestFeatures_heatmap} plotHeatmap(rsecFluidigm, whichClusters=c("combineMany","mergeClusters"),clusterSamplesData="dendrogramValue", clusterFeaturesData=unique(pairsAllRSEC[,"IndexInOriginal"]), main="Heatmap of features w/ significant pairwise differences", breaks=.99) ``` Notice that the samples clustered into the `-1` cluster (i.e. not assigned) are clustered as an outgroup. This is a choice that is made when the dendrogram (described below). These samples can also be mixed into the dendrogram (see [makeDendrogram](#makeDendrogram)) We can identify the genes corresponding to the different contrasts with the `plotContrastHeatmap` function where the genes (rows) are organized by what contrast for which they are significant. The option `nBlankLines` controls the space between the groups of genes from each contrast. We also give the argument `whichCluster="primary"` to indicate that the contrasts were created with the primary clustering -- this means that the legend will put in the names of the clusters rather than their (internal) numeric id. ```{r plotContrastHeatmap} plotContrastHeatmap(rsecFluidigm, signif=pairsAllRSEC,nBlankLines=40,whichCluster="primary") ``` # Overview of the clustering workflow {#workover} We give an overview here of the steps used by the `RSEC` wrapper and more generally in the `clusterExperiment` package. The section [The clustering workflow](#workflow) goes over these steps and the possible arguments in greater details. The standard clustering workflow steps are the following: * `clusterMany` -- run desired clusterings * `combineMany` -- get a unified clustering * `makeDendrogram` -- get a hierarchical relationship between the clusters * `mergeClusters` -- merge together clusters with little DE between their genes. These clustering steps are done in one function call by `RSEC`, described above, which is most straightforward usage. However, to understand the parameters available in `RSEC` it is useful to go through the steps individually. Furthermore `RSEC` has streamlined the workflow to concentrate on using the workflow with subsampling and sequential strategies, but going through the individual steps demonstrates how the user can make different choices. Therefore in this section, we will go through these steps, but instead of using the parameters of `RSEC` that involve subsampling and are more computationally intensive, we will run through the same steps, only using just basic PAM clustering with no subsampling or sequential clustering. This is simply for the purpose of briefly understanding the intermediate steps that `RSEC` follows. Later sections will go through these steps in more detail and discuss the particular choices embedded in `RSEC`. ## Step 1: Clustering with `clusterMany` {#step1} `clusterMany` lets the user quickly pick between many clustering options and run all of the clusterings in one single command. In the quick start we pick a simple set of clusterings based on varying the dimensionality reduction options. The way to designate which options to vary is to give multiple values to an argument. Here is our call to `clusterMany`: ```{r clusterMany} ce<-clusterMany(se, clusterFunction="pam",ks=5:10, minSizes=5, isCount=TRUE,reduceMethod=c("PCA","var"),nFilterDims=c(100,500,1000), nReducedDims=c(5,15,50),run=TRUE) ``` In this call to `clusterMany` we made the follow choices about what to vary: * set `reduceMethod=c("PCA", "var")` meaning run the clustering algorithm trying two different methods for dimensionality reduction: the top principal componetns and filtering to the top most variable genes * For PCA reduction, choose 5,15, and 50 principal components for the reduced data set (set `nReducedDims=c(5,15,50)`) * For most variable genes reduction, we choose 100, 500, and 1000 most variable genes (set `nFilterDims=c(100,500,1000)`) * For the number of clusters, vary from $k=5,\ldots,10$ (set `ks=5:10`) By giving only a single value to the relative argument, we keep the other possible options fixed, for example: * we used 'pam' for all clustering (`clusterFunction="pam"`) * we set `minSizes=5`. This is an argument that allows the user to set a minimum required size and clusters of size less than that value will be ignored and samples assigned to them given the unassigned value of -1. The default of `1` would mean that this option is not used. We also set `isCount=TRUE` to indicate that our input data are counts. This means that operations for clustering and visualizations will internally transform the data as $log_2(x+1)$ (We could have alternatively explicitly set a transformation by giving a function to the `transFun` argument, for example if we wanted $\sqrt(x)$ or $log(x+\epsilon)$ or just `identity`). We can visualize the resulting clusterings using the `plotClusters` command as we did for the `RSEC` results. ```{r plotClusterEx1} defaultMar<-par("mar") par(mar=plotCMar) plotClusters(ce,main="Clusters from clusterMany", whichClusters="workflow", sampleData=c("Biological_Condition","Published2"), axisLine=-1) ce<-clusterMany(se, clusterFunction="pam",ks=5:10, minSizes=5, isCount=TRUE,reduceMethod=c("PCA","var"),nFilterDims=c(100,500,1000), nReducedDims=c(5,15,50),run=TRUE) test<-getClusterManyParams(ce) ord<-order(test[,"k"],test[,"nFilterDims"],test[,"nReducedDims"]) plotClusters(ce,main="Clusters from clusterMany", whichClusters=test$clusteringIndex[ord], sampleData=c("Biological_Condition","Published2"), axisLine=-1) ``` We can see that some clusters are fairly stable across different choices of dimensions while others can vary dramatically. Notice that again some samples are white (i.e the value -1, meaning they were not clustered). This is from our choices to require at least 5 samples to make a cluster. We have set `whichClusters="workflow"` to only plot clusters created from the workflow. Right now that's all there are anyway, but as commands get rerun with different options, other clusterings can build up in the object (see discussion in [this section](#rerun) about how multiple calls to workflow are stored). So setting `whichClusters="workflow"` means that we are only going to see our most recent calls to the workflow (so far we only have the 1 step, which is the `clusterMany` step). We seen already that `whichClusters` can be set to limit to specific clusterings or specific steps in the workflow . We cal also give to the `whichClusters` an argument indices of clusters stored in the `ClusterExperiment` object, which can allow us to show the clusters in a different order. Here we'll pick an order which corresponds to varying the number of dimensions, rather than k. We can find the parameters for each clustering using the `getClusterManyParams` ```{r plotCluster_params} cmParams<-getClusterManyParams(ce) head(cmParams) ``` `getClusterManyParams` returns the parameter values, as well as the index of the corresponding clustering (i.e. what column it is in the matrix of clusterings). It is important to note that the index will change if we add additional clusterings, as we will do later. We will set an order with first the PCA, ordered by number of dimensions, then the Var, ordered by number of diminsions ```{r plotCluster_newOrder} ord<-order(cmParams[,"reduceMethod"],cmParams[,"nReducedDims"]) ind<-cmParams[ord,"clusteringIndex"] par(mar=plotCMar) plotClusters(ce,main="Clusters from clusterMany", whichClusters=ind, sampleData=c("Biological_Condition","Published2"), axisLine=-1) ``` We see that the order in which the clusters are given to `plotClusters` changes the plot greatly. The labels shown are those given automatically by `clusterMany` but can be a bit much for plotting. We choose to remove "Features" as being too wordy: ```{r plotCluster_newLabels} clOrig<-clusterLabels(ce) clOrig<-gsub("Features","",clOrig) plotClusters(ce,main="Clusters from clusterMany", whichClusters=ind, clusterLabels=clOrig[ind], sampleData=c("Biological_Condition","Published2"), axisLine=-1) ``` We could also permanently assign new labels in our `ClusterExperiment` object if we prefer, for example to be more succinct, by changing the `clusterLabels` of the object. There are many different options for how to run `plotClusters` discussed in in the detailed section on [plotClusters](#plotClusters), but for now, this plot is good enough for a quick visualization. ## Step 2: Find a consensus with `combineMany` {#step2} To find a consensus clustering across the many different clusterings created by `clusterMany` the function `combineMany` can be used next. ```{r} ce<-combineMany(ce,proportion=1) ``` The `proportion` argument indicates the minimum proportion of times samples should be with other samples in the cluster they are assigned to in order to be clustered together in the final assignment. Notice we get a warning that we did not specify any clusters to combine, so it is using the default -- those from the `clusterMany` call. If we look at the `clusterMatrix` of the returned `ce` object, we see that the new cluster from `combineMany` has been added to the existing clusterings. This is the basic strategy of all of these functions in this package. Any clustering function that is applied to an existing `ClusterExperiment` object adds the new clustering to the set of existing clusterings, so the user does not need to keep track of past clusterings and can easily compare what has changed. We can again run `plotClusters`, which will now also show the result of `combineMany`. Instead, we'll use `plotClustersWorkflow` which is nicer for looking specifically at the results of `combineMany` ```{r lookAtCombineMany} head(clusterMatrix(ce)[,1:3]) par(mar=plotCMar) plotClustersWorkflow(ce) ``` The choices of `proportion=1` in `combineMany` is not usually a great choice, and certainly isn't helpful here. The clustering from the default `combineMany` leaves most samples unassigned (white in the above plot). This is because we requires samples to be in the same cluster in *every clustering* in order to be assigned to a cluster together. This is quite stringent. We can vary this by setting the `proportion` argument to be lower. Explicit details on how `combineMany` makes these clusters are discussed in the section on [combineMany](#combineMany). So let's label the one we found as "combineMany,1" and then create a new one. (Making or changing the label to an informative label will make it easier to keep track of this particular clustering later, particularly if we make multiple calls to the workflow). ```{r combineMany_changeLabel} wh<-which(clusterLabels(ce)=="combineMany") if(length(wh)!=1) stop() else clusterLabels(ce)[wh]<-"combineMany,1" ``` Now we'll rerun `combineMany` with `proportion=0.7`. This time, we will give it an informative label upfront in our call to `combineMany`. ```{r combineMany_newCombineMany} ce<-combineMany(ce,proportion=0.7,clusterLabel="combineMany,0.7") par(mar=plotCMar) plotClustersWorkflow(ce) ``` We see that more clusters are detected. Those that are still not assigned a cluster from `combineMany` clearly vary across the clusterings as to whether the samples are clustered together or not. Varying the `proportion` argument will adjust whether some of the unclustered samples get added to a cluster. There is also a `minSize` parameter for `combineMany`, with the default of `minSize=5`. We could reduce that requirement as well and more of the unclustered samples would be grouped into a cluster. Here, we reduce it to `minSize=3` (we'll call this "combineMany,final"). We'll also choose to show all of the different combineMany results in our call to `plotClustersWorkflow`: ```{r combineMany_changeMinSize} ce<-combineMany(ce,proportion=0.7,minSize=3,clusterLabel="combineMany,final") par(mar=plotCMar) plotClustersWorkflow(ce,whichClusters=c("combineMany,final","combineMany,0.7","combineMany,1"),main="Min. Size=3") ``` As we did before for `RSEC` results, we can also visualize the proportion of times these clusters were together across these clusterings (this information was made and stored in the ClusterExperiment object when we called combineMany provided that proportion argument is <1): ```{r plotCoClustering_quickstart} plotCoClustering(ce) ``` This visualization can help in determining whether to change the value of `proportion` (though see [combineMany](#combineMany) for how -1 assignments affect `combineMany`). ## Step 3: Merge clusters together with `makeDendrogram` and `mergeClusters` {#step3} Once you start varying the parameters, is not uncommon in practice to create forty or more clusterings with `clusterMany`. In which case the results of `combineMany` can often result in too many small clusters. We might wonder if they are necessary or could be logically combined together. We could change the value of `proportion` in our call to `combineMany`. But we have found that it is often after looking at the clusters, for example with a heatmap, and how different they look on individual genes that we best make this determination, rather than the proportion of times they are together in different clustering routines. For this reason, we often find the need for an additional clustering step that merges clusters together that are not different, based on running tests of differential expression between the clusters found in `combineMany`. This is done by the function `mergeClusters`. We often display and use both sets of clusters side-by-side (that from `combineMany` and that from `mergeClusters`). `mergeClusters` needs a hierarchical clustering of the clusters in order to merge clusters; it then goes progressively up that hierarchy, deciding whether two adjacent clusters can be merged. The function `makeDendrogram` makes such a hierarchy between clusters (by applying `hclust` to the medoids of the clusters). Because the results of `mergeClusters` are so dependent on that hierarchy, we require the user to call `makeDendrogram` rather than calling it automatically internally. This is because different options in `makeDendrogram` can affect how the clusters are hierarchically ordered, and we want to encourage the user make these choices. As an example, here we use the 500 most variable genes to make the cluster hierarchy (note we can make different choices here than we did in the clustering). ```{r makeDendrogram} ce<-makeDendrogram(ce,reduceMethod="var",nDims=500) plotDendrogram(ce) ``` We can see that clusters 1 and 3 are most closely related, at least in the top 500 most variable genes. We can see the relative size of the clusters by setting some options in `plotDendrogram`: ```{r makeDendrogram2} plotDendrogram(ce,plotType="colorblock",leafType="sample") ``` Notice I don't need to make the dendrogram again, because it's saved in `ce`. If we look at the summary of `ce`, it now has 'makeDendrogram' marked as 'Yes'. ```{r makeDendrogram_show} ce ``` Now we are ready to actually merge clusters together. We run `mergeClusters` that will go up this hierarchy and compare the level of differential expression (DE) in each pair. In other words, if we focus on the left side of the tree, DE tests are run, between 1 and 3, and between 6 and 8. If there is not enough DE between each of these (based on a cutoff that can be set by the user), then clusters 1 and 3 and/or 6 and 8 will be merged. And so on up the tree. If the dataset it not too large, is can be useful to first run `mergeClusters` without actually saving the results so as to preview what the final clustering will be (and perhaps to help in setting the cutoff). ```{r mergeClustersPlot} mergeClusters(ce,mergeMethod="adjP",plotInfo="mergeMethod") ``` The default is cutoff is `cutoff=0.1`, meaning those nodes with less than 10% of genes estimated to be differentially expressed between its two children groupings of samples are merged. This is pretty stringent, and as we see it results in no clusterings being kept. However, the plot tells us the estimate of that proportion for each node. We can decide on a better cutoff using that information. We choose instead `cutoff=0.01`: ```{r mergeClusters} ce<-mergeClusters(ce,mergeMethod="adjP",cutoff=0.01) par(mar=plotCMar) ``` Notice that now the plot has given the estimates from *all* of the methods (the default set by `plotInfo` argument), not just the `adjP` method. But the dotted lines of the dendrogram indicate the merging performed by the actual choices in merging made in the command (`mergeMethod="adjP"` and `cutoff=0.01`). It can be interesting to visualize the clusterings both with the `plotClustersWorkflow` and the co-Proportion plot (a heatmap of the proportion of times the samples co-clustered): ```{r mergeClusters_coClustering} plotClustersWorkflow(ce,whichClusters="workflow") #, sampleData=c("Biological_Condition","Published2") plotCoClustering(ce,whichClusters=c("mergeClusters","combineMany"), sampleData=c("Biological_Condition","Published2"),annLegend=FALSE) ``` Notice that `mergeClusters` combines clusters based on the actual values of the features, while the `coClustering` plot shows how often the samples clustered together. It is not uncommon that `mergeClusters` will merge clusters that don't look "close" on the `coClustering` plot. This can be due to just the choices of the hierarchical clustering between the clusters, but can also be because the two merged clusters are not often confused for each other across the clustering algorithms, yet still don't have strong differences on individual genes. This can be the case especially when the clustering is done on reduced PCA space, where an accumulation of small differences might consistently separate the samples (so that comparatively few clusterings are "confused" as to the samples). But because the differences are not strong on individual genes, `mergeClusters` combines them. These are ultimately different criteria. Finally, we can do a heatmap visualizing this final step of clustering. ```{r plotHeatmap} plotHeatmap(ce,clusterSamplesData="dendrogramValue",breaks=.99, sampleData=c("Biological_Condition", "Published1", "Published2")) ``` By choosing "dendrogramValue" for the clustering of the samples, we will be showing the clusters according to the hierarchical ordering of the clusters found by `makeDendrogram`. The argument `breaks=0.99` means that the last color of the heatmap spectrum will be forced to be the top 1% of the data (rather than evenly spaced through the entire range of values). This can be helpful in making sure that rare extreme values in the upper range do not absorb too much space in the color spectrum. There are many more options for `plotHeatmap`, some of which are discussed in the section on [plotHeatmap](#plotHeatmap). ##RSEC The above explanation follows the simple example of PAM. The original RSEC result called `RSEC` which calls these steps internally. Many of the options described above can be set through a call to RSEC, but some are restricted for simplicity. A detail explanation of the differences can be found in the section [RSEC]{#RSEC} below. But briefly, the following RSEC command, which uses most of the arguments of `RSEC`: ```{r quasiRsecCode, eval=FALSE} rsecOut<-RSEC(se, reduceMethod="PCA", nReducedDims=c(50,10), k0s=4:15, alphas=c(0.1,0.2,0.3),betas=c(0.8,0.9), minSizes=c(1,5), clusterFunction="hierarchical01", combineProportion=0.7, combineMinSize=5, dendroReduce="mad",dendroNDims=500, mergeMethod="adjP",mergeCutoff=0.05, ) ``` would be equivalent to the following individual steps: ```{r stepCode,eval=FALSE} ce<-clusterMany(se,ks=4:15,alphas=c(0.1,0.2,0.3),betas=c(0.8,0.9),minSizes=c(1,5), clusterFunction="hierarchical01", sequential=TRUE,subsample=TRUE, reduceMethod="PCA",nFilterDims=c(50,10)) ce<-combineMany(ce, proportion=0.7, minSize=5) ce<-makeDendrogram(ce,reduceMethod="mad",nDims=500) ce<-mergeClusters(ce,mergeMethod="adjP",cutoff=0.05,plot=FALSE) ``` Note that this mean the `RSEC` function always calls sequential and subsampling. # ClusterExperiment Objects {#ceobjects} The `ce` object that we created by calling `clusterMany` is a `ClusterExperiment` object. The `ClusterExperiment` class is used by this package to keep the data and the clusterings together. It inherits from `SummarizedExperiment`, which means the data and `colData` and other information orginally in the `fluidigm` object are retained and can be accessed with the same functions as before. The `ClusterExperiment` object additionally stores clusterings and information about the clusterings along side the data. This helps keep everything together, and like the `SummarizedExperiment` object, allows for simple things like subsetting to a reduced set of subjects and being confident that the corresponding clusterings, colData, and so forth are similarly subset. Typing the name at the control prompt results in a quick summary of the object. ```{r show} ce ``` This summary tells us the total number of clusterings (`r nClusterings(ce)`), and gives some indication as to what parts of the standard workflow have been completed and stored in this object. It also gives information regarding the `primaryCluster` of the object. The `primaryCluster` is just one of the clusterings that has been chosen to be the "primary" clustering, meaning that by default various functions will turn to this clustering as the desired clustering to use. `clusterMany` arbitrarily sets the 'primaryCluster' to the first one, and each later step of the workflow sets the primary index to the most recent, but the user can set a specific clustering to be the primaryCluster with `primaryClusterIndex`. Often, if a function is not given a specific clustering (usually via an option `whichCluster` or `whichClusters`) the "primary" cluster is taken by default. There are also additional commands to access the clusterings and their related information (type `help("ClusterExperiment-methods")` for more). The cluster assignments are stored in the `clusterMatrix` slot of `ce`, with samples on the rows and different clusterings on the columns. We saw that we can look at the cluster matrix and the primary cluster with the commands `clusterMatrix` and `primaryCluster` ```{r CEHelperCommands1} head(clusterMatrix(ce))[,1:5] primaryCluster(ce) ``` Remember that we made multiple calls to `combineMany`: only the last such call will be shown when we use `whichClusters="workflow"` in our plotting (see this [section](#rerun) for a discussion of how these repeated calls are handled.) **Negative Valued Cluster Assignments** The different clusters are stored as consecutive integers, with '-1' and '-2' having special meaning. '-1' refers to samples that were not clustered by the clustering algorithm. In our example, we removed clusters that didn't meet specific size criterion, so they were assigned '-1'. '-2' is for samples that were not included in the original input to the clustering. This is useful if, for example, you cluster on a subset of the samples, and then want to store this clustering with the clusterings done on all the data. You can create a vector of clusterings that give '-2' to the samples not originally used and then add these clusterings to the `ce` object manually with `addClusters`. `clusterLabels` gives the column names of the `clusterMatrix`; `clusterMany` has given column names based on the parameter choices, and later steps in the workflow also give a name (or allow the user to set them). ```{r CEHelperCommands2} head(clusterLabels(ce),10) ``` As we've seen, the user can also change these labels. `clusterTypes` on the other hand indicates what call made the clustering. Unlike the labels, it is wise to not change the values of `clusterTypes` unless you are sure of what you are doing because these values are used to identify clusterings from different steps of the workflow. ```{r CEHelperCommands3} head(clusterTypes(ce),10) ``` The information that was in the original `fluidigm` object has also been preserved, like `colData` that contains information on each sample. ```{r SECommandsOnCE} colData(ce)[,1:5] ``` Another important slot in the `ClusterExperiment` object is the `clusterLegend` slot. This consists of a list, one element per column (or clustering) of `clusterMatrix`, that gives colors and names to each cluster within a clustering. ```{r CEClusterLengend} length(clusterLegend(ce)) clusterLegend(ce)[1:2] ``` We can see that each element of `clusterLegend` consists of a matrix, with number of rows equal to the number of clusters in the clustering. The columns store information about that cluster. `clusterIds` is the internal id (integer) used in `clusterMatrix` to identify the cluster, `name` is a name for the cluster, and `color` is a color for that cluster. `color` is used in plotting and visualizing the clusters, and `name` is an arbitrary character string for a cluster. They are automatically given default values when the `ClusterExperiment` object is created, but we will see under the description of visualization methods how the user might want to manipulate these for better plotting results. We can assign new values with a simple assignment operator. Here we change the internal cluster names of the first clustering from lowercase to uppercase "M": ```{r CEClusterLengendAssign} existingColors<-clusterLegend(ce)[[1]] existingColors[,"name"]<-gsub("m","M",existingColors[,"name"]) clusterLegend(ce)[[1]]<-existingColors print(ce) ``` However, the user should not assign new values to `clusterIds` which must correspond to the internal ids of the clustering stored in the clustering matrix. # Visualizing the data {#visual} ## Cluster Alignment plot with `plotClusters`{#plotClusters} We demonstrated during the quick start that we can visualize the alignment of multiple clusterings via a Cluster Alignment plot implemented in `plotClusters` or `plotClustersWorkflow` command. Since `plotClustersWorkflow` calls the `plotClusters` command and passes the arguments of `plotClusters` onward, we will focus mainly on the main `plotClusters` command, with the understanding that most of these arguments work for both. Here is our basic call to `plotClusters`: ```{r plotClusterEx1_redo} par(mar=plotCMar) plotClusters(ce,main="Clusters from clusterMany", whichClusters="workflow", axisLine=-1) ``` We have seen that we can get very different plots depending on how we order the clusterings, and what clusterings are included. The argument `whichClusters` allows the user to choose different clusterings or provide an explicit ordering of the clusterings. For `plotClustersWorkflow`, `whichClusters` indicates the clusterings that are "highlighted" by being drawn separately from the results of `clusterMany`. `whichClusters` can take either a single character value, or a vector of either characters or indices. If `whichClusters` matches either "all" or "workflow", then the clusterings chosen are either all, or only those from the most recent calls to the workflow functions. Choosing "workflow" removes from the visualization both user-defined clusterings and also previous calls to the workflow that have since been rerun. Setting `whichClusters="workflow"` can be a useful if you have called a method like `combineMany` several times, as we did, only with different parameters. All of those runs are saved (unless `eraseOld=TRUE`), but you may not want to plot them. If `whichClusters` is a character that is not one of these designated values, the entries should match a "clusterType" value (like "clusterMany") or a "clusterLabel" value (with exact matching). Alternatively, the user can specify numeric indices corresponding to the columns of `clusterMatrix` that provide the order of the clusters. ```{r plotClusterEx1_newOrder} par(mar=plotCMar) plotClusters(ce,whichClusters="clusterMany", main="Only Clusters from clusterMany",axisLine=-1) ``` We can also add to our plot (categorical) information on each of our subjects from the `colData` of our SummarizedExperiment object (which is also retained in our ClusterExperiment object). This can be helpful to see if the clusters correspond to other features of the samples, such as sample batches. Here we add the values from the columns "Biological_Condition" and "Published2" that were present in the `fluidigm` object and given with the published data. ```{r plotClusterEx1_addData} par(mar=plotCMar) plotClusters(ce,whichClusters="workflow", sampleData=c("Biological_Condition","Published2"), main="Workflow clusters plus other data",axisLine=-1) ``` This options of plotting the data in the `colData`, however, is not currently available with `plotClustersWorkflow` which only plots clusterings. ### Manipulations of colors {#plotClustersAlign} In this section we will talk about a number of related ways to manipulate the colors assigned to clusters in conjunction with the `plotClusters` command. We will turn off the plotting of the cluster labels to make the plot less cluttered using the option `plot=FALSE`. **Saving Assignment of colors from `plotClusters`:** `plotClusters` invisibly returns a `ClusterExperiment` object. In our earlier calls to `plotCluster`, this would be the same as the input object and so there is no reason to save it. However, the alignment and color assignments created by `plotClusters` can be requested to be saved *into* the appropriate slots of the `ClusterExperiment` object in order to save the color and alignments of samples. This is done via the `resetNames`, `resetColors` and `resetOrderSamples` arguments. If any of these are set to TRUE, then the object returned will be different than those of the input. Specifically, if `resetColors=TRUE` the `colorLegend` of the returned object will be changed so that the colors assigned to each cluster will be as were shown in the plot (and indeed this is done automatically by `mergeClusters` so that the `combineMany` and `mergeClusters` steps will have aligned color assignments). Similarly, if `resetNames=TRUE` the *names* of the clusters will be changed to be integer values, but now those integers will be aligned to try to be the same across clusters (and therefore will not be consecutive integers, which is why these are saved as *names* for the clusters and not the internal `clusterIds`). If `resetOrderSamples=TRUE`, then the order of the samples shown in the plot will be similarly saved in the slot `orderSamples`. As an example, we will make a call to `plotClusters`, but now ask to reset everything to match this clusterAlignment. First let's look at what the object's default colors are for the first two clusterings, accessed by `clusterLegend` function: ```{r plotClusterEx1_origColors} clusterLegend(ce)[c("mergeClusters","combineMany,final")] ``` The `plotClusterLegend` creates a quick legend plot of this information for a single clustering: ```{r plotClusterLegend_before,out.width="300px",out.height="300px"} plotClusterLegend(ce,whichCluster="combineMany,final") ``` Now, we'll run `plotClusters` and save the new colors. We'll save this as a different object so that this is not a permanent change for the rest of the vignette, though in practice we would usually overwrite the existing `ce` to save memory on our computer and not have many versions floating around. ```{r plotClusterEx1_setColors} ce_temp<-plotClusters(ce,whichClusters="workflow", sampleData=c("Biological_Condition","Published2"), main="Cluster Alignment of Workflow Clusters",clusterLabels=FALSE, axisLine=-1, resetNames=TRUE,resetColors=TRUE,resetOrderSamples=TRUE) ``` Now, the `clusterLegend` slot of the object no longer has the default color/name assignments, but it has names and colors that match across the clusters. Notice, that this means the prefix "m" or "c" that was previously given to distinguish the `combineMany` result from the `mergeClusters` result is now gone (the user could manually add them by changing the values in the clusterLegend). Instead, there are names that "match up" the clusters across the different clusterings. ```{r plotClusterEx1_newColors} clusterLegend(ce_temp)[c("mergeClusters","combineMany,final")] ``` **Manual Assignment of colors** A similar setting can be that we have colors we want to manually set for a particular cluster, but we want to have the other clusterings get aligned to the colors of that clustering. We can use `plotClusters` to assign colors to the other clusterings so that the other clusterings inherit the colors of the cluster of interest. Let's manually set the colors for the "mergeClusters" clustering. We'll again create a (new) `ce_temp` object so again we don't overwrite the previous colors for the rest of the vignette. Again, the color information is accessed with the `clusterLegend` command: ```{r plotClusterEx1_clusterLegend} ce_temp<-ce clusterLegend(ce_temp)[["mergeClusters"]] ``` We will just assign new colors to the `color` column. We can also give them new names too. ```{r plotClusterEx1_assignColors} clusterLegend(ce_temp)[["mergeClusters"]][,"color"]<-c("white","blue","green","cyan","purple") clusterLegend(ce_temp)[["mergeClusters"]][,"name"]<-c("Not assigned","Cluster1","Cluster2","Cluster3","Cluster4") ``` We use the `plotClusterLegend` to check that we assigned them as we expected: ```{r plotClusterLegend,out.width="300px",out.height="300px"} plotClusterLegend(ce_temp,whichCluster="mergeClusters") ``` Now we will run the `plotClusters` alignment plot, but we will direct the alignment to use the cluster colors we just gave for the "mergeClusters" cluster. We do this by using the argument `existingColors="firstOnly"` -- but as the argument option indicates it only works if the "mergeClusters" clustering is the first clustering in the plot. ```{r plotClusterEx1_firstOnlyNoSave} plotClusters(ce_temp,whichClusters="workflow", sampleData=c("Biological_Condition","Published2"), clusterLabels=FALSE, main="Clusters from clusterMany, different order",axisLine=-1,existingColors="firstOnly") ``` This just created the visualization. We can also save the results of this as we did before with `resetColors=TRUE`, so that now all of our future clusters make use of this color information. We won't reset the names, however. Note we can avoid making the plot again with the argument `plot=FALSE`. ```{r plotClusterEx1_firstOnly} ce_temp<-plotClusters(ce_temp,whichClusters="workflow", sampleData=c("Biological_Condition","Published2"), resetColors=TRUE, main="Clusters from clusterMany, different order",axisLine=-1, clusterLabels=FALSE, existingColors="firstOnly",plot=FALSE) ``` **Using only the assigned colors** Once we start manually changing the colors, we will sometimes want to align our clusters, but *use the existing cluster colors* that are saved in the object. We can force `plotClusters` to use all the existing color assignments, rather than create its own, with the argument `existingColors="all"`. This makes particular sense if you want to have continuity between plots -- i.e. be sure that a particular cluster always has a certain color -- but would like to do different variations of plotClusters to get a sense of how similar the clusters are. For example, we set the colors above based on the cluster alignment produced in the above `plotClusters` where the clusterings were ordered according to the workflow and making use of the colors we manually assigned to "mergeClusters". But now we want to plot only the clusters from `clusterMany`, yet keep the same colors that we just saved so we can compare them. We do this by setting the argument `existingColors="all"`, meaning use all of the existing colors. ```{r plotClusterEx1_forceColors,fig.width=18,fig.height=9} par(mfrow=c(1,2)) plotClusters(ce_temp, sampleData=c("Biological_Condition","Published2"), whichClusters="workflow", existingColors="all", clusterLabels=FALSE, main="All Workflow Clusters, use existing colors",axisLine=-1) plotClusters(ce_temp, sampleData=c("Biological_Condition","Published2"), existingColors="all", whichClusters="clusterMany", clusterLabels=FALSE, main="clusterMany Clusters, use existing colors", axisLine=-1) ``` We see that while the order of the samples has changed (because I have a different set of clusters to align) the *colors* assigned to each cluster have stayed the same, so I can more easily compare the plots. Note that the use of `existingColors="firstOnly"` and `existingColors="all"` will not give the same color assignments, *even if the colors have been previously aligned to be the same* unless its the exact same set of clusterings in the same order. This is because with each set of ordered clusterings, the realignment between clusters changes and so the assignment of colors changes. Here we will make a ClusterAlignment plot for each of these three options of the argument to demonstrate the difference. ```{r plotClusterEx1_compareForceColors,fig.width=18,fig.height=18} par(mfrow=c(2,2)) plotClusters(ce_temp, sampleData=c("Biological_Condition","Published2"), existingColors="all", whichClusters="clusterMany", clusterLabels=FALSE, main="clusterMany Clusters, use existing colors", axisLine=-1) plotClusters(ce_temp, sampleData=c("Biological_Condition","Published2"), existingColors="firstOnly", whichClusters="clusterMany", clusterLabels=FALSE, main="clusterMany Clusters, use existing of first row only", axisLine=-1) plotClusters(ce_temp, sampleData=c("Biological_Condition","Published2"), existingColors="ignore", whichClusters="clusterMany", clusterLabels=FALSE, main="clusterMany Clusters, default\n(ignoring assigned colors)", axisLine=-1) ``` **Choosing a different set of colors:** `plotClusters` uses the set of colors defined in the variable `massivePalette`. This is a set of `r length(massivePalette)` colors. `plotClusters` draws from this set of colors sequentially as it goes down the clusterings each time it needs a new color. This is obviously a very large list of colors; the reason for such a long list is that `plotClusters` will error out if there are not enough colors, so we want to have a large list. If we are running RSEC with many clusterings, and each clustering has many clusters, you can quickly run through many. The first `r length(bigPalette)` colors are a smaller subset of colors saved as a variable `bigPalette`, and these have been hand-chosen so that the colors are vibrant and not too similar to each other; the order has also been chosen so that more similar colors are not next to each other. This are the colors that are most frequently seen. `massivePalette` consists of these colors, plus all of the non-grey colors in `colors()` that are not already contained in `bigPalette` We can examine the colors in `bigPalette` with the command `showPalette`: ```{r showPalette} showPalette() ``` This plot shows us both the name of the color (on the top) and the index of the color in `bigPalette`. We can show a smaller subset or give an arbitrary set of colors to show with `showPalette` ```{r showPaletteOptions} showPalette(which=1:10) showPalette(palette()) ``` We can even show the entire `massivePalette` (notice for list of colors > 100 in length, the index is no longer plotted) ```{r showPaletteMassive} showPalette(massivePalette,cex=0.5) ``` We can use this information to help change our color choices. For example, suppose I want the unassigned samples to be given the color black instead of white. Then I would like to not use the black in `bigPalette` to assign to a clustering. I will use the `colPalette` argument to give a new set of colors that does not include "black". And I will set the unassigned samples using the option `unassignedColor="black"` (there is a similar argument `missingColor` to set the color assigned to those clusters with the identification of "-2", meaning they were not included in the clustering at all). ```{r removeBlack} plotClusters(ce,whichClusters="workflow", sampleData=c("Biological_Condition","Published2"), unassignedColor="black",colPalette=bigPalette[-grep("black",bigPalette)], main="Setting unassigned color",axisLine=-1) ``` ### Variant version: `plotClustersWorkflow` `plotClusters` is a generic function for showing any clusterings. In following our workflow, however, there can often be many clusterings from `clusterMany` and the important clusters from `combineMany` and `mergeMany` can get lost from view. The function `plotClustersWorkflow` implements `plotClusters` but pulls out specific clusterings designated by the user and puts them apart from the results of `clusterMany`. ```{r plotWorkflow} plotClustersWorkflow(ce, axisLine= -1) ``` By default, the "combineMany" and "mergeClusters" clusterings are highlighted. We can choose other clusterings to be the result, rather than the default ones using the argument `whichClusters`. We can also sort the clusterings differently so that we sort based on the results of `clusterMany` first, rather than the highlighted results first and/or choose to have the highlighted clusters on the bottom or the top (independently of the sorting choices). In the following code, we choose to show the different versions of `combineMany` as our highlighted clusters. ```{r plotWorkflow_resort} par(mar=plotCMar) plotClustersWorkflow(ce,whichClusters=c("combineMany,final","combineMany,0.7","combineMany,1"),main="Different choices of proportion in combineMany",sortBy="clusterMany",highlightOnTop=FALSE, axisLine= -1) ``` There are no limits as to which clusterings can be shown with the `whichClusters` command, but the non-highlighted clusterings must always be of `clusterType` "clusterMany". The user can select a subset or different ordering of the "clusterMany" clusterings with the argument `whichClusterMany`, which must be a numeric vector giving the indices of the clusters to be chosen. ## Heatmap including the clusters with `plotHeatmap` {#plotHeatmap} There is also a default heatmap command for a `ClusterExperiment` object that we used in the Quick Start. By default it clusters on the most variable features (after transforming the data) and shows the `primaryCluster` alongside the data. The `primaryCluster`, now that we've run the workflow, has been set as that from the last mergeClusters step. ```{r plotHeatmap_Ex1} par(mfrow=c(1,1)) par(mar=defaultMar) plotHeatmap(ce,main="Heatmap with clusterMany") ``` The `plotHeatmap` command has numerous options, in addition to those of `aheatmap`. `plotHeatmap` mainly provides additional functionality in the following areas: * Easy inclusion of clustering information or sample information, based on the ClusterExperiment object. * Additional methods for ordering/clustering the samples that makes use of the clustering information. * Use of separate input data for clustering and for visualization. * Setting the breaks for better visualization ### Displaying clustering or sample information Like `plotClusters`, `plotHeatmap` has a `whichClusters` option that behaves similarly to that of `plotClusters`. In addition to the options "all" and "workflow" that we saw with `plotClusters`, `plotHeatmap` also takes the option "none"" (no clusters shown) and "primary" (only the primaryCluster). The user can also request a subset of the clusters by giving specific indices to `whichClusters` like in `plotClusters`. Here we create a heatmap that shows the clusters from the workflow. Notice that we choose only the last 2 -- from `combineMany` and `mergeClusters`. If we chose all "workflow" clusters it would be too many. ```{r plotHeatmap_Ex1.1} whClusterPlot<-1:2 plotHeatmap(ce,whichClusters=whClusterPlot, annLegend=FALSE) ``` Notice we also passed the option 'annLegend=FALSE' to the underlying `aheatmap` command (with many clusterings shown, it is often not useful to have a legend for all the clusters because the legend doesn't fit on the page!). The many detailed commands of `aheatmap` that are not set internally by `plotHeatmap` can be passed along as well. Like `plotClusters`, `plotHeatmap` takes an argument `sampleData`, which refers to columns of the `colData` of that object and can be included. ### Additional options for clustering/ordering samples We can choose to not cluster the samples, but order the samples by cluster. This time we'll just show the primary cluster (the `mergeCluster` result) by setting `whichClusters="primaryCluster"`: ```{r plotHeatmap_primaryCluster} plotHeatmap(ce,clusterSamplesData="primaryCluster", whichClusters="primaryCluster", main="Heatmap with clusterMany",annLegend=FALSE) ``` As an improvement upon this, we can cluster the clusters into a dendrogram so that the most similar clusters will be near each other. We already did this before with our call to `makeDendrogram`. We haven't done anything to change that, so the dendrogram from that call is still stored in the object. We can check this in the information shown in our object: ```{r showCE_dendrogram} show(ce) ``` We can see, as we expected, that the dendrogram we made from "combineMany,final" is still the active dendrogram saved in the `ce` object. Now we will call `plotHeatmap` choosing to display the "mergeClusters" and "combineMany,final" clustering, and ordering the samples by the dendrogram in `ce`. We will also display some data about the samples from the `colData` slot using the `sampleData` argument. Notice that instead of giving the index, we can also give the clusterLabels of the clusters we want to show. ```{r plotHeatmap_dendro} plotHeatmap(ce,clusterSamplesData="dendrogramValue", whichClusters=c("mergeClusters","combineMany"), main="Heatmap with clusterMany", sampleData=c("Biological_Condition","Published2"),annLegend=FALSE) ``` If there is not a dendrogram stored, `plotHeatmap` will call `makeDendrogram` based on the primary cluster (with the default settings of `makeDendrogram`); calling `makeDendrogram` on `ce` is preferred so that the user can control the choices in how it is done (which we will discuss below). For visualization purposes, the dendrogram for the `combineMany` cluster is preferred to that of the `mergeCluster` cluster, since "combineMany,final" is just a finer partition of the "mergeClusters" clustering. ### Using separate input data for clustering and for visualization While count data is a common type of data, it is also common that the input data in the SummarizedExperiment object might be normalized data from a normalization package such as `RUVSeq`. In this case, the clustering and all numerical calculations should be done on the normalized data (which may or may not need a log transform). However, these normalized data might not be on a logical count scale (for example, in `RUVSeq`, the normalize data are residuals after subtracting out gene-specific batch effects). In this case, it can be convenient to have the *visualization* of the data (i.e. the color scale), be based on a count scale that is interpretable, even while the clustering is done based on the normalized data. This is possible by giving a new matrix of values to the argument `visualizeData`. In this case, the color scale (and clustering of the features) is based on the input `visualizeData` matrix, but all clustering of the samples is done on the internal data in the `ClusterExperiment` object. ### Setting the breaks Usually, the breaks that determine the colors of the heatmap are evenly spaced across the range of the data in the entire matrix. When there are a few outlier samples or genes, they can dominate the color and make it impossible to visualize the bulk of the data. For this reason, the argument `breaks` in `plotHeatmap` allows for a value between 0 and 1, to indicate that the range of colors should be chosen as equally spaced between certain quantiles of the data. For example, if `breaks=0.99`, the range of equally spaced breaks will stop at the top 0.99 quantile of the data and anything above that value gets assigned the single extreme color. If there is negative data in the matrix, then it also will use the lower quantile of the data to stop the range of equally spaced breaks (see `?setBreaks`) Here ```{r plotHeatmap_break99} plotHeatmap(ce,clusterSamplesData="primaryCluster", whichClusters="primaryCluster", breaks=0.99, main="Heatmap with clusterMany, breaks=0.99",annLegend=FALSE) ``` ```{r plotHeatmap_break95} plotHeatmap(ce,clusterSamplesData="primaryCluster", whichClusters="primaryCluster", breaks=0.95, main="Heatmap with clusterMany, breaks=0.95",annLegend=FALSE) ``` The function `setBreaks` which is called internally by `plotHeatmap` is also a stand-alone function that the user can call directly to have greater flexibility in getting breaks for the heatmap. For example it allows the user to specify that the breaks should be symmetric around 0. We also provide some default color spectrum that can be better for different settings or symmetric data around 0 -- see `?showHeatmapPalettes` # The clustering workflow {#workflow} We will now go into more detail about important options for the main parts of the clustering workflow. ## clusterMany {#clusterMany} In the quick start section we picked some simple and familiar clustering options that would run quickly and needed little explanation. However, our workflow generally assumes more complex options and more parameter variations are tried. Before getting into the specific options of `clusterMany`, let us first describe some of these more complicated setups, since many of the arguments of `clusterMany` depend on understanding them. ### Base clustering algorithms and the `ClusterFunction` class This package is meant to be able to use and compare different clustering routines. However, the required input, arguments, etc. of different clustering algorithms varies greatly. We create the `ClusterFunction` class so that we can take ensure that the necessary information to fit into our workflow is well defined, and otherwise the other details of the algorithm can be ignored. In general, the user will not need to know the details of this class, since they will use built-in functions provided by the package which can be accessed by character values. To see the set of character values that correspond to built in functions, ```{r builtInFunctions} listBuiltInFunctions() ``` If you are interested in implementing your own `ClusterFunction` object see the documentation of the `ClusterFunction` class. There are some important features of any clustering algorithm that are encoded in the `ClusterFunction` object for which it is important to understand because they affect which algorithms can be used when. **`inputType`** The type of input the algorithm expects, which can be either an $p x n$ matrix of features, in which case the argument `x` gives that data, or a $n x n$ matrix of dissimilarities, in which case the argument `diss`. Some algorithms can accept either type. To determine the `inputType` of an algorithm(s), ```{r getInputType} inputType(c("kmeans","pam","hierarchicalK")) ``` **`algorithmType`** we group together algorithms that cluster based on common strategies that affect how we can use them in our workflow. Currently there are two "types" of algorithms we consider, which we call type "K" and "01". We can determine the type of a builtin function by the following: ```{r getAlgorithmType} algorithmType(c("kmeans","hierarchicalK","hierarchical01")) ``` The "K" algorithms are so called because their main parameter requirement is that the user specifies the number of clusters ($K$) to be created and require an input of `k` to the clustering function. Built in 'K' algorithms are: ```{r builtInKFunctions} listBuiltInTypeK() ``` The "01" algorithms are so named because the algorithm assumes that the input is a *disimilarities* between samples and that the similarities encoded in $D$ are on a scale of 0-1. The clustering functions should use this fact to make the primary user-specified parameter be not the number of final clusters, but a measure $\alpha$ of how dissimilar samples in the same cluster can be (on a scale of 0-1). Given $\alpha$, the algorithm then implements a method to then determine the clusters (so $\alpha$ implicitly determines $K$). These methods rely on the assumption that because the 0-1 scale has special significance, the user will be able to make an determination more easily as to the level of dissimilarity allowed in a true cluster, rather than predetermine the number of clusters $K$. The current 01 methods are: ```{r builtIn01Functions} listBuiltInType01() ``` **`requiredArgs` The different algorithm types correspond to requiring different input types (`k` versus `alpha`). This is usually sorted out by `clusterMany`, which will only dispatch the appropriate one. Clustering functions can also have additional required arguments. See below for more discussion about how these arguments can be passed along to `clusterMany` or `RSEC`. To see all of the required arguments of a function, ```{r requiredArgs} requiredArgs(c("hierarchical01","hierarchicalK")) ``` ### Internal clustering procedures `clusterMany` iteratively calls a function `clusterSingle` over the collection of parameters. `clusterSingle` is the clustering workhorse, and may be used by the user who wants more fine-grained control, see documentation of `clusterSingle`. Within each call of `clusterSingle`, there are three possible steps, depending on the value of `subsample` and `sequential`. If these are both false, then just a basic clustering routine is done on the input data (called the "main" clustering). If `subsample=TRUE`, there is first a step that subsamples and clusters the subsamples to calculate a co-occurance matrix, and that is used as the input for the main clustering step. If `sequential=TRUE` this process is iterated over and over again to iteratively select the best clusters (see `?seqCluster` for a detailed description). Each of these steps has a function that goes with it, but that should not generally be called by the user. However, the documentation of these functions can be useful. In particular, arguments to these functions that are not set by `clusterMany` can be passed via *named* lists: `subsampleArgs`, `mainClusterArgs`, and `seqArgs`. Some of the arguments to these steps can be varied in `clusterMany`, but more esoteric ones should be sent to these arguments of `clusterMany` (and they will be fixed for parameter combinations tried in `clusterMany`). **Main Clustering Step** (`mainClustering`) The main clustering step described above is done by the function `mainClustering`. In addition to the basic clustering algorithms on the input data, we also implement many other common cluster processing steps that are relevant to the result of the clustering. We have already seen such an example with dimensionality reduction, where the input $D$ is determined based on different input data. Many of the arguments to `mainClustering` are arguments to `clusterMany` as well so that `mainClusterArgs` is usually not needed. The main exception would be to send more esoteric arguments to the underlying clustering function called in the main clustering step. The syntax for this would be to give a nested list to the argument `mainClusterArgs` ```{r mainClusterArgsSyntax,eval=FALSE} clusterMany(x,clusterFunction="hierarchicalK", ... , mainClusterArgs=list(clusterArgs=list(method="single") )) ``` Here we change the argument `method` in the clustering function `hclust` called by the `hierarchicalK` function to `single`. **Subsampling** (`subsampleClustering`) A more significant processing that can be coupled with any clustering algorithm is to continually by subsample the data and cluster the subsampled data. This creates a $n x n$ matrix $S$ that is a matrix of co-clustering percentages -- how many times two samples co-clustered together over the subsamples (there are slight variations as how this can be calculated, see help pages of `subsampleClustering` ). This does not itself give a clustering, but the resulting $S$ matrix can then form the basis for clustering the samples. Specifically, the matrix $D=1-S$ is then given as input to the main clustering step described above. The subsampling option is computationally expensive, and when coupled with comparing many parameters, does result in a lengthy evaluation of `clusterMany`. However, we recommend it as one of the most useful methods for getting stable clustering results. Note that the juxtaposition of these two steps (the subsampling and then feeding the results to the main clustering function) implies there actually two different possible clustering algorithms (and sets of corresponding parameters) -- one for the clustering on the subsampled data, and one for the clustering of the resulting $D$ based on the percentage of coClustering of samples. This brings up a restriction on the clustering function in the main clustering step -- it needs to be able to handle input that is a dissimilarity (`inputType` is either `diss` or `either`). Furthermore, the user might want to set clustering function and corresponding parameters separately for the two steps. The way that `clusterMany` handles this is that the main arguments of `clusterMany` focus on varying the parameters related to the main clustering step (the clustering of $D$ after subsampling). For this reason, the argument `clusterFunction` varies the clustering function used by the main clustering step, not the subsampling step. The clustering function of the subsampling step can be specified by the user via `subsampleArgs`, but in this case it is set for *all* calls of `clusterMany` and does not vary. Alternatively, if the user doesn't specify the `clusterFunction` in `subsampleArgs` then the default is to use `clusterFunction` of the main clustering step along with any required arguments given by the user for that function (there are some cases where using the `clusterFunction` of the main step is not possible for the subsampling step, in which case the default is to use "pam"). More generally, since few of the arguments to `subsampleClustering` are allowed to be varied by the direct arguments to `clusterMany`, it is also more common to want to change these arguments via the argument `subsampleArgs`. Examples might be `resamp.num` (the number of subsamples to draw) or `samp.p` (the proportion of samples to draw in each subsample) -- see `?subsampleClustering` for a full documentation of the possible arguments. In addition, there are arguments to be passed to the underlying clustering function; like for `mainClustering`, these arguments would be a nested list to the argument `subsampleArgs`. An example of a syntax that sets the arguments for `subsampleClustering` would be: ```{r subsampleArgsSyntax,eval=FALSE} clusterMany(x,..., subsampleArgs=list(resamp.num=100,samp.p=0.5,clusterFunction="hiearchicalK", clusterArgs=list(method="single") )) ``` **Sequential Detection of Clusters** Another complicated addition that can be added to the main clustering step is the implementation of sequential clustering. This refers to clustering of the data, then removing the "best" cluster, and then re-clustering the remaining samples, and then continuing this iteration until all samples are clustered (or the algorithm in some other way calls a stop). Such sequential clustering can often be convenient when there is very dominant cluster, for example, that is far away from the other mass of data. Removing samples in these clusters and resampling can sometimes be more productive and result in a clustering more robust to the choice of samples. A particular implementation of such a sequential method, based upon [@tseng2005], is implemented in the `clusterExperiment` package when the option `sequential=TRUE` is chosen (see `?seqCluster` for documentation of how the iteration is done). Sequential clustering can also be quite computationally expensive, particularly when paired with subsampling to determine $D$ at each step of the iteration. Because of the iterative nature of the sequential step, there are many possible parameters (see `?seqCluster`). Like subsample clustering, `clusterMany` does not allow variation of very many of these parameters, but they can be set via passing arguments in a named list to `seqArgs`. An example of a syntax that sets the arguments for `seqCluster` would be: ```{r seqArgsSyntax,eval=FALSE} clusterMany(x,..., seqArgs=list( remain.n=10)) ``` This code changes the `remain.n` option of the sequential step, which governs when the sequential step stops because there are not enough samples remaining. ### Arguments of `clusterMany` Now that we've explained the underlying architecture of the clustering provided in the package, and how to set the arguments that can't be varied, we discuss the parameters that *can* be varied in `clusterMany`. (There are a few additional arguments available for `clusterMany` that govern how `clusterMany` works, but right now we focus on only the ones that can be given multiple options). Recall that arguments in `clusterMany` that take on multiple values mean that the combinations of all the multiple valued arguments will be given as input for a clustering routine. * `sequential` This parameter consists of logical values, TRUE and/or FALSE, indicating whether the sequential strategy should be implemented or not. * `subsample` This parameter consists of logical values, TRUE and/or FALSE, indicating whether the subsampling strategy for determining $D$ should be implemented or not. * `clusterFunction` The clustering functions to be tried in the *main clustering step*. Recall if `subsample=TRUE` is part of the combination, then `clusterFunction` the method that will be used on the matrix $D$ created from subsampling the data. Otherwise, `clusterFunction` is the clustering method that will be used directly on the data. * `ks` The argument 'ks' is interpreted differently for different choices of the other parameters *and can differ from between parameter combinations!*. If `sequential=TRUE` is part of the parameter combination, `ks` defines the argument `k0` of sequential clustering (see `?seqCluster`), which is approximately like the initial starting point for the number of clusters in the sequential process. Otherwise, `ks` is passed to set `k` of both the main clustering step (and by default that of the subsampled data), and is only relevant if `clusterFunction` is of type "K". When/if `findBestK=TRUE` is part of the combination, `ks` also defines the range of values to search for the best k (see the details in the documentation of `clusterMany` for more). * `reduceMethod` These are character strings indicating what choices of dimensionality reduction should be tried. The choices are "PCA","var","mad","abscv", and/or "none". "PCA" indicates clustering on the top principal components. "var","mad",and "abscv" indicate clustering on the top most variable features, as determined by either "var", "mad" or "abscv" per gene. And "none" indicates the whole data set should be used (which is usually going to be computationally intractable). If either "PCA" or "var" are chosen, the following parameters indicate the number of such features to be used (and can be a vector of values to try as we have seen): * `nFilterDims` * `nReducedDims` * `distFunction` These are character values giving functions that provide a distance matrix between the samples, when applied to the data. These functions should be accessible in the global environment (`clusterMany` applies `get` to the global environment to access these functions). To make them compatible with the standard R function `dist`, these functions should assume the samples are in the rows, i.e. they should work when applied to t(assay(ce)). We give an example in the next subsection below. * `minSizes` these are integer values determining the minimum size required for a cluster (passed to the `mainClustering` part of clustering). * `alphas` These are the $\alpha$ parameters for "01" clustering techniques; these values are only relevant if one of the `clusterFunction` values is a "01" clustering algorithm. The values given to `alphas` should be between 0 and 1, with smaller values indicating greater similarity required between the clusters. * `betas` These are the $\beta$ parameters for sequential clustering; these values are only relevant if `sequential=TRUE` and determine the level of stability required between changes in the parameters to determine that a cluster is stable. * `findBestK` This option is for "K" clustering techniques, and indicates that $K$ should be chosen automatically as the $K$ that gives the largest silhouette distance between clusters. * `removeSil` A logical value as to whether samples with small silhouette distance to their assigned cluster are "removed", in the sense that they are not given their original cluster assignment but instead assigned -1. This option is for "K" clustering techniques as a method of removing poorly clustered samples. * `silCutoff` If `removeSil` is TRUE, then `silCutoff` determines the cutoff on silhouette distance for unassigning the sample. `clusterMany` tries to have generally simple interface, and for this reason makes choices about what is meant by certain combinations of parameters. For example, in combinations where `findBestK=TRUE`, `ks=2:10` is taken to mean that the clustering should find the best $k$ out of the range of 2-10. However, in other parameter combinations where `findBestK=FALSE` the same `ks` might indicate the specific number of clusters, $K$, that should be found. To see the parameter choices that will be run, the user can set `run=FALSE` and the output will be a matrix of the parameter values indicated by the choices of the user. For parameter combinations that are not what is desired, the user should consider making direct calls to `clusterSingle` where all of these options combinations (and many more) can be explicitly called. Other parameters for the clustering are kept fixed. As described above, there are many more possible parameters in play than are considered in `clusterMany`. These parameters can be set via the arguments `mainClusterArgs`, `subsampleArgs` and `seqArgs`. These arguments correspond to the different processes described above (the main clustering step, the creation of $D$ to be clustered via subsampling, and the sequential clustering process, respectively). These arguments take a list of arguments that are sent directly to `clusterSingle`. However, these arguments may be overridden by the interpretation of `clusterMany` of how different combinations interact; again for complete control direct calls to `clusterSingle` are necessary. ```{r tableArguments, echo=FALSE, message=FALSE, warnings=FALSE, results='asis'} # simple table creation here tabl <- " Argument| Dependencies | Passed to | Argument passed to ---------------|-----------------|:-------------:|------:| ks | sequential=TRUE | seqCluster | k0 - | sequential=FALSE, findBestK=FALSE, clusterFunction of type 'K' | mainClustering | k - | sequential=FALSE, findBestK=FALSE, subsample=TRUE | subsampleClustering | k - | sequential=FALSE, findBestK=TRUE, clusterFunction of type 'K' | mainClustering | kRange reduceMethod | none | transform | reduceMethod nFilterDims | reduceMethod in 'mad','cv','var' | transform | nFilterDims nReducedDims | reduceMethod='PCA' | transform | nReducedDims clusterFunction| none | mainClustering | clusterFunction minSizes | none | mainClustering | minSize distFunction | subsample=FALSE | mainClustering | distFunction alphas | clusterFunction of type '01'| mainClustering | alpha findBestK | clusterFunction of type 'K' | mainClustering | findBestK removeSil | clusterFunction of type 'K' | mainClustering | removeSil silCutoff | clusterFunction of type 'K' | mainClustering | silCutoff betas | sequential=TRUE | seqCluster | beta " cat(tabl) # output the table in a format good for HTML/PDF/docx conversion ``` ### Example changing the distance function and clustering algorithm Providing different distance functions is slightly more involved than the other parameters, so we give an example here. First we define distances that we would like to compare. We are going to define two distances that take values between 0-1 based on different choices of correlation. ```{r defineDist} corDist<-function(x){(1-cor(t(x),method="pearson"))/2} spearDist<-function(x){(1-cor(t(x),method="spearman"))/2} ``` These distances are defined so as to give distance of 0 between samples with correlation 1, and distance of 1 for correlation -1. We will also compare using different algorithms for clustering. Currently, `clusterMany` requires that the distances work with all of the `clusterFunction` choices given. Since some of the `clusterFunction` algorithms require a distance matrix between 0-1, this means we can only compare all of the algorithms when the distance is a 0-1 distance. (Future versions may try to create a work around so that the algorithm just skips algorithms that don't match the distance). Since the distances we defined are between 0-1, however, we can use any algorithm that takes dissimilarities as input. **Note on 0-1 clustering when `subsample=FALSE`** We would note that the default values of $\alpha$ in `clusterMany` and `RSEC` for the 0-1 clustering were set with the distance $D$ the result of subsampling or other concensus summary in mind. In generally, subsampling creates a $D$ matrix with high similarity for many samples who share a cluster (the proportion of times samples are seen together for well clustered samples can easily be in the .8-.95 range, or even exactly 1). For this reason the default $\alpha$ is 0.1 which requires distances between samples in the 0.1 range or less (i.e. a similarity in the range of 0.9 or more). To illustrate this point, we show an example of the $D$ matrix from subsampling. To do this we make use of the `clusterSingle` which is the workhorse mentioned above that runs a single clustering command directly; it gives the output $D$ from the sampling in the "coClustering" slot of `ce` when we set `replaceCoCluster=TRUE` (and therefore we save it as a separate object, so that it doesn't write over the existing "coClustering" slot in `ce`). Note that the result is $1-p_{ij}$ where $p_{ij}$ is the proportion of times sample $i$ and $j$ clustered together. ```{r visualizeSubsamplingD} ceSub<-clusterSingle(ce,reduceMethod="mad",nDims=1000,subsample=TRUE,subsampleArgs=list(clusterFunction="pam",clusterArgs=list(k=8)),clusterLabel="subsamplingCluster",mainClusterArgs=list(clusterFunction="hierarchical01",clusterArgs=list(alpha=0.1),minSize=5), replaceCoClustering=TRUE) plotCoClustering(ceSub,colorScale=rev(seqPal5)) ``` We see even here, the default of $\alpha=0.1$ was perhaps too conservative since only two clusters came out (at leastwith size greater than 5). However, the distances based on correlation calculated directly on the data, such as we created above, are also often used for clustering expression data directly (i.e. without the subsampling step). But they are unlikely to have dissimilarities as low as seen in subsampling, even for well clustered samples. Here's a visualization of the correlation distance matrix we defined above (using Spearman's correlation) on the top 1000 most variable features: ```{r visualizeSpearmanDist} dSp<-spearDist(t(transformData(ce,reduceMethod="mad",nFilterDims=1000))) plotHeatmap(dSp,isSymmetric=TRUE,colorScale=rev(seqPal5)) ``` We can see that the choice of $\alpha$ must be much higher (and we are likely to be more sensitive to it). Notice to calculate the distance in the above plot, we made use of the `transform` function applied to our `ce` object to get the results of dimensionality reduction. The `transform` function gave us a data matrix back that has been transformed, and also reduced in dimensions, like would be done in our clustering routines. `transform` has similar parameters as seen in `clusterMany`,`makeDendrogram` or `clusterSingle` and is useful when you want to manually apply something to transformed and/or dimensionality reduced data; and you can be sure you are getting the same matrix of data back that the clustering algorithms are using. **Comparing distance functions with `clusterMany`** Now that we have defined the distances we want to compare in our global environment, we can give these to the argument "distFunction" in `clusterMany`. They should be given as character strings giving the names of the functions. For computational ease for this vignette, we will just choose the dimensionality reduction to be the top 1000 features based on MAD and set K=8 or $\alpha=0.45$. Since we haven't yet calculated "mad" on this object, it hasn't been calculated yet. `clusterMany` does not let you mix and match between uncalculated and stored filters (or dimensionality reductions), so our first step is to store the mad results. We will save these results as a separate object so as to not disrupt the earlier workflow. ```{r clusterManyDiffDist_calculateMad,fig.width=15,fig.height=6} ceDist<-makeFilterStats(ce,filterStats="mad") ceDist ``` ```{r clusterManyDiffDist,fig.width=15,fig.height=6} ceDist<-clusterMany(ceDist, k=7:9, alpha=c(0.35,0.4,0.45), clusterFunction=c("tight","hierarchical01","pam","hierarchicalK"), findBestK=FALSE,removeSil=c(FALSE),dist=c("corDist","spearDist"), reduceMethod=c("mad"),nFilterDims=1000,run=TRUE) clusterLabels(ceDist)<-gsub("clusterFunction","alg",clusterLabels(ceDist)) clusterLabels(ceDist)<-gsub("Dist","",clusterLabels(ceDist)) clusterLabels(ceDist)<-gsub("distFunction","dist",clusterLabels(ceDist)) clusterLabels(ceDist)<-gsub("hierarchical","hier",clusterLabels(ceDist)) par(mar=c(1.1,15.1,1.1,1.1)) plotClusters(ceDist,axisLine=-2,sampleData=c("Biological_Condition")) ``` Notice that using the "tight" methods did not give relevant results (no samples were clustered) ### Dealing with large numbers of clusterings A good first check before running `clusterMany` is to determine how many clusterings you are asking for. `clusterMany` has some limited internal checks to not do unnecessary duplicates (e.g. `removeSil` only works with some clusterFunctions so `clusterMany` would detect that), but generally takes all combinations. This can take a while for more complicated clustering techniques, so it is a good idea to check what you are getting into. You can do this by running `clusterMany` with `run=FALSE`. In the following we consider expanding our original clustering choices to consider individual choices of $K$ (rather than just `findBestK=TRUE`). ```{r clusterManyCheckParam} checkParam<-clusterMany(se, clusterFunction="pam", ks=2:10, removeSil=c(TRUE,FALSE), isCount=TRUE, reduceMethod=c("PCA","var"), nFilterDims=c(100,500,1000),nReducedDims=c(5,15,50),run=FALSE) dim(checkParam$paramMatrix) #number of rows is the number of clusterings ``` Each row of the matrix `checkParam$paramMatrix` is a requested clustering (the columns indicate the value of a possible parameter). Our selections indicate `r nrow(checkParam$paramMatrix)` different clusterings (!). We can set `ncores` argument to have these clusterings done in parallel. If `ncores>1`, the parallelization is done via `mclapply` and should not be done in the Rgui interface (see help pages for `mclapply`). ## Create a unified cluster from many clusters with `combineMany` {#combineMany} After creating many clusterings, `combineMany` finds a single cluster based on what samples were in the same clusters throughout the many clusters found by `clusterMany`. While subsampling the data helps be robust to outlying samples, combining across many clustering parameters can help be robust to choice in parameters, particularly when the parameters give roughly similar numbers of clusters. As mentioned in the Quick Start section, the default option for `combineMany` is to only define a cluster when *all* of the samples are in the same clusters across *all* clusterings. However, this is generally too conservative and just results in most samples not being assigned to a cluster. Instead `combineMany` has a parameter `proportion` that governs in what proportion of clusterings the samples should be together. Internally, `combineMany` makes a coClustering matrix $D$. Like the $D$ created by subsampling in `clusterMany`, the coClustering matrix takes on values 0-1 for the proportion of times the samples are together in the clustering. This $D$ matrix is saved in the `ce` object and can be visualized with `plotCoClustering` (which is just a call to `plotHeatmap`). Recall the one we last made in the QuickStart, with our last call to `combineMany` (`proportion=0.7` and `minSize=3`). ```{r combineMany_detailed} plotCoClustering(ce) ``` `combineMany` performs the clustering by running a "01" clustering algorithm on the $D$ matrix of percentage co-clustering (the default being "hierarchical01"). The `alpha` argument to the 01 clustering is `1-proportion`. Also passed to the clustering algorithm is the parameter `minSize` which sets the minimum size of a cluster. We can also manually choose the set of clusters to use in `combineMany` with the argument `whichClusters`. Here we choose only the clusters that correspond to using dimensionality reduction using the most variable features. We also set `minSize` to be lower than the default of 5 to allow for smaller clusters ```{r combineMany_chooseClusters} wh<-getClusterManyParams(ce)$clusteringIndex[getClusterManyParams(ce)$reduceMethod=="var"] ce<-combineMany(ce,whichCluster=wh,proportion=0.7,minSize=3, clusterLabel="combineMany,nVAR") plotCoClustering(ce) ``` We can compare to all of our other versions of `combineMany`. While they do not all have `clusterTypes` equal to "combineMany" (only the most recent call has clusterType exactly equal to "combineMany"), they all have "combineMany" as part of their clusterType, even though they have different clusterLabels (and now we'll see that it was useful to give them different labels!) ```{r combineMany_showDifferent} wh<-grep("combineMany",clusterTypes(ce)) par(mar=plotCMar) plotClusters(ce,whichClusters=rev(wh),axisLine=-1) ``` **Treatment of Unclustered assignments** -1 values are treated separately in the calculation. In particular, they are not considered in the calculation of percentage co-clustering -- the percent co-clustering is taken only with respect to those clusterings where both samples were assigned. However, a post-processing is done to the clusters found from running the clustering on the $D$ matrix. For each sample, the percentage of times that they were marked -1 in the clusterings is calculated. If this percentage is greater than the argument `propUnassigned` then the sample is marked as -1 in the clustering returned by `combineMany`. **Good scenarios for running `combineMany`** Varying certain parameters result in clusterings better for `combineMany` than other sets of parameters. In particular, if there are huge discrepancies in the set of clusterings given to `combineMany`, the results will be a shattering of the samples into many small clusters. Similarly, if the number of clusters $K$ is very different, the end result will likely be like that of the large $K$, and how much value that is (rather than just picking the clustering with the largest $K$), is debatable. However, for "01" clustering algorithms or clusterings using the sequential algorithm, varying the underlying parameters $\alpha$ or $k_0$ often results in roughly similar clusterings across the parameters so that creating a consensus across them is highly informative. ## Creating a Hierarchy of Clusters and Merging clusters {#hierarchy} As mentioned above, we find that merging clusters together based on the extent of differential expression between the features to be a useful method for combining many small clusters. We provide a method for doing this that consists of two steps. Making a hierarchy between the clusterings and then estimating the amount of differential expression at each branch of the hierarchy. ### makeDendrogram and plotDendrogram {#makeDendrogram} `makeDendrogram` creates a hierarchical clustering of the clusters as determined by the primaryCluster of the `ClusterExperiment` object. In addition to being used for merging clusters, the dendrograms created by `makeDendrogram` are also useful for ordering the clusters in `plotHeatmap` as has been shown above. `makeDendrogam` performs hierarchical clustering of the cluster medoids (after transformation of the data) and provides a dendrogram that will order the samples according to this clustering of the clusters. The hierarchical ordering of the dendrograms is saved internally in the `ClusterExperiment` object. Like clustering, the dendrogram can depend on what features are included from the data. The same options for clustering are available for the hierarchical clustering of the clusters, namely choices of dimensionality reduction via `reduceMethod` and the number of dimensions via `nDims`. ```{r makeDendrogram_reducedFeatures} ce<-makeDendrogram(ce,reduceMethod="var",nDims=500) plotDendrogram(ce) ``` Notice that the plot of the dendrogram shows the hierarchy of the clusters (and color codes them according to the colors stored in colorLegend slot). Recall that the most recent clustering made is from our call to `combineMany`, where we experimented with using on some of the clusterings from `clusterMany`, so that is our current primaryCluster: ```{r showCe} show(ce) ``` This is the clustering from combining only the clusterings from `clusterMany` that use the top most variable genes. Because it is the primaryCluster, it was the clustering that was used by default to make the dendrogram. We might prefer to get back to the dendrogram based on our `combineMany` in quick start (the "combineMany, final" clustering). We've lost that dendrogram when we called `makeDendrogram` again. However, we can rerun `makeDendrogram` and choose a different clustering from which to make the dendrogram. ```{r remakeMakeDendrogram} ce<-makeDendrogram(ce,reduceMethod="var",nDims=500, whichCluster="combineMany,final") ``` We will visualize the dendrogram with `plotDendrogram`. The default setting plots the dendrogram where there are color blocks equal to the size of the clusters (i.e number of samples in each cluster). ```{r plotRemadeDendrogram} plotDendrogram(ce,leafType="sample",plotType="colorblock") ``` We can actually use `plotDendrogram` to compare clusterings too, like `plotClusters` using the `whichClusters` argument to identfy which clusters to show. For example, lets compare our different `combineMany` results ```{r plotRemadeDendrogram_compare} whCM<-grep("combineMany",clusterTypes(ce)) plotDendrogram(ce,whichClusters=whCM,leafType="sample",plotType="colorblock") ``` Unlike `plotClusters`, however, there is no aligning of samples to make samples with the same cluster group together. **Making this clustering current** Note that if we look at the clusterType of the "combineMany,final" cluster, it is not of type "combineMany", but "combineMany.x". This is because we've run additional `combineMany` steps on this data, and the clustering "combineMany,final" is not the most current one. So to distinguish old iterations from the most recent run, the ".x" is added to the "clusterType" where "x" indicates what iteration it was (it would have added ".x" to the clusterLabel as well as if we had assigned it our own label) ```{r clusterTypeOfCombineMany} clusterTypes(ce)[which(clusterLabels(ce)=="combineMany,final")] ``` We can choose reset this past call to `combineMany` to be the current 'combineMany' output (which will also set this clustering to be the primaryCluster). ```{r getBackCombineMany} ce<-setToCurrent(ce,whichCluster="combineMany,final") show(ce) ``` We don't need to recall `makeDendrogram`, since we already used this clustering to make the dendrogram by our explicit call of the argument `whichCluster`. ### Merging clusters with little differential expression {#mergeClusters} We then can use this hierarchy of clusters to merge clusters that show little difference in expression. We do this by testing, for each node of the dendrogram, for which features is the mean of the set of clusters to the right split of the node is equal to the mean on the left split. This is done via the `getBestFeatures` (see section on [getBestFeatures](#getBestFeatures)), where the `type` argument is set to "Dendro". Starting at the bottom of the tree, those clusters that have the percentage of features with differential expression below a certain value (determined by the argument `cutoff`) are merged into a larger cluster. This testing of differences and merging continues until the estimated percentage of non-null DE features is above `cutoff`. This means lower values of `cutoff` result in less merging of clusters. There are multiple methods of estimation of the percentage of non-null features implemented. The option `mergeMethod="adjP"` which we showed earlier is the simplest: the proportion found significant by calculating the proportion of DE genes a given False Discovery Rate threshold of 0.05 (using the Benjamini-Hochberg procedure). However, other more sophisticated methods are also implemented (see the help of `mergeClusters`). Notice that `mergeClusters` will always run based on the clustering that made the currently existing dendrogram. So it is always good to check that it is what we expect. ```{r checkWhatDendro} ce ``` We see in the summary "Dendrogram run on 'combineMany,final'", showing us that this is the clustering that will be used (and also showing us the value of giving our own labels to the results of `combineMany` if we are going to try different strategies). We will run `mergeClusters` with the option `mergeMethod="adjP"`. We will also set `plotInfo="adjP"` meaning that we would like the `mergeClusters` command to also produce a plot showing the dendrogram and the estimates from the `adjP` method for each node. We also set `calculateAll=FALSE` for illustration purposes, meaning the function will only calculate the estimates for the methods we request, but as we explain below, that is not necessarily the best option if you are going to be trying out different cutoffs. ```{r runMergeDetail} ce<-mergeClusters(ce,mergeMethod="adjP",plotInfo=c("adjP"),calculateAll=FALSE) ``` The info about the merge is saved in the `ce` object. ```{r getMergeInfo} mergeMethod(ce) mergeCutoff(ce) nodeMergeInfo(ce) ``` Notice that `nodeMergeInfo` gives for each node the proportion estimated to be differentially expressed at each node (as displayed in the plot that we requested), as well as whether that node was merged together in the `mergeClusters` call (the `isMerged` column). Because we set `calculateAll=FALSE` only the methods needed for our command were calculated (`adjP`). The others have `NA` values. The column `mergeClusterId` tells us which nodes in the tree are now equivalent to a cluster; this is different than the `isMerged` column, since some nodes can be merged but if their parent nodes were also merged, then that node will not be equivalent to a cluster in the "mergeClusters" clustering. `mergeClusters` can also be run without merging the cluster, and simply drawing a plot showing the dendrogram along with the estimates of the percentage of non-null features to aid in deciding a cutoff and method. By setting `plotInfo="all"`, all of the estimates of the different methods are displayed simultaneously, while before we only showed the values for the specific `mergeMethod` we requested. ```{r mergeClusters_plot,fig.width=12} ce<-mergeClusters(ce,mergeMethod="none",plotInfo="all") ``` Notice that now if we call `nodeMergeInfo`, all of the methods now have estimates. ```{r showMergeAll} nodeMergeInfo(ce) ``` This means in any future calls to `mergeClusters` there will be no more need for calculations of per-gene significance, which will speed up the calls if you just want to change the cutoff (all of the methods used the same input of per-gene p-values, so recalling them each time is computationally inefficient). In practice, the default is `calculateAll=TRUE`, meaning all methods are calculated unless the user specifically requests otherwise. Now we can pick a cutoff and rerun `mergeClusters`. We'll give it a label to keep it separate from the previous merge clusters run we had made. Note, we can turn off plotting completely by setting `plot=FALSE`. ```{r mergeClusters_ex} ce<-mergeClusters(ce,cutoff=0.05,mergeMethod="adjP",clusterLabel="mergeClusters,v2",plot=FALSE) ce ``` Notice that the `nodeMergeInfo` has changed, since different nodes were merged, but the estimates per node stay the same. ```{r mergeClusters_reexamineNode} nodeMergeInfo(ce) ``` If we want to rerun `mergeClusters` with a different method, we can do that instead. ```{r mergeClusters_redo} ce<-mergeClusters(ce,cutoff=0.15,mergeMethod="MB", clusterLabel="mergeClusters,v3",plot=FALSE) ce ``` We can use `plotDendrogram` to compare the results. Notice that `plotDendrogram` can recreate the above plots that were created in the calls to `mergeClusters` via the argument `mergeInfo` (of course, this only works *after* `mergeClusters` has actually been called so that the information is saved in the `ce` object). ```{r mergeClusters_compareMerges} par(mar=c(1.1,1.1,6.1,2.1)) plotDendrogram(ce,whichClusters=c("mergeClusters,v3","mergeClusters,v2"),mergeInfo="mergeMethod") ``` ## Keeping track of and rerunning elements of the workflow {#rerun} The commands we have shown above show a workflow which continually saves the results over the previous object, so that additional information just gets added to the existing object. What happens if some parts of the clustering workflow are re-run? For example, in the above we reran parts of the workflow when we talked about them in more detail, or to experiment with parameter settings. The workflow commands check for existing clusters of the workflow (based on the `clusterTypes` of the clusterings). If there exist clusterings from previous runs *and* such clusterings came from calls that are "downstream" of the requested clustering, then the method will change their clusterTypes value by adding a ".i", where $i$ is a numerical index keeping track of replicate calls. For example, suppose we rerun 'combineMany', say with a different parameter choice of the proportion similarity to require. Then `combineMany` searches the existing clusterings in the input object. Any existing `combineMany` results will have their `clusterTypes` changed from `combineMany` to `combineMany.x`, where $x$ is the largest such number needed to be greater than any existing `combineMany.x` (after all, you might do this many times!). Their labels will also be updated if they just have the default label, but if the user has given different labels to the clusters those will be preserved. Moreover, since `mergeClusters` is downstream of `combineMany` in the workflow, currently existing `mergeClusters` will also get bumped to `mergeClusters.x`. However, `clusterMany` is upstream of `combineMany` (i.e. you expect there to be existing `clusterMany` before you run `combineMany`) so nothing will happen to `clusterMany`. This is handled internally, and may never be apparent to the user unless they choose `whichClusters="all"` in a plotting command. Indeed this is one reason to always pick `whichClusters="workflow"`, so that these saved previous versions are not displayed. However, if the user wants to "go back" to previous versions and make them the current iteration, we have seen that the `setToCurrent` command will do this. `setToCurrent` follows the same process as described above, only with an existing cluster set to the current part of the pipeline. Note that there is nothing that governs or protects the `clusterTypes` values to be of a certain kind. This means that if the user decides to name a clusterTypes of a clustering one of these protected names, that is allowed. However, it could clearly create some havoc if done poorly. **Erasing old clusters** You can also choose to have **all** old versions erased by choosing the options `eraseOld=TRUE` in the call to `clusterMany`, `combineMany`,`mergeClusters` and/or `setToCurrent`. `eraseOld=TRUE` in any of these functions will delete ALL past workflow results except for those that are both in the current workflow *and* "upstream" of the requested command. You can also manually remove clusters with `removeClusters`. **Finding workflow iterations** Sometimes which numbered iteration a particular call is in will not be obvious if there are many calls to the workflow. You may have a `mergeClusters.2` cluster but no `mergeClusters.1` because of an upstream workflow call in the middle that bumped the iteration value up to 2 without ever making a `mergeClusters.1`. If you really want to, you can see more about the existing iterations and where they are in the `clusterMatrix`. "0" refers to the current iteration; otherwise the smaller the iteration number, the earlier it was run. ```{r workflowTable} workflowClusterTable(ce) ``` Explicit details about every workflow cluster and their index in `clusterMatrix` is given by `workflowClusterDetails`: ```{r workflowDetails} head(workflowClusterDetails(ce),8) ``` **A note on the `whichCluster` argument** Many functions take the `whichCluster` argument for identifying a clustering or clusterings on which to perform an action. These arguments all act similarly across functions, and allow the user to give character arguments. As described above, these can be shortcuts like "workflow", or they can match either clusterTypes or clusterLabels of the object. It is important to note that matching is first done to clusterTypes, and then if not successful to clusterLabels. Since neither clusterTypes nor clusterLabels is guaranteed to be unique, the user should be careful in how they make the call. And, of course, `whichCluster` arguments can also take explicit numeric integers that identify the column(s) of the clusterMatrix that should be used. ### Designate a Final Clustering A final protected clusterTypes is "final". This is not created by any method, but can be set to be the clusterType of a clustering by the user (via the `clusterTypes` command). Any clustering marked `final` will be considered one of the workflow values for commands like `whichClusters="workflow"`. However, they will NOT be renamed with ".x" or removed if `eraseOld=TRUE`. This is a way for a user to 'save' a clustering as important/final so it will not be changed internally by any method, yet still have it show up with the "workflow" clustering results. There is no limit to the number of such clusters that are so marked, but the utility of doing so will drop if too many such clusters are chosen. For best functionality, particularly if a user has determined a single final clustering after completing clustering, a user will probably want to set the primaryClusterIndex to be that of the final cluster and rerun makeDendrogram. This will help in plotting and visualizing. The `setToFinal` command does this. Here we will demonstrate marking a cluster as final. We go back to our previous mergeClusters that we found with `cutoff=0.05` and mark it as our final clustering. First we need to find which cluster it is. We see from our above call to the workflow functions above, that it is clusterType equal to "mergeClusters.4" and label equal to "mergeClusters,v2". In our call to `setToFinal` we will decide to change it's label as well. ```{r markFinal} ce<-setToFinal(ce,whichCluster="mergeClusters,v2", clusterLabel="Final Clustering") par(mar=plotCMar) plotClusters(ce,whichClusters="workflow") ``` Note that because it is labeled as "final" it shows up automatically in "workflow" clusters in our `plotClusters` plot. It has also been set as our primaryCluster and has the new clusterLabel we gave it in the call to `setToFinal`. This didn't get rid of our undesired `mergeClusters` result that is most recent. It still shows up as "the" mergeClusters result. This might be undesired. We could remove that "mergeClusters" result with `removeClusters`. Alternatively, we could manually change the clusterTypes to `mergeClusters.x` so that it doesn't show up as current. A cleaner way to do this would have been to first set the desired cluster ("mergeClusters.4") to the most current iteration with `setToCurrent`, which would have bumped up the existing `mergeClusters` result to be no longer current. ## RSEC {#RSEC} `RSEC` is a single function that follows the entire workflow described above, but makes the choices to set `subsample=TRUE` and `sequential=TRUE` to provide more robust clusterings. This removes a number of options from clusterMany, making for a slightly reduced set of arguments. `RSEC` also implements the `combineMany`, `makeDendrogram` and `mergeClusters` steps, again with not all the arguments available to those function to be set by the user, only the most common. Furthermore, the defaults set in `RSEC` are those we choose for our algorithm, and occassionally vary from stand-alone method. The final output is a `ClusterExperiment` object as you would get from following the workflow. We give the following correspondence to help see what arguments of each component are fixed by RSEC, and which are allowed to be set by the user (as well as their correspondence to arguments in the workflow functions). ```{r rsecTable, echo=FALSE, message=FALSE, warnings=FALSE, results='asis'} # simple table creation here tabl <- " | | Arguments in original function internally fixed | Arguments in RSEC for the user | | | |:-----------------|:-----------------|:-------------|:------|:------| | | *Name of Argument in original function (if different)* | *Notes*| *clusterMany*| sequential=TRUE | k0s | ks | RSEC only sets 'k0', no other k - | distFunction=NA | clusterFunction | | - | removeSil=FALSE | reduceMethod | | - | subsample=TRUE | nFilterDims | | - | silCutoff=0 | nReducedDims | | - | | alphas | | - | | betas | | - | | minSizes | | - | | mainClusterArgs | | - | | subsampleArgs | | - | | seqArgs | | - | | run | | - | | ncores | | - | | random.seed | | - | | isCount | | - | | transFun | | *combineMany* | propUnassigned = *(default)* | combineProportion | proportion - | combineMinSize | minSize | | *makeDendrogram* | ignoreUnassignedVar=TRUE | dendroReduce | reduceMethod | - | unassignedSamples= *(default)* | dendroNDims | nDims | *mergeClusters* | plot=FALSE | mergeMethod | | - | | mergeCutoff | cutoff | - | | isCount | | argument used for both mergeMethod and clusterMany " cat(tabl) # output the table in a format good for HTML/PDF/docx conversion ``` # Finding Features related to a Clustering {#getBestFeatures} The function `getBestFeatures` finds features in the data that are strongly differentiated between the clusters of a given clustering. Finding the best features is generally the last step in the workflow, once a final clustering has been decided upon, though as we have seen it is also called internally in `mergeClusters` to decide between which clusters to merge together. The function `getBestFeatures` calls `limma` [@Smyth:2004gh, @Ritchie:2015fa] on input data to determine the gene features most associated with a particular clustering. `getBestFeatures` picks the `primaryCluster` of a `ClusterExperiment` object as the clustering to use to find features. If the standard workflow is followed, this will be the last completed step (usually the result of `mergeClusters` or manually choosing a final cluster via `setToFinal`). The primaryCluster can of course be changed by setting `primaryClusterIndex` to point to a different clustering. The basic implementation is that of `limma` which fits a linear model per feature and tests for the significance of parameters of that linear model. The main contribution of `getBestFeatures` is to interface with `limma` so as to pick appropriate parameters or tests for comparing clusters. Naturally, `getBestFeatures` also seamlessly works with `ClusterExperiment` objects to minimize the burden on the user. The output is in the form of `topTable` in `limma`, i.e. a data.frame giving the relevant features, the p-value, etc. ## Types of Significance Tests There are several choices of what is the most appropriate test to determine whether a feature is differentially expressed across the clusterings. All of these methods first fit a linear model where the clusters categories of the clustering is the explanatory factor in the model (samples with -1 or -2 are ignored). The methods differ only in what significance tests they then perform, which is controlled by the argument `type`. By default, `getBestFeatures` finds significant genes based on a F-test between the clusters (`type="F"`). This is a very standard test to compare clusters, which is why it is the default, however it may not be the one that gives the best or most specific results. Indeed, in our "Quick Start", we did not use the $F$ test, but rather all pair-wise comparisons between the clusters. The $F$ test is a test for whether there are *any* differences in expression between the clusters for a feature. Three other options are available that try to detect instead specific kinds of differences between clusters that might be of greater interest. Specifically, these differences are encoded as "contrasts", meaning specific types of differences between the means of clusters. Note that for all of these contrasts, we are making use of all of the data, not just the samples in the particular cluster pairs being compared. This means the variance is estimated with all the samples. Indeed, the same linear model is being used for all of these comparisons. ### All Pairwise The option `type="Pairs"`, which we saw earlier, performs all pair-wise tests between the clusters for each feature, testing for each pair of clusters whether the mean of the feature is different between the two clusters. Here is the example from above using all pairwise comparisons on the results of rsec: ```{r getBestFeatures_onlyTopPairs} pairsAllTop<-getBestFeatures(rsecFluidigm,contrastType="Pairs",p.value=0.05) dim(pairsAllTop) head(pairsAllTop) ``` Notice that compared to the quick start guide, we didn't set the parameter `number` which is passed to topTable, so we can get out *at most* 10 significant features for each contrast/comparison (because the default value of `number` in `topTable` is 10). Similarly, if we didn't set a value for `p.value`, `topTable` would return the top `number` genes per contrast, regardless of whether they were all significant or not. These are the defaults of `topTable`, which we purposefully do not modify, but we urge the user to read the documentation of `topTable` carefully to understand what is being asked for. In the QuickStart, we set `number=NROW(rsecFluidigm)` to make sure we got *all* significant genes. In addition to the columns provided by `topTable`, the column "Contrast" tells us what pairwise contrast the result is from. "Cl01-Cl02" means a comparison of cluster 1 and cluster 2 (note that these refer to the cluster ids, not any name they might have). The column "IndexInOriginal" gives the index of the gene to the original input data matrix, namely `assay(ce)`. The other columns are given by `topTable` (with the column "Feature" renamed -- it is usually "ProbeID" in `limma`). ### One Against All The choice `type="OneAgainsAll"` performs a comparison of a cluster against the mean of all of the other clusters. ```{r getBestFeatures_oneAgainstAll} best1vsAll<-getBestFeatures(rsecFluidigm,contrastType="OneAgainstAll",p.value=0.05,number=NROW(rsecFluidigm)) head(best1vsAll) ``` Notice that now there is both a "Contrast" and a "ContrastName" column, unlike with the pairs comparison. Like before, "Contrast" gives an explicit definition of what is the comparisons, in the form of "(Cl02+Cl03+Cl04+Cl05+Cl06)/5-Cl01", meaning the mean of the means of clusters 2-6 is compared to the mean of cluster1. Note that the contrasts here are always written in terms of the internal (numeric) cluster id, with an "Cl" in front of the number and a '0' to make the number 2 digits. "ContrastName" interprets this into a more usable name, namely that this contrast can be easily identified as a test of "Cl01" (cluster 1). We can plot the contrasts with a heatmap for these results. Here we notice that the color next to the gene group matches the cluster that the contrast matches. ```{r getBestFeatures_oneHeatmap} plotContrastHeatmap(rsecFluidigm,signifTable=best1vsAll,nBlankLines=10, whichCluster="primary") ``` ### Dendrogram The option `type="Dendro"` is more complex; it assumes that there is a hierarchy of the clusters (created by `makeDendrogram` and stored in the `ClusterExperiment` object). Then for each *node* of the dendrogram, `getBestFeatures` defines a contrast or comparison of the mean expression between the daughter nodes. ```{r getBestFeatures_dendro} bestDendro<-getBestFeatures(rsecFluidigm,contrastType="Dendro",p.value=0.05,number=NROW(rsecFluidigm)) head(bestDendro) ``` Again, there is both a "ContrastName" and "Contrast" column. The "Contrast" column identifies which clusters ids were on each side of the node (and hence commpared) and "ContrastName" is the name of the node, determined internally during `makeDendrogram`. ```{r dendroContrastLevels} levels((bestDendro)$Contrast) ``` We can look at the results again with `plotContrastHeatmap`. ```{r getBestFeatures_dendroHeatmap} plotContrastHeatmap(rsecFluidigm,signifTable=bestDendro,nBlankLines=10) ``` We can plot the dendrogram to help make sense of which contrasts go with which nodes and choose to show the node names with `show.node.label=TRUE` (plotDendrogram calls `plot.phylo` from the `ape` package and can take as imput those arguments like `show.node.label`). ```{r dendroWithNodeNames} plotDendrogram(rsecFluidigm,show.node.label=TRUE,whichClusters=c("combineMany","mergeClusters"),leaf="samples",plotType="colorblock") ``` ## Analysis for count and other RNASeq data The `getBestFeatures` method for `ClusterExperiment` objects has an argument `isCount`. If this is marked `TRUE` then the data in `assay(x)` is assumed to be counts, and the call to `limma` uses the `voom`[@Law:2014ff] correction. This correction deals with the mean-variance relationship that is found with count data. This means that the differential expression analysis is done on $log_2(x+0.5)$. This is *regardless of what transformation is stored in the `ClusterExperiment` object*! The `voom` call within `getBestFeatures` however, is by default set to `normalize.method = "none"` in the call to `voom` (though the user can set `normalize.method` in the call to `getBestFeatures`). If instead `isCount=FALSE`, then `limma` is performed on `transformData(x)`, i.e. the data after transformation of the data with the transformation stored in the `ClusterExperiment` object. In this case, there is no `voom` correction. Unlike edgeR or DESeq, the voom correction does not explicitly require a count matrix, and therefore it has been proposed that it can be used on FPKM or TPM entries, or data normalized via RUV. Setting `isCount=TRUE` even if the data in the assay slot is not count will have this effect. However, the authors of the package do not recommend using voom on anything other than counts, see e.g. [this discussion](https://support.bioconductor.org/p/45749/). ## Piping into other DE routines Ultimately, for many settings, the user may prefer to use other techniques for differential expression analysis or have more control over certain aspects of it. The function `clusterContrasts` may be called by the user to get the contrasts that are defined within `getBestFeatures` (e.g. dendrogram contrasts or pairwise contrasts). These contrasts, which are in the format needed for `limma` can be piped into programs that allow for contrasts in their linear models like edgeR [@Robinson:2010cw] for mRNA-Seq; they can also be chosen to be returned in the formated needed by MAST [@Finak:2015id] for single-cell sequencing by settting `outputType="MAST"`. Similarly, more complicated normalizations, like RUV [@GagnonBartsch:2011jv], adjust each gene individually for unwanted batch or other variation within the linear model. In this case, a matrix $W$ that describes this variation should be included in the linear model. Again, this can be done in other programs, using the contrasts provided by `clusterContrasts` ## Additional considerations The user should be careful about questions of multiple comparisons when all of these multiple contrasts are being performed on each feature; the default is to correct across all of these tests (see the help of `getBestFeatures` and the argument `contrastAdj` for more). As noted in the introduction, p-values created in this way are reusing the data (since the data was also used for creating the clusters) and hence should not be considered valid p-values regardless. As mentioned, `getBestFeatures` accepts arguments to `limma`'s function `topTable` to decide which genes should be returned (and in what order). In particular, we can set an adjusted p-value cutoff for each contrast, and set `number` to control the number of genes returned *for each contrast*. By setting `number` to be the length of all genes, and `p.value=0.05`, we can return all genes for each contrast that have adjusted p-values less than 0.05. All of the arguments to `topTable` regarding what results are returned and in what order can be given by the user at the call to `getBestFeatures`. # Session Information This vignette was compiled under: ```{r sessionInfo} sessionInfo() ``` # References