--- title: "ClusterJudge" author: "Adrian Pasculescu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Vignette Title} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} references: - author: - family: Gibbons given: F.D - family: Roth given: F.P. container-title: Genome Research id: GibRoth2002 issued: month: October volume: 12 year: 2002 publisher: Genome Research title: 'Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation' - author: - family: Cho given: R.C. et all container-title: Molecular Cell id: Cho1998 issued: month: July volume: 2 year: 1998 publisher: Science Direct title: 'A genome-wide transcriptional analysis of the mitotic cell cycle' - author: - family: Tavazoie given: S. et all container-title: Nature Genetics id: Taivasoie1999 issued: month: July volume: 22 year: 1999 publisher: Nature Publishing Group title: 'Systematic determination of genetic network architecture' - author: - family: Cover given: T.M. et all container-title: book id: Cover1991 issued: month: volume: (ed. D.L. Schilling) year: 1991 publisher: Wiley-Interscience title: 'Elements of information theory' --- ```{r echo=FALSE, warning=FALSE, message=FALSE} devtools::load_all('.') ``` ## Introduction The quality of a cluster analysis can be judged based on external information. For example when the clustered entities possess attributes, the total mutual information between the clustering and each of the important and non-redundant attributes of entities might be helpful. Gene Ontology is an example of a system of attributes that can help in assessing gene clustering resulting from genomics or proteomics experiments. A clustering that `agrees well` with a system of attributes should have a pronounced decrease of the total mutual information when the number of random swaps of entities between clusters is increased [@GibRoth2002]. The following simple steps must be performed to evaluate a clustering using ClusterJudge: 1. Obtain the clusters as a structure containing entities and their cluster IDs; 2. Obtain the entity-attribute associations and verify the entity-attribute associations for consistency; 3. Often, the number of attributes can be consolidated (reduced) by using a set of _uncertainty_ levels; 4. 'Judge' the original clustering versus clustering of random changes that are applied to the original data. ## Obtaining the clustering In most of the cases the clustering is obtained from experimental data after applying some statistical methods such as k-means or cutting a hierarchical clustering at a predefined 'height'. The following example extracts the Yeast cell cycle clustering of genes related to the Yeast cell cycle [@Cho1998] [@Taivasoie1999]: ```{r warning=FALSE, message=FALSE} library('yeastExpData') ``` ```{r echo=TRUE} data(ccyclered) head(ccyclered) ``` The clustering should not contain any ambiguities ( i.e. one entity should be assigned to just one cluster) and no NA should be present. For this package, the clusters must be organized as a named vector: ```{r} clusters <- ccyclered$Cluster ### convert from Gene names to the new standard of Saccharomyces Genome Database (SGD) gene ids ccyclered$SGDID <- sub('^S','S00',ccyclered$SGDID) names(clusters) <- ccyclered$SGDID str(clusters) ``` Note that, in this example, we modified the names of the genes to match the names of entities in the Gene Ontology data (see next section). ## Obtaining Entity Attribute associations For genomics or proteomics data, one source of attributes is the Gene Ontology (http://www.geneontology.org/). For Yeast Gene Ontology attributes can also be downloaded from the [Saccharomyces Genome Database (SGD)] (http://www.yeastgenome.org/) In the ClusterJudge R package there is already a downloaded mapping of Yeast genes to attributes: ```{r, fig.show='hold', echo=TRUE} data(Yeast.GO.assocs); str(Yeast.GO.assocs); head(Yeast.GO.assocs); validate_association(Yeast.GO.assocs) ``` The *download_Yeast_GO_mapping()* function is also available to download a fresh copy of the mapping. For other species one can use specialized R and Bioconductor packages such as biomaRt as in the following `toy` example. ```{r, eval=FALSE, echo=TRUE} library(biomaRt) rn <- useDataset("rnorvegicus_gene_ensembl", mart=useMart("ensembl")) rgd.symbol=c("As3mt", "Borcs7", "Cyp17a1", "Wbp1l", "Sfxn2", "Arl3") ### exemplify for a limited set of genes entity.attr <- getBM(attributes=c('rgd_symbol','go_id'), filters='rgd_symbol', values=rgd.symbol, mart=rn) ``` The _validate_association()_ function quickly verifies if the input structure has two columns, if there are no NAs or NULLs and if there are no duplicated associations. ## Consolidating Entity-Attribute associations In some cases, the attributes may not bring additional useful information. For example, an attribute may be too specific as it is only assigned to one or two genes (e.g. it may simply be a new name of a gene). In other cases, two or more attributes may be very correlated and appear on almost the same set of genes. For the first case a consolidated association will ignore those attributes that belong to very few genes (*min.entities.per.attr*) ```{r, fig.width=6, fig.height=4} entities_attribute_stats(Yeast.GO.assocs) ### shows number of entities per attribute distribution Yeast.GO.assocs.cons1 <- consolidate_entity_attribute( entity.attribute = Yeast.GO.assocs , min.entities.per.attr =3 ### keep only attributes associated to 3 or more entities , mut.inf=FALSE ) dim(Yeast.GO.assocs) dim(Yeast.GO.assocs.cons1) ### shows reduction in the number of associations ``` For the second case we use the pre-calculated mutual information [@Cover1991] between attributes and impose a maximum _uncertainty_ level where uncertainty is defined as mutual information divided by the maximum of mutual information. ```{r, fig.width=6, fig.height=4} data(mi.GO.Yeast) Yeast.GO.assocs.cons <- consolidate_entity_attribute( entity.attribute = Yeast.GO.assocs , min.entities.per.attr =3 , mut.inf=mi.GO.Yeast ### use precalculated mutual information , U.limit = c(0.1, 0.001) ### calculate consolidated association for these uncertainty levels ) ### shows distribution of the number of pairs of attributes by Uncertainty str(Yeast.GO.assocs.cons) ``` The lower the Uncertainty limit the more attributes are ignored and the faster the judging of clusters will be performed. ## Calculating the mutual information In the case where the entity-attributes associations are downloaded from other organisms or other sources, mutual information can be calculated using this function: *attribute_mut_inf* : ```{r} data(Yeast.GO.assocs) ### because it takes time, we use a small sampled subset of associations entity.attribute.sampled <- Yeast.GO.assocs[sample(1:nrow(Yeast.GO.assocs),100),] mi.GO.Yeast.sampled <- attribute_mut_inf( entity.attribute = entity.attribute.sampled , show.progress = FALSE ## for this small example do not print progress ) str(mi.GO.Yeast.sampled) ``` The mutual information of the pair of attributes A,B is defined as mi(A,b) = H(A)+H(B) - H(cbind(A,B)) where H is the entropy. ## Judging the clustering for selected `uncertainty` levels The _clusters_ named vector prepared above and one of the consolidated entity-attribute associations (for example the one at 0.001 level of Uncertainty) are needed at this step. The _ClusterJudge_ will swap at random more and more entities between clusters and will calculate and plot the new total mutual information obtained at each iteration. If the total mutual information decreases by increasing the number of random swaps we can conclude that there is a good agreement between the clustering of experimental data and the ensemble of information provided by the entity-attribute associations. ```{r, fig.width=6, fig.height=4} mi.by.swaps<-clusterJudge( clusters = clusters , entity.attribute=Yeast.GO.assocs.cons[["0.001"]] , plot.notes='Yeast clusters judged at uncertainty level 0.001 - Ref: Tavazoie S,& all `Systematic determination of genetic network architecture. Nat Genet. 1999`' , plot.saveRDS.file= 'cj.rds') ### save the plot for later use p <- readRDS('cj.rds') ### retrieve the previous plot pdf('cj.pdf'); plot(p); dev.off() ### plot on another device ``` In most of the cases mutual information should have a decreasing trend when the number of shuffles increases. However, in rare cases, a consistent increasing trend might suggest novel discoveries; only when both experimental data and clustering are *very reliable*. In these rare cases a *high confident* clustering might complement or even contradict information accumulated in time into entity-attribute associations. *ClusterJudge* can also be seen as a complement to many of the Bioconductor packages related to evaluation and comparison of Clustering (such as [clustComp](http://www.bioconductor.org/packages/devel/bioc/html/clustComp.html ), [clusterExperiment](http://www.bioconductor.org/packages/devel/bioc/html/clusterExperiment.html ), [clusterSignificance](http://www.bioconductor.org/packages/devel/bioc/html/ClusterSignificance.html ), [GOpro](http://www.bioconductor.org/packages/devel/bioc/html/GOpro.html ), [multiClust](http://www.bioconductor.org/packages/devel/bioc/html/multiClust.html ) ). ## References