%%(13/1/10 Rich Savage) %%Sweave vignette for the R package BHC %% % \VignetteIndexEntry{BHC Overview} % \VignetteKeywords{cluster} % \VignetteDepends{graphics, affydata, gcrma} \documentclass{article} \usepackage{amsmath} \usepackage{amscd} \usepackage[tableposition=top]{caption} \usepackage{ifthen} \usepackage[utf8]{inputenc} \begin{document} \title{BHC: Bayesian Hierarchical Clustering} \author{Richard S. Savage} \maketitle The BHC method performs bottom-up hierarchical clustering, using a Dirichlet Process (infinite mixture) to model uncertainty in the data and Bayesian model selection to decide at each step which clusters to merge. This avoids several limitations of traditional methods, for example how many clusters there should be and how to choose a principled distance metric. This implementation accepts multinomial (i.e. discrete, with 2+ categories) data. The paper `R/BHC: fast Bayesian hierarchical clustering for microarray data' (Savage et al. 2009, BMC Bioinformatics) contains a quantitative comparison of the performance of the BHC algorithm with that of a conventional agglomerative hierarchical clustering method (using an uncentred correlation coefficient as a distance metric and complete linkage). Gene Ontology (GO) annotations are used to show that BHC produces more biologically-relevant gene clustering results. Here is an example of how to use the BHC package. <>= require(graphics) require(BHC) require(affydata) require(gcrma) data(Dilution) ai <- compute.affinities(cdfName(Dilution)) Dil.expr <- gcrma(Dilution,affinity.info=ai,type="affinities") testData <- exprs(Dil.expr) keep <- sd(t(testData))>0 testData <- testData[keep,] testData <- testData[1:100,] geneNames <- row.names(testData) nGenes <- (dim(testData))[1]; nFeatures <- (dim(testData))[2]; nFeatureValues <- 4 ##NORMALISE EACH EXPERIMENT TO ZERO MEAN, UNIT VARIANCE for (i in 1:nFeatures){ newData <- testData[,i] newData <- (newData - mean(newData)) / sd(newData) testData[,i] <- newData } ##DISCRETISE THE DATA ON A GENE-BY-GENE BASIS ##(defining the bins by equal quartiles) for (i in 1:nGenes){ newData <- testData[i,] newData <- rank(newData) - 1 testData[i,] <- newData } ##PERFORM THE CLUSTERING hc <- bhc(testData, geneNames, nFeatureValues=nFeatureValues) @ %%NOW GENERATE THE ACTUAL PLOT <>= plot(hc, axes=FALSE) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{BHC example plot} \label{fig:one} \end{figure} <>= ##OUTPUT CLUSTER LABELS TO FILE WriteOutClusterLabels(hc, "labels.txt", verbose=FALSE) @ \end{document}