% \VignetteIndexEntry{Introduction to Mfuzz} % \VignetteDepends{Mfuzz} % \VignetteKeywords{Expression Analysis, Clustering} % \VignettePackage{Mfuzz} \documentclass[a4paper,11pt]{article} \usepackage{a4wide} \title{Short introduction to Mfuzz package} \author{Matthias E. Futschik\\ Institute for Theoretical Biology, Humboldt-University\\ URL: \textit{http://itb.biologie.hu-berlin.de/$\sim$futschik/software/R/Mfuzz}} \begin{document} \maketitle \tableofcontents \section{Overview} Clustering is an important tool in microarray data analysis. This unsupervised learning technique is commonly used to reveal structures hidden in large gene expression data sets. The vast majority of clustering algorithms applied so far produce \textit{hard partitions} of the data, i.e. each gene is assigned exactly to one cluster. \textit{Hard} clustering is favourable if clusters are well separated. However, this is generally not the case for microarray time-course data, where gene clusters frequently overlap. Additionally, hard clustering algorithms are often highly sensitive to noise. To overcome the limitations of hard clustering, \textit{soft} clustering can be applied offering several advantages to researchers \cite{bronwyn}. First, it generates accessible internal cluster structures, i.e. it indicates how well corresponding clusters represent genes. This additional information can be used for a refined search for regulatory elements. Second, the overall relation between clusters, and thus a global clustering structure, can be defined. Additionally, soft clustering is more noise robust and \textit{a priori} pre-filtering of genes can be avoided. This prevents the exclusion of biologically relevant genes from the data analysis. This vignette gives a short introduction to soft clustering using the \textit{Mfuzz} package. It misses some features (such as parameter selection and cluster stability) due to the size restrictions for Bioconductor vignettes. More information and a enlarged introduction can be found at the \textit{Mfuzz} webpage: \\ \hfill \fbox{\textit{http://itb.biologie.hu-berlin.de/$\sim$futschik/software/R/Mfuzz}}\hfill %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%% \section{Installation requirements} Following software is required to run the Mfuzz package: \begin{itemize} \item R (> 2.0.0). For installation of R, refer to \textit{http://www.r-project.org}. \item R-package: e1071. For installation of these add-on packages, refer to \textit{http://cran.r-project.org}. \item Bioconductor package: Biobase. Refer to \textit{http://www.bioconductor.org} for installation. \end{itemize} If these requirements are fulfilled, the \textit{Mfuzz} add-on R-package can be installed. To see how to install add-on R-packages on your computer system, start \textit{R} and type in \textit{help(INSTALL)}. Optionally, you may use the R-function \textit{install.packages()}. Once the \textit{Mfuzz} package is installed, you can load the package by <<>>= library(Mfuzz) @ %%%%%% DATA PREPARATION \section{Data pre-processing} To illustrate our approach, we apply soft clustering to yeast cell cycle expression data by Cho \textit{et al.} \cite{cho}. 6178 genes were monitored at 17 time points over a span of 160 minutes using Affymetrix chips. Note that we do not exclude here the array corresponding to the time-point of 90 mins which displays irregularities in the expression values measured. Additionally, the data set was modified by restricting the number of genes to 3000. Thus, results here might differ from those reported in reference~\cite{bronwyn}. <<>>= data(yeast) @ \subsection{Missing values} As a first step, we exclude genes with more than 25\% of the measurements missing. Note that missing values should be denoted by \texttt{NA} in the gene expression matrix. <<>>= yeast.r <- filter.NA(yeast, thres=0.25) @ Fuzzy c-means like many other cluster algorithms, does not allow for missing values. Thus, we replace remaining missing values by the average values expression value of the corresponding gene. <<>>= yeast.f <- fill.NA(yeast.r,mode="mean") @ Alternatively (and recommended), the (weighted) k-nearest neighbour method can be used (\texttt{mode=`knn'/`wknn'}). These methods perform usually favourable compared to the simple method above, but are computationally intensive. \subsection{Filtering} Most cluster analyses published include a filtering step to remove genes which are expressed at low levels or show only small changes in expression. Different filtering procedures have been proposed. A popular procedure is the setting of a minimum threshold for variation \\ Calculation the standard deviation shows, however, that the transition between low and high values for variation in gene expression is smooth and no particular cut-off point is indicated (figure~\ref{std}). <>= tmp <- filter.std(yeast.f,min.std=0) @ \begin{figure} \centering \resizebox{0.8\textwidth}{!}{\includegraphics{sd.png}} %<>= % tmp <- filter.std(yeast.f,min.std=0) %@ \caption{Standard deviation of gene expression vectors before standardisation. The genes were ordered by the standard deviation of the corresponding expression vector. A unique cut-off value for filtering is not prominent.} \label{std} \end{figure} Thus, the value of a filtering threshold remains arbitrary. As no stringent filtering procedure currently exists, we avoided any prior filtering of gene data. This prevents the loss of genes that may be biologically important. \subsection{Standardisation} Since the clustering is performed in Euclidian space, the expression values of genes were standardised to have a mean value of zero and a standard deviation of one. This step ensures that vectors of genes with similar changes in expression are close in Euclidean space: <<>>= yeast.s <- standardise(yeast.f) @ %%%%%%% SOFT CLUSTERING \section{Soft clustering of gene expression data} Clustering is often applied to reveal regulatory mechanisms underlying gene expression. It is well known that regulation of genes is generally not in an `on-off', but gradual manner which allows a finer control of the genes' functions. A cluster algorithm should reflect this finding by differentiating how closely a gene follows the dominant cluster patterns. Soft clustering appears as a good candidate for this task since it can assign a gene \textit{i} gradual degrees of membership $\mu_{ij}$ to a cluster $j$. The membership values can vary continuously between zero and one. This feature enables soft clustering to provide more information about the structure of gene expression data. Soft clustering is implemented in the function \texttt{mfuzz} using the fuzzy $c$-means algorithm (of the \textit{e1071} package) based on minimization of a weighted square error function \cite{bezdak}. For soft clustering, the cluster centroids $\textbf{c}_j$ result from the weighted sum of all cluster members and show the overall expression patterns of clusters. The membership values $\mu_{ij}$ indicate how well the gene $i$ is represented by cluster $\textbf{c}_{j}$. Low values $\mu_{ij}$ point to a poor representation of gene $i$ by $\textbf{c}_{j}$. Large values $\mu_{ij}$ point to a high correlation of the expression of gene $i$ with the cluster centroid $\textbf{c}_{j}$. The membership values are color-encoded in the plots generated by \texttt{mfuzz.plot}. This can facilitates the identification of temporal patterns in gene cluster (figure~\ref{fp1}). <>= cl <- mfuzz(yeast.s,c=16,m=1.25) mfuzz.plot(yeast.s,cl=cl,mfrow=c(4,4),time.labels=seq(0,160,10)) @ %\begin{figure}[ht!] %\centering %<>= % cl <- mfuzz(yeast.s,c=16,m=1.25) % mfuzz.plot(yeast.s,cl=cl,mfrow=c(4,4),time.labels=seq(0,160,10),new.window=FALSE) %@ \begin{figure}[t] \centering \resizebox{0.8\textwidth}{!}{\includegraphics{cluster.png}} \caption{Soft clusters of yeast cell cycle expression data. Yellow or green colored lines correspond to genes with low membership value; red and purple colored lines correspond to genes with high membership value. Note the peaks in negative and positive direction for some clusters at time-point 90 min which may correspond to an experimental artifact.} \label{fp1} \end{figure} %%%%% \subsection{Cluster cores} Membership values can also indicate the similarity of vectors to each other. If two gene expression vectors have a high membership value for a specific cluster, they are generally similar to each other. This is the basis for the definition of the core of a cluster. We define that genes with membership values larger than a chosen threshold $\alpha$ belong to the \textit{$\alpha$-core} of the cluster. This allows us to define relationships between genes within a cluster. Similarly to hierarchical clustering, the internal structures of clusters become accessible. % As an example, the $\alpha$-cores for the previous clustering can be displayed: %<>= % mfuzz.plot(yeast.s,cl=cl,mfrow=c(4,4),min.mem = 0.7,time.labels=seq(0,160,10)) %@ %\begin{figure}[t!] %\centering %<>= % mfuzz.plot(yeast.s,cl=cl,mfrow=c(4,4),min.mem = 0.7,time.labels=seq(0,160,10)) %@ %\caption{The same clustering is shown as in figure~\ref{fp1}, but only genes with membership values larger than 0.7 are displayed.} %\label{fp2} %\end{figure} The average within-cluster variation is considerably reduced setting $\alpha =0.7$. The use of the $\alpha$-threshold can therefore act as \textit{a posteriori} filtering of genes. This contrasts with previously discussed procedures which demand the problematic setting of a threshold \textit{a priori} to the cluster analysis To extract list of genes belonging to the cluster cores, the \texttt{acore} function can be used. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Cluster stability} Variation of the FCM parameter $m$ also allows investigation of the stability of clusters. We define here stable clusters as clusters that show only minor changes in their structure with variation of the parameter $m$. Stable clusters are generally well distinct of other clusters and compact. This is contrasted by weak clusters that lose their internal structure or disappear if $m$ was increased. <>= cl2 <- mfuzz(yeast.s,c=16,m=1.35) mfuzz.plot(yeast.s,cl=cl2,mfrow=c(4,4),time.labels=seq(0,160,10)) @ %\begin{figure}[ht!] %\centering %<>= % tmp <- mfuzz(yeast.s,c=16,m=1.35) % mfuzz.plot(yeast.s,cl=cl2,mfrow=c(4,4),time.labels=seq(0,160,10)) %@ % %\caption{Cluster resulting with increased $m=1.35$. Stable clusters contain their cluster cores, whereas weak cluster %show only diffused cores. Note that the numbering of the clusters might have changed compared to the previous %clustering, because of random initiation of the cluster centroids at the start of the clustering process.} %\label{fp3} %\end{figure} %\clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Global clustering structures} \noindent An interesting feature of soft clustering is the overlap or coupling between clusters. The coupling coefficient $V_{kl}$ between cluster $k$ and cluster $l$ can be defined by \begin{equation} V_{kl} = \frac{1}{N}\sum_{i=1}^{N} \mu_{ik} \mu_{il} \label{eq3} \end{equation} where $N$ is the total number of gene expression vectors. The coupling indicates how many genes are shared by two clusters. Clusters which have a low coupling show distinct overall patterns. If the coupling is large, clusters patterns are more similar. Hence, the coupling defines a similarity measure for pairs of clusters. <>= O <- overlap(cl) Ptmp <- overlap.plot(cl,over=O,thres=0.05) @ This allows the analysis of global clustering structures obtained by soft clustering, since relationships between clusters are defined. Similarly to hierarchical clustering, the global clustering structure can be examined at different resolutions determined by the cluster number $c$. For a small $c$, only the major clusters present in the data are obtained.%(figure~\ref{fp4} \& \ref{O4}). <>= cl3 <- mfuzz(yeast.s,c=10,m=1.25) mfuzz.plot(yeast.s,cl=cl3,mfrow=c(3,4)) O3 <- overlap(cl3) overlap.plot(cl3,over=O3,P=Ptmp,thres=0.05) @ \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% If $c$ is increased, sub-clusters with distinct patterns emerge %(figure~\ref{fp1}). Sub-clusters derived from a major cluster are generally strongly coupled, since they share the overall expression pattern. Finally, soft clustering produces empty clusters for further increase of $c$.% (figures~\ref{fp5} \& \ref{O5}). <>= cl4 <- mfuzz(yeast.s,c=25,m=1.25) mfuzz.plot(yeast.s,cl=cl4,mfrow=c(5,5)) O4 <- overlap(cl4) overlap.plot(cl4,over=O4,P=Ptmp,thres=0.05) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%% %BIBLIOGRAPHY \begin{thebibliography}{99} \bibitem{bronwyn} M.E. Futschik and B. Charlisle, Noise robust clustering of gene expression time-course data, Journal of Bioinformatics and Computational Biology, Vol. 3, No. 4, 965-988, 2005 \bibitem{cho} Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, Davis RW, A genome-wide transcriptional analysis of the mitotic cell cycle, \textit{Mol Cell}, \textbf{2}:65--73, 1998 \bibitem{bezdak} Bezdak JC, \textit{Pattern Recognition with Fuzzy Objective Function Algorithms}, Plenum Press, New York, 1981 \end{thebibliography} \end{document}