%\VignetteIndexEntry{GOsim} %\VignetteKeyword{GOsim} %\VignettePackage{GOsim} \documentclass[12pt,a4paper]{article} \usepackage{natbib} \usepackage{amsmath} \usepackage{amsfonts} \usepackage{hyperref} \usepackage[T1]{fontenc} \usepackage[latin1]{inputenc} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \begin{document} \SweaveOpts{concordance=TRUE} \author{Holger Fr\"ohlich} \title{The \emph{GOSim} package} \date{\today} \maketitle \section{Introduction} The Gene Ontology (GO) has become one of the most widespread systems for systematically annotating gene products within the bioinformatics community and is developed by the Gene Ontology Consortium \citep{GOConsortium04}. It is specifically intended for describing gene products with a controlled and structured vocabulary. GO terms are part of a Directed Acyclic Graph (DAG), covering three orthogonal taxonomies or "aspects": \emph{molecular function, biological process} and \emph{cellular component}. Two different kinds of relationship between GO terms exist: the "is-a" relationship and the "part-of" relationship. Providing a standard vocabulary across any biological resources, the GO enables researchers to use this information for automated data analysis. The \emph{GOSim} package \citep{Froehlich2007GOSim} provides the researcher with various information theoretic similarity concepts for GO terms \citep{Resnik95, Resnik99, Lin98, Jiang98, Lord03, Couto2003FuSSiMeg, Couto2005GraSM}. Moreover, since version 1.1.5 \emph{GOSim} contains several new similarity concepts, which are based on so-called diffusion kernel techniques \citep{Lerman2007ManifoldEmbedding}. Additionally \emph{GOSim} implements different methods for computing functional similarities between gene products based on the similarties between the associated GO terms \citep{FroeGO05, FroeSpeerGOKer06, Schlicker2006GOFuncSim, Lerman2007ManifoldEmbedding, Pozo2008GOFuncSim}. This can, for instances, be used for clustering genes according to their biological function \citep{FroeGO05, FroeSpeerGOKer06} and thus may help to get a better understanding of the biological aspects covered by a set of genes. Since version 1.1 \emph{GOSim} additionally offers the possibility of a GO enrichment analysis using the topGO package \citep{Alexa2006topGO}. Hence, \emph{GOSim} acts now as an umbrella for different analysis methods employing the GO structure. \section{Usage of \emph{GOSim}} To elucidate the usage of \emph{GOSim} we show an example workflow and explain the employed similarity concepts. We create a character vector of Entrez gene IDs, which we assume to be from human: <>= library(GOSim) genes=c("207","208","596","901","780","3169","9518","2852","26353","8614","7494") @ Next we investigate the GO annotation within the current ontology (which is \emph{biological process} by default): <>= getGOInfo(genes) @ \subsection{Term Similarities} Let us examine the similarity of the GO terms for genes "8614" and "2852" in greater detail: <>= getTermSim(c("GO:0007166","GO:0007267","GO:0007584","GO:0007165","GO:0007186"),method="Resnik",verbose=FALSE) @ This calculates Resnik's pairwise similarity between GO terms \citep{Resnik95,Resnik99}: \begin{equation} sim(t,t') = IC_{ms}(t,t') := \max_{\hat{t}\in Pa(t,t')} IC(\hat{t})\label{eq:Resnik} \end{equation} Here $Pa(t,t')$ denotes the set of all common ancestors of GO terms $t$ and $t'$, while $IC(t)$ denotes the information content of term $t$. It is defined as (e.g. \cite{Lord03}) \begin{equation} IC(\hat{t}) = -\log P(\hat{t}) \end{equation} i.e. as the negative logarithm of the probability of observing $\hat{t}$. The information content of each GO term is already precomputed for each ontology based on the empirical observation, how many times a specific GO term or any of its direct or indirect offsprings appear in the annotation of the GO with gene products. GOSim provides a normalized version of Resnik's similarity measure, which divides the information content of the minimum subsumber by the maximum information content of all GO terms, hence obtaining a number between 0 and 1. <>= data("ICsBPhumanall") IC[c("GO:0007166","GO:0007267","GO:0007584","GO:0007165","GO:0007186")] @ This loads the information contents of all GO terms within "biological process". Likewise, the data files {\tt ICsMFhumanall} and {\tt ICsCChumanall} contain the information contents of all GO terms within "molecular function" and "cellular component" for human. Since GOSim version 1.1.4.0 the information content of GO terms relies on the mapping of primary gene IDs (mainly Entrez) to GO terms provided by the libraries org.Dm.eg.db (fly), org.Hs.eg.db (human), org.Mm.eg.db (mouse), etc. Additionally, it is possible to pass a user provided mapping via the function {\tt setEvidenceLevel}. Please refer to the manual pages for details. If only GO terms having certain evidence codes should be considered, one must explicitely calculate the corresponding information contents in the function {\tt calcICs}. Again, more information on this function can be found in the manual pages. To continue our example from above, let us also calculate Jiang and Conrath's pairwise similarity between GO terms, which is the default, for compairson reasons \citep{Jiang98}: <>= getTermSim(c("GO:0007166","GO:0007267","GO:0007584","GO:0007165","GO:0007186"),verbose=FALSE) @ Jiang and Conrath's similarity measure is defined as \begin{equation} sim(t,t') = 1 - \min(1, IC(t) - 2IC_{ms}(t,t') + IC(t')) \end{equation} i.e. the similarity between $t$ and $t'$ is 0, if their normalized distance is at least 1. Likewise, we can also compute Lin's pairwise similarity between GO terms \citep{Lin98}: <>= getTermSim(c("GO:0007166","GO:0007267","GO:0007584","GO:0007165","GO:0007186"),method="Lin",verbose=FALSE) @ It is defined as: \begin{equation} sim(t,t') = \frac{2IC_{ms}(t,t')}{IC(t) + IC(t')}\label{eq:Lin} \end{equation} Resnik's, Jiang-Conraths's and Lin's term similarities all refer to $IC_{ms}(t,t')$, the information content of the minimum subsumer of $t$ and $t'$, i.e. of the lowest common ancestor in the hierarchy. For illustration let us plot the GO graph with leaves GO:0007166 and GO:0007267 and let us compute their minimum subsumer (see Fig. \ref{Fig:GOExample}): <>= library(igraph) G = getGOGraph(c("GO:0007166","GO:0007267")) G2 = igraph.from.graphNEL(G) plot(G2, vertex.label=V(G2)$name) @ <>= getMinimumSubsumer("GO:0007166","GO:0007267") @ %\begin{figure} %\begin{center} %\includegraphics[width=0.7\textwidth]{GOExample.pdf} %\caption{\label{Fig:GOExample}Example of a GO graph starting with leaves GO:0007166 and GO:0007267.} %\end{center} %\end{figure} In contrast to the above defined similarity measures Couto et al. \citep{Couto2005GraSM} introduced a concept, which is not based on the minimum subsumer, but on the set of all disjunctive common ancestors. Roughly speaking, the idea is not to consider the common ancestor having the highest information content only, but also others, if they are somehow "separate" from each other, i.e. there exists a path to $t$ or to $t'$ not passing any other of the disjunctive common ancestors. <>= getDisjCommAnc("GO:0007166","GO:0007267") @ In this case the set of disjunctive common ancestors consists of the minimum subsumer, GO:0007154, and its parent, GO:0009987, because from both there exists a path to GO:0007166 not passing any other disjunctive common ancestor(see Fig. \ref{Fig:GOPlot}). Based on the notion of disjunctive common ancestors Resnik's similarity concept can be extended by defining: \begin{equation} sim(t,t') = IC_{share}(t,t') = \frac{1}{|DisjCommAnc|}\sum_{t\in DisjCommAnc} IC(t) \end{equation} Likewise, Jiang-Conraths's and Lin's measures can be extended as well by replacing $IC_{ms}(t,t')$ by $IC_{share}(t,t')$. <>= getTermSim(c("GO:0007166","GO:0007267"),method="CoutoResnik",verbose=FALSE) @ Finally, it should be mentioned that also the depth and density enriched term similarity by Couto et al. \citep{Couto2003FuSSiMeg} has been integrated into \emph{GOSim}: <>= setEnrichmentFactors(alpha=0.5,beta=0.3) getTermSim(c("GO:0007166","GO:0007267"),method="CoutoEnriched",verbose=FALSE) @ Since version 1.1.5 \emph{GOSim} contains several new similarity concepts, which are based on so-called diffusion kernel techniques \citep{Lerman2007ManifoldEmbedding} rather than on the information theoretic ideas presented before. For using these similarity measures it is necessary to pre-compute a diffusion kernel on the Gene Ontology graph via \texttt{calc.diffusion.kernel}. This will take some time and result in a kernel/similarity matrix that is stored in a file called e.g. 'diffKernelpowerBPhumanall.rda' (meaning matrix power diffusion kernel for ontology BP in human using all evidence codes) in the current working directory. Once the kernel is created, it has to be loaded into the environment first \texttt{load.diffusion.kernel}. Afterwards GO term similarities can be computed via function \texttt{getTermSim}. Please check the manual pages for details. Since version 1.2 \emph{GOSim} also contains Schlicker et al.'s GO term similarity measure \citep{Schlicker2006GOFuncSim}, which is an adaption of Lin's similarity measure. Moreover, the graph information content similarity by Pesquita et al. has been implemented \citep{Pesquita2007GOSim}. <>= getTermSim(c("GO:0007166","GO:0007267","GO:0007584","GO:0007165","GO:0007186"),method="relevance",verbose=FALSE) @ \subsection{Functional Gene Similarities} The special strength of \emph{GOSim} lies in the possibility not only to calculate similarities for individual GO terms, but also for genes based on their complete GO anntation. Since \emph{GOSim} version 1.1.5 for this purpose the following ideas have been implemented: \begin{enumerate} \item Maximum \citep{Couto2003FuSSiMeg} and average pairwise GO term similarity \item Average of best matching GO term similarities \citep{Schlicker2006GOFuncSim}. \item Computation of a so-called \emph{optimal assignment} of terms from one gene to those of another one \citep{FroeSpeerGOKer06}. \item Similarity derived from Hausdorff distances between sets \citep{Pozo2008GOFuncSim}. \item Embedding of each gene into a feature space: \citep{FroeGO05,FroeSpeerGOKer06} proposed to define feature vectors by a gene's maximum GO term similarity to certain prototype genes. More simple (but probably also less accurate), \citep{Mistry2008GOFuncSim} recently proposed to represent each gene by a feature vector describing the presence/absence of all GO terms. The absence of each GO term is additionally weighted by its information content. Within a feature space gene functional similarities naturally arise as dot products between feature vectors. These dot products can be understood as so-called \emph{kernel functions} \citep{SchSmo02}, as used in e.g. Support Vector Machines \citep{CorVap95}. Depending on the choice of later normalization (see below) one can arrive at the cosine similarity (Eq.~\ref{eq:normalize}), at the Tanimoto coefficient (Eq.~\ref{eq:normalize2}) or at a measure similiar to Lin's one (Eq.~\ref{eq:normalize3}, Eq.~\ref{eq:Lin}). \end{enumerate} \subsubsection{Normalization of Similarities} Often, people want to normalize similarities, e.g. on the interval $[0, 1]$, for better interpretation. To do so, we can perform the transformation \begin{equation} sim_{gene}(g,g')\leftarrow\frac{sim_{gene}(g,g')}{\sqrt{sim_{gene}(g,g)sim_{gene}(g',g')}}\label{eq:normalize} \end{equation} Provided $sim_{gene} \geq 0$, the consequence will be a similarity of 1 for $g$ with itself and between 0 and 1 for $g$ with any other gene. In case of a feature space embedding this transformation is equivalent to computing the cosine similarity between two feature vectors. Another possibility is to use Lin's normalization (see Eq. \ref{eq:Lin}): \begin{equation} sim_{gene}(g,g')\leftarrow\frac{2sim_{gene}(g,g')}{sim_{gene}(g,g)+sim_{gene}(g',g')}\label{eq:normalize2} \end{equation} Furthermore, one can use a normalization in the spirit of the Tanimoto coefficient: \begin{equation} sim_{gene}(g,g')\leftarrow\frac{sim_{gene}(g,g')}{sim_{gene}(g,g)+sim_{gene}(g',g')-sim_{gene}(g,g')}\label{eq:normalize3} \end{equation} In case of a feature space embedding the transformation corresponds exactly to the Tanimoto coefficient betweem two feature vectors. We now give a more detailed overview over the different similarity concepts mentioned above. \subsubsection{Maximum and Average Pairwise GO Term Similarity} The idea of the maximum pairwise GO term similarity is straight forward. Given two genes $g$ and $g'$ annotated with GO terms $t_{1},...,t_{n}$ and $t'_{1},...,t'_{m}$ we define the functional similarity between between $g$ and $g'$ as \begin{eqnarray} sim_{gene}(g,g') & = & \max_{\begin{array}{c} i=1,,...,n\\ j=1,...,m\end{array}}sim(t_{i},t'_{j}) \end{eqnarray} where $sim$ is some similarity measure to compare GO terms $t_i$ and $t'_j$. This idea is, for instance, realized in FuSSiMeg \citep{Couto2003FuSSiMeg}. Instead of computing the maximum pairwise GO term similarity one may also take the average here. \subsubsection{Average of Best Matching GO Term Similarities } The idea of this approach \citep{Schlicker2006GOFuncSim} is to assign each GO term $t_i$ occuring in gene $g$ to its best matching partner $t'_{\pi{i}}$ in gene $g'$. Hence multiple GO terms from gene $g$ can be assigned to one GO term from gene $g'$. A similarity score is computed by taking the average similarity of assigned GO terms. Since, however, genes can have an unequal number of GO terms the result depends on whether GO terms of gene $g$ are assigned to those of gene $g'$ or vice versa. Hence, in \cite{Schlicker2006GOFuncSim} it was proposed to either take the maximum or the average of both similarity scores. Both strategies are implemented in \emph{GOSim}. \subsubsection{Optimal Assignment Gene Similarities} To elucidate the idea of the optimal assignment \citep{FroeSpeerGOKer06}, consider the GO terms associated with gene "8614" on one hand and gene "2852" on the other hand: <>= getGOInfo(c("8614","2852")) @ Given a similarity concept $sim$ to compare individual GO terms, the idea is now to assign each term of the gene having fewer annotation to exactly one term of the other gene such that the overall similarity is maximized. More formally the optimal assignment problem can be stated as follows: Let $\pi$ be some permutation of either an $n$-subset of natural numbers $\{1,...,m\}$ or an $m$-subset of natural numbers $\{1,...,n\}$ (this will be clear from context). Then we are looking for the quantity \begin{equation} sim_{gene}(g,g')=\left\{ \begin{array}{cc} \max_{\pi}\sum_{i=1}^{n}sim(t_{i},t'_{\pi(i)}) & \textrm{if }m>n\\ \max_{\pi}\sum_{j=1}^{m}sim(t_{\pi(j)},t'_{j}) & \textrm{ otherwise} \end{array}\right.\label{eq:OAKernel} \end{equation} The computation of (\ref{eq:OAKernel}) corresponds to the solution of the classical maximum weighted bipartite matching (optimal assignment) problem in graph theory and can be carried out in $O(\max(n,m)^{3})$ time \citep{LEDABook99}. To prevent that larger lists of terms automatically achieve a higher similarity we may further $sim_{gene}$ divide \ref{eq:OAKernel} by $\max(m,n)$. In our example, using Lin's GO term similarity measure the following assignments yielding a corresponding similarity matrix are found: <>= getGeneSim(c("8614","2852"),similarity="OA",similarityTerm="Lin",avg=FALSE, verbose=FALSE) @ Note the difference to a gene similarity that is just based on the maximum GO term similarity and to a gene similarity that is based on the average of best matching GO terms: <>= getGeneSim(c("8614","2852"),similarity="max",similarityTerm="Lin",verbose=FALSE) getGeneSim(c("8614","2852"),similarity="funSimMax",similarityTerm="Lin",verbose=FALSE) @ \subsubsection{Gene Similarities In the Spirit of Hausdorff Metrics} Hausdorff metrics are a general concept for measuring distances between compact subsets of a metric space. Let $X$ and $Y$ be the two sets of GO terms associated to genes $g$ and $g'$, and let $d(t,t')$ denote the distanc between GO terms $t$ and $t'$. Then the Hausdorff distance $X$ and $Y$ is defined as \begin{equation} d_{Hausdorff}(X,Y) = \max\{\sup_{t \in X} \inf_{t' \in Y} d(t,t'), \sup_{t' \in Y} \inf_{t \in X} d(t,t') \} \end{equation} Using Hausdorff metrics for measuring gene functional distances was proposed in \cite{Pozo2008GOFuncSim}. We translate the idea to define a similarity measure between $g$ and $g'$ (see the difference to previous GOSim versions): \begin{equation} sim_{gene}(g,g') = \exp(-d_{Hausdorff}(g,g')) \end{equation} <>= getGeneSim(c("8614","2852"),similarity="hausdorff",similarityTerm="Lin",verbose=FALSE) @ \subsubsection{Feature Space Embedding of Gene Products} \paragraph{The Simple Approach} \cite{Mistry2008GOFuncSim} proposed to represent each gene by a feature vector describing the presence/absence of all GO terms. The absence of each GO term is additionally weighted by its information content. In the feature space similarities arise as dot products. Hence, the similarity between two GO terms $t$ and $t'$ is implicitly defined as the product of their information content values, hence igoring the exact DAG structure of the Gene Ontology as employed by the GO term similarity measures explained in the beginning of this document. <>= getGeneSim(c("8614","2852"),similarity="dot",method="Tanimoto", verbose=FALSE) @ This will calculate the Tanimoto coefficient between feature vectors as a similarity measure. It is possible to retrieve the feature vectors via: <>= features = getGeneFeatures(c("8614","2852")) @ \paragraph{Embeddings via GO Term Similarities to Prototype Genes} This approach is due to \cite{FroeGO05,FroeSpeerGOKer06}. The idea is to define a feature vector for each gene by its pairwise GO term similarity to certain prototype genes, i.e. the prototype genes form a (nonorthogonal) basis, and each gene is defined relative to this basis. The prototype genes can eithed be defined a priori or one can use one of the heuristics implemented in the function {\tt selectPrototypes}. The default behavior is to select the 250 best annotated genes, i.e. which have been annotated with GO terms most often, but here we just use 3 for computational reasons: <>= proto = selectPrototypes(n=3,verbose=FALSE) @ We now calculate for each gene $g$ feature vectors $\phi(g)$ by using their similarity to all prototypes $p_1,...,p_n$: \begin{equation} \phi(g) = (sim'(g,p_1),...,sim'(g,p_n))^T \end{equation} Here $sim'$ by default is the maximum pairwise GO term similarity. Alternatively, one can use other similarity measures for $sim'$ as well. These similarity measures can by itself again be combined with arbitrary GO term similarity concepts. The default is the Jiang-Conrath term similarity. Because the feature vectors are very high-dimensional we usually perform a principal component analysis (PCA) to project the data into a lower dimensional subspace. The results are not shown here due to long computation time. <>= PHI = getGeneFeaturesPrototypes(genes,prototypes=proto, verbose=FALSE) @ This uses the above define prototypes to calculate feature vectors and performs a PCA afterwards. The number of principal components is chosen such that at least 95\% of the total variance in feature space can be explained (this is a relatively conservatve criterion). We can now plot our genes in the space spanned by the first 2 principal components to get an impression of the relative "position" of the genes to each other in the feature space (see Fig. \ref{Fig:GOPCAExample}). The feature vectors are normalized to Euclidian norm 1 by default: <>= x=seq(min(PHI$features[,1]),max(PHI$features[,1]),length.out=100) y=seq(min(PHI$features[,2]),max(PHI$features[,2]),length.out=100) plot(x,y,xlab="principal component 1",ylab="principal component 2",type="n") text(PHI$features[,1],PHI$features[,2],labels=genes) @ \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{GOPCAExample.pdf} \caption{\label{Fig:PCAPlot}Embedding of genes into feature space spanned by the first 2 principal components} \end{center} \end{figure} Finally, we can directly calculate the similarities of the genes to each other, this time using the Resnik's GO term similarity concept. These similarities may then be used to cluster genes with respect to their function: <>= sim = getGeneSimPrototypes(genes[1:3],prototypes=proto,similarityTerm="Resnik",verbose=FALSE) h=hclust(as.dist(1-sim$similarity),"average") plot(h,xlab="") @ This produces a hierarchical clustering of all genes using average linkage clustering (see Fig. \ref{Fig:GOCluster}). \begin{figure} \begin{center} \includegraphics[width=0.5\textwidth]{GOClusterExample.pdf} \caption{\label{Fig:GOCluster}Possible functional clustering of the genes using Ward's method.} \end{center} \end{figure} \subsubsection{Combination of Similarities from Different Ontologies} It should be mentioned that up to now all similarity computations were performed within the ontology "biological process". One could imagine to combine functional similarities between gene products with regard to different taxonomies. An obvious way for doing so would be to consider the sum of the respective similarities: \begin{equation} sim_{total}(g,g') = sim_{Ontology1}(g,g') + sim_{Ontology2}(g,g') \end{equation} Of course, one could also use a weighted averaging scheme here, if desired. \subsection{Cluster Evaluations} \emph{GOSim} has the possibility to evaluate a given clustering of genes or terms by means of their GO similarities. Supposed, based on other experiments (e.g. microarry), we have decided to put genes "8614", "9518", "780", "2852" in one group, genes "3169", "207", "7494", "596" in a second and the rest in a third group. Then we can ask ourselves, how similar these groups are with respect to their GO annotations: <>= ev = evaluateClustering(c(2,3,2,3,1,2,1,1,3,1,2), sim$similarity) plot(ev$clustersil,main="") @ \begin{figure}[t] \begin{center} \includegraphics[width=0.5\textwidth]{GOClustersil.pdf} \caption{\label{Fig:GOClustersil}Silhouette plot of a possible given grouping of genes.} \end{center} \end{figure} A good indiciation of the clustering qualitiy can be obtained by looking at the cluster silhouettes \citep{rousseeuw87} (see Fig. \ref{Fig:GOClustersil}). This shows that clusters 1 and 2 are relatively homogenous with respect to the functional similarity of the genes contained in it, while the genes in cluster 3 are more dissimilar. \subsection{GO Enrichment Analysis} Since version 1.1 \emph{GOSim} also offers the possibility of a GO enrichment analysis. Suppose, we may now want to get a clearer picture of the genes involved in cluster 1. For this purpose we use the topGO tool \citep{Alexa2006topGO}. <>= library(org.Hs.eg.db) library(topGO) allgenes = union(c("8614", "9518", "780", "2852"), sample(keys(org.Hs.egGO), 1000)) # suppose these are all genes GOenrichment(c("8614", "9518", "780", "2852"), allgenes) # print out what cluster 1 is about @ \bibliographystyle{apalike} \bibliography{references} \end{document}