% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Yeast PPI Example} %\VignetteDepends{Biobase, graph, GO.db, RBGL, org.Sc.sgd.db} %\VignetteSuggests{Rgraphviz} %\VignetteKeywords{EDA, graphs, protein protein interaction} %\VignettePackage{yeastExpData} \documentclass[12pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{times} \usepackage{comment} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \title{Using graphs to relate expression data and protein-protein interaction data} \author{R. Gentleman and D. Scholtens} \begin{document} \maketitle \section*{Introduction} <>= library("yeastExpData") library("RBGL") require("Rgraphviz", quietly=TRUE) data(ccyclered) @ In \cite{Ge} the authors consider an interesting question. They assemble gene expression data from a yeast cell-cycle experiment \citep{rc:agwtamcc}, literature protein-protein interaction (PPI) data and yeast two-hybrid data. We have curated the data slightly to make it simpler to carry out the analyses reported in \cite{Ge}. In particular we reduced the data to the \Sexpr{nrow(ccyclered)} genes that were common to all experiments. We have also represented most of the data in terms of graphs (nodes and edges) since that is the form we will use for most of our analyses. The results reported in this document are based on version \Sexpr{packageDescription("yeastExpData")[["Version"]]} of the \Rpackage{yeastExpData} data package. \subsection{Graphs} The cell-cycle data is stored in \Robject{ccyclered} and from this it is easy to create a \textit{cluster}-graph. A cluster graph is simply a graph that is created from clustered data. In this graph all genes in the same cluster have edges between them and there are no between cluster edges. <>= clusts = split(ccyclered[["Y.name"]], ccyclered[["Cluster"]]) cg1 = new("clusterGraph", clusters = lapply(clusts, as.character)) ccClust = connComp(cg1) @ %%FIXME: is there any information on what the different clusters are? Next we are interested in looking for associations between the clusters in this graph (genes that show similar patterns of expression) and the literature curated set of protein-protein interactions. We note that these are not really literature purported protein complexes (readers might want to use the output of a procedure like that in \Rpackage{apComplex} applied to TAP data to explore protein complex data). We also note that the data used to create these graphs is already somewhat out of date and that there is new data available from MIPS. We have chosen to continue with these data since they align closely with the report in \cite{Ge}. <>= data(litG) ccLit = connectedComp(litG) cclens = sapply(ccLit, length) table(cclens) ccL2 = ccLit[cclens>1] ccl2 = cclens[cclens>1] @ We see that there are most of the proteins do not have edges to others and that there are a few, rather large sets of connected proteins. We can plot a few of those. <>= sG1 = subGraph(ccL2[[5]], litG) sG2 = subGraph(ccL2[[1]], litG) @ \begin{figure}[htbp] \begin{center} <>= if (require("Rgraphviz")) plot(sG1, "neato") @ \caption{One of the larger PPI graphs \label{fig:sg1}} \end{center} \end{figure} \begin{figure}[htbp] \begin{center} <>= if (require("Rgraphviz")) plot(sG2, "neato") @ \caption{Another of the larger PPI connected components. \label{fig:sg2}} \end{center} \end{figure} It is worth noting that the structure in Figures~\ref{fig:sg1} and \ref{fig:sg2} is suggestive of collections of proteins that form cohesive subgroups that work to achieve particular objectives. An open problem is the development of algorithms (and ultimately software and statistical models) capable of reliably identifying the constituent components. Among the important aspects of such a decomposition is the notion that some proteins will be involved in multiple complexes. \section{Testing Associations} The hypothesis presented in \cite{GePPI} was to investigate a potential correlation between expression clusters and interaction clusters. The devised a strategy called the \textit{transcriptome-interactome correlation mapping strategy} to assess the level of dependence. Here we reformulate their ideas (a more extensive discussion with some extensions is provided in \cite{Raji}). The basic idea is to represent the different sets of interactions (relationships) in terms of graphs. Each graph is defined on the same set of nodes (the genes/proteins that are common to all experiments) and the specific set of relationships are represented as edges in the appropriate graph. Above we created one graph where the edges represent known (literature based) interactions, \Robject{litG} and the second is based on clustered gene expression data, \Robject{cg1}. We can now determine how many edges are in common between the two graphs by simply computing their intersection. <>= commonG = intersection(litG, cg1) commonG @ And we can see that there are 42 edges in common. Since the literature graph has \Sexpr{numEdges(litG)} and the gene expression graph has \Sexpr{numEdges(cg1)} we might wonder whether 42 is remarkably large or remarkably small. One way to test this assumption is to generate an appropriate null distribution and to compare the observed value (42) to the values from this distribution. While there are some reasons to consider a random edge model (as proposed by Erdos and Renyi) there are more compelling reasons to condition on the structure in the graph and to use a node label permutation distribution. This is demonstrated below. Note that this does take quite some time to run though. <>= ePerm = function (g1, g2, B=500) { ans = rep(NA, B) n1 = nodes(g1) for(i in 1:B) { nodes(g1) = sample(n1) ans[i] = numEdges(intersection(g1, g2)) } return(ans) } set.seed(123) ##takes a long time #nPdist = ePerm(litG, cg1) data(nPdist) max(nPdist) @ \section{Data Analysis} Now that we have satisfied our testing curiosity we might want to start to carry out a little exploratory data analysis. There are clearly some questions that are of interest. They include the following: \begin{itemize} \item Which of the expression clusters have intersections and with which of the literature clusters? \item Are there expression clusters that have a number of literature cluster edges going between them (and hence suggesting that the expression clustering was too fine, or that the genes involved in the literature cluster are not cell-cycle regulated). \item Are there known cell-cycle regulated protein complexes and do the genes involved tend to cluster together? \item Is the expression behavior of genes that are involved in multiple literature clusters (or at least that we suspect of being so involved) different from that of genes that are involved in only one cluster? \end{itemize} Many of these questions require access to more information. For example, we need to know more about the pattern of expression related to each of the gene expression clusters so that we can try to interpret them better. We need to have more information about the likely protein complexes from the literature data so that we can better identify reasonably complete protein complexes (and given them, then identify those genes that are involved in more than one complex). But, the most important fact to notice is that all of the substantial calculations and computations (given the meta-data) are very simple to put in graph theoretic terms. \bibliography{GOstats} \end{document}