% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteDepends{graph,Rgraphviz} %\VignetteKeywords{Protein Complex, Graphs} %\VignettePackage{apComplex} %\VignetteIndexEntry{apComplex} \documentclass[11pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage{times} \usepackage{comment} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \begin{document} \title{simulatorAPMS} \author{Tony Chiang \\ Denise Scholtens \\ Robert Gentleman} \date{} \maketitle \Rpackage{simulatorAPMS} is a tool that uses computational tools to simulate affinity purification (co-precipitation) technologies as well as the mass spectrometry used to ascertain protein-protein co-membership data. Simulation of the affinity purification - mass spectrometry (AP-MS) technologies is based on the bait to prey model. <>= library(simulatorAPMS) @ \section*{AP-MS Data} There are various bait to prey models and the fundamental principals differ slightly. For each cell or tissue under some conditions, certain portions of the genome is expressed, and the resulting proteins interact, forming multi-protein structures to perform some biological process. Based on these underlying biological principals, affinity purification employs a bait to prey technology where certain set of expressed proteins are tagged or labelled and infused into some environment (e.g. in vivo) where another set of proteins (usually not dis-joint from the bait set) reside. It is the affinity between proteins in the latter set with each protein in the bait set which is the data the technology is designed to collect. Purification of the protein to protein affiliation is possible because of the modification to the bait proteins. The most common method is physically inserting a tag or labell which can be recovered. One technique forces affiliated sets of proteins through a layer of beads which have a strong physical attraction to the tags inserted into the baits. Once the baits (along with it affiliated proteins called prey) have been successfully extracted, mass spectrometry is used to decipher and identify which proteins comprise of each clustering. We can represent this model of AP-MS bait to prey technology with an adjacency matrix $A$. This matrix is a $\{0,1\}$-matrix which documents affiliation between bait proteins and prey proteins. \vspace{.2in} \begin{tabular}{cc|cccccc} & \multicolumn{1}{c}{} & \multicolumn{6}{c}{Hits} \\ & \multicolumn{1}{c}{} & $P_{1}$ & $P_{2}$ & $P_{3}$ & $P_{4}$ & $P_{5}$ & $P_{6}$ \\ \cline{3-8} & $P_{1}$ & 1 & 1 & 0 & 1 & 0 & 1 \\ Baits & $P_{2}$ & 1 & 1 & 0 & 1 & 0 & 1 \\ & $P_{3}$ & 0 & 0 & 1 & 1 & 1 & 0 \\ \end{tabular} \vspace{.2in} \noindent The rows of the matrix are baits, the columns are hits, an entry of 1 in the $i$th row and $j$th column indicates that bait protein $i$ finds protein $j$ as a hit, and an entry of 0 in the $i$th row and $j$th column indicates that bait protein $i$ does not find protein $j$ as a hit. The diagonal entries are 1 since a protein is always a complex comember with itself. Note that bait proteins can be found as hits by other bait proteins. Also note that some proteins are never used as baits. A graph of the data is useful for understanding which comembership relationships are tested in AP-MS experiments and which are not. In the graph in Figure \ref{fig:trueData}, nodes represent proteins and directed edges from baits to hits represent complex comembership. Bait proteins are yellow and hit-only proteins (i.e. proteins that are found as hits but never used as baits) are white. Directed edges always originate at yellow bait proteins and point to the set of hits detected by that bait. The red reciprocated edge connecting $P_{1}$ and $P_{2}$ represents a bait-bait relationship that is tested twice, once in each purification. The gray unreciprocated edges represent bait-hit-only edges that are only tested once. Missing edges between baits and other baits or hit-only proteins represent comemberships that are tested, but not observed. Edges between hit-only proteins are always missing, not because the comemberships are known not to exist, but because they are never tested. %%<>= %%library(Rgraphviz) %%FromT <- c(rep("P1",3),rep("P2",3),rep("P3",2)) %%ToT <- c("P2","P4","P6","P1","P4","P6","P4","P5") %%LT <- cbind(FromT,ToT) %%M1T <- ftM2adjM(LT) %%apEXGT <- as(M1T,"graphNEL") %%fcT <- c(rep("yellow",3),rep("white",3)) %%names(fcT) <- c("P1","P2","P3","P4","P5","P6") %%nAT <- list(fillcolor=fcT) %%ecT <- c("red","red",rep("gray",6)) %%names(ecT) <- %%c("P1~P2","P2~P1","P1~P6","P1~P4","P2~P4","P2~P6","P3~P4","P3~P5") %%eAT <- list(color=ecT) %%plot(apEXGT,nodeAttrs=nAT,edgeAttrs=eAT) %%@ %%\begin{figure}[htbp] %%\begin{center} %%\includegraphics[width=0.5\textwidth]{simulatorAPMS-trueData} %%\caption{\label{fig:trueData} True complex comemberships that would be %%detected with perfectly sensitive and specific AP-MS technology using $P_{1}$, %%$P_{2}$, and $P_{3}$ as baits.} %%\end{center} %%\end{figure} In reality, AP-MS technology is neither perfectly sensitive nor specific, resulting in false positive (FP) and false negative (FN) observations of the complex comemberships under investigation. Suppose in this experiment, we make a FN observation between $P_{2}$ and $P_{4}$, i.e. $P_{4}$ is not found as a hit when we use $P_{2}$ as a bait. Also suppose we make two FP observations: 1) when we use $P_{3}$ as a bait, we find an extraneous hit-only protein $P_{7}$, and 2) when performing a purification using $P_{8}$ as a bait, we find $P_{3}$ as a hit. Data for this example are contained in the matrix \Robject{apEX}. In this matrix, rows again represent baits and columns represent hits. <>= data(apEX) apEX @ The graph of the data in Figure \ref{fig:observedData} demonstrates the observations recorded in \Robject{apEX}. Note the missing edge from $P_{2}$ to $P_{4}$ and the new edge from $P_{3}$ to $P_{7}$. Also note the blue unreciprocated edge between $P_{3}$ and $P_{8}$ -- this is a complex comembership that was tested twice when $P_{3}$ and $P_{8}$ were used as baits, but only detected once in the purification using $P_{8}$ as a bait. %%<>= %%From <- c(rep("P1",3),rep("P2",2),rep("P3",3),"P8") %%To <- c("P2","P4","P6","P1","P6","P4","P5","P7","P3") %%L <- cbind(From,To) %%M1 <- ftM2adjM(L) %%apEXG <- as(M1,"graphNEL") %%fc <- c(rep("yellow",4),rep("white",4)) %%names(fc) <- c("P1","P2","P3","P8","P4","P5","P6","P7") %%nA <- list(fillcolor=fc) %%ec <- c("red","red","blue",rep("gray",6)) %%names(ec) <- %%c("P1~P2","P2~P1","P8~P3","P1~P6","P1~P4","P2~P6","P3~P4","P3~P5","P3~P7") %%eA <- list(color=ec) %%plot(apEXG,nodeAttrs=nA,edgeAttrs=eA) %%@ %%\begin{figure}[htbp] %%\begin{center} %%\includegraphics[width=0.5\textwidth]{simulatorAPMS-observedData} %%\caption{\label{fig:observedData} Hypothetical data from an AP-MS experiment %%with a FN observation between $P_{2}$ and $P_{6}$ and FP observations %%between $P_{3}$ and $P_{7}$ and $P_{8}$ and $P_{3}$.} %%\end{center} %%\end{figure} \section{Preliminaries} Before we proceed onto the package itself, it is necessary to acquaint ourselves with the language we will use. We can consider any experimental technology as a black box; the input to the black box is some true state of nature (TSN) which is basically an in silico interactome (ISI), the collection of protein complexes for some cell under some conditions. The output of the black box is an output from the simulated experiments. It is clear that the experiment begins with some representation of the truth. For complex co-membership, we begin with an ISI described by its bipartite graph: one set of nodes represents proteins; the other, protein complexes. The input is the incidence matrix representation of the bipartite graph: the rows index the proteins; the columns, protein complexes. The matrix has a one in row $i$ and column $j$ if protein $i$ is part of complex $j$; otherwise that entry is zero. Here is an example of such a matrix: <>= data(simEX) simEX @ Each organism or cell has an unique interactome for some specified conditions and time. For our in silico interactome, we have taken the estimation of apComplex on the HMSPCI yeast data set by Ho, et al. We note that any estimate can serve as the ISI. Once we have the bipartite graph representation of the ISI, we may begin the simulation of the APMS technology. \section{Simulator} In this section we discuss how to use the simulator in the \Rpackage{simulatorAPMS} package. In describing how to use the main function, \Rfunction{runSimulators()}, we will work with an example: >runSimulators(TSNMatEX, vBaitsEX, vDeformed, vSticky, rateFP, rateFN, rateD, rateS, missedProtHMSPCI, Seed) First we describe each of the inputs. Each input parameter has a wet-lab correspondent, and we describe the how each affects the output of the experiment and, thus, the simulation. \\ \\ TSNMatEX - The first parameter is the incidence matrix of the bipartite graph representation of our ISI. It is our model true state of nature. In the wet-lab experiment, this maybe some cell line or tissue under some specified conditions. The goal of AP-MS technology is to estimate $TSNMatEX$, so too then does our simulation technology. \\ \\ vBaitsEX - Just as AP-MS technology uses a subset of proteins from the cell or tissue of interests as bait, so too does our simulation use a subset of proteins as baits. This parameter is given as a character vector composed of the protein names used as baits which is identical to names used for the rows of the $TSNMatEx$. \\ \\ vDeformed - In AP-MS, baits need to be tagged so that they can be identified at the end of the affinity process. Certain baits have been experimentally verified to lose normal functionality when tagged. Thus deformed baits are known to interact with very few proteins in the experimentation giving rise false negative observations. The vDeformed parameter is also a character vector of protein names; they are a subset of the vBaitsEX vector. \\ \\ vSticky - In the course of the experiment, some baits are found to falsely interact with a large number other proteins. Proteins that have high occurences of interactions with a large number of other proteins are called $sticky$, and as this name suggest, these proteins deliver a high number of false positive interactions. The vSticky parameter is also a subset of vBaitsEX. \\ \\ rateFP - Through any experiment, there are a number of stochastic elements by which errors can occur. How sensitive the technology, how specific the technology, etc all contribute to error in the observed data. The rateFP parameter is a scalar in the unit interval dictating the probability of a false positive interaction between bait $b$ and protein $p$. \\ \\ rateFN - Analogous to rateFP, rateFN is the probability that the technology (simulation) will record a false negative interaction. \\ \\ rateD - For each deformed bait, we need to describe how deformed the protein has become, i.e. what proportion of prey will each deformed bait miss. rateD is a named vector of scalars (of the same size as vDeformed) estimating this proportion for each deformed bait. \\ \\ rateS - Analogous to rateD, rateS is a vector of scalars estimating how the proportion of proteins to which each proteins interacts. \\ \\ missedProtHMSPCI - The ISI's bipartite graph might not include all the proteins, since proteins not involved in any non-trivial complexes need not be recorded in the bipartite graph. These proteins need to be re-inserted to the in silico model organism's bipartite graph since they would be present in the wet-lab organism. In the simulation, these prpteins will contribute to any false positive interactions within the simulation. \\ \\ seed - The $seed$ is not used in the wet-lab experiments. It is a computational parameter that sets the seed for the random integer generator for the purpose of reproducible data sets. seed is any three digit positive integer. \\ \\ The output from \Rfunction{runSimulators()} does not represent a bipartite graph. The simulation derives the protein-protein interaction graph by $TSNMatEX \otimes TSNMatEX^{T}$. This interaction graph does not represent direct binary protein-protein interactions, but rather protein co-membership interaction. It is from this matrix that the simulated output is derived. This output reflects the data output of an actual AP-MS experiment, and it is given as the incidence matrix representation where each row indexed by the bait proteins and the columns represents all proteins found in the cell. <>= data(TSNMatrix) data(vBaitsEX) data(missedProtEX) vDeformed <- vBaitsEX[2] vSticky <- vBaitsEX[5] rateFP <- 0.10 rateFN <- 0.25 rateD <- 0.50 rateS <- 0.66 seeed <- 237 runSimulators(TSNMatrix, vBaitsEX, vDeformed, vSticky, rateFP, rateFN, rateD, rateS, missedProtEX, seeed) @ \section{Complex Estimation Algorithm} From the simulated data, the \Rpackage{simulatorAPMS} package calls the \Rfunction{findComplexes()} function in the \Rpackage{apComplex} package, a complex co-membership algorithm. For detailed instructions concerning the \Rpackage{apComplex} package, the user can refer to its own vignette. One of the functionality of the \Rpackage{simulatorAPMS} is its ability to test the effectiveness of the complex estimation algorithms. Though the \Rpackage{simulatorAPMS} defaults to \Rpackage{apComplex} as the complex estimator, the user can chose to use any other complex estimator. <>= data(runSimEX) data(vBaitsEX) runAPComplex(runSimEX, vBaitsEX) @ The output from \Rpackage{apComplex} is the incidence matrix representation of the bipartite graph of proteins to protein complexes. This output is an estimate for the in silico model organism $A^*$. After calculating this esimate, it is necessary to ascertain how accurate this estimate is - either with respect to the in silico model organism or compared to other estimation algorithms. \section{Statistical Tools} The last component of \Rpackage{simulatorAPMS} is its statistical tools. It is necessary to compare the simulated test results with the in silico model organism's true state of nature, $A$. When we have calculated an estimate, $\hat{A}$ using \Rpackage{apComplex}, the first step in the comparison between $A$ and $\hat{A}$ is calculating three statistics between the complexes $C_i \in A$ and $K_j \in \hat{A}$: (1) $C_i \cap K_j$; (2) $C_i \setminus K_j$; and (3) $K_j \setminus C_i$. To calculate these three statistics, we call the \Rfunction{runCompareComplex()} function. The return value of \Rfunction{runCompareComplex} is a list containing three matrices. The first component of the list is a matrix of intersection values between the complexes of $A$ and $\hat{A}$. The second and third components are a matrix of the difference values between the complexes of $A$ and $\hat{A}$ and a matrix of the difference values between the complexes of $\hat{A}$ and $A$ respectively. Once we have caluculated these statistics, we can compute similarity measures between the complexes of $A$ and $\hat{A}$. The first method involves using the Jaccard similarity coefficient. This coefficient is a simple measure of simularity between two complexes by taking the ratio of overlapping sections of the complexes to combining the two complexes. For complex $C_i$ and $K_j$, we define the Jaccard Matrix as $JC$ with the $\{i,j\}th$ entry as \begin{equation} JC_{ij} = \frac{|C_i \cap K_j|}{|C_i \cup K_j|} \end{equation} To call the \Rfunction{JaccardCoef} function, we only need the return value of the \Rfunction{runCompareComplex} function: The Jaccard index has its benefits. The main benefit of the Jaccard index is that it is a very intuitive statistic. It is clear that if two complexes have almost indentical composition, the similarity measure ought to reflect this. It is pretty easy to see that if the two complexes are identical, the Jaccard index is one, and if they are completely different, the index is zero. Because of its relatively easy structure, the index is a good measure of similarity. The matrix of Jaccard coefficients above has its rows indexed by the complexes of $A$ and the columns indexed by the estimate $\hat{A}$. The second method uses a variant of the Kullbach-Liebler formula to calculate the independence of two probability distributions. For a random protein $p$, we can calculate three probabilities: 1. the probability that $p$ is in the complex $C_i$; 2. the probability that $p$ is in the complex $K_j$; and 3. the probability that $p$ is in both the complexes $C_i$ and in $K_j$. With these three probabilities, we can calculate the degree of independence of (1) and (2) with respect to (3). Independence implies that the estimate is not be closely aligned to the truth. To call the \Rfunction{compIndep()}, we need three parameters, the matrix of the bipartite graph of $A$, the matrix of the bipartite graph of $\hat{A}$, and the intersection matrix of the function \Rfunction{runCompareComplex()}. After we have calculated a similarity measures, we need to align complexes of $A$ to complexes of $\hat{A}$. To find an optimal alignment, we employ a version of a greedy algorithm we call \Rfunction{compBijection}. We find the largest entry in the similarity matrix (either Jaccard or the Kullbach-Liebler); its row and column correspond to particular complexes in $A$ and in the estimate $\hat{A}$, and so we align these complexes. We then delete this row and column from the matrix and recursively call \Rfunction{compBijection()} on the smaller matrix. If there is a tie in one row (or one column), we chose larger complexes to align first, since the relative size of the complexes is an important indicator of how well the estimation algorithm works. To call the alignment function, we need three parameters: the matrix of $A$, the matrix of $\hat{A}$, and either the Jaccard Matrix or the Kullbach-Liebler Matrix: The output of the \Rfunction{runAlignment} function is a matrix where the row contains the alignment of the complexes of $A$ and $\hat{A}$ with the similarity coefficient (whichever one the user decided to use) is given in the third column. Thus is it easy to see how well the estimation algoithm performed the complexes of the estimate are aligned with those of the ISI. There is one particular caveat for this function; the greedy algorithm is known not to return optimal alignments is some pathelogical cases though in most cases it does. \section*{Where do we go} The \Rpackage{simulatorAPMS} is developed for many reasons, one of which is to test the significance of interactome estimation algorithms. The second part of the package uses the estimation algorithm \Rpackage{apComplex} as the tool to predict the ISI, but \Rpackage{simulatorAPMS} can just as easily test significance of any estimation algorithm. In fact, \Rpackage{simulatorAPMS} could compare and contrast the estimations of different estimation algorithms under variable conditions, i.e. different rates of FP/FN, using different baits, etc. One of the most substantial elements to \Rpackage{simulatorAPMS} is that if an estimation algorithm can be shown to be highly accurate and statistically significant, we would then have a reasonably clear description of the interactome. Bait selection for AP-MS techonology is highly related to how much information one can discern from the interactome. The more we understand about the interactome, the more intelligent we can make our bait selections, and thus we gan garner more detail from the actual wet-lab experiment. \section*{References} \noindent Gavin, A.-C., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes, \textit{Nature}, \textbf{415}, 141-147. \vspace{.2in} \noindent Ho, Y., et al. (2002) Systematic identification of protein complexes in \textit{Saccharomyces cerevisiae} by mass spectrometry, \textit{Nature}. \textbf{415}, 180-183. \vspace{.2in} \noindent Jiang, Y., Broach, J. (1999) Tor proteins and protein phosphatase 2A reciprocally regulate Tap42 in controlling cell growth in yeast, \textit{EMBO Journal}, \textbf{18}, 2782-2792. \vspace{.2in} \noindent Scholtens, D., Gentleman, R. (2004) Making sense of high-throughput protein-protein interaction data, \textit{Statistical Aplications in Genetics and Molecular Biology}, \textbf{3}, Article 39. \vspace{.2in} \noindent Scholtens, D., Vidal, M., Gentleman, R. Local modeling of global interacome networks, \textit{Submitted}. \end{document}