%\VignetteIndexEntry{Application Examples of RpsiXML package} %\VignetteDepends{RBGL, Rgraphviz, ppiStats} %\VignettePackage{RpsiXML} \documentclass[11pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{geometry} \usepackage{longtable} \usepackage{subfigure} \usepackage[pdftex]{graphicx} \SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=TRUE,prefix=FALSE} % R part \newcommand{\todo}[2]{\textit{\textbf{To do:} (#1) #2}} \newcommand{\fixme}[2]{\textit{\textbf{FIXME:} (#1) #2}} \newcommand{\R}[1]{{\textsf{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Metas}[1]{{\texttt{#1}}} \newcommand{\myincfig}[3]{% \begin{figure}[htbp] \begin{center} \includegraphics[width=#2]{#1} \caption{\label{#1}#3} \end{center} \end{figure} } \newcommand{\mysubfig}[8]{% \begin{figure}[htbp] \begin{center} \subfigure[#3]{ \label{#1} \includegraphics[width=#2]{#1} } \subfigure[#6]{ \label{#4} \includegraphics[width=#5]{#4} } \caption{#7} \label{#8} \end{center} \end{figure} } \begin{document} \setkeys{Gin}{width=0.9\textwidth} \title{RpsiXML: Application Examples} \author{Jitao David Zhang and Tony Chiang} \maketitle \begin{abstract} % 1. Why do we care about the problem? % 2. What is the problem? % 3. How do we solve it? % 4. What is the result? % 5. What are the implications of the result? \Rpackage{RpsiXML} allows the communication between protein interaction data stored in PSI-MI XML format and the statistical and computational environment of \R{R} and \R{Bioconductor}. In the vignette \emph{RpsiXML}, we introduced how to read in PSI-MI XML 2.5 files with \Rpackage{RpsiXML}. In this vignette, we illustrate the use of \Rpackage{RpsiXML} with example. These applications demonstrate the power of the package in analyzing protein-protein interaction networks (PPIN). \end{abstract} \section{Introduction} The systematic mapping of protein interactions by bait-prey techniques contributes a unique and novel perspective on the global picture of cellular machines. We introduce \R{R} and \R{Bioconductor} package \Rpackage{RpsiXML} as an interface between standarized protein-protein interaction data and statistical and computational environment, and a collection of tools for statistical analysis of graph representations of these data. In this vignette we use an small toy protein protein interaction network to illustrate some aspects of the package usage. To this end we first load neccessary packages. <>= library(RpsiXML) library(ppiStats) library(Rgraphviz) library(RBGL) @ \section{Statistics of Protein-Protein Interaction Networks} Upon reading the protein-protein interaction data into \R{R} environment, one intuitive approach to study it is to query the statistics of the network. In this section we will use tools implemented in \Rpackage{RpsiXML}, \Rpackage{graph}, \Rpackage{ppiStats} and \Rpackage{RBGL} to study the some statistical characteristics of a sample dataset provided by \textit{IntAct}. And \Rpackage{Rgraphviz} package is used to visualize some results. <>= xmlDir <- system.file("/extdata/psi25files",package="RpsiXML") intactxml <- file.path(xmlDir, "intact_2008_test.xml") x <- psimi25XML2Graph(intactxml, INTACT.PSIMI25, verbose=FALSE) @ <>= nA <- makeNodeAttrs(x, label="", fillcolor="lightblue", width=0.4, height=0.4) plot(x, "neato", nodeAttrs=nA) @ <>= xn <- removeSelfLoops(x) nA <- makeNodeAttrs(xn, label="", fillcolor="lightblue", width=0.4, height=0.4) plot(xn, "neato", nodeAttrs=nA) @ \mysubfig{visGraph}{0.45\textwidth}{Original graph}{removeSelfLoop}{0.45\textwidth}{Self-loops removed}{Sample dataset from \textit{IntAct} data repository, before and after removing self-loops.}{visGraphWaWOloop} The graph contains \Sexpr{numNodes(x)} proteins (nodes) and \Sexpr{numEdges(x)} interactions (edges), as visualized in Figure \ref{visGraph}, which suggests there are quite a few loops, namely edges that connect a vertex to itself. The proteins with such a self-loop interact with themselves, for example by forming dimers (or polymers). We count how many proteins have such an attribute. <>= isSelf <- function(g) { ns <- nodes(g) sapply(ns, function(x) x %in% adj(g, x)[[1]]) } isSelfLoop <- isSelf(x) selfCount <- sum(isSelfLoop) print(selfCount) @ The results show that \Sexpr{selfCount} proteins interact with themselves. In the next steps, these self-loops are removed, since they raise special issues of certain graph statistics like degree for which we don't delve into details. The graph after removing loops are visualized in Figure \ref{removeSelfLoop}. <>= opar <- par(mar=c(4,4,0,1)) ds <- degree(xn) hist(ds[[2]], xlab="", main="") @ <>= opar <- par(mar=c(4,4,0,1)) hist(ds[[1]], xlab="", main="") @ \mysubfig{outdHist}{0.45\textwidth}{Out-degree}{indHist}{0.45\textwidth}{In-degree}{Histograms of degree distribution in the graph}{degreeHist} Even though it is a small network, the sample protein-protein interaction network seems to be like a scale-free network, whose degree distribution follows a power law, or at least skewed, as seen in Figure \ref{degreeHist}. Studies on larger protein-protein interaction networks suggests the human interactome may be of a scale-free network, although there is also discussion whether the interactome is rather geometric. We refer interested readers to related references. One problem of special interest is to find cliques in protein interaction networks. A \emph{clique} is a complete subgraph, i.e., there is an edge between every pair of vertices. \emph{Maximum Clique} problem is to find the largest clique in a graph. We use \Rfunction{maxClique} function implemented in \Rpackage{RBGL} to find all cliques in the example graph. <>= xu <- ugraph(xn) cls <- maxClique(xu)$maxCliques cs <- sapply(cls,length) cls[cs==max(cs)] @ <>= cc <- table(cs) c4 <- cc[["4"]] c3 <- cc[["3"]] @ <>= c4ns <- cls[cs==max(cs)] c4a <- c4ns[[1]] c4b <- c4ns[[2]] ns <- nodes(xn); ncols <- rep("lightblue", length(ns)) ncols[ns %in% c4a] <- "#FF0033" ncols[ns %in% c4b] <- "#FFFF33" ncols[ns %in% intersect(c4a,c4b)] <- "#FF8033" nA <- makeNodeAttrs(xn, fillcolor=ncols, label="",width=0.4, height=0.4) plot(xn, "neato", nodeAttrs=nA) @ <>= c4nodes <- unique(c(c4a, c4b)) c4sub <- subGraph(c4nodes, xn) ns <- nodes(c4sub); ncols <- rep("lightblue", length(ns)) ncols[ns %in% c4a] <- "#FF0033" ncols[ns %in% c4b] <- "#FFFF33" ncols[ns %in% intersect(c4a,c4b)] <- "#FF8033" nA <- makeNodeAttrs(c4sub, fillcolor=ncols) plot(c4sub, "neato", nodeAttrs=nA) @ \mysubfig{visClique}{0.45\textwidth}{sample graph}{visCliqueAlone}{0.45\textwidth}{two cliques}{Two cliques of size 4 found in the example network, colored with yellow (A4YJD4) and red (Q6NW92) individually. The two cliques share three common nodes, whose color is orange due to blending.}{maxClique} In graph \Robject{xn}, \Sexpr{length(cls)} cliques are found. \Sexpr{c4} of them is of size four, and \Sexpr{c3} of size three. Figure \ref{maxClique} illustrate the two 4-cliques in the graph. The cliques may be explained by functional protein complexes, and the sharing nodes between cliques could be forming core facility of the complex. \section{Assess reciprocity of interactions} Bait to prey systems allow for the testing of an interaction between a pair of proteins in two directions. If bi-directionally tested, we anticipate the result as both positive or both negative. Failure to attain reciprocity indicates some form of error. Here we assess the reciprocity of the interactions in the example network. We use the \Rfunction{assessSymmetry} function implemented in \Rpackage{ppiStats} to access the symmetry. <>= sym <- assessSymmetry(xn, bpGraph=TRUE) head(sym$deg) @ The \Robject{deg} element in the returned list is a $3 \times n$ matrix. The rows are indexed by each protein. Three columns gives the number of reciprocated edges, unreciprocated out-edges and unreciprocated in-edges, respectively. <>= deg <- sym[[1]] outR <- deg[,2]==0 inR <- deg[,3]==0 nrCount <- outR & inR @ Among the \Sexpr{numNodes(xn)} nodes, \Sexpr{nrCount} nodes have all reprocated edges.\Sexpr{sum(deg[,3])} edges out of \Sexpr{numEdges(xn)}are uni-directional, that is, may indicate some form of error. For demonstration let us assume now all the protein in the network have been tested twice, e.g. both as viable bait and as viable prey. Then one could used method of moments to estimate false positive and false negative error probablities. The model is described in the vignette \textit{Stochastic and systematic errors in PPI data, by looking at unireciprocated in- or out-edges} by W.Hubaer, T.Chiang and R.Gentleman. The estimated false positive and false negative rates are visualized in Figure \ref{fpfn} \begin{figure}[htbp] \centering <>= nint <- 49:56 nrec <- sum(deg[,1]) nunr <- sum(deg[,2]) ntot <- nrow(deg) est <- estErrProbMethodOfMoments(nint=nint, nrec=nrec, nunr=nunr, ntot=ntot) plot(est[, c("pfp2", "pfn2")], type="l", col="orange", lwd=2, xlab=expression(p[FP]), ylab=expression(p[FN]), xlim=c(-0.001, 0.005), ylim=c(-0.001, 0.045)) abline(h=0, v=0, lty=2) @ \caption{False positive ($p_{FP}$) and false negative ($p_{FN}$) rate of the sample graph estimated by the method of moments. The relative low false positive and false negative rates can be explained by the fact that most edges in the example graph are reprocated.}\label{fpfn} \end{figure} The estimated values of false positive ($p_{FP}$) and false negative ($p_{FN}$) rate are very small, this can be explained by the observation that most edges in the example graph are reprocated. <>= library(RpsiXML) @ %\section{Build computational interaction network with orthologues} \section{Discussion} We have shown how to use \Rpackage{RpsiXML} package to study protein-protein interaction data with statistical and mathematical tools implemented in \R{R} and \R{Bioconductor} with several examples. This application example is dynamic, that is, we will implement new examples or revise old ones from time to time to demonstrate the use of package, depending on the feedback of users. \section{Acknowledgements} We would like to thank Wolfgang Huber, Robert Gentleman, Denise Scholtens, Sandra Orchard, Nick Luscombe, and Li Wang for their very helpful and insightful comments on both the software. We would like to thank the curators of \textit{IntAct}, \textit{MINT}, \textit{DIP}, \textit{HPRD}, \textit{BioGRID}, \textit{MatrixDB}, \textit{CORUM}, and \textit{MPact} in working with us and by guiding us through their molecular interaction repositories. \section{Session Info} The script runs within the following session: <>= sessionInfo() @ \end{document}