% rm(list=ls());library("weaver");Sweave("SuppMat.Rnw", driver=weaver) %\VignetteIndexEntry{Reading PSI-25 XML file from IntAct} %\VignetteDepends{} %\VignettePackage{Rintact} \documentclass[11pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{geometry} \usepackage{longtable} \usepackage{times} \SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=TRUE,include=FALSE,prefix=FALSE,width=4,height=4} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \title{Reading PSI-25 XML file from IntAct with the \Rpackage{Rintact} package} \author{Tony Chiang and Nianhua Li} \begin{document} \maketitle \begin{abstract} This document serves as a user's vignette to the R package \Rpackage{Rintact}. We present examples of how to use the two main function's of \Rpackage{Rintact}, and also how to take the output data from these two functions and create the input data for statistical methods in proteomic analysis provided by other Bioconductor packages. \end{abstract} \section{Introduction} \Rpackage{Rintact} is an R package mainly used to parse the PSI-25 files generated by the \textit{IntAct} data repository which collects, curates and stores thousands of protein interactions. Currently, there are two main functions within \Rpackage{Rintact}: \begin{enumerate} \item \Rfunction{psi25interaction} \item \Rfunction{psi25complex} \end{enumerate} The first function, \Rfunction{psi25interaction}, takes either a PSI-25 XML file from IntAct or an URL containing the web address of where such an XML file can be obtained. The XML file must contain \emph{binary} protein protein interaction data. Example for such data are direct physical interactions, complex co-membership, synthetic genetic interactions. The second function, \Rfunction{psi25complex}, also takes a PSI-25 XML file or URL as an input parameter, but the file must contain protein complex membership information. In principle, these two functions can take any XML file which adheres to the PSI-25 standards. We have constructed these functions, however, to work primarily with the \textit{IntAct} PSI-25 XML files, as there are subtle implementation differences between repositories such as \textit{IntAct} and \textit{DIP}, although both use the PSI-25 standards. In this vignette, we shall demonstrate the use of these functions on the data generated by \cite{Ewing2007} and the manually curated protein complexes derived by the \textit{IntAct} curators. \subsection{Loading R Libraries} We begin by loading the various R libraries with which we shall use. Our primary focus will be with the \Rpackage{Rintact} package, but we will also examine and exploit statistical methods found in various Bioconductor packages for the analysis of the interaction data obtain from \textit{IntAct}. <>= library("Rintact") library("graph") library("Rgraphviz") #library("ppiStats") library("RBGL") library("apComplex") library("xtable") @ \section{Obtaining the Interaction Information} \subsection{\Rfunction{psi25interaction}} We first demonstrate the use of the function \Rfunction{psi25interaction}. We can either download the \textit{IntAct} PSI-25 XML file onto a local directory or we can simply use the URL (of where the file can be obtained) as the input parameter. We have chosen the latter: <>= url <- system.file("PSI25XML", "interactionSample.xml", package="Rintact") ewing <- psi25interaction(url) @ Once the XML file has been parsed by \Rfunction{psi25interaction}, we can look at its overall structure. <>= class(ewing) @ We can see that the output of \Rfunction{psi25interaction} is an instance of the class \Rclass{interactionEntry}. This class has \Sexpr{length(slotNames(ewing))} slots: % <>= slotNames(ewing) @ Three of them contain simple character vectors: <>= ewing@organismName ewing@taxId ewing@releaseDate @ %FIXME: We should get rid of organismName and taxId or make sure that they will get all of %the organisms tested in the XML file. \Rclass{organismName} records all the organisms for which interactions were assayed. For each organism, we have also included its taxonomy identification code. Because \textit{IntAct} does not currently version its weekly release, we have added the \Rclass{releaseDate} as a time stamp to act as a surrogate for the version number. Let us investigate the structure of the \Rclass{interactions} slot. This slot contains a list which holds all the binary interactions given within the XML file (along with information about each particular interaction). Each element of the list is an instance of the class \Rclass{intactInteraction} class. This class has \Sexpr{length(slotNames(interactions(ewing)[[1]]))} slots: % <>= length(interactions(ewing)) class(interactions(ewing)[[1]]) interactions(ewing)[[1]] slotNames(interactions(ewing)[[1]]) @ The various slots contain information which is relevant for each individual interaction. The \Rclass{interactionType} slot details what manner of interaction was found between the bait protein and the prey protein, which are specified in the \Rclass{bait} and \Rclass{prey} slots. Another important attribute is the experimental confidence value given in the \Rclass{confidenceValue} slot. This confidence value is reported by the experimenters; it does not report scores derived by third parties. We can extract the names of the bait and prey proteins for all of the interactions in the \Robject{ewing} dataset: % <>= ewbait <- sapply(interactions(ewing), bait) ewprey <- sapply(interactions(ewing), prey) @ % We now have two character vectors, \Robject{ewbait} and \Robject{ewprey}, that are aligned to each other: the $i^{th}$ protein in \Robject{ewprey} is found by the $i^{th}$ protein in \Robject{ewbait}. % <>= ewbait ewprey @ The \textit{IntAct} accession codes are useful as unique and uniform identifiers in the \textit{IntAct} repository, but we will usually want to translate them to other identifier schemes such as HUGO gene name Ensembl gene identifier. The PSI-25 XML files from \textit{IntAct} contain a look-up table for this purpose. This look-up table is stored in the \Rclass{interactors} slot of the \Rclass{interactionEntry} object \Robject{ewing}, in the form of a character matrix. Its rows are indexed by the \textit{IntAct} accession numbers of the molecules in the data structure, and its has \Sexpr{ncol(interactors(ewing))} columns. % <>= interactors(ewing) @ % The \textit{IntAct} accession codes can be translated into any of the associated identifier schemes. Two further properties are given for each molecule: the organism in which the molecule is native and the corresponding taxonomy ID. Most of the interactions found in \textit{IntAct} will be protein-protein interactions; other types of interactions, however, are also stored such as small molecule to protein interactions as well as gene-gene interactions. As a result, there will be times when a molecule cannot be mapped to a locus name or an ORF etc. We also remark that interactions have been tested between proteins of different organisms (i.e human protein against mice). Thus the organism attribute is vital to keep such interactions in the proper context. Using the look-up table is quick and efficient because of the subsetting functionality of R. For instance, say we would like to translate the following \textit{IntAct} accession codes % <>= wh = ewbait[3:4] @ into gene names: <>= interactors(ewing)[wh, "geneName"] @ \section{Obtaining Protein Complex Composition Information} Now we will demonstrate the parser function \Rfunction{psi25complex}. We remark here that the protein complexes which this function obtains have been manually curated from literature sources by \textit{IntAct} curators. %FIXME: Why? %Those protein complexes estimated by the experimenters (e.g. \cite{Gavin2006}) should not be %culled from the XML files. The parameters of \Rfunction{psi25complex} are identical to those of \Rfunction{psi25interaction}, while its output only contains 3 slots: % <>= url2 <- system.file("PSI25XML/complexSample.xml", package="Rintact") comps <- psi25complex(url2) slotNames(comps) @ % Again the \Rclass{releaseDate} slot serves as a surrogate version number. The \Rclass{interactors} slot again holds a look-up table that can be used to translate the \textit{IntAct} accession codes. The \Rclass{complexes} slot is a list of \Rclass{intactComplex} objects. Each list entry is an instance of the class \Rclass{intactComplex}, which itself has \Sexpr{length(slotNames(comps@complexes[[1]]))} slots. <>= length(complexes(comps)) class(complexes(comps)[[1]]) slotNames(complexes(comps)[[1]]) @ These slots describe the multi-protein complex. The three most important ones are \Rclass{fullName}, \Rclass{attributes} and \Rclass{members} slots. The \Rclass{fullName} slot gives the exact name of the multi-protein complex while the \Rclass{attributes} slots gives a short description as to the known functionality of the complex. The \Rclass{interactors} slot gives the members of the complex and their multiplicity. <>= stopifnot(all(c("fullName", "attributes", "members")%in%slotNames(complexes(comps)[[1]]))) @ <>= complexes(comps)[[1]] @ %--------------------------------------------------------- % SessionInfo %--------------------------------------------------------- \begin{table*}[tbp] \begin{minipage}{\textwidth} <>= toLatex(sessionInfo()) @ \end{minipage} \caption{\label{tab:sessioninfo}% The output of \Rfunction{sessionInfo} on the build system after running this vignette.} \end{table*} \begin{thebibliography}{12} \bibitem[Gavin et~al.(2007)]{Gavin2006} Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, D{\"u}mpelfeld B, et~al. \newblock{Proteome survey reveals modularity of the yeast cell machinery}. \newblock \emph{Nature} 2006, \textbf{440}:631--636. \bibitem[Ewing et~al.(2007)]{Ewing2007} Ewing EM et~al. \newblock {Large-scale Mapping of Protein-Protein Interactions by Mass Spectrometry}. \newblock \emph{Molecular Systems Biology} 2007, 3. \end{thebibliography} \end{document}