%\VignetteIndexEntry{GRAPH Interaction from pathway Topological Environment} %\VignetteDepends{graph} %\VignetteDepends{graphite} %\VignetteDepends{SPIA} %\VignetteDepends{hgu133plus2.db} %\VignetteDepends{DEGraph} %\VignetteDepends{topologyGSA} \documentclass{article} \usepackage[utf8]{inputenc} <>= BiocStyle::latex() @ \title{GRAPH Interaction from pathway Topological Environment} \author{ Gabriele Sales (gabriele.sales@unipd.it) \\ Enrica Calura (enrica.calura@gmail.com) \\ Chiara Romualdi (chiara.romualdi@unipd.it) } \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents <>= library(graph) library(graphite) @ \section{Introduction} {\it graphite} (GRAPH Interaction from pathway Topological Environment) an R package built to i) provide networks from six databases (KEGG, \cite{KEGG}; Biocarta, www.biocarta.com; Reactome, \cite{Reactome}; NCI/Nature Pathway Interaction Database, \cite{NCI}; HumanCyc, \cite{humancyc}; Panther, \cite{panther}) and, since the version 1.14, from 14 species; ii) discriminate between different types of gene groups; iii) propagate pathway connections through chemical compounds; iv) allow the selection of edge attributes and the conversion of nodes identifiers to EntrezGene IDs and HUGO Symbols; finally, v) to run {\it SPIA}, {\it DEGraph}, {\it CliPPER} and {\it topologyGSA} analyses directly on {\it graphite} networks. \section{Pathway topology conversion to gene network} In order to gather curated information about human pathways, we have collected data from the four public databases that have emerged as reference points for the systems biology community. The KEGG database has been in development by Kanehisa Laboratories since 1995, and is now a prominent reference knowledge base for integration and interpretation of large-scale molecular data sets generated by genome sequencing and other high-throughput experimental technologies. KEGG is the only pathway database not in biopax format, they use the KGML format. Reactome, backed by the EBI, is one of the most complete repository; it is frequently updated and provides a semantically rich description of each pathway \cite{Reactome}. KEGG Pathways (KGML format) provides maps for both signaling and metabolic pathways \cite{KEGG}. BioCarta (www.biocarta.com) is a developer, supplier and distributor company of reagents and assays for biopharmaceutical and academic research. Through an "open source" approach, this community-fed forum constantly integrates emerging proteomic information from the scientific community. It also catalogs and summarizes important resources providing information for over 120,000 genes.BioCarta pathway data in biopax format are available through NCI website. NCI (NCI/Nature Pathway Interaction Database \cite{NCI}) is a highly-structured, curated collection of information about known biomolecular interactions and key cellular processes assembled into signaling pathways. This was a collaborative project between the NCI and Nature Publishing Group (NPG). Finally, Panther \cite{panther} data are a comprehensive, curated database of pathways, protein families, trees, subfamilies and functions available at http://pantherdb.org backed by the University of Southern California. HumanCyc is part of the BioCyc database collection of pathways \cite{humancyc}. \subsection{Pathway definition} KEGG database provides separate {\tt xml} files, one for each pathway, thus, a pathway is defined by all the reactions defined within each file. For the other databases, we identify a pathway every time in the BioPax format is defined the {\tt xml} tag {\tt pathway}. \subsection{Nodes with multiple elements} Within a pathway nodes often correspond to multiple gene products. These can be divided into protein complexes (proteins linked by protein-protein interactions) and the groups containing alternative members (genes generally with similar biochemical functions). Thus, when considering signal propagation these groups should be considered differently; the first (hereafter group AND) should be expanded into a clique (all proteins connected to the others), while the second (hereafter group OR) should be expanded without connection among the contained elements. In the KGML format there are two ways of defining nodes with multiple elements: protein complexes (group AND defined by entry {\tt type="group"}) and group with alternative members (group OR defined by entry {\tt type="gene"}). In the BioPax format only one type of group is allowed: protein complexes (group AND) with the {\tt owl} tag {\tt complex}. However, it often happened that {\tt protein} tag contains multiple {\tt xref} pointing to alternative elements of the process (group OR). \subsection{Compound mediated interactions} Compound mediated interactions are interactions for which a compound acts as a bridge between two elements. Since chemical compounds are generally not measured with high-throughput technology, they should be removed from the network. However, the trivial elimination of the compounds, without signal propagation, will strongly bias the topology interrupting the signals that pass through them. If element $A$ is linked to compound $c$ and compound $c$ is linked to element $B$, thus elements $A$ should be linked to elements $B$. Within the KGML format there are two different ways of describing a compound mediated interaction: i) direct interaction {\tt type="PPrel"} ($A$ interacts whit $B$ through compound $c$ ) and ii) indirect one {\tt type="PCrel"} ($A$ interacts whit compound $c$ and $c$ interacts whit $B$). In the BioPax {\tt owl} language, one the other hand, only a indirect way of defining compound mediated signals is available. Since signal propagation reconstruction is crucial for topological gene set analysis we decide to include additional rules for the propagation reconstructing a connection between two genes connected through a series of chemical compounds. Not all compound are considered for the propagation because some of them, such as Hydrogen, H2O, ATP, ADP etc., are highly frequent in map descriptions and the signal propagation through them would lead to degenerate too long chains of compounds. The compounds not considered for propagation are not characteristic of a specific reaction but secondary substrates/products widely shared among different processes. \subsection{Relation attributes} {\it graphite} allows the user to see the single/multiple relation types that characterized an edge. The type of edges have been conserved as much as possible to those annotated in the data formats. Some new types have been introduced due to topological conversion needs. \subsection{Pathway functions recovery} Human pathway data are natively distribuited with the package. But, since the version 1.14.0, non-human pathway data are also available and can be dowloaded using \Rfunction{pathways} \Rfunction{pathways} requires the name of the specie of interest and the name of the pathway database name as follow: <>= humanReactome <- pathways("hsapiens", "reactome") names(humanReactome)[1:10] p <- humanReactome[["ABC-family proteins mediated transport"]] p @ A pathway database is a list of pathways. We can access to a pathway through its name, as above, or through its position in the list, as follow: <>= p <- humanReactome[[21]] p@title @ In the pathway, nodes represent genes/proteins and the edges represent the functional relationships: <>= head(nodes(p)) @ Edges can be characterized by multiple functional relationships: <>= head(edges(p)) @ This same steps can be used to access the KEGG, Biocarta, Panther, HumanCyc and NCI databases for Homo sapiens pathways (through the \Robject{kegg}, \Robject{biocarta}, \Robject{panther}, \Robject{humancyc} and \Robject{nci} lists, respectively), but not all the databases are present for all the species. To know the pathway data available the user can use the \Rfunction{pathwayDatabases}. <>= pathwayDatabases() @ \section{Graph} The function \Rfunction{pathwayGraph} builds a \Rclass{graphNEL} object from a pathway \Robject{p}: <>= g <- pathwayGraph(p) g @ <>= edgeData(g)[1] @ \section{Identifiers} Gene annotations databases are widely used as public repositories of biological information. Our current knowledge on biological elements is spread out over a number of databases (such as: Entrez Gene , RefSeq, backed by the NCBI http://www.ncbi.nlm.nih.gov/, UniProt, ENSEMBL backed by the EBI http://www.ebi.ac.uk/ to name just a few), specialised on a subset of specific biological entities (for instance, UniProt focuses on proteins while Entrez Gene focuses on genes). Key identifiers (IDs) in the internal structure of each database uniquely represent biological entities, thus biological entities can be identified by homogeneous IDs according to the selected database they refer to. Due to their different origins and specificity, switching from an ID to another is possible but not trivial: there could be either no correspondence between them or many-to-many relations. For detailed information about IDs, their structures and differences please consult those resources. The function \Rfunction{converterIdentifiers} allows the user to convert the pathway IDs into different types, to fit the user needs. This mapping process, however, may lead to the loss of some nodes (not all identifiers may be recognized) and has an impact on the topology of the network (one ID may correspond to multiple IDs in another annotation or vice versa). We based the function of ID conversion through the species-specific Bioconductor \Biocpkg{AnnotationDbi}, such as \Biocpkg{org.Hs.eg.db}. Since the version 1.14, all the conversion supported in the annotation packages can be used. \Rfunction{converterIdentifiers} needs a list of pathways or a single pathway and a string describing the type of the identifier as provided by an Annotation package (for example,‘"ENTREZID"’), the values ‘"entrez"’, ‘"symbol"’ remains for backward compatibility. <>= pSymbol <- convertIdentifiers(p, "SYMBOL") pSymbol head(nodes(pSymbol)) @ <>= reactomeSymbol <- convertIdentifiers(humanReactome[1:5], "SYMBOL") @ \clearpage \section{Cytoscape Plot} Several pathways have a huge number of nodes and edges, thus there is the need of an efficient system of visualization. To this end \Biocpkg{graphite} uses the \Biocpkg{Rcytoscape} package to export the network to Cytoscape. Cytoscape is a Java based software specifically built to manage biological network complexity and for this reason it is widely used by the biological community. Unfortunately, since \Biocpkg{Rcytoscape} use CytoscapeRCP plugin, which is not available for the new Cytoscape 3.0 version, the use of \Rfunction{cytoscapePlot} is currently limited to Cytoscape 2.8 and lower versions. \begin{Schunk} \begin{Sinput} > cytoscapePlot(convertIdentifiers(reactome$`Unwinding of DNA`, "symbol")) \end{Sinput} \end{Schunk} \begin{center} \includegraphics[width=10cm, height=9.12cm]{Cytoscape.png} \end{center} \section{Topological pathway analysis} \Biocpkg{graphite} gives access to three types of topological pathway analyses recently proposed. For more details on the results obtained by these analyses see the corresponding R packages. \subsection{SPIA} The analysis with \Biocpkg{SPIA} requires the conversion of the networks in a suitable format. This conversion is performed by the function \Rfunction{prepareSPIA} that must be executed before the analysis command \Rfunction{runSPIA}. The \Biocpkg{SPIA} data will be saved in the current working directory; every time you change it, you should also re-run \Rfunction{prepareSPIA}. Edges not included in SPIA have been coerced into the admitted SPIA types. Compound mediated interactions annotated in graphite with "indirect" type are mapped into the SPIA edge type "indirect effect" by default set to zero. To use the signal propagated through compounds type 1 in "indirect effect". <>= library(SPIA) data(colorectalcancer) library(hgu133plus2.db) x <- hgu133plus2ENTREZID top$ENTREZ <- unlist(as.list(x[top$ID])) top <- top[!is.na(top$ENTREZ), ] top <- top[!duplicated(top$ENTREZ), ] tg1 <- top[top$adj.P.Val < 0.05, ] DE_Colorectal = tg1$logFC names(DE_Colorectal) <- as.vector(tg1$ENTREZ) ALL_Colorectal <- top$ENTREZ prepareSPIA(biocarta[1:2], "biocartaEx") res <- runSPIA(de=DE_Colorectal, all=ALL_Colorectal, "biocartaEx") res @ For more details see the \Biocpkg{SPIA} package \cite{Tarca2009,Adi2009,Draghici2007}. \subsection{DEGraph} \Biocpkg{DEGraph} implements recent hypothesis testing methods which directly assess whether a particular gene network is differentially expressed between two conditions. In \Biocpkg{graphite}, \Biocpkg{DEGraph} has a dedicated function, called \Rfunction{runDEGraph}, which performs the analysis on a single pathway or on a pathway list. \begin{Schunk} \begin{Sinput} > library(DEGraph) > data("Loi2008_DEGraphVignette") > biocarta <- pathways("hsapiens", "biocarta") > p <- convertIdentifiers(biocarta[["actions of nitric oxide in the heart"]], "ENTREZID") > res <- runDEGraph(p, exprLoi2008, classLoi2008) > res$`1` > cytoscapePlot(convertIdentifiers(reactome$`Unwinding of DNA`, "symbol")) \end{Sinput} \end{Schunk} The function \Rfunction{runDEGraph}, which easily performs the analysis on the entire pathway database, provides as result a list of two elements: a list with the results of the pathway analyses and the list of generated errors. For more details see the \Biocpkg{DEGraph} package \cite{Jacob2010}. \subsection{topologyGSA} \Biocpkg{topologyGSA} uses graphical models to test the pathway components and to highlight those involved in its deregulation. In \Biocpkg{graphite}, \Biocpkg{topologyGSA} has a dedicated function, \Rfunction{runTopologyGSA}, which performs the analysis on a single pathway or on a pathway list. <>= library(topologyGSA) data(examples) kegg <- pathways("hsapiens", "kegg") p <- convertIdentifiers(kegg[["Fc epsilon RI signaling pathway"]], "SYMBOL") runTopologyGSA(p, "var", y1, y2, 0.05) @ The function \Rfunction{runTopologyGSA}, which easily performs the analysis on the entire pathway database, provides as result a list of two elements: a list with the results of the pathway analyses and the list of generated errors. For more details see the \Biocpkg{topologyGSA} package \cite{Massa2010}. \subsection{clipper} \Biocpkg{clipper} is a package for topological gene set analysis. It implements a two-step empirical approach based on the exploitation of graph decomposition into a junction tree to reconstruct the most relevant signal path. In the first step clipper selects significant pathways according to statistical tests on the means and the concentration matrices of the graphs derived from pathway topologies. Then, it "clips" the whole pathway identifying the signal paths having the greatest association with a specific phenotype. In \Biocpkg{graphite}, \Biocpkg{clipper} has a dedicated function, \Rfunction{runClipper}, which performs the analysis on a single pathway or on a pathway list. <>= library(clipper) library(ALL) kegg <- pathways("hsapiens", "kegg") paths <-convertIdentifiers(kegg[1:5], "ENTREZID") genes <- unlist(lapply(paths, nodes)) data(ALL) all <- as.matrix(exprs(ALL[1:length(genes),1:20])) classes <- c(rep(1,10), rep(2,10)) rownames(all) <- genes clipped <- runClipper(paths, all, classes, "mean", pathThr=0.1) resClip <- do.call(rbind,clipped$results) resClip[,1:5] @ The function \Rfunction{runClipper}, which easily performs the analysis on the entire pathway database, provides as result a list of two elements: the list with the results of the pathway analyses and the list of generated errors. For more details see the \Biocpkg{clipper} package \cite{Martini2013}. \begin{thebibliography}{9} \bibitem{KEGG} Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999 Jan 1;27(1):29-34. \bibitem{Reactome} Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B, Kanapin A, Lewis S, Mahajan S, May B, Schmidt E, Vastrik I, Wu G, Birney E, Stein L, D'Eustachio P. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res. 2009 Jan;37(Database issue):D619-22. Epub 2008 Nov 3. \bibitem{NCI} Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH. PID: the Pathway Interaction Database. Nucleic Acids Res. 2009 Jan;37(Database issue):D674-9. Epub 2008 Oct 2. \bibitem{Draghici2007} Draghici, S., Khatri, P., Tarca, A.L., Amin, K., Done, A., Voichita, C., Georgescu, C., Romero, R. A systems biology approach for pathway level analysis. Genome Research, 17, 2007. \bibitem{Tarca2009} Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim JS, Kim CJ, Kusanovic JP, Romero R. A novel signaling pathway impact analysis. Bioinformatics. 2009 Jan 1;25(1):75-82. \bibitem{Adi2009} Adi L. Tarca, Sorin Draghici, Purvesh Khatri, et. al. A Signaling Pathway Impact Analysis for Microarray Experiments. Bioinformatics, 2009, 25(1):75-82. \bibitem{Jacob2010} L. Jacob, P. Neuvial, and S. Dudoit. Gains in power from structured two-sample tests of means on graphs. Technical Report arXiv:q-bio/1009.5173v1, arXiv, 2010. \bibitem{Massa2010} Massa MS, Chiogna M, Romualdi C. Gene set analysis exploiting the topology of a pathway. BMC System Biol. 2010 Sep 1;4:121. \bibitem{Martini2013} Martini P, Sales G, Massa MS, Chiogna M, Romualdi C. Along signal paths: an empirical gene set approach exploiting pathway topology. Nucleic Acids Res. 2013 Jan 7;41(1):e19. doi: 10.1093/nar/gks866. Epub 2012 Sep 21. \bibitem{humancyc} Caspi R, Altman T, Dale JM, Dreher K, Fulcher CA, Gilham F, Kaipa P, Karthikeyan AS, Kothari A, Krummenacker M, Latendresse M, Mueller LA, Paley S, Popescu L, Pujar A, Shearer AG, Zhang P, Karp PD. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Research 38:D473-9 2010. \bibitem{panther} PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Huaiyu Mi, Anushya Muruganujan and Paul D. Thomas Nucl. Acids Res. (2012) doi: 10.1093/nar/gks1118 \end{thebibliography} \end{document}