%\VignetteIndexEntry{RefNet} %\VignettePackage{RefNet} %\VignetteEngine{utils::Sweave} \documentclass{article} <>= BiocStyle::latex() @ \newcommand{\exitem}[3]{\item \texttt{\textbackslash#1\{#2\}} #3 \csname#1\endcsname{#2}.} \title{RefNet} \author{Paul Shannon} \begin{document} \maketitle \tableofcontents \section{Introduction} \Biocpkg{RefNet} allows you to query a large and growing collection of data sources to obtain annotated molecular interactions. Many of these sources are well-known, and many of them are from the \href{http://code.google.com/p/psicquic/}{PSCIQUIC} collaboration. Other sources, \emph{native} to this package, are culled from recent publications. We emphasize that \Biocpkg{RefNet} is a query tool, not a download tool. Molecular interactions are often transient, and frequently dependent upon cell-type and biological context. The rich diversity of the interactions returned by RefNet queries should always be examined closely for relevance to the actual biological topic being studied. To assist in this, RefNet interactions include annotations which describe \begin{itemize} \item detection method \item interaction type \item publication identifiers \end{itemize} RefNet's query interface (the \Rcode{interactions} method) supports numerous filtering parameters. Combined with post-processing tools the package offers, \Biocpkg{RefNet} provides a curatorial tool for constructing context-specific molecular networks. \section{Providers: interaction data sources} What are the currently available interaction data sources (hereafter called \textbf{providers})? {\scriptsize <>= library(RefNet) refnet <- RefNet() providers(refnet) @ } The structure of this list reveals the two classes of providers currently offered: those which are directly contained in \Biocpkg{RefNet} and those which are obtained via the \Bioconductor{} package \Biocpkg{PSICQUIC}. The former group (``native'') will in general hold smaller, special purpose collections. New \Biocpkg{PSICQUIC} providers, and new interactions from existing providers become available automatically. Other classes of providers in addition to these two may be added as well. \section{Quick Start: Interactions for \textbf{E2F3}} To introduce \Biocpkg{RefNet's} principal function, \Rfunction{interactions}, we will query two providers for interactions with the \emph{E2F3} transcription factor: \begin{itemize} \item \textbf{gerstein-2012}: transcription factors (TFs) and their targets, from Architecture of the human regulatory network derived from ENCODE data\cite{gerstein:2012}, obained by chromatin immunoprecipitation assay. \item \textbf{BioGrid}: ``The Biological General Repository for Interaction Datasets (BioGRID) is a public database that archives and disseminates genetic and protein interaction data from model organisms and humans (thebiogrid.org). BioGRID currently holds over 720,000 interactions curated from both high-throughput datasets and individual focused studies, as derived from over 41,000 publications in the primary literature. Complete coverage of the entire literature is maintained for budding yeast (S. cerevisiae), fission yeast (S. pombe) and thale cress (A. thaliana), and efforts to expand curation across multiple metazoan species are underway. Current curation drives are focused on particular areas of biology to enable insights into conserved networks and pathways that are relevant to human health.'' \end{itemize} %\vspace{1em} The \Rcode{interactions} method has nine arguments, eight of which are optional. In practice, one or more (typically three: \Rcode{id}, \Rcode{species}, \Rcode{provider}) are always used. For example, to obtain interactions for the transcription factor \textbf{E2F3}: {\scriptsize <>= if("Biogrid" %in% unlist(providers(refnet), use.names=FALSE)){ tbl.1 <- interactions(refnet, species="9606", id="E2F3", provider=c("gerstein-2012", "BioGrid")) dim(tbl.1) } @ } The full set of arguments, of which all but the first is optional: \begin{itemize} \item \emph{object} a RefNet instance \item \emph{id=NA} a list of one or more identifiers \item \emph{species=NA} limit interactions to organisms, described with the NCBI taxonomy codes \item \emph{speciesExclusive=TRUE} force all interactions to be within the specified species \item \emph{type=NA} limit interactions to interaction types \item \emph{provider=NA} limit interactions by providers \item \emph{detectionMethod=NA} limit interactions by detection methods \item \emph{publicationID=NA} \item \emph{quiet=TRUE} \end{itemize} Thus \Biocpkg{RefNet's} \Rfunction{interactions} method is designed for focused use in a curatorial mode, in which one limits a query by providing values to some or all of these arguments, iteratively creating a biologically relevant network of interactions, as we will demonstrate below. One \emph{could} retrieve all interactions from all providers by calling \Rfunction{interactions} with all defaulted arguments. This would take a very long time, would be a disservice to the PSICQUIC providers, and would be of little benefit. It may be reasonable in some circumstances to retrieve full data sets from the \emph{native} providers. This is demonstrated in their respective man pages. \section{Query Results: understanding the data.frame} A \Rcode{data.frame} is returned by the \Rfunction{interactions} method. Because PSICQUIC provides many of the data sources used by \Biocpkg{RefNet}, and because the PSI community provide a common results format which was the result of much deliberation, \Biocpkg{RefNet} returns a \Rcode{data.frame} with all of the standard PSICQUIC columns, with several (sometimes many) columns added. Some of these additional columns are copied directly from the provider source data. \emph{recon2} for instance, provides metabolic reaction interactions, and characterizes each reaction as reversible or not. If your query providers includes \emph{recon2} then you will see this column added to your results. Other providers report other non-PSICQUIC data columns. \Biocpkg{RefNet} constructs a data.frame which includes the union of all data columns reported by all of the providers, neccessarily including many missing values (currently represented in PSICQUIC style, with a `-''). Four ``entity name'' columns are also added to every results \Rcode{data.frame}, in an attempt to solve -- or at least, to ameliorate -- the ``identifier problem'', in which different providers prefer different naming schemes for the interactions they report. PSICQUIC providers, most of whom are interested primarily in protein-protein interactions, tend to report interaction pairs as interacting proteins, using a variety of naming schemes (UniProt.kb, RefSeq, Ensembl, STRING). Current bioinformatic practice, however, commonly describes protein interactions in terms of interactions between the genes which code for the interacting proteins. This practice is reflected in the PSICQUIC query convention: gene symbol names are used in PSICQUIC queries. We support that practice in order to get good results from PSICQUIC providers. For \Biocpkg{RefNet} native sources, we go further, and look for query matches against \emph{any} identifier provided by the native source: a reaction name, a small molecule metabolite, a protein, gene symbol or an entrez geneID. Furthermore, and crucially, every interaction in the results data.frame includes these four extra name columns: \begin{itemize} \item \textbf{A.common} a familiar, readable name, e.g. ``E2F3'', ``acetyl-coa transport'' \item \textbf{B.common} \item \textbf{A.canonical} a more formal identifier, e.g. ``1871'', ``R\_ACCOAgt'' \item \textbf{B.canonical} \end{itemize} These columns are added to the RefNet native sources as they are parsed into the package. You must add them to interactions obtained from RefNet PSICQUIC sources by invoking the IDMapper class, from the PSICQUIC package. {\scriptsize <>= if("IntAct" %in% unlist(providers(refnet), use.names=FALSE)){ tbl.2 <- interactions(refnet, id="E2F3", provider="IntAct", species="9606") dim(tbl.2) idMapper <- IDMapper("9606") tbl.3 <- addStandardNames(idMapper, tbl.2) dim(tbl.3) tbl.3[, c("A.name", "B.name", "A.id", "B.id", "type", "provider")] } @ } Mixed queries produce many columns. {\scriptsize <>= if("Biogrid" %in% unlist(providers(refnet), use.names=FALSE)){ tbl.4 <- interactions(refnet, id="ACO2", provider=c("gerstein-2012", "BioGrid")) tbl.5 <- addStandardNames(idMapper, tbl.4) sort(colnames(tbl.5)) } @ } \section{Curation} Let us now examine more closely the interactions returned from the E2F3 query above, to demonstrate the curation process \Biocpkg{RefNet} is designed to support. \subsection{detectionMethod and interaction type} Of the 54 interactions returned by that query, 10 come from \emph{gerstein-2012} and 44 from \emph{BioGrid}. {\scriptsize <>= if(exists("tbl.5")){ dim(tbl.5) table(tbl.5$provider) } @ } With what methods were these interactions detected? What interaction types were reported? Note that twelve of the BioGrid interactions were identified in a high-throughput ``two hybrid'' experiment, and may deserve less weight than interactions from small-scale experiments such as ``western blotting'' and ``enzymatic study''. {\scriptsize <>= if(exists("tbl.5")){ options(width=180) tbl.info <- with(tbl.5, as.data.frame(table(detectionMethod, type, provider))) tbl.info <- tbl.info[tbl.info$Freq>0,] tbl.info options(width=80) } @ } \subsection{Detect and Examine Duplicates} A query will often return duplicate interactions, either redundant reports of the same interaction from the same experiment and published paper, or essentially identical interactions between two entities discovered and reported more than once. You will often want to eliminate these duplicates as you build out a network. And, in general, you will want to keep the interactions which are most reliably reported, which are most specifically observed, and which come from from well-regarded experiments. You may wish to select only those interactions which come from small-scale experiments involving a cell-type identical with, or similar to, the one you are modeling. To help with this, \Biocpkg{RefNet} offers two related functions: \Rcode{detectDuplicateInteractions} and \Rcode{pickBestFromDupGroup}. {\scriptsize <>= if("Biogrid" %in% unlist(providers(refnet), use.names=FALSE)){ tbl.6 <- interactions(refnet, species="9606", id="E2F3", provider=c("gerstein-2012", "BioGrid")) tbl.7 <- addStandardNames(idMapper, tbl.6) tbl.withDups <- detectDuplicateInteractions(tbl.7) } @ The last function call adds a ``dupGroup'' column, identifying ten groups, each of which has the same two interacting molecules. The ``0'' group has special status: it contains unique interactions, of which 28 were found. dupGroup number 1 has three interactions: {\scriptsize <>= if(exists("tbl.withDups")){ options(width=180) table(tbl.withDups$dupGroup) subset(tbl.withDups, dupGroup==1)[, c("A.name", "B.name", "type", "detectionMethod", "publicationID")] options(width=80) } @ } We see three interactions in this dupGroup. Because the pubmed ID is the same, and the interacting proteins are the same, albeit ordered differently, we surmise that this may just be one interaction between \textbf{FZR1} and \textbf{E2F3}. We prefer the ``enzymatic study'', ``direct interaction'' version, for the extra specificity they imply, but an examination of the abstract of the source publication is often helpful: {\scriptsize <>= noquote(pubmedAbstract("22580460", split=TRUE)) @ } \textbf{FZR1} is not mentioned in the abstract. Perhaps an alternate name for this gene has been used? To explore that possibiity, first obtain the entrez geneID, then see what aliases are known for it. {\scriptsize <>= library(org.Hs.eg.db) if(exists("tbl.withDups")){ geneID <- unique(subset(tbl.withDups, A.common=="FZR1")$A.canonical) suppressWarnings(select(org.Hs.eg.db, keys=geneID, columns="ALIAS", keytype="ENTREZID")) } @ } Interpreting \textbf{Cdh1} as \textbf{FZR1}, and based on the text of the abstract, we can with high confidence claim that \textbf{FZR1} interacts directly with \textbf{E2F3} resulting in its proteasome-dependent degradation: a specific, attested molecular interaction likely to be of strong interest. In the next section we demonstrate some \Biocpkg{RefNet} function calls which speed up this process of curation. \subsection{Programmatically Eliminate Duplicates} A common \Biocpkg{RefNet} scenario is to query all providers for interactions with a gene or protein of interest, and then -- the total reported interactions being quite large -- programmatically elminate all but the most interesting non-redundant interactions. {\scriptsize <>= providers <- intersect(unlist(providers(refnet), use.names=FALSE), c("BIND", "BioGrid", "IntAct", "MINT", "gerstein-2012")) tbl.8 <- interactions(refnet, species="9606", id="E2F3", provider=providers) tbl.9 <- addStandardNames(idMapper, tbl.8) dim(tbl.9) @ } Duplicate interactions can only be detected if both participating entities have canonical names assigned to them. Some PSICQUIC providers return identifiers which \Rcode{addStandardNames} cannot (at the present time) map to standard identifiers. We eliminate interactions involving those few identifiers. A case-by-case ``manual'' study of these interactions will sometimes be warranted. {\scriptsize <>= removers <- with(tbl.9, unique(c(grep("^-$", A.id), grep("^-$", B.id)))) if(length(removers) > 0) tbl.10 <- tbl.9[-removers,] dim(tbl.10) @ } In order to distinguish better interactions from worse an ordered list of interaction types must be provided. (For now, this is the only ranking criteria we support; detectionMethod and provider ranking will be added in the future). To begin, we must first find out the interaction types present in the current set: {\scriptsize <>= options(width=120) table(tbl.10$type) @ } We are, for now, not interested in interctions of unassigned type (``-''). We shall ingore ``colocalization'' as well. {\scriptsize <>= tbl.11 <- detectDuplicateInteractions(tbl.10) dupGroups <- sort(unique(tbl.11$dupGroup)) preferred.types <- c("direct interaction", "physical association", "transcription factor binding") bestOfDups <- unlist(lapply(dupGroups, function(dupGroup) pickBestFromDupGroup(dupGroup, tbl.11, preferred.types))) deleters <- which(is.na(bestOfDups)) if(length(deleters) > 0) bestOfDups <- bestOfDups[-deleters] length(bestOfDups) tbl.12 <- tbl.11[bestOfDups,] tbl.12[, c("A.name", "B.name", "type", "provider", "publicationID")] @ } We thus obtain a high-confidence annotated list of E2F3 interactions. %--------------------------------------------------------- \section{Session info} %--------------------------------------------------------- Here is the output of \Rfunction{sessionInfo} on the system on which this document was compiled: <>= toLatex(sessionInfo()) @ \bibliography{RefNet} \end{document}