%% LyX 1.6.9 created this file. For more info, see http://www.lyx.org/. %% Do not edit unless you really know what you are doing. \documentclass[english]{article} \usepackage[T1]{fontenc} \usepackage[latin9]{inputenc} \usepackage{geometry} \geometry{verbose,tmargin=2cm,bmargin=3cm,lmargin=3cm,rmargin=3cm,headheight=1cm,headsep=1cm,footskip=2cm} \usepackage{color} \usepackage{babel} \usepackage{amstext} \usepackage[unicode=true] {hyperref} \usepackage{breakurl} \makeatletter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands. %% Because html converters don't know tabularnewline \providecommand{\tabularnewline}{\\} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands. \newcommand{\lyxaddress}[1]{ \par {\raggedright #1 \vspace{1.4em} \noindent\par} } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rcommand}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} % Meta information - fill between {} and do not remove % % \VignetteIndexEntry{An R Package for retrieving data from EnVision into R objects. } % \VignetteDepends{rJava, XML, utils} % \VignetteKeywords{} % \VignettePackage{ENVISIONQuery} \makeatother \begin{document} \title{The ENVISIONQuery package in BioConductor: Retrieving data through the Enfin-Encore annotation portal.} \author{Alex Lisovich$^{\text{1}}$, Roger Day$^{\text{1,2}}$} \date{April 19, 2011} \maketitle \lyxaddress{$^{\text{1}}$Department of Biomedical Informatics, $^{\text{2}}$Department of Biostatistics \\ University of Pittsburgh} \section{Overview} The Enfin-Encore (EnCore) is the integration platform for the ENFIN European Network of Excellence {[}1{]}, which provides a portal to various database resources with a special focus on systems biology ( \href{http://code.google.com/p/enfin-co}{http://code.google.com/p/enfin-co}re/). EnCore is appealing to developers of bioinformatics applications, because of the scope and variety of EnCore's annotation resources, and because its collection of web services$^{*}$ utilizes a common standard format (EnXML: \href{http://code.google.com/p/enfin-core/wiki/wp1_encore_enxml}{http://code.google.com/p/enfin-core/wiki/wp1\_{}encore\_{}enxml}). Web services can communicate client applications using a variety of programming languages. Many bioinformaticians work in R, so an R-based solution is desirable. The ENVISIONQuery package provides programmatic access to the EnCore web services in R. EnCore's capabilities evolve rapidly, so the architecture of ENVISIONQuery enables rapid integration of the new services when they appear.\\ \\ $^{*}$\textbf{IMPORTANT NOTE: The Java Web Services utilized by the package require Java 1.6 or higher to be installed, otherwise the package installation will fail giving the warning abot the wrong Java version.} \section{Design considerations} \subsection{Motivating setting} Our group studied differential expression between endometrial cancer tissues with normal tissues. We received an Orbitrap proteomic mass spectrometry experiment that generated over 12,000 protein UNIPROT identifiers, as well as an Affymetrix U133 Plus 2 microarray experiment on the same samples. The analysis required mapping UNIPROT identifiers to Affymetrix probe-sets, intending to produce an integrated view of expression at protein and transcript levels tied to the same gene. We examined several strategies for accomplishing this mapping. We found evidence that utilizing Enfin-Encore services is at minimum competitive, and possibly superior to other approaches {[}2{]}. Initially, the motivation ENVISIONQuery was speifically to automate the retrieval of ID mapping information. Later, it became evident that a broad approach would make a wide range of Enfin-Encore services available to R users. Moreover, the Enfin-Encore architecture allows one to combine queries into the pipelines, using the given query results as an input for the next query utilizing the unified EnXML data format. Consequently, the scope of the ENVISIONQuery package expanded to cover additional services provided by the Enfin-Encore portal, the final goal being to cover the whole range of services which the Enfin-Encore online front ends, EnVision and EnVision2 provide. \textcolor{black}{The ENVISIONQuery is a second (the first being the DAVIDQuery) in a series of packages devoted to ID mapping related information retrieval and performance analysis. At the moment our group is preparing two more packages for submission to BioConductor: the IdMappingAnalysis package for ID mapping performance analysis and the IdMappingRetrieval package providing a unified framework for ID mapping related information retrieval from various online services for further analysis by the IdMappingAnalysis package. DAVIDQuery and ENVISIONQuery are utilized by the IdMappingRetrieval package. } \subsection{Comparison to biomaRt} The biomaRt package available from BioConductor is an extremely powerful tool covering a wide range of data types and sources. On the other hand, the Enfin Encore (and ENVISIONQuery), though at an early stage of development, provides access to some types of data that not covered by biomaRt, the IntAct and PICR being the examples (see section 5). Thus, we believe ENVISIONQuery would nicely complement biomaRt. When we began the integration study described above, biomaRt did not actively support UNIPROT. For this reason, we explored other solutions, leading us to use the Enfin Encore online interactive front end (EnVisio\textcolor{black}{n). Recently, we returned to biomaRt, comparing its results with Enfin Encore in mapping Affymetrix probeset IDs to UNIPROT Accessions. }Despite the fact that both systems access presumably the same Biomart (\href{http://www.biomart.org}{www.biomart.org}) data source through the Ensembl (\href{http://www.ensembl.org}{www.ensembl.org}), the results were somewhat different. For the HG U133 Plus 2 micro array, either biomaRt or ENVISIONQuery returned UNIPROT Accession matches for 32438 probesets. Of those probesets, 70\% of the returned UNIPROT lists were identical, for 28\% the lists overlapped but were not identical, and for 2\% only one of the two methods returned a UNIPROT accession. The Enfin Encore team informed us that there is more than one instance of the Biomart data source underlying database, and that Ensembl and Enfin Encore ID mapping service access different on\textcolor{black}{es. We are not certain which service is more accurate.} It should be noted that EnXML file format imposes a significant performance penalty on the query system due to high percent of redundant information contained within the xml text file. In our experience, the EnXML file is 10 to 20 times larger than the comparable csv-format file which biomaRt returns, with comparable ratio for time expended. Therefore, large queries should probably use biomaRt, if the discrepancies described above are of little concern. Otherwise, our recommendation is to use the ENVISIONQuery when the query size is small enough (up to a few hundred input IDs), or when the service desired is not supported by biomaRt. \subsection{ENVISIONQuery and Java Web Services} As mentioned above, the Encore platform is implemented in Java as a collection of Web Services, and therefore, the most robust way to integrate it into the custom application would be to use the language supporting the web service paradigm. In fact, the EnCore team recommends using Java in client applications and provides extensive support to facilitate this style of development. From the other hand, R does not have such capabilities. To overcome this limitation, the package is subdivided in two distinctive subsystems: the Java library providing the set of classes serving as wrappers for EnCore web services and the R counterpart communicating with Java using the rJava package available from CRAN and providing the package high level functionality including user interaction, preparing data for online submission, formatting the raw EnXML data, filtering data on a set of constraints etc. Each web service implemented within Java is registered in R during the package initialization in a special data structure which contains the information on how to access the Java methods, what services are available, what tools and options can be used for a particular service, as well as additional information allowing to interactively define the processing details as necessary. As a result, adding an additional EnCore service is a straightforward process involving adding a Web Service into the Java library (any major Java IDE like Eclipse or NetBeans has a built-in support for this), implementing a wrapper suitable for call from R through rJava and updating the package initialization function to include the newly developed service. Provided the conversion of the EnXML file into the R data frame is straightforward (being the case for majority of services), the effort of adding the new functionality is reduced and the whole process the package updating and maintenance gets simplified. \section{Types of identifiers, services and reports} As of this version, the ENVISIONQuery there are following important attributes in the package query syntax. The \Robject{"ids"} attribute determines the set of identifiers about which information is to be retrieved. It could be either a character vector of IDs (UNIPROT or Affymetrix probeset IDs as an example) or EnXML compliant text in the form of either character string or file. The \Robject{"typeName"}determines the type of input identifier set. The \Robject{"serviceName"}defines the type of service from which information should be retrieved, and \Robject{"toolName"}defines the tool to be used within the particular service. The \Robject{"options"} attribute defines any additional query parameters (maximum number of pathways for Reactome service being an example) while the \Robject{"filter"} attribute defines the additional filtering to be applied to the formatted results ( organism species, micro array platform etc.). A summary of ENVISIONQuery currently supported functionality is given in a Table 1.\\ \\ \\ % \begin{table}[h] \caption{Services Summary} \begin{tabular}{|c|c|c|c|} \hline \textbf{Service} & \textbf{Tools} & \textbf{Description} & \textbf{Source}\tabularnewline \hline ID & Affy2Uniprot, & Convert Affy probeset to UNIPROT & Ensembl \tabularnewline Conversion & Uniprot2Affy$^{\text{1}}$ & Convert UNIPROT to Affy probeset & \href{http://www.ensembl.org}{www.ensembl.org}\tabularnewline \hline Intact & FindPartners & Find protein-protein interaction & EMBL-EBI \tabularnewline & & (UNIPROT) & \href{http://www.ebi.ac.uk/intact}{www.ebi.ac.uk/intact}\tabularnewline \hline Picr & mapProteinsAdv & Map protein identifiers between & EMBL-EBI\tabularnewline & & various DBs$^{\text{2}}$ & \href{http://www.ebi.ac.uk/Tools/picr}{www.ebi.ac.uk/Tools/picr}\tabularnewline \hline Reactome & FindPathAdv & Finds pathways for specified proteins & Reactome \tabularnewline & & & \href{http://www.reactome.org}{www.reactome.org}\tabularnewline \hline \end{tabular}% \end{table} $^{\text{1}}$As of the end of January 2011 this service is temporarily unavailable. The development team has assured that it will be operational by the end of February 2011. $^{\text{2}}$The list of supported databases and their identifiers can be found at the bottom of the following page: \href{http://www.ebi.ac.uk/Tools/picr/WSDLDocumentation.do}{http://www.ebi.ac.uk/Tools/picr/WSDLDocumentation.do}.\\ \\ For detailed description of services as well as options supported for a particular service please use the links provided in Table 2. % \begin{table}[h] \caption{Service references} \begin{tabular}{|c|c|} \hline Service & Documentation URL\tabularnewline \hline ID Conversion & \href{http://code.google.com/p/enfin-core/wiki/wp1_encore_webservices_affy2uniprot}{http://code.google.com/p/enfin-core/wiki/wp1\_{}encore\_{}webservices\_{}affy2uniprot}\tabularnewline \hline Intact & \href{http://www.enfin.org/encore/wsdl/enfin-intact.wsdl}{http://www.enfin.org/encore/wsdl/enfin-intact.wsdl} $^{\text{1}}$\tabularnewline \hline Picr & \href{http://code.google.com/p/enfin-core/wiki/wp1_encore_webservices_picr}{http://code.google.com/p/enfin-core/wiki/wp1\_{}encore\_{}webservices\_{}picr}\tabularnewline \hline Reactome & \href{http://code.google.com/p/enfin-core/wiki/wp1_encore_webservices_reactome}{http://code.google.com/p/enfin-core/wiki/wp1\_{}encore\_{}webservices\_{}reactome}\tabularnewline \hline \end{tabular}% \end{table} $^{\text{1}}$At the moment, only formal service description is available. For informal description, please refer to \href{http://www.ebi.ac.uk/intact}{www.ebi.ac.uk/intact}. \section{Getting started} \subsection{Exploring package capabilities} The code snippet below shows how to check what services as well as tools and options for a given service are supported by \Robject{"ENVISIONQuery"} package. <>= library(ENVISIONQuery); #check available services getServiceNames(); #check available tools for a given service service<-getService("ID Conversion"); getToolNames(service); #check available input type for a given tool service<-getService("Reactome"); tool<-getTool(service,"FindPathAdv"); getInputTypes(tool); #getAvailable options for a given service getServiceOptions("Picr"); @ \subsection{Using basic ENVISIONQuery functionality} The set of queries below retrieves the formatted information from all 4 services supported in current version of the package using the default service options and providing the the set of request specific IDs as well as service, tool and input type attributes. <>= #convert the Affy probeset IDs to Uniprot IDs res<-ENVISIONQuery(ids=c("1553619_a_at","1552276_a_at","202795_x_at"), serviceName="ID Conversion",toolName="Affy2Uniprot",typeName="Affymetrix ID"); print(res); #retrieve the pathways for given Uniprot ID(s) res<-ENVISIONQuery(ids="P38398",serviceName="Reactome",typeName="Uniprot ID"); print(res[1:5,]); #retrieve protein-protein interactions res<-ENVISIONQuery(ids="P38398",serviceName="Intact",typeName="Uniprot ID"); print(res[1:5,]); #convert EnSembl IDs to Uniprot IDs res<-ENVISIONQuery(ids=c("ENSP00000397145","ENSP00000269554"), serviceName="Picr",typeName="Protein ID"); print(res); @ \subsection{Using Options and Filters} In order to alter the default service behavior, one needs to supply the options and or filter set in the form of the list where item names and values define to options/filters types and values. The code snippet below illustrates how to: \begin{enumerate} \item Retrieve the data from PICR service specifying the databases which the Uniprot IDs should be mapped to. \item Retrieve data from Reactome service adding coverage and total protein count for a given pathway to the output. \item Convert Affymetrix probeset IDs to Uniprot IDs filtering output on micro array type and organism species. \end{enumerate} <>= #match Uniprot IDs to EnSembl and TrEMBL options<-list("enfin-picr-search-database"=c("ENSEMBL_HUMAN","TREMBL")); res<-ENVISIONQuery(ids="P38398",options=options, serviceName="Picr",typeName="Protein ID"); print(res); #retrieve the pathways for given Uniprot ID(s)sorting them by coverage #and calculating the total protein count options<-list("enfin-reactome-add-coverage"="true", "enfin-reactome-sort-by-coverage"="true"); res<-ENVISIONQuery(ids="P38398",options=options, serviceName="Reactome",typeName="Uniprot ID"); print(res[1:5,]); #convert the Affy probeset IDs to UniProt IDs restricting #output by micro array type and organism species filter<-list(Microarray.Platform="affy_hg_u133_plus_2", organism.species="Homo sapiens"); res<-ENVISIONQuery(ids=c("1553619_a_at","1552276_a_at","202795_x_at"), filter=filter, serviceName="ID Conversion", toolName="Affy2Uniprot",typeName="Affymetrix ID"); print(res); @ \subsection{Cascading query requests.} The Enfin Encore architecture allows to use the EnXML output file as an input for another service, the resulting EnXML file being superposition of all intermediate results. The code snippet below shows how the ENVISIONQuery can be used to combine Intact and Reactome service requests into a single pipeline. Note, that as of the recent \Robject{"ENVISIONQuery"} does not support the formatted output for a cascaded call but we plan to provide it in the nearest future. In the meantime, the \Robject{"XML"} can be used to explore the resulting EnXML file. <>= #Intact-Reactome cascaded request for UniProt ID(s). IntactReactomeXML<-ENVISIONQuery(ids="P38398", serviceName=c("Intact","Reactome"),typeName="Uniprot ID",verbose=TRUE); #convert xml text into XMLDocument #using XML package for further exploring if(!is.null(IntactReactomeXML)){ xmlDoc<-xmlTreeParse(IntactReactomeXML,useInternalNodes = TRUE, asText=TRUE); class(xmlDoc); } @ \subsection{Defining ENVISIONQuery request attributes interactively} Sometimes it's desirable to specify the service, tool and/or input format interactively at the run time. This can be accomplished by specifying the {}``menu'' (or omitting the parameter) instead of providing a concrete service, tool or input type name as illustrated below. <>= filter<-list(Microarray.Platform="affy_hg_u133_plus_2", organism.species="Homo sapiens"); res<-ENVISIONQuery(ids=c("1553619_a_at","1552276_a_at","202795_x_at"), filter=filter, serviceName="menu",toolName="menu",typeName="menu"); print(res); @ \section{Running large queries} The current version of Enfin Encore services imposes a limitation on a number of input identifiers in the query request. To overcome this limitation, the ENVISIONQuery call processes data in chunks (using ENVISIONQuery.loop and ENVISIONQuery.chunks internally) returning the list of EnXML documents in case of unformatted output and merging the partial data frames into a single one in case the formatted output is requested. Unfortunately, the Enfin Encore documentation does not specify the limit on a number of input identifiers and if the limit is exceeded, just silently returns the input EnXML document or partial (and varying) resulting set. For Affy probeset to Uniprot conversion (Affy2Uniprot service) we have determined that the fail-safe chunk size is 1000 (default). For other services, the size could be different. In case the instability is observed, the chunk size can be altered by user. \section{Session information } This version of ENVISIONQuery has been developed with R 2.11.0. R session information: <>= toLatex(sessionInfo()) @ \section{Acknowledgments} We would like to thank Rafael Jimenez from the Enfin EnCore team for an enjoyable discussion and the great effort he has put into expanding the ID mapping related EnCore web services to better suite our needs. We also would like to thank Himanshu Grover, a graduate student in Biomedical Informatics at the University of Pittsburgh, for the great effort and useful suggestions he has made during the package preparation and testing. \section{References} {[}1{]} Florian Reisinger, Manuel Corpas1, John Hancock, Henning Hermjakob, Ewan Birney, and Pascal Kahlem1. ENFIN - An Integrative Structure for Systems Biology. Data Integration in the Life Sciences Lecture Notes in Computer Science, 2008, Volume 5109/2008, 132-143, DOI: 10.1007/978-3-540-69828-9\_13\\ {[}2{]} Identifier mapping performance for integrating transcriptomics and proteomics experimental results Roger S. Day1, Kevin K. McDade1, Uma Chandran1, Alex Lisovich, Thomas Conrads, Brian Hood, V.S.Kumar Kolli, David Kirchner, Traci Litzi, G. Larry Maxwell, in submission to BMC Bioinformatics \end{document}