% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{HOWTO: Use the online query tools} % \VignetteDepends{annotate, hgu95av2.db} % \VignetteKeywords{Expression Analysis, Annotation} %\VignettePackage{annotate} \documentclass{article} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{times} \bibliographystyle{plainnat} \author{Jeff Gentry and Robert Gentleman} \begin{document} \title{HowTo: Querying online Data} \maketitle{} \section{Overview} This article demonstrates how you can make use of the tools that have been provided for on-line querying of data resources. These tools rely on others (such as the NLM and NCBI) providing and documenting appropriate web interfaces. The tools described here allow you to either retrieve the data (usually in XML) or have it rendered in a browser on the local machine. To do this you will need the \Rpackage{Biobase}, \Rpackage{XML}, and \Rpackage{annotate} packages. The functionality in this article was first described in \citep{PubMedRnews}, although some enhancements have been made since the writing of that article. Assembling and using meta-data annotation is a non-trivial task. In the Bioconductor Project we have developed tools to support two different methods of accessing meta-data. One is based on obtaining data from a variety of sources, of curating it and packaging it in a form that is suitable for analysing microarray data. The second method is to make use of on-line resources such as those provided by NLM and NCBI. The functions described in this vignette provide infrastructure for that second type of meta-data usage. We first describe the functions that allow users to specify queries and open the appropriate web page on their local machine. Then, we investigate the much richer set of tools that are provided by NLM for accessing and working with PubMed data. \section{Using the Browser} There are currently four functions that provide functionality for accessing online data resources. They are: \begin{description} \item[genbank] Users specify GenBank identifiers and can request them to be rendered in the browser or returned in XML. \item[pubmed] Users specify PubMed identifiers and can request them to be rendered in the browser or returned in XML. More details on parsing and manipulating the XML are given below. \item[entrezGeneByID] Users specify Entrez Gene identifiers and the appropriate links are opened in the browser. Entrez Gene does not provide XML so there is no download option, currently. The user can request that the URL be rendered or returned. \item[entrezGeneQuery] Users specify a string that will be used as the Entrez Gene query and the species of interest (there can be several). The user can request either that the URL be rendered or returned. \end{description} Both \Rfunction{genbank} and \Rfunction{pubmed} can return XML versions of the data. These returned values can subsequently be processed using functionality provided by the \Rpackage{XML} package \citep{XML}. Specific details and examples for PubMed are given in Section~\ref{sec:pmq}. The function \Rfunction{entrezGeneByID} takes a set of known Entrez Gene identifiers and constructs a URL that will have these rendered. The user can either save the URL (perhaps to send to someone else or to embed in an HTML page, see the vignette on creating HTML output for more details). The function \Rfunction{entrezGeneQuery} takes a character string to be used for querying PubMed. For example, this function call, \begin{verbatim} entrezGeneQuery("leukemia", "Homo sapiens") \end{verbatim} will find all Human genes that have the word leukemia associated with them in their Entrez Gene records. Note that the R code is merely an interface to the services provided by NLM and NCBI and users are referred to those sites for complete descriptions of the algorithms they use for searching etc. \section{Accessing PubMed information} \label{sec:pmq} In this section we demonstrate how to query PubMed and how to operate on the data that are returned. As noted above, these queries generate XML, which must then be parsed to provide the specific data items of interest. Our example is based on the \Robject{sample.ExpressionSet} data from the package \Rpackage{Biobase}. Users should be able to easily replace these data with their own. <>= library("annotate") data(sample.ExpressionSet) affys <- featureNames(sample.ExpressionSet)[490:500] affys @ Here we have selected an arbitrary set of 11 genes to be interested in from our sample data. However, \verb+sample.ExpressionSet+ provided us with Affymetrix identifiers, and for the \verb+pubmed+ function, we need to use PubMed ID values. To obtain these, we can use the annotation tools within \Rpackage{annotate}. <>= library("hgu95av2.db") ids <- getPMID(affys,"hgu95av2") ids <- unlist(ids,use.names=FALSE) ids <- unique(ids[!is.na(as.numeric(ids))]) length(ids) ids[1:10] @ We use \Rfunction{getPMID} to obtain the PubMed identifiers that are related to our probes of interest. Then we process these to leave out any that have no PMIDs and we remove duplicates as well. The mapping to PMIDs are actually based on Entrez Gene identifiers and since the mapping from Affymetrix IDs to Entrz Gene is many to one there is some chance of duplication. From our initial \Sexpr{length(affys)} Affymetrix identifiers we see that there are \Sexpr{length(ids)} unique PubMed identifiers (i.e. papers). For each of these papers we can obtain information, such as the title, the authors, the abstract, the Entrez Gene identifiers for genes that are referred to in the paper and many other pieces of information. Again, for a complete listing and description the reader is referred to the NLM website. We next generate the query and store the results in a variable named \Robject{x}. This object is of class \Robject{XMLDocument} and to manipulate it we will use functions provided by the XML package. <>= x <- pubmed(ids) a <- xmlRoot(x) numAbst <- length(xmlChildren(a)) numAbst @ Our search of the \Sexpr{length(ids)} PubMed IDs (from the \Sexpr{length(affys)} Affymetrix IDs) has resulted in \Sexpr{numAbst} abstracts from PubMed (stored in R using XML format). The \Rpackage{annotate} package also provides a \Robject{pubMedAbst} class, which will take the raw XML format from a call to \Rfunction{pubmed} and extract the interesting sections for easy review. <<>>= arts <- vector("list", length=numAbst) absts <- rep(NA, numAbst) for (i in 1:numAbst) { ## Generate the PubMedAbst object for this abstract arts[[i]] <- buildPubMedAbst(a[[i]]) ## Retrieve the abstract text for this abstract absts[i] <- abstText(arts[[i]]) } arts[[7]] @ In the S language we say that the \Robject{pubMedAbst} class has a number of different slots. They are: \begin{description} \item[authors] The vector of authors. \item[pmid] The PubMed record number. \item[abstText] The actual abstract (in text). \item[articleTitle] The title of the article. \item[journal] The journal it is published in. \item[pubDate] The publication date. \end{description} These can all be individually extracted utilizing the provided methods, such as \Robject{abstText} in the above example. As you can see, the \Robject{pubMedAbst} class provides several key pieces of information: authors, abstract text, article title, journal, and the publication date of the journal. Once the abstracts have been assembled you can search them using any of the standard search techniques. Suppose for example we wanted to know which abstracts have the term \texttt{cDNA} in them, then the following code chunk shows how to identify these abstracts. <<>>= found <- grep("cDNA",absts) goodAbsts <- arts[found] length(goodAbsts) @ So \Sexpr{length(goodAbsts)} of the articles relating to our genes of interest mention the term \texttt{cDNA} in their abstracts. Lastly, as a demonstration for how one can use the \verb+query+ toolset to cross reference several databases, we can use the same set of PubMed IDs with another function. In this example, the \Rfunction{genbank} function is used with the argument \Robject{type="uid"}. By default, the \Rfunction{genbank} function assumes that the id values passed in are Genbank accession numbers, but we currently have PubMed ID values that we want to use. The \Robject{type="uid"} argument specifies that we are using PubMed IDs (aka NCBI UID numbers). <<>>= y <- genbank(ids[1:10], type="uid") b <- xmlRoot(x) @ At this point the object {\tt b} can be manipulated in a manner similar to {\tt a} from the PubMed example. Also, note that both \Rfunction{pubmed} and \Rfunction{genbank} have an option to display the data directly in the browser instead of XML, by specifying {\tt disp="browser"} in the parameter listing. \section{Generating HTML output for your abstracts} Many users find it useful to have a web page created with links for all of their abstracts, leading to the actual PubMed page online. These pages can then be distributed to other people who have an interest in the abstracts that you have found. There are two formats for this, the first provides for a simple HTML page which has a link for every abstract, and the other provides for a framed HTML page with the links on the left and the resulting PubMed page in the main frame. For these examples, we will be using temporary files: <>= fname <- tempfile() pmAbst2HTML(goodAbsts, filename=fname) fnameBase <- tempfile() pmAbst2HTML(goodAbsts, filename=fnameBase, frames=TRUE) @ \begin{figure}[htb] \begin{center} \includegraphics{noframes} \caption{pmAbst2HTML without frames} \includegraphics{frames} \caption{pmAbst2HTML with frames} \end{center} \end{figure} \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= sessionInfo() @ \bibliography{annotate} \end{document}