\documentclass[nogin,a4paper,11pt]{article} \usepackage{bm} %needed for bold greek letters and math symbols \usepackage{hyperref} % use for hypertext links, including those to external documents and URLs \usepackage{graphicx} % need for PS figures \usepackage{natbib} %number and author-year style referencing \usepackage{fullpage} %full page support \usepackage{bookmark} \usepackage[small]{caption} \parskip 8pt \setlength{\abovecaptionskip}{-1ex} % \setlength{\belowcaptionskip}{2ex} \pagestyle{plain} %\VignetteIndexEntry{Programmatic retrieval of information from DAS servers} \begin{document} \title{Programmatic retrieval of information from DAS \\ servers using the \textbf{DASiR} package} \author{Oscar Flores Guri and Anna Mantsoki \\ \small{Institute for Research in Biomedicine \& Barcelona Supercomputing Center} \\ \small{Joint Program on Computational Biology}} %\date{} %comment to include current date \maketitle \section{Introduction} \label{sec:intro} Distributed Annotation System (DAS) is a protocol for information exchange between a server and a client. Is widely used in bioinformatics and the most important repositories include a DAS server parallel to their main front-end. Few examples are \href{http://genome.ucsc.edu/cgi-bin/das/dsn}{"UCSC"}, \href{http://www.ensembl.org/das/dsn}{"Ensembl"} or \href{http://www.ebi.ac.uk/das-srv/uniprot/das}{"UniProt"}. The \textsf{DASiR} package provides a convenient R-DAS interface to programmatically access DAS servers available from your network. It supports the main features of DAS 1.6 protocol providing a convenient interface to R users to a huge amount of biological information. DAS uses XML and HTTP protocols, so the server deployment and client requirements are significantly less than MySQL or BioMart based alternatives. You can find an browsable list with more than 1500 on-line DAS servers in the url \url{http://www.dasregistry.org}. Despite the DAS protocol supports querying different kinds of data, \textsf{DASiR} has been designed with ranges-features in mind. Querying genomic sequences and protein structures is also supported, but you probably will find better ways to access such data in R if you requiere an intensive use of them (\textsf{Biostrings} genomes or \textsf{Bio3dD} package for PDB structure, for example). Here you have a brief summary of the main functions of \textsf{DASiR}: \begin{itemize} \item Set/get the DAS server: \texttt{setDasServer}, \texttt{getDasServer} \item Retrieve the data sources: \texttt{getDasSource}, \texttt{getDasDsn} \item Retrieve the entry points and types: \texttt{getDasEntries}, \texttt{getDatTypes} \item Retrieve information about the features: \texttt{getDasFeature} \item Nucleotide/amino acid sequence retrieval: \texttt{getDasSequence} \item Protein 3D structure retrieval: \texttt{getDasStructure} \end{itemize} For detailed information about these functions and how to use them refer to the \textsf{DASiR} manual or give a look to the following sections for an overview. <>= library(DASiR) ?DASiR @ \section{DAS metadata information handling} \label{sec:DAS} As an important part of the dialog between the client and the server consists on knowing which information has every server, \textsf{DASiR} provides the following functions to query metadata from a DAS server. The first step before start quering and retrieving information is, of course, set the DAS server we will use during this session. Notice that DASiR only supports one active DAS session per R instance. To set the server we will use the function \texttt{setDasServer}: <>= setDasServer(server="http://genome.ucsc.edu/cgi-bin/das") getDasServer() @ The function \texttt{getDasSource()} will return the id's, the titles and the capabilities of the data sources available in the server in the form of a data frame. This function is important since it is necessary to use the exact name ("id" or "title" depending on the server) as reference name in the other functions of the package. <>= #sources = getDasSource() #This will fail for UCSC (but no for ENSEMBL) sources = getDasDsn() #This will fail for ENSEMBL (but no for UCSC) head(sources) @ We should also take into account that the values in capabilities of each data source returned by this function. If we query an unknown name or we query for a capability that is not implemented in the server (for example we ask for a atomic structure to a sequence server) \textsf{DASiR} will return a NULL value, but due to the nature of HTTP based queries, we cannot detect the origin of this error without overloading the server with cross-calls. Despite "sources" is the common way to recover server features, some servers (as UCSC) are still using the deprecated "dsn" option to get the ids of the different datasets. The rule of a thumb is that if the first don't work, try the second. Once we have the name of the database we want to query, we need to know where we can look. This is called "entry point" and could be from a protein ID to a chromosome number (depending on the type of server, of course). Notice that output will be a GRanges by default (if the server supplies start, stop and id values). If this is not possible or texttt{as.GRanges=FALSE}, the output will be a data.frame. <>= source = "sacCer3" entries = getDasEntries(source, as.GRanges=TRUE) head(entries) @ Finally, the different attributes we can ask for, called "types" could be obtained with the function \texttt{getDasTypes}. <>= types = getDasTypes(source) head(types) @ Think about Sources as the name of the database, Entries are the tables of this database and finally Types are the columns on this table. In general a \texttt{character} vector is returned for most of the functions, except for \texttt{getDasSource} and \texttt{getDasEntries} which return a data.frame with the name of the entry the ranges of the elements and possibly other attributes depending on the server. \section{Querying features} \label{sec:DAS features} The \texttt{getDasFeature} function queries the DAS server and returns the available information for the type(s) at the given range(s). <>= ranges=entries[c(1,2)] types=c("sgdGene","mrna") features = getDasFeature(source, ranges, types) head(features) @ A \texttt{data.frame} is returned with the contents of the specific annotation for the given ranges and types in the server (again, note that servers could provide different/additional information for types with the same name). \section{Querying nucleotide or amino acid sequence} \label{sec:DAS sequence} The \texttt{getDasSequence} function queries the DAS server and retrieves the nucleotide or amino acid sequence for the given ranges. <>= #Now we will retrieve sequences from VEGA server setDasServer("http://vega.sanger.ac.uk/das") source = "Homo_sapiens.VEGA51.reference" ranges = GRanges(c("1","2"), IRanges(start=10e6, width=1000)) #Returning character vector, we only ask 50 first bases in the range sequences = getDasSequence(source, resize(ranges, fix="start", 50)) print(sequences) @ An \texttt{character} vector is returned (the default class) containing the nucleotide or amino acid sequences that match the content of the annotation for the given ranges in the server. Automatic conversion to Biostring classes is supported with the \texttt{class} attribute: <>= #Now we specify we want a AAStringSet (Biostring class for AminoAcids strings) #and query for the whole sequence length sequences = getDasSequence(source, ranges, class="AAStringSet") print(sequences) @ \section{Query a protein 3D structure} \label{sec:DAS structure} The \texttt{getDasStructure} function queries the DAS server and retrieves the 3D structure, including metadata and coordinates for the given query (ID of the reference structure) using the data source id (or title). <>= #On 2013-03-05 there are 2 structure servers in dasregistry.org... setDasServer(server="http://das.sanger.ac.uk/das") #Get the sources with "structure" capability... sources = getDasSource() sources[grep("structure", sources$capabilities),] source="structure" #...which name is also "structure" query="1HCK" #PDB code structure=getDasStructure(source,query) head(structure) @ An \texttt{data.frame} is returned with the information of every atom of the reference structure of the given query. Notice that a single residue (identified by groupID and name) has few atoms inside. Depending on your purposes, you can find this way to represent the structural information unconvenient (for example, if you are used to work with PDB format). Structure query in DAS seems a secondary feature of the protocol (as only 2 servers out of >1500 are supporting it) but anyway we wanted to support the maximum amount of features that DAS protocol implement. We realize that are better options to work with PDB files, but maybe this one can help you for quick structural analysis. \section{Plotting of obtained features} \label{sec:plot features} The \texttt{plotFeatures} function creates a basic plot with the features retrieved by \texttt{getDasFeature} or adds them to an existing plot. <>= #Let's retrieve some genes from UCSC Genome Browser DAS Server now setDasServer(server="http://genome.ucsc.edu/cgi-bin/das") #Official yeast genes and other annotated features in the range I:22k-30k source = "sacCer3" #Saccharomices Cerevisiae range = GRanges(c("I"), IRanges(start=22000, end=30000)) type = c("sgdGene", "sgdOther") #This is also the name of the UCSC tracks features = getDasFeature(source, range, type) #Only the main columns head(features[,c("id", "label", "type", "start", "end")]) plotFeatures(features, box.height=10, box.sep=15, pos.label="top", xlim=c(22000,30000)) @ <>= par(mar=c(6,3,0,3), cex=0.75) plotFeatures(features, box.height=10, box.sep=15, pos.label="top", xlim=c(22000,30000)) @ \begin{figure}[h] \centering <>= <> @ \caption{\label{fig:plotFeatures} Basic feature plot generated with \texttt{getDasFeature} and \texttt{plotFeatures} } \end{figure} The \texttt{plotFeatures} function is a simple way of drawing text boxes in a new or an existing plot with the features retrieved. Notice that when overplotting to a already open graphical device, the x-coordinates must match. And this is all. You can find more detailed information and examples of each function in the R manpages for \textsf{DASiR}. We hope this package helps you in your data mining. \end{document}