% %\VignetteIndexEntry{Rredland} % % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \documentclass[12pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{RDF processing for Bioconductor: \Rpackage{Rredland}} \author{\copyright 2005 VJ Carey \texttt{}} \maketitle \tableofcontents \section{Introduction} Resource Description Framework (RDF) is a graphical model for information. RDF statements are ordered triples of the form (subject, predicate, object). Subjects and objects are viewed as nodes in a directed graph, and predicates are viewed as arcs in the graph. RDF is a key component of current developments towards a semantic web, with considerable work completed on web resource metadata representation and exchange using RDF. A richer metadata model is provided by OWL (Web Ontology Language), but most OWL models are serialized using XML/RDF. Thus, as we will illustrate, various public OWL resources can be processed by this package. Redland is the name of an open source software project downloadable from \url{librdf.org}. Redland is a C language library with bindings provided to a variety of other languages. Redland is highly modular, and allows developers to drop in components to substitute for base functionalities. Because metadata resources can be very voluminous, such flexibility is important. A solution to the problem of persistent storage of indexed metadata is provided through the use of BerkeleyDB serializations of Redland models. \Rpackage{Rredland} is an R package that provides interfaces to facilities of Redland. Configuration support is currently limited. You will be able to use Rredland if you do a stock installation of librdf and BerkeleyDB. If you have these resources in nonstandard locations, you can set the Makevars variables in \texttt{src} to reflect your configuration. You may need to set \verb+LD_LIBRARY_PATH+. \section{Illustration} \subsection{Simple manipulations with a fragment of GO} Eric Jain of ISB-CH has provided an RDF serialization of the UniProt database and associated annotation resources, including an RDF serialization of GO. A fragment of this serialization is distributed with the \Rpackage{Rredland} package. <>= library(Rredland) gofrag <- system.file("RDF/gopart.rdf", package="Rredland") @ Here we dump the first 10 lines of this document as text: <>= readLines(gofrag,n=10) @ This could be processed as an XML document, but let's use Redlands modeling facilities. First we need to set up a URI object for the model source document. <>= gouri <- makeRedlURI( paste("file:", gofrag, sep="") ) @ Now we read from this document. We will set the \texttt{useCore} option to use in-memory storage. <>= gof <- readRDF( gouri ) gof @ We are handed back an S4 object of class \Rclass{redlModel}. <>= getClass("redlModel") @ We need to use the \Rfunction{model} accessor to get to the model reference. We can easily compute the number of statements (also computed with show()): <>= #getRedlModelSize(model(gof)) size(gof) @ We can also transform to a data frame: <>= godf <- as(gof, "data.frame") godf[1:4,] @ We see that long text strings can cause a problem for rendering. <>= as.character(godf[1:4,3]) @ The data frame representation is useful for splitting up the statement set. <>= bypred <- split(godf, as.character(godf$predicate)) names(bypred) sapply(bypred, nrow) @ The \texttt{subClassOf} predicate helps determine the DAG structure: <>= bypred$"http://www.w3.org/2000/01/rdf-schema#subClassOf"[,-2] @ \subsection{BioPAX Level 1} The BioPAX pathway ontologies are available. <>= bp1 <- makeRedlURI(paste("file:",system.file("RDF/biopax-level1.owl", package="Rredland"),sep="")) bp1m <- readRDF( bp1 ) size(bp1m) @ This is a manageable object, so we convert to data frame: <>= bp1df <- as(bp1m, "data.frame") sapply(bp1df[1:5,], substring, 1, 70) @ The namespace qualifications make the strings difficult to render. A simple approach uses substitution up to the pound sign, preceded by eliminating any XSD postfix information. <>= strip2pound <- function(x) gsub(".*#","",cleanXSDT(as.character(x))) sapply(bp1df[1:5,], strip2pound) @ Working with a data frame, it is easy to filter statements of interest. Suppose we wish to determine all the instances of \texttt{owl\#Class} in the model. <>= isTypeOwlClass <- grep("owl#Class", as.character(bp1df[,3])) strip2pound( bp1df[isTypeOwlClass,1] ) @ We see a number of decipherable terms, and some tokens of the form (rnnn...). The latter are called blank nodes. These are created to define classes that have no names, but that are implicitly defined in the model. For example, a class that is the union of entity and physicalEntity is a blank node in this model. To get the detailed commentary on a class definition, the following function can be used: <>= chopLong = function(x,nword=12) { tvec <- strsplit(x," ")[[1]] ltvec <- length(tvec) if (ltvec %% nword != 0) { pad <- rep(" ", ceiling(length(tvec))*nword) pad[1:ltvec] <- tvec } else pad <- tvec ss <- matrix(pad,nr=nword) ss <- rbind(ss,"\n") paste(ss,collapse=" ") } <>= getClassComment <- function(term, df, nsPref="http://www.biopax.org/release/biopax-level1.owl#", commPred= "http://www.w3.org/2000/01/rdf-schema#comment", doChop=TRUE, nword=12 ) { ind <- which( as.character(df[,1]) == paste(nsPref,term,sep="") & as.character(df[,2]) == commPred ) chopLong(cleanXSDT(as.character(bp1df[ind,3])), nword=nword) } cat(getClassComment("chemicalStructure", bp1df )) cat(getClassComment("biochemicalReaction", bp1df )) @ \subsection{BioPAX level 2} Here we check the classes available in BioPAX level 2. <>= bp2 <- makeRedlURI(paste("file:",system.file("RDF/biopax-level2.owl", package="Rredland"),sep="")) bp2m <- readRDF( bp2 ) size(bp2m) bp2df <- as(bp2m, "data.frame") isTypeOwlClass <- grep("owl#Class", as.character(bp2df[,3])) strip2pound( bp2df[isTypeOwlClass,1] ) @ %\subsection{HumanCyc} % %The BioCyc project (\url{www.biocyc.org}) is a collection of %pathway/genome databases in a variety of structures. The data resources %are available to academic researchers, and a registration/download process %must be completed for access. We illustrate use of \Rpackage{Rredland} %to work with the BioPAX encoding of HumanCyc. This is 19MB of RDF %and an in-core storage model is not likely to be satisfactory. %We will use the default BerkeleyDB storage approach. <>= humu <- makeRedlURI(paste("file:","humancyc.owl",sep="")) humm <- readRDF( humu, storageType="bdb", storageName="hucyc") @ %Note that the vignette cannot assume that you have this OWL file. %After the above commands, we have % %\begin{verbatim} %-rw-r--r-- 1 stvjc stvjc 59723776 Jul 28 13:09 test-sp2o.db %-rw-r--r-- 1 stvjc stvjc 39538688 Jul 28 13:07 test-po2s.db %-rw-r--r-- 1 stvjc stvjc 57499648 Jul 28 13:07 test-so2p.db %\end{verbatim} %These are the BerkeleyDB hashes representing aspects of the graph. % %It is not too difficult to transform into a data frame. <>= hudf <- as(humm, "data.frame") husubs <- as.character(hudf[,1]) hupreds <- as.character(hudf[,2]) huobs <- as.character(hudf[,3]) table(hupreds) @ %To find the named pathways, <>= isPw <- grep("pathway", husubs) isNa <- grep("NAME", hupreds) isnp <- intersect(isPw, isNa) cleanXSDT(huobs[isnp][1:10]) @ %So we see in the predicate set what kinds of relationships are %described, and we get a glimpse of the pathway names addressed %in this resource. % %Note that there is no need to parse the data once the Berkeley %DB hashes are made available. The BDBSexists option on readRedlModel %can be used to revive a model-hash association. \subsection{HumanCyc} The BioCyc project (\url{www.biocyc.org}) is a collection of pathway/genome databases in a variety of structures. The data resources are available to academic researchers, and a registration/download process must be completed for access. We illustrate use of \Rpackage{Rredland} to work with the BioPAX encoding of HumanCyc. This is 19MB of RDF and an in-core storage model is not likely to be satisfactory. We will use the default BerkeleyDB storage approach. \begin{Schunk} \begin{Sinput} > humu <- makeRedlURI(paste("file:", "humancyc.owl", sep = "")) > humm <- readRDF(humu, storageType = "bdb", storageName = "hucyc") \end{Sinput} \end{Schunk} Note that the vignette cannot assume that you have this OWL file. After the above commands, we have \begin{verbatim} -rw-r--r-- 1 stvjc stvjc 59723776 Jul 28 13:09 test-sp2o.db -rw-r--r-- 1 stvjc stvjc 39538688 Jul 28 13:07 test-po2s.db -rw-r--r-- 1 stvjc stvjc 57499648 Jul 28 13:07 test-so2p.db \end{verbatim} These are the BerkeleyDB hashes representing aspects of the graph. It is not too difficult to transform into a data frame. \begin{Schunk} \begin{Sinput} > hudf <- as(humm, "data.frame") > husubs <- as.character(hudf[, 1]) > hupreds <- as.character(hudf[, 2]) > huobs <- as.character(hudf[, 3]) > table(hupreds) \end{Sinput} \begin{Soutput} hupreds http://www.biopax.org/release/biopax-level1.owl#AUTHORS 31432 http://www.biopax.org/release/biopax-level1.owl#CELLULAR-LOCATION 2800 http://www.biopax.org/release/biopax-level1.owl#COFACTOR 11 http://www.biopax.org/release/biopax-level1.owl#COMMENT 1231 http://www.biopax.org/release/biopax-level1.owl#COMPONENTS 36 http://www.biopax.org/release/biopax-level1.owl#CONTROL-TYPE 36 http://www.biopax.org/release/biopax-level1.owl#CONTROLLED 2216 http://www.biopax.org/release/biopax-level1.owl#CONTROLLER 2216 http://www.biopax.org/release/biopax-level1.owl#DATA-SOURCE 167 http://www.biopax.org/release/biopax-level1.owl#DB 12251 http://www.biopax.org/release/biopax-level1.owl#DELTA-G 23 http://www.biopax.org/release/biopax-level1.owl#EC-NUMBER 872 http://www.biopax.org/release/biopax-level1.owl#ID 12251 http://www.biopax.org/release/biopax-level1.owl#LEFT 1968 http://www.biopax.org/release/biopax-level1.owl#MOLECULAR-WEIGHT 666 http://www.biopax.org/release/biopax-level1.owl#NAME 6046 http://www.biopax.org/release/biopax-level1.owl#NEXT-STEP 895 http://www.biopax.org/release/biopax-level1.owl#ORGANISM 1730 http://www.biopax.org/release/biopax-level1.owl#PATHWAY-COMPONENTS 1049 http://www.biopax.org/release/biopax-level1.owl#PHYSICAL-ENTITY 2800 http://www.biopax.org/release/biopax-level1.owl#RIGHT 2020 http://www.biopax.org/release/biopax-level1.owl#SEQUENCE 12 http://www.biopax.org/release/biopax-level1.owl#SOURCE 5534 http://www.biopax.org/release/biopax-level1.owl#SPONTANEOUS 3 http://www.biopax.org/release/biopax-level1.owl#STEP-INTERACTIONS 2869 http://www.biopax.org/release/biopax-level1.owl#STOICHIOMETRIC-COEFFICIENT 2783 http://www.biopax.org/release/biopax-level1.owl#STRUCTURE 776 http://www.biopax.org/release/biopax-level1.owl#STRUCTURE-DATA 776 http://www.biopax.org/release/biopax-level1.owl#STRUCTURE-FORMAT 776 http://www.biopax.org/release/biopax-level1.owl#SYNONYMS 10032 http://www.biopax.org/release/biopax-level1.owl#TAXON-XREF 1 http://www.biopax.org/release/biopax-level1.owl#TERM 10 http://www.biopax.org/release/biopax-level1.owl#TITLE 5534 http://www.biopax.org/release/biopax-level1.owl#XREF 13605 http://www.biopax.org/release/biopax-level1.owl#YEAR 5460 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 22984 http://www.w3.org/2000/01/rdf-schema#comment 1 \end{Soutput} \end{Schunk} To find the named pathways, \begin{Schunk} \begin{Sinput} > isPw <- grep("pathway", husubs) > isNa <- grep("NAME", hupreds) > isnp <- intersect(isPw, isNa) > cleanXSDT(huobs[isnp][1:10]) \end{Sinput} \begin{Soutput} [1] "\"biosynthesis of aspartate and asparagine; interconversion of aspartate and asparagine.\"" [2] "\"serine and glycine biosynthesis\"" [3] "\"alanine biosynthesis II\"" [4] "\"alanine biosynthesis I\"" [5] "\"alanine biosynthesis III\"" [6] "\"superpathway of alanine biosynthesis\"" [7] "\"arginine biosynthesis III\"" [8] "\"citrulline biosynthesis\"" [9] "\"asparagine biosynthesis I\"" [10] "\"aspartate biosynthesis and degradation\"" \end{Soutput} \end{Schunk} So we see in the predicate set what kinds of relationships are described, and we get a glimpse of the pathway names addressed in this resource. Note that there is no need to parse the data once the Berkeley DB hashes are made available. The BDBSexists option on readRedlModel can be used to revive a model-hash association. \section{Future work} We will need to take unions of RDF models and C code will be required for that. We need R interfaces to Redland approaches to model filtering. Some graph/set-theoretic activities can be introduced to bring some RDF/RDFS inferencing in. \end{document}