% % NOTE -- ONLY EDIT .Rnw!!! % .tex file will get overwritten. % %\VignetteIndexEntry{ontoTools: sgdiOntology} %\VignetteDepends{} %\VignetteKeywords{Genomics, Ontology} %\VignettePackage{ontoTools} % % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \documentclass[12pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\bi}{\begin{itemize}} \newcommand{\ei}{\end{itemize}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{SGDI ontology development} \author{VJ Carey \url{stvjc@channing.harvard.edu}} \maketitle \section{Introduction} This document describes tasks related to ontology development for the SGDI (software for genomic data integration) project. Principal concerns include \begin{itemize} \item establishment of conventions for describing designs of and samples from microarray experiments; \item establishment of software tools that help implement these conventions; \item maximizing reuse of programatically accessible vocabulary resources, such as those provided by caCORE/EVS; \item employing appropriate standards for metadata structure design and deployment, such as RDF/OWL models and associated XML serializations. \end{itemize} \section{Implementation issues} We will distinguish three basic information structures: \begin{itemize} \item {\bf provenanceStruct:} a container for information regarding the source and maintenance of ontology-related information; \item {\bf nomenclature:} a set of tokens representing terms or concepts, with specified provenance and definitions; \item {\bf ontology:} a nomenclature with a hierarchical structure reflecting semantic relations among the terms. \end{itemize} \subsection{Nomenclature example} An example of a nomenclature structure is given here: <>= options(width=70) require(ontoTools, quietly=TRUE) library(KEGG.db) #require(KEGG, quietly=TRUE) <>= KPL <- eapply(KEGGPATHID2NAME, function(x)x) GDI_KEGGPATH <- new("nomenclature", name="KEGGPATH", provenance=new("provStruct", URI="ftp://ftp.genome.ad.jp/pub/kegg/pathways/map_title.tab", captureDate="June 30 2004", comment="Rel 30.0"), inMappings=c("LL2KEGGmap.hsa", "LL2KEGGmap.rno"), terms=names(KPL), definitions=as.character(unlist(KPL))) GDI_KEGGPATH @ The information on ``GDI maps'' pertains to usage of the nomenclature in other information structures. In general it will be important to catalog not only the terms and vocabularies in which they are used, but also the substantive data resources in which these vocabulary resources are deployed. For example, in cross-organism inference, it will be useful to be able to identify which data resources use KEGG identifiers to characterize genomic components or gene products. One can then iterate over only these resources, searching for the presence of a given set of KEGG identifiers. \subsection{Ontology example} An example of an ontology is given here. We use the class \Rclass{taggedHierNomenclature} to emphasize \begin{enumerate} \item the existence of tags, which are abbreviated tokens that are typically semantically opaque, used for abbreviated reference to terms of interest; \item the existence of hierarchical semantic relationships among terms; \item the extension of the formal nomenclature class. \end{enumerate} The data resource in this example is the NCI \textit{Thesaurus}, as opposed to the \textit{Metathesaurus}. The thesaurus is made available to support ontological inference in ways that are not straightforward for the metathesaurus at this time. <>= data(GDI_NCIThesaurus) GDI_NCIThesaurus @ There are helper functions that navigate the ontology; at present a true graph is not employed. <>= mpar <- parents("Mesna", GDI_NCIThesaurus) mpar children( mpar, GDI_NCIThesaurus ) @ General regular expression matching can be used: <>= substring(grep("HER-2", GDI_NCIThesaurus),1,70) @ When definitions are present, they can be obtained: <>= getDefs("Mesna", GDI_NCIThesaurus) @ \section{Building a new ontology} A workflow for building a new ontology is not clearly established at present, but the basic tasks appear to be \begin{enumerate} \item determine a set of concepts and associated terms; \item examine existing ontologies for coverage of the terms of interest; \item if the application can proceed on the basis of the harvesting of a single pre-existing ontology, it may suffice to build a \Rclass{taggedHierNomenclature} instance based on this ontology; \item if the application requires a separate ontology or desired concepts are not adequately covered, \begin{enumerate} \item construct a new OWL model for the ontology using Protege; \item deserialize the OWL to an \Rclass{ontModel} using Rswub; \item create a \Rclass{taggedHierNomenclature} instance on the basis of lists derivable from the \Rclass{ontModel} instance. \end{enumerate} \end{enumerate} We'll illustrate this process using some terms related to breast cancer identified by Sridhar. We'll work backwards from a data structure, \Robject{SGDIvocab}, currently in \Rpackage{ontoTools} Sridhar looked the terms up in the NCI Metathesaurus and determined that they were covered in some sense, but he did not provide the exact entry matching the intended concept. The terms are <>= data(SGDIvocab) SGDIvocab@terms @ The exact meanings of these terms is not completely clear. The use of the \verb+_array+ suffix has no conventional interpretation that I am aware of; likewise the \verb+_clinical+ suffix. We may need to invent new terms and definitions to clarify the intended meaning of these tokens. We proceed provisionally; the resulting tools may be modified at any time in the future as usage patterns emerge. \subsection{Protege-based management of terms and structure} Figure \ref{protfig} shows the Protege ontology editor in use to define the BCTerms class. There are 13 instances. Note to the right of the display that there are a variety of fields to be defined, including an RDFS comment field, and an \verb+NCI_Meta_tag+ field. Figure \ref{prot2} shows the editor focused on the formal tags provided by NCI Metathesaurus for terms semantically similar (by informal matching) to the BCTerms of interest. \begin{figure} \setkeys{Gin}{width=1.0\textwidth} \includegraphics{protlk} \caption{View of protege ontology editor for SGDI vocabulary. Focus on BCTerms.} \label{protfig} \end{figure} \begin{figure} \setkeys{Gin}{width=1.0\textwidth} \includegraphics{prot2} \caption{View of protege ontology editor for SGDI vocabulary. Focus on NCI metatags.} \label{prot2} \end{figure} \subsection{Importation of OWL model} The Rswub package is still not ready for distribution, but I will convert OWL to R structures as needed. The ontology model is prescribed by the Jena system from HP. \Rfunction{readSWModel} returns an anonymous omegahat reference to a Java class instance. <>= library(Rswub) <>= omod <- readSWModel("http://www.biostat.harvard.edu/~carey/SGDI.owl", asURL=TRUE) <>= omod <- readSWModel("/home/stvjc/Protege_3.0_beta/SGDI.owl") omod@documentName <- "http://www.biostat.harvard.edu/~carey/SGDI.owl" @ <>= omod @ \begin{Soutput} Ontology model (instance of com.hp.hpl.jena.ontology.impl.OntModelImpl ) source: http://www.biostat.harvard.edu/~carey/SGDI.owl There are 4 named classes. Base namespace: http://www.owl-ontologies.com/unnamed.owl . \end{Soutput} The Java-based ontology model object here is completely independent of the tagged hierarchical nomenclature structures which are implemented purely in R. For our purposes, the \Robject{omod} object is just a bridge from OWL to R. \Robject{omod} can be interrogated for the underlying RDF model. This is just a set of triples (subject, predicate, object), and getSplits returns a list of two elements: bypred and bysub. <>= somod <- getSplits(omod) names(somod) @ We have defined \verb+NCI_Meta_tag+ as a property with domain BCTerms and range \verb+NCI_Meta_Termset+. <>= somod$bysub$NCI_Meta_tag @ The value of the property for each BCTerm instance is: <>= somod$bypred$NCI_Meta_tag @ This table is the basis of the tagged nomenclature that we need in the SGDI ontology. The ``subj'' values are the tokens we are interested in. The ``obj'' values are the formal identifiers of the NCI Metathesaurus entries that we want to map our tokens to. When CACore is working, we will use these formal identifiers to retrieve detailed definitions, focused keyword matches, etc. \subsection{The nomenclature} <>= data(SGDIvocab) SGDIvocab @ \section{Summary and future work} We have defined a few classes representing nomenclatures and hierarchical nomenclatures. Populating instances of these classes with existing genomic information was illustrated with KEGG and NCI Thesaurus (not metathesaurus!). Search and navigation of these structures is supported, but additional facilities will be required as the workflow clarifies. When a specific collection of clinical and technical terms is identified, we propose to formally manage the collection using Protege to define an OWL/RDF model. The model can be deserialized to Java and R structures using Rswub, which will be placed in Bioc development in the near future. The OWL/RDF model has a regimented graphical structure. It is a collection of linked ordered triples with interpretation (subject, predicate, object). When OWL objectProperty status is conferred on an entity (property term), a domain and range can be defined for that property. A benefit of formal specification of domain and range is that misuse of the property can be programmatically forbidden. In our example, the \verb+NCI_Meta_tag+ property maps from the BCTerms set to an instance of the \verb+NCI_Meta_termset+. If we ask for the \verb+NCI_Meta_tag+ property of any BCTerm instance, we are guaranteed to receive an instance of \verb+NCI_Meta_termset+. An instance of the \verb+NCI_Meta_termset+ is composed of a formal NCI Metathesaurus alphanumeric tag, and a brief verbal definition. Ultimately the tag will be used in programmatic interrogation of the metathesaurus for ancillary information such as detailed definition, links to defining and illustrative documents, and so forth. Use of the OWL technology conveys access to techniques for vocabulary harmonization. Schemas may be defined that establish equivalent properties with different names, or set-theoretically defined combinations of formal classes, and models may be revised using formal inference based on the schema. Logical inference within the OWL model is also supported. @ \end{document}