% %\VignetteIndexEntry{ontoTools Primer} %\VignetteDepends{ontoTools, org.Hs.eg.db, graph, GO.db} %\VignetteKeywords{ontology} %\VignettePackage{ontoTools} % % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \documentclass[12pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref,epsfig} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \bibliographystyle{plainnat} \begin{document} \title{{\it ontoTools}: software for working with ontologies in Bioconductor} \author{VJ Carey {\tt stvjc@channing.harvard.edu}} \maketitle \tableofcontents \section{Introduction} An ontology is a vocabulary arranged as a rooted directed acyclic graph (rDAG, or just DAG). This package provides tools for working with such data structures. A key concern is the interpretation of sets of objects mapped to an ontology. %This package is in development in the absence %of a truly mature graph data structure system in R. %We use the {\it graph} package of Gentleman and Whalen, %and provide some ad hoc extensions such as a {\tt compoundGraph} %class that can be much more straightforwardly implemented. This package relies heavily on the \Rpackage{graph} package from Bioconductor, and the {\it SparseM} package from UIUC, available from CRAN. %That package is also %not fully mature, and in particular seems to have some %bugs (in version 0.24) when used with Windows R. \section{Basic structures} \subsection{Rooted DAG} The {\tt rootedDAG} class is built using a {\tt graph::graphNEL} and specifying a root. This package comes with {\tt litOnto}, which is a {\tt graphNEL}. <>= library(graph) library(ontoTools) #library(Rgraphviz) data(litOnto) print(litOnto) print(class(litOnto)) @ Children point to parents by default. When Rgraphviz is available, the graphical structure can be plotted using {\tt plot}. %# should be fig=TRUE %#plot(litOnto) @ %\epsfig{file=demo1.ps,width=3.9in} It is possible to reverse the directions of arcs in such a graph. The {\tt rootedDAG} object is constructed by hand. <>= g1 <- new("rootedDAG", DAG=litOnto, root="A") show(DAG(g1)) root(g1) @ \subsection{ {\tt compoundGraph}} This structure was devised to allow presentations like the following: %%%% ajr: BAD VINCE! \epsfig{file=demoComp.ps,width=4.5in} \includegraphics[width=4.5in]{demoComp} The R code that generated the dot code from which this image was created, and the dot language code for the image, are as follows: <>= data(litObj) com <- new("compoundGraph", grList=list(litOnto,litObj), between= list(c("W","E"), c("X", "K"), c("Y","B"), c("Z","D"), c("Z","G"))) compRendList<-list( list( prenodes="node [fontsize=28 color=orange fontcolor=orange];", preedges="edge [color=black];"), list( prenodes="node [fontsize=28 color=green fontcolor=green];", preedges="edge [color=black];"), betweenRend = list( preedges = "edge [color = red]")) ff <- "demoComp.dot" toDot(com, ff, compRendList) cat(readLines(ff),sep="\n") @ \subsection{{ \tt ontology}} This is just a lightly annotated {\tt rootedDAG}. <>= g1 <- new("rootedDAG", DAG=litOnto, root="A") o1 <- new("ontology", name="demo", version="0.1", rDAG=g1) show(o1) @ \subsection{ Mapped key-value list} At present this is not a class. @ <>= kvlist <- list(W="E", X="K", Y="B", Z=c("D","G")) litMap <- otkvList2namedSparse( names(kvlist), LETTERS[1:12], kvlist ) print(litMap) @ \subsection{ OOC: object-ontology complex } This combines an ontology and a mapped key-value list that specifies how objects map to terms. <>= ooc1 <- makeOOC( o1, litMap ) show(ooc1) #coverageMat(ooc1) @ \section{Basic methods} \subsection{Matrix conversion of DAG; namedSparse extension} This should be a generic. There are two basic sorts of matrices available: {\tt child2parent} which has 1 in element (i,j) if node i is the child of node j, and an accessibility matrix that has 1 in element (i,j) if node i may be reached from node j. \subsubsection{child2parent} <>= g1 <- new("rootedDAG", DAG=litOnto, root="A") mg1d <- getMatrix(g1, "child2parent", "dense") print(mg1d) @ Typically a sparse representation will be needed. <>= ng1 <- getMatrix(g1, "child2parent", "sparse") ng1 <- new("namedSparse", mat=ng1) dimnames(ng1) <- list(as.character(1:dim(ng1@mat)[1]), as.character(1:dim(ng1@mat)[2])) print(ng1@mat) print(as.matrix(ng1)) @ We have not implemented a full set of element subsetting operations as we expect the SparseM authors to do so. However, efficient operations using names of rows and columns are essential. These are implemented using hashed environments. <>= dimnames(ng1) <- list(letters[1:12], LETTERS[1:12]) print(class(ng1)) print(getSlots("namedSparse")) print(as.matrix(ng1)) @ \subsubsection{Accessibility} Currently this is only handled for an ontology. <>= show(accessMat(o1)) @ \subsection{Coverage matrix for an OOC} This matrix has element 1 in row r, column c if object corresponding to row r is covered by the term corresponding to column c. This means that the term for column c is accessible from some term to which the object for row r was explicitly mapped. <>= print(coverageMat(ooc1)) @ \subsection{Depth of terms in an ontology} We need to be able to get the set of terms at a certain depth (root is depth 0) and the depth of any given term. <>= print(ontoDepth(g1)) ds1 <- depthStruct(g1) print(ds1$tag2depth("B")) print(ds1$depth2tags(3)) @ \section{Applications to large data objects} \subsection{Key-value list for LocusLink to GO} The {\tt org.Hs.eg.db} metadata package from Bioconductor includes org.Hs.egGO, to regarded as a key-value environment mapping. The otkvEnv2namedSparse function can be used with this environment and the GOMF terms to create the object to term mapping. See ooMapLL2GOMFdemo.rda. @ \subsection{GO graph and ontology for molecular function} @ The buildGOGraph function was used to create goMFgraph.1.15.rda on the basis of the GO package. With this graph in hand, we construct the rooted DAG <>= data(goMFgraph.1.15) gomfrDAG <- new("rootedDAG", root="GO:0003674", DAG=goMFgraph.1.15) @ and then the ontology: <>= GOMFonto <- new("ontology", name="GOMF", version="bioc 1.3.1", rDAG=gomfrDAG) @ The term accessibility matrix is important for semantic calculations. This is a slow calculation and its result is archived. \begin{verbatim} gomfAmat <- accessMat(GOMFonto) save(gomfAmat,file="gomfAmat.rda", compress=TRUE) \end{verbatim} @ \subsection{Mapping from LocusLink to GO} The LocusLink database is a collection of one-to-many mappings from genes or gene-associated sequence to GO tags. This object-to-term (`ot') association is provided as an environment-like resource by bioconductor in the org.Hs.egGO package. We will use a named sparse matrix structure to encode the object to term mapping. For this example, the mapping is archived after computation by the following code: \begin{verbatim} library(org.Hs.eg.db) data(goMFgraph1.15) obs <- ls(org.Hs.egGO) tms <- nodes(goMFgraph.1.15) ooMapLL2GOMFdemo <- otkvEnv2namedSparse( obs, tms, org.Hs.egGO ) save(ooMapLL2GOMFdemo,file="ooMapLL2GOMFdemo.rda") \end{verbatim} \section{Applications} \subsection{Lord's {\it Bioinformatics} 2003 paper} Lord's paper on semantic similarity measures defines a probability $p(c)$ for each concept term $c$. Semantic similarity between terms $c_1$ and $c_2$ can be measured as $-\log p_{ms}(c_1, c_2)$, where $p_{ms}$ is the so-called ``probability of the minimum subsumer'', defined as \[ p_{ms}(c_1, c_2) = \min\limits_{c \in S(c_1,c_2)} \{p(c)\}, \] where $S(c_1, c_2)$ is the set of all terms that are refined by both $c_1$ and $c_2$. We indicate how these computations can be performed using test data in {\it ontoTools}. Most of the functions defined here as x.vig are included in the package (sometimes with slight modifications), lacking the vig suffix. \subsubsection{Setup of the data} We begin by attaching the package and convering the {\tt litOnto} information (encoding the 12-term ontology) into a {\tt rootedDAG}. <>= library(ontoTools) data(litOnto) g1 <- new("rootedDAG", DAG=litOnto, root="A") @ Now we create an ontology object based on this rooted DAG: <>= o1 <- new("ontology", name="demo", version="0.1", rDAG=g1) @ The mapping from objects to the concepts in the ontology uses a key-value list, with multimappings represented by vectors. <>= kvlist <- list(W="E", X="K", Y="B", Z=c("D", "G")) @ A special function compute the object to term map: <>= demomap <- otkvList2namedSparse( names(kvlist), LETTERS[1:12], kvlist ) @ We obtain the object-ontology complex. <>= demoooc <- makeOOC( o1, demomap ) @ The map from objects to terms is: <>= print(OOmap(demoooc)) @ The coverage matrix can be directly computed: <>= cov1 <- coverageMat(demoooc) print(cov1) @ \subsubsection{Concept probabilities} We begin with the concept probabilities. The counts of usages per term is <>= #print(pc <- colSums(coverageMat(demoooc))) ACC <- accessMat(ontology(demoooc)) acctms <- dimnames(ACC)[[2]] print(acctms) MAP <- OOmap(demoooc) nterms <- function(x) length(nodes(DAG(rDAG(x)))) ontoTerms <- function(x) nodes(DAG(rDAG(x))) print(ontoTerms) usageCount.vig <- function (MAP, ACC) { maptms <- dimnames(MAP)[[2]] acctms <- dimnames(ACC)[[2]] usages <- rep(0,length(maptms)) names(usages) <- maptms for (i in 1:nrow(MAP)) { hits <- maptms[as.matrix.ok(MAP[i, ]) == 1] usages[hits] <- usages[hits] + 1 for (j in 1:length(hits)) { anctags <- acctms[as.matrix.ok(ACC[hits[j], ]) == 1] usages[anctags] <- usages[anctags] + 1 } } usages } print(usages <- usageCount.vig(MAP,ACC)) @ because whenever a term is used, all its ancestors are implicitly used as well (Lord section 2). This inherited usage does not recurse, however. A basic usage event fires up the graph once; the fact that a descendent is used transitively does not increment the usage of its ancestors. The total number of basic usage events is <>= print(N <- max(usages)) @ The vector of concept probabilities is <>= print(usages/N) @ So we have <>= conceptProbs.vig <- function(ooc) { if (is(occ, "OOC")) stop("arg must have class OOC") pc <- usageCount.vig(OOmap(ooc), accessMat(ontology(ooc))) pc/max(pc, na.rm=TRUE) } @ \subsubsection{Identifying the most informative subsumer} This task involves the ontology (to identify subsumers) and then the concept probabilities. <>= subsumers.vig <- function(c1, c2, ont) { if (! is(ont, "ontology")) stop("ont must have class ontology") tmp <- colSums(accessMat(ont)[c(c1,c2),]) names(tmp[tmp==2]) } print(subsumers.vig("I", "K", ontology(demoooc))) @ Thus we can use <>= pms.vig <- function(c1, c2, ooc) { if (! is(ooc, "OOC")) stop("arg must have class OOC") if (any(!(c(c1,c2) %in% nodes(DAG(rDAG(ontology(ooc))))))) stop("some term not found in ontology DAG nodes") S <- subsumers.vig(c1,c2,ontology(ooc)) pc <- conceptProbs.vig(ooc) min(pc[S]) } print(pms("I", "K", demoooc)) @ Now we can compute semantic similarity relative to an OOC: <>= semsim.vig <- function(c1, c2, ooc) -log(pms.vig(c1,c2,ooc)) @ \subsection{Iyer's time-course experiment} \subsubsection{Setup} We attach the Iyer517 library and the IyerAnnotated data structure. IyerAnnotated is a data.frame that gives the mapping from probes in Iyer517 to up to 5 GO MF tags. <>= library(ontoTools) library(Iyer517) data(IyerAnnotated) print(IyerAnnotated[1:3,]) @ We also attach the concept probabilities for GO MF, relative to the LocusLink usages. <>= data(LL2GOMFcp.1.15) print(LL2GOMFcp.1.15[1:3]) @ We also define a function that extracts the most informative of a set of tags: <>= getMostInf <- function (tags,pc=LL2GOMFcp.1.15) { if (length(tags) == 1) return(tags) if (all(is.na(tags))) return(NA) tags[pc[tags] == min(pc[tags])][1] } @ and apply this to the probe-specific tag sets: <>= GOmost <- apply(IyerAnnotated[, 5:9], 1, function(x) getMostInf(as.character(x[!is.na(x)]))) @ We also need an OOC for GOMF relative to LocusLink: <>= data(goMFgraph.1.15) data(LL2GOMFooMap.1.15) data(goMFamat.1.15) # # build the rooted DAG, the ontology, and the OOC objects # gomfrDAG <- new("rootedDAG", root="GO:0003674", DAG=goMFgraph.1.15) GOMFonto <- new("ontology", name="GOMF", version="bioc 1.3.1", rDAG=gomfrDAG) LLGOMFOOC <- makeOOC(GOMFonto, LL2GOMFooMap.1.15) # @ \subsubsection{Filtering the IyerAnnotated tags} The Iyer data have been clustered. <>= print(table(IyerAnnotated$Iclust)) @ We will work with the four largest clusters, extracting the most informative tag for each cluster. <>= getMostInfTags <- function(let) { grp.GO <- GOmost[ IyerAnnotated$Iclust==let ] grp.GO[!is.na(grp.GO)] } A.GO <- getMostInfTags("A") B.GO <- getMostInfTags("B") D.GO <- getMostInfTags("D") H.GO <- getMostInfTags("H") @ \subsubsection{Computing pairwise semantic similarity} The next function computes the pairwise semantic similarities in a group of tags. However, to save time, the loop is truncated at 30 pairs. <>= getSim <- function(x,Nchk=30) { library(combinat) prs <- combn(x,2) sim <- rep(NA,ncol(prs)) for (i in 1:ncol(prs)) { if (i > Nchk) break cat(i) if (any(!(c(prs[1,i], prs[2,i]) %in% goMFamat.1.15@Dimnames[[1]]))) sim[i] = NA else sim[i] <- semsim( prs[1,i], prs[2,i], acc=goMFamat.1.15, pc=LL2GOMFcp.1.15, ooc=LLGOMFOOC ) } sim[1:Nchk] } @ We compute the pairwise similarities for the 4 groups. <>= Asim <- getSim(A.GO) save(Asim,file="Asim.rda") Bsim <- getSim(B.GO) save(Bsim,file="Bsim.rda") Dsim <- getSim(D.GO) save(Dsim,file="Dsim.rda") Hsim <- getSim(H.GO) save(Hsim,file="Hsim.rda") @ Display the similarity statistics: <>= boxplot(Asim,Bsim,Dsim,Hsim) @ \end{document}