--- title: "ontoProc: RDF ontology processing for Bioconductor" author: "Vincent J. Carey, stvjc at channing.harvard.edu" date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{ontoProc: RDF ontology processing} %\VignetteEncoding{UTF-8} output: BiocStyle::pdf_document: toc: yes number_sections: yes BiocStyle::html_document: highlight: pygments number_sections: yes theme: united toc: yes --- ```{r setupp,echo=FALSE,results="hide"} suppressWarnings({ suppressPackageStartupMessages({ library(ontoProc) library(BiocStyle) library(org.Mm.eg.db) library(org.Hs.eg.db) }) }) ``` # Introduction The `r Biocpkg("ontoProc")` package includes tools for - programming with ontology snapshots that are distributed with the package - annotating free text with ontology tags Our primary objective is facilitating use of ontological metadata to simplify construction of formally annotated hierarchies of samples or features that should be traversed in analysis of complex genomic experiments. The ontoProc package was developed to facilitate the coding of an ontology-driven visualizer of transcriptomic patterns in single-cell RNA-seq studies ([tenXplore](http://github.com/vjcitn/tenXplore)). ![dashsnap](dashboard.png) # Application to cell type hierarchy ## An enumeration of cell types We used the [Experimental Factor Ontology](https://www.ebi.ac.uk/efo/) 'cell type' class ([EFO_0000324](https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0000324)) to obtain an enumeration of cell types. As of August 22 2017 it is an open question whether [Cell Ontology](http://obofoundry.org/ontology/cl.html) or [Cellosaurus](https://web.expasy.org/cellosaurus/) should be used for this purpose. The author's subjective impression is that EFO has a simpler collection of terms for cell types, while Cell Ontology has a better collection of terms for types of neurons. ## Basic operations using ontologyIndex facilities This package ships with an R serialization of an OBO representation of the [Cell Ontology](http://obofoundry.org/ontology/cl.html). This is created using `get_OBO` in `r CRANpkg("ontologyIndex")`. (For ontologies only available in OWL format, the python pronto package was used to convert to OBO.) ```{r useCO} library(ontoProc) cellOnto = getCellOnto() cellOnto ``` At this time, elementary manipulations of the ontology involve collecting the children, siblings, or labels for given URIs. ```{r useCO2} cochil = children_TAG("CL:0000540", cellOnto) cochil label_TAG("CL:0000540", cellOnto) siblings_TAG("CL:0000540", cellOnto) ``` # Application: finding genes annotated to neuron subtypes We focus on mouse. The neuron subtypes identified as OWL subclasses of "neuron" have names ```{r getcl} cleanNames = function(tset) { slot(tset,"cleanFrame")$clean } cleanNames(cochil) ``` We would like to see if the expression data would allow us to discriminate neurons of these different types. ## Bridging from Cell Ontology to mouse genes There is no formal linkage at present between terms of Cell Ontology and those of Gene Ontology. Research on inference of tissue of origin from expression signatures has led to accurate classifiers (Lee, Krishnan, Troyanskaya) and applications in cell mixture deconvolution (Houseman). Formal work in ontology bridging has been described but the specific task of mapping from Cell Ontology terms to Gene Ontology terms has not culminated in any programmatically available resource. We apply approximate pattern matching (agrep in R) to find gene ontology terms that are apparently relevant to cell type vocabulary terms of interest. These are then mapped to gene annotation. Simple (non-vectorized) functions that accomplish this in an organism-specific are straightforward using the OrgDb packages. We serialized all GO terms for convenience with this package, in the data object `allGOterms`. ```{r lkfuns} data(allGOterms) cellTypeToGO("serotonergic neuron", gotab=allGOterms) cellTypeToGenes("serotonergic neuron", orgDb=org.Mm.eg.db, gotab=allGOterms) cellTypeToGenes("serotonergic neuron", orgDb=org.Hs.eg.db, gotab=allGOterms) ``` ## Discrimination of neuron types: exploratory multivariate analysis At this point the API for selecting cell types, bridging to gene sets, and acquiring expression data, is not well-modularized. Thus the best ways to get a feel for it are to use tenXplore() function, and to read the source code. In brief, we often fail to find GO terms that approximately match, as strings, Cell Ontology terms corresponding to cell subtypes. On the other hand, if we match on cell types, we get very large numbers of matches, which, at this time, will need to be filtered to get manageable feature sets. We will introduce tools for generating additional RDF to improve gene harvesting in real time. But the associated statements will need to be curated. The EBI Webulous system should be useful for introducing new terms that facilitate better connections between anatomic structures and sets of genes or other genomic features. # Annotation of free text The `humrna` data.frame supplied with the package is a small sample of metadata from NCBI Sequence Read Archive (SRA). The `study title` field has been serialized as `minicorpus`. ```{r lkmc} data(minicorpus) head(minicorpus) ``` There is a convention in text analysis of identifying _stop words_ that are unlikely to be very useful for interpretation. The `dropStop` function tokenizes the study titles and eliminates stop words. ```{r lksto} dropStop(head(minicorpus)) ``` My hope is that EMBL BioSolr will help index strings of this type with formal ontology terms. However, as a step in the general direction, we have the following examples. ```{r lk1} library(ontoProc) cs = getCellosaurusOnto() ch = getChebiOnto() minicorpus[1] grep("P493", cs$name, value=TRUE, ignore.case=TRUE) grep("doxycyline", ch$name, value=TRUE, ignore.case=TRUE) ``` Based on PMID 10956386, P493-6 is an EBV-EBNA1 positive cell line, but that is not revealed in our image of the ontology. Will available tools help us to automate the systematic mapping of study concepts? Or will manual curation be necessary?