---
title: "ontoProc: RDF ontology processing for Bioconductor"
author: "Vincent J. Carey, stvjc at channing.harvard.edu"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{ontoProc: RDF ontology processing}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::pdf_document:
    toc: yes
    number_sections: yes
  BiocStyle::html_document:
    highlight: pygments
    number_sections: yes
    theme: united
    toc: yes
---

```{r setupp,echo=FALSE,results="hide"}
suppressWarnings({
suppressPackageStartupMessages({
library(ontoProc)
library(BiocStyle)
library(org.Mm.eg.db)
library(org.Hs.eg.db)
})
})
```

# Introduction

The `r Biocpkg("ontoProc")` package
includes tools for

- programming with ontology snapshots that are distributed with the package
- annotating free text with ontology tags

Our primary objective is facilitating use of ontological
metadata to simplify construction of formally
annotated hierarchies of samples or
features that should be traversed in analysis of complex
genomic experiments.

The ontoProc package was developed to facilitate
the coding of an ontology-driven visualizer of transcriptomic
patterns in single-cell RNA-seq studies ([tenXplore](http://github.com/vjcitn/tenXplore)).

![dashsnap](dashboard.png)

# Application to cell type hierarchy

## An enumeration of cell types

We used the [Experimental Factor Ontology](https://www.ebi.ac.uk/efo/) 'cell type' class ([EFO_0000324](https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0000324)) to obtain
an enumeration of cell types.  As of August 22 2017 it is an open question
whether [Cell Ontology](http://obofoundry.org/ontology/cl.html) or
[Cellosaurus](https://web.expasy.org/cellosaurus/) should be used for this purpose.  The author's
subjective impression is that EFO has a simpler collection of terms for cell types,
while Cell Ontology has a better collection of terms for types of neurons.

## Basic operations using ontologyIndex facilities

This package ships with an R serialization
of an OBO representation of the [Cell Ontology](http://obofoundry.org/ontology/cl.html).
This is created using `get_OBO` in `r CRANpkg("ontologyIndex")`.
(For ontologies only available in OWL format, the python pronto
package was used to convert to OBO.)
```{r useCO}
library(ontoProc)
cellOnto = getCellOnto()
cellOnto
```
At this time,
elementary manipulations of
the ontology involve collecting the children, siblings, or labels for given
URIs.
```{r useCO2}
cochil = children_TAG("CL:0000540", cellOnto) 
cochil
label_TAG("CL:0000540", cellOnto)
siblings_TAG("CL:0000540", cellOnto) 
```

# Application: finding genes annotated to neuron subtypes

We focus on mouse.  The neuron subtypes identified as
OWL subclasses of "neuron" have names
```{r getcl}
cleanNames = function(tset) {
 slot(tset,"cleanFrame")$clean
}
cleanNames(cochil)
```

We would like to see if the expression data would allow us to discriminate
neurons of these different types.

## Bridging from Cell Ontology to mouse genes

There is no formal linkage at present between terms
of Cell Ontology and those of Gene Ontology.  Research
on inference of tissue of origin from expression 
signatures has led to accurate classifiers (Lee, Krishnan, Troyanskaya) and
applications in cell mixture deconvolution (Houseman).  
Formal work in ontology bridging has been described but the
specific task of mapping from Cell Ontology terms
to Gene Ontology terms has not culminated in any
programmatically available resource.

We apply approximate pattern matching (agrep in R) to
find gene ontology terms that are apparently relevant to
cell type vocabulary terms of interest.  These are then
mapped to gene annotation.  Simple (non-vectorized)
functions that accomplish this in an organism-specific
are straightforward using the OrgDb packages.  We
serialized all GO terms for convenience with this package,
in the data object `allGOterms`.

```{r lkfuns}
data(allGOterms)
cellTypeToGO("serotonergic neuron", gotab=allGOterms)
cellTypeToGenes("serotonergic neuron", orgDb=org.Mm.eg.db, gotab=allGOterms)
cellTypeToGenes("serotonergic neuron", orgDb=org.Hs.eg.db, gotab=allGOterms)
```

## Discrimination of neuron types: exploratory multivariate analysis

At this point the API for selecting cell types, bridging to gene
sets, and acquiring expression data, is not well-modularized.  Thus
the best ways to get a feel for it are to use tenXplore() function,
and to read the source code.  In brief, we often fail to find
GO terms that approximately match, as strings, Cell Ontology
terms corresponding to cell subtypes.  On the other hand, if we
match on cell types, we get very large numbers of matches, which,
at this time,
will need to be filtered to get manageable feature sets.  We 
will introduce tools for generating
additional RDF to improve gene harvesting in real time.  But the
associated statements will need to be curated.  The EBI Webulous
system should be useful for introducing new terms that
facilitate better connections between anatomic structures and
sets of genes or other genomic features.

# Annotation of free text

The `humrna` data.frame supplied with the package is a small
sample of metadata from NCBI Sequence Read Archive (SRA).  The
`study title` field has been serialized as `minicorpus`.

```{r lkmc}
data(minicorpus)
head(minicorpus)
```

There is a convention in text analysis of identifying _stop words_ that
are unlikely to be very useful for interpretation.  The `dropStop`
function tokenizes the study titles and eliminates stop words.

```{r lksto}
dropStop(head(minicorpus))
```

My hope is that EMBL BioSolr will help index strings of this
type with formal ontology terms.  However, as a step in the
general direction, we have the following examples.

```{r lk1}
library(ontoProc)
cs = getCellosaurusOnto()
ch = getChebiOnto()
minicorpus[1]
grep("P493", cs$name, value=TRUE, ignore.case=TRUE)
grep("doxycyline", ch$name, value=TRUE, ignore.case=TRUE)
```
Based on PMID 10956386, P493-6 is an EBV-EBNA1 positive
cell line, but that is not revealed in our image of
the ontology.  Will available tools help us to automate the
systematic mapping of study concepts?  Or will manual curation
be necessary?