---
title: "ontology_DAG: a class for ontology data"
author: "Zuguang Gu ( z.gu@dkfz.de )"
date: '`r Sys.Date()`'
output:
html_vignette:
css: main.css
toc: true
vignette: >
%\VignetteIndexEntry{01. ontology_DAG: a class for ontology data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo = FALSE, message = FALSE}
library(knitr)
knitr::opts_chunk$set(
error = FALSE,
tidy = FALSE,
message = FALSE,
warning = FALSE,
fig.width = 6, fig.height = 6,
fig.align = "center")
```
Ontologies are represented in a form of directed acyclic diagram (DAG). A DAG
is a generalized form of tree where a parent term can have multiple child
terms, and also a child term can have multiple parent terms. The DAG is
directed and a link connects from a child term to a parent term, representing
"the child is a sub-class of the parent." (left panel). In some other
cases, the direction can also be reversed to represent "a parent includes the
child" (right panel).
In this vignette, I will introduce the `ontology_DAG` class as well as related
functions with a tiny example. As shown in the following diagram, there are
six terms in the DAG where term "a" is the root, term "e" and "f" are two leaf
terms. Note term "c" has two parents.
## Construct the object
The DAG object is constructed via a list of parent-child pairs. The following
code constructs a DAG in the diagram above.
```{r}
library(simona)
parents = c("a", "a", "b", "b", "c", "d")
children = c("b", "c", "c", "d", "e", "f")
dag = create_ontology_DAG(parents, children)
```
Typing `dag` prints the basic information of the DAG.
```{r}
dag
```
Aspect ratio is calculated as `width/height`, where `width` is the largest
number of terms on a specific depth (i.e. `max(table(depth(dag)))`).
Definition of the height of a term in the DAG depends on whether using the
longest or the shortest distance from root. The aspect ratio gives an
impression of how the shape of the DAG looks like (fat or slim).
The `ontology_DAG` object can also be constructed by a vector of parent-child links:
```r
dag = create_ontology_DAG(c("a-b", "a-c", "b-c", "b-d", "c-e", "d-f"))
```
Following functions return the root term, leaf terms and test whether
terms are leaves.
```{r}
dag_root(dag)
dag_leaves(dag)
dag_is_leaf(dag, letters[1:6])
```
`dag_all_terms()` returns a vector of all terms. `dag_n_terms()` simply returns
the number of all terms in the DAG.
```{r}
dag_all_terms(dag)
dag_n_terms(dag)
```
## DAG traverse
Numbers of child/parent/offspring/ancestor terms.
```{r}
n_children(dag)
n_parents(dag)
n_offspring(dag)
n_ancestors(dag)
```
`n_connected_leaves()` returns numbers of
leaves that every term can reach (or has a finite directed distance to). Leaf
terms have value of 1.
```{r}
n_connected_leaves(dag)
```
Parents/children of a single term or union of parents/children of a group
of terms. The term argument can be a single term name or a vector of term
names.
```{r}
dag_parents(dag, "c")
dag_parents(dag, c("d", "e")) # union of d's parents and e's parents
dag_children(dag, "b")
```
It is similar to get offspring/ancestor terms of a single term or a group of
terms.
```{r}
dag_offspring(dag, "b")
dag_ancestors(dag, "e")
```
In many methods, when constructing the set of a term's ancestors of offsprings,
the set itself is also included. In `dag_offspring()` and `dag_ancestors()`,
an argument `include_self` can be set to `TRUE` to include the term itself.
```{r}
dag_offspring(dag, "b", include_self = TRUE)
dag_ancestors(dag, "e", include_self = TRUE)
```
## Reorder
DAG can be reordered by adjusting the order of child terms under every term.
There are two ways to reorder the DAG. 1. Reorder the DAG by a vector of
numeric values that correspond to every term in the DAG; 2. Reorder by
a vector of numeric values that correspond to every leaf term in the DAG.
The two ways are very similar. If the vector corresponding to the complete set
of terms is provided, each term in the DAG is associated to a score which is
the mean value of all its offspring terms (including itself) in the DAG; while
if the a vector of leaf-level value is provided, each term in the DAG is
associated to a score which is the mean of all its connectable leaf terms in
the DAG.
Taking `dag` as an example, term `c` is on the left of term `d`
```{r}
dag_children(dag, "b")
```
First, a numeric vector for all terms in the DAG is provides to reorder `dag`.
Note that the order of `value` should correspond to terms in `dag_all_terms(dag)`.
```{r}
dag2 = dag_reorder(dag, value = c(1, 1, 10, 1, 10, 1))
dag_children(dag2, "b")
```
Or a numeric vector for all leaf terms.
Note that the order of `value` here should correspond to terms in `dag_leaves(dag)`.
```{r}
dag3 = dag_reorder(dag, value = c(10, 1))
dag_children(dag3, "b")
```
`dag_permutate_children()` randomly permutes child terms under every term.
```{r}
dag2 = create_ontology_DAG(c("a-b", "a-c", "a-d", "a-e", "a-f"))
dag3 = dag_permutate_children(dag2)
dag_children(dag3, "a")
dag3 = dag_permutate_children(dag2)
dag_children(dag3, "a")
```
## Distance
The depth of a term is defined as the longest distance from root and is
calculated by `dag_depth()`. If you want the shortest distance from root,
use `dag_shortest_dist_from_root()`.
```{r}
dag_depth(dag)
dag_shortest_dist_from_root(dag)
```
Similarly, the "height" of a term is defined as the longest distance to its
reachable leaf terms. `dag_shortest_dist_to_leaves()` returns the shortest
distance to leaves.
```{r}
dag_height(dag)
dag_shortest_dist_to_leaves(dag)
```
These four functions introduced so far calculate distance from root or to
leaves for all terms in the DAG. The following functions are more general
which calculates distance from a self-defined group of terms or until a group
of terms.
`dag_depth()` is identical to `dag_longest_dist_to_offspring(dag, root)`, and
`dag_height()` is identical to `dag_longest_dist_from_ancestors(dag, leaves)`
where `root` is a single root term and `leaves` is a vector of all leaf terms
in the DAG.
```
dag_longest_dist_to_offspring(dag, from)
dag_shortest_dist_to_offspring(dag, from)
dag_longest_dist_from_ancestors(dag, to)
dag_shortest_dist_from_ancestors(dag, to)
```
Please note, these four functions are applied to all terms. They return a
vector with the same length as the total terms in the DAG. If a term is not
included in the calculation, e.g. the root term when `from` does not include
it, `dag_longest_dist_to_offspring()` and `dag_shortest_dist_to_offspring()`
will assign -1 to the root term.
Given any two terms in the DAG, the following four functions calculate their
pair-wise distance. There are different ways to calculate the distance:
For two terms $a$ and $b$, `shortest_distances_via_NCA()` calculates the
distance as:
$$ \min_{t \in \mathrm{CA}(a, b)}(D_{sp}(t, a) + D_{sp}(t, b)) $$
where $\mathrm{CA}(a, b)$ is the set of common ancestors (CA) of $a$ and $b$.
$D_{sp}(x, y)$ is the shortest distance between $x$ and $y$. In this way,
common ancestor $t$ which returns the minimal distance between $a$ and $b$ is
called the "nearest common ancestor (NCA)" of $a$ and $b$. The usage of the function is
as follows. It returns a symmetric matrix of pair-wise distance of `terms`.
```
shortest_distances_via_NCA(dag, terms)
```
`longest_distances_via_LCA()` calculates the distance as:
$$ \mathrm{len}(t, a) + \mathrm{len}(t, b) $$
where $\mathrm{len}(x, y)$ is the longest distance between $x$ and $y$. Common ancestor
$t$ is the one with the largest depth in DAG and it is called the lowest
common ancestor (LCA) of $a$ and $b$:
$$ t = \operatorname*{argmax}_{t \in \mathrm{CA}(a, b)} \delta(t)$$
where $\delta(t)$ is the depth (maximal distance from root) of term $t$. The usage of the function is:
```
longest_distances_via_LCA(dag, terms)
```
The next two functions treat the DAG relations as directional. There is a
positive distance value only if a term is an ancestor of the other, or else
the distance is set to -1 in the returned distance matrix.
```
shortest_distances_directed(dag, terms)
longest_distances_directed(dag, terms)
```
## Convert to other formats
### To an igraph object
DAG is a graph. `dag_as_igraph()` converts the DAG to an `igraph` object.
```{r}
g = dag_as_igraph(dag)
g
```
Draw the graph with a hierarchical graph layout:
```{r}
library(igraph)
plot(g, layout = layout.sugiyama)
```
### To a dendrogram object
In a DAG, a term can have multiple parents. A tree is a reduced form of a DAG
where a term only has one parent. `dag_treelize()` simplifies a DAG to a tree.
The reducing is applied in a breadth-first manner:
Starting from root and on a certain depth, for every term $a$ on this depth,
its child term $c$ and parent-child relation are kept only when $\delta_c =
\delta_a + 1$. If $c$ is selected, it is marked as visited and will not be
checked again. In this way, depths of all terms in the orignal DAG are still
identical to the depths in the tree.
```{r}
tree = dag_treelize(dag)
dag_depth(dag)
dag_depth(tree)
```
Note, although we are talking about "a tree", `tree` in the example is still
an `ontology_DAG` object where only multi-parent relations are removed.
If `n_relations + 1 == n_terms`, there is special mark of being a tree in the object.
```{r}
tree
```
When the DAG is a tree, it can be converted to a `dendrogram` object.
Since some nodes may only have one child, I add labels on nodes to mark the nodes are there. You can see the link `a->c` is removed from
the DAG.
```{r}
dend = dag_as_dendrogram(tree)
dend = dendrapply(dend, function(d) {
attr(d, "nodePar") = list(pch = attr(d, "label"))
d
})
plot(dend, leaflab = "none")
```
Note reducing DAG is mainly for visualization purpose.
## Add relation types
If `relations` argument is not set, all relations are treated as in "is_a"
such that a child is a sub-class of a parent class. It is common in
ontologies, besides the "is_a" relation, there are many other self-defined
relation types. One typical relation type is "part_of". A vector of relation
types can be set via the `relations` argument. It might be used later when
calculating semantic similarities where different relation types are assigned
with different weights.
```{r}
relations = c("is_a", "is_a", "part_of", "part_of", "is_a", "is_a")
dag = create_ontology_DAG(parents, children, relations = relations)
dag
```
## Add annotations
Terms can have external items annotated. One typical example is that GO terms can
have genes annotated. Later the annotated items can be used for calculating semantic
similarities.
The annotation should be set as a list of character vectors, where names of
the element vectors should correspond to term names in the DAG so that
annotations can be mapped to terms.
```{r}
annotation = list(
"a" = c("t1", "t2", "t3"),
"b" = c("t3", "t4"),
"c" = c("t5"),
"d" = c("t7"),
"e" = c("t4", "t5", "t6", "t7"),
"f" = c("t8")
)
dag = create_ontology_DAG(parents, children, annotation = annotation)
dag
```
Due to the nature of DAG, if a child term is annotated to an item, all its
ancestor terms are also associated with that item. The calculation of annotated items
is applied in a recursive way.
For a term $x$, denote $\mathcal{C}_x$ is the set of its child terms, the
items "directly" annotated to $x$ denoted as set $G^*_x$, then the final
set of items annotated to $x$ denoted as $G_x$ is the union of all items annotated
to its child terms as well as $x$ itself.
$$ G_x = \left( \bigcup_{z \in \mathcal{C}_x} G_z \right) \bigcup G^*_x $$
Note above equation is applied resursively. If denoting $\mathcal{D}^+_x$
as the set of $x$'s offspring terms plus $x$ itself, $G^*_x$ can also be written as:
$$ G_x = \bigcup_{z \in \mathcal{D}^+_x} G_z^* $$
The numbers of annotated items of DAG terms can be obtained via the function
`n_annotations()`. The attribute `attr(,"N")` is the maximal number of items
annotated to the DAG (the same as `max(n_annotations(dag))`), which normally
corresponds to the root term.
```{r}
n_annotations(dag)
```
The next two functions return the associations between terms and items.
```{r}
term_annotations(dag, letters[1:6])
annotated_terms(dag, c("t1", "t2", "t3"))
```
Or return a binary matrix:
```{r}
term_annotations(dag, letters[1:6], return = "matrix")
annotated_terms(dag, c("t1", "t2", "t3"), return = "matrix")
```
## Pseudo root
The DAG should be lead by a single root term. If in an ontology there are
multiple root terms, a pseudo root named `"~all~"` is automatically added.
You can see this process from the message of the function call.
```{r, message = TRUE}
parents = c("a", "a", "b", "x", "x", "y")
children = c("b", "c", "c", "z", "y", "z")
create_ontology_DAG(parents, children)
```
## Sub-DAG
The following code returns a sub-DAG where the input term is picked
as the root of the sub-DAG.
```{r}
# or with the double bracket: dag[["b"]]
dag["b"]
```
Two indicies can be provided in the brackets where the first one corresponds
to root terms and the second one corresponds to leaf terms:
# the same as dag["b"], a sub-DAG where b is the root
dag["b", ]
# a sub-DAG where b is the root and e is the only leaf
dag["b", "e"]
# a sub-DAG that contains all e's ancestors and e itself
dag[, "e"]
With the more general function `dag_filter()`, you can obtain a sub-DAG by providing
a group of terms, a subset of relations, a group of root terms or a leaf terms. Note
if there are multiple root after the fitlering, a pseudo node will be automatically added.
dag_filter(dag, terms, relations, root, leaves)
## Meta data frame
The DAG object can have a meta data frame attached, which provides more information
on the terms in the DAG. Examples can be found from [the "**2. Gene ontology**" vignette](v2_GO.html).
`dag_filter()` can also be applied based on columns in the meta data frame.
## Other aspects
When creating the `ontology_DAG` object, `create_ontology_DAG()` checks
whether there exist cyclic links. Denote a cyclic link as `a->b->c->d->b`. The
last link `d->b` is automatically removed to get rid of cyclic links.
A second, even more extreme scenario are isolated sub-graphs represented as
rings where the root can not be identified there. For example `a->b->c->d->a`
is a ring where there is no root term, thus, it cannot be attached to the
main DAG by adding a pseudo root. Rings with the form of `a->b->a` is more
observed in public ontology datasets. `create_ontology_DAG()` automatically
removes rings.
## Session info
```{r}
sessionInfo()
```