--- title: "Similarity between two groups of terms" author: "Zuguang Gu ( z.gu@dkfz.de )" date: '`r Sys.Date()`' output: html_vignette: css: main.css toc: true vignette: > %\VignetteIndexEntry{06. Similarity between two groups of terms} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE} library(knitr) knitr::opts_chunk$set( error = FALSE, tidy = FALSE, message = FALSE, warning = FALSE, fig.align = "center") ``` The methods of group similarity implemented in **simona** are mainly from the [supplementary file](https://academic.oup.com/bib/article/18/5/886/2562801#supplementary-data) of the paper ["Mazandu et al., Gene Ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery. Briefings in Bioinformatics 2017"](https://doi.org/10.1093/bib/bbw067). Original denotations have been slightly modified to make them more consistent. Also more explanations have been added in this vignette. There are two groups of terms denoted as $T_p$ and $T_q$ represented as two sets: $$ T_p = \{ a_1, a_2, ...\} \\ T_q = \{ b_1, b_2, ... \} $$ where $a_i$ is a term in set $T_p$ and $b_j$ is a term in set $T_q$. The wrapper function `group_sim()` calculates semantic similarities between two groups of terms with a specific method. Note the method name can be partially matched.
group_sim(dag, group1, group2, method = ..., control = list(...))Some of the group similarity methods have no assumption of which similarity measure between single terms to use. If there are annotation already provided in the DAG object, by default *Sim_Lin_1998* is used, or else *Sim_WP_1994* is used. The term similarity method can be set via the `term_sim_method` parameter in `control`. Additionally parameters for a specific `term_sim_method` can also be set in `control`.
group_sim(dag, group1, group2, method = ..., control = list(term_sim_method = ...))All supported group similarity methods are: ```{r} library(simona) all_group_sim_methods() ``` ## Pairwise term similarity-based methods ### GroupSim_pairwise_avg Denote $S(a, b)$ as the semantic similarity between term $a$ and $b$ where $a$ is from group $p$ and $b$ is from group $q$, The similarity between group $p$ and group $q$ is the average similarity of every pair of individual terms in the two groups: $$ \mathrm{GroupSim}(p, q) = \frac{1}{|T_p|*|T_q|} \sum_{a \in T_p, b \in T_q}S(a, b) $$ The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_avg" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.Other parameters for the `term_sim_method` can also be set in the `control` list. Paper link: https://doi.org/10.1093/bioinformatics/btg153. ### GroupSim_pairwise_max The similarity is defined as the maximal $S(a, b)$ among all pairs of terms in group $p$ and $q$: $$ \mathrm{GroupSim}(p, q) = \max_{a \in T_p, b \in T_q}S(a, b) $$ The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_max" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.Other parameters for the `term_sim_method` can also be set in the `control` list. Paper link: https://doi.org/10.1109/TCBB.2005.50. ### GroupSim_pairwise_BMA BMA stands for "best-match average". First define similarity of a term $x$ to a group of terms $T$ as $$ S(x, T) = \max_{y \in T} S(x, y) $$ which corresponds to the most similar term in $T$ to $x$. Then the BMA similarity is calculated as: $$ \mathrm{GroupSim}(p, q) = \frac{1}{2}\left( \frac{1}{|T_p|}\sum_{a \in T_p} S(a, T_q) + \frac{1}{|T_q|}\sum_{b \in T_q} S(b, T_p) \right) $$ The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMA" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.Other parameters for the `term_sim_method` can also be set in the `control` list. Paper link: https://doi.org/10.1155/2012/975783. ### GroupSim_pairwise_BMM BMM stands for "best-match max". It is defined as: $$ \mathrm{GroupSim}(p, q) = \max \left \{ \frac{1}{|T_p|}\sum_{a \in T_p} S(a, T_q), \frac{1}{|T_q|}\sum_{b \in T_q} S(b, T_p) \right \} $$ The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMM" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.Other parameters for the `term_sim_method` can also be set in the `control` list. Paper link: https://doi.org/10.1186/1471-2105-7-302. ### GroupSim_pairwise_ABM ABM stands for "average best-match". It is defined as: $$ \mathrm{GroupSim}(p, q) = \frac{1}{|T_q| + |T_q|} \left( \sum_{a \in T_p} S(a, T_q) + \sum_{b \in T_q} S(b, T_p) \right) $$ The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_ABM" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.Other parameters for the `term_sim_method` can also be set in the `control` list. Paper link: https://doi.org/10.1186/1471-2105-14-284. ### GroupSim_pairwise_HDF First define the distance of a term $x$ to a group of terms $T$: $$D(x, T) = 1 - S(x, T)$$ Then the Hausdorff distance between two groups are: $$ \mathrm{HDF}(p, q) = \max \left\{ \max_{a \in T_p} D(a, T_q), \max_{b \in T_q} D(b, T_q) \right\} $$ This final similarity is: $$ \mathrm{GroupSim}(p, q) = 1 - \mathrm{HDF}(p, q) $$ The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_HDF" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.Other parameters for the `term_sim_method` can also be set in the `control` list. ### GroupSim_pairwise_MHDF Instead of using the maximal distance from a group to the other group, MHDF uses mean distance: $$ \mathrm{MHDF}(p, q) = \max \left\{ \frac{1}{|T_p|} \sum_{a \in T_p} D(a, T_q), \frac{1}{|T_q|} \sum_{b \in T_q} D(b, T_q) \right\} $$ This final similarity is: $$ \mathrm{GroupSim}(p, q) = 1 - \mathrm{MHDF}(p, q) $$ The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_MHDF" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.Other parameters for the `term_sim_method` can also be set in the `control` list. Paper link: https://doi.org/10.1109/ICPR.1994.576361. ### GroupSim_pairwise_VHDF It is defined as: $$ \mathrm{VHDF}(p, q) = \frac{1}{2} \left( \sqrt{\frac{1}{|T_p|} \sum_{a \in T_p} D^2(a, T_q)} + \sqrt{\frac{1}{|T_q|} \sum_{b \in T_q} D^2(b, T_q)} \right) $$ This final similarity is: $$ \mathrm{GroupSim}(p, q) = 1 - \mathrm{VHDF}(p, q) $$ The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_VHDF" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.Other parameters for the `term_sim_method` can also be set in the `control` list. Paper link: https://doi.org/10.1073/pnas.0702965104. ### GroupSim_pairwise_Froehlich_2007 The similarity is: $$ \mathrm{GroupSim}(p, q) = \exp(-\mathrm{HDF}(p, q)) $$ The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_Froehlich_2007" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.Other parameters for the `term_sim_method` can also be set in the `control` list. Paper link: https://doi.org/10.1186/1471-2105-8-166. ### GroupSim_pairwise_Joeng_2014 Similar to VHDF, but it directly uses the similarity: $$ \mathrm{GroupSim}(p, q) = \frac{1}{2} \left( \sqrt{\frac{1}{|T_p|} \sum_{a \in T_p} S^2(a, T_q)} + \sqrt{\frac{1}{|T_q|} \sum_{b \in T_q} S^2(b, T_q)} \right) $$ The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_Joeng_2014" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.Other parameters for the `term_sim_method` can also be set in the `control` list. Paper link: https://doi.org/10.1109/tcbb.2014.2343963. ## Pairwise edge-based methods ### GroupSim_SimALN It is based on the average distance between every pair of terms in the two groups: $$ \mathrm{GroupSim}(p, q) = \exp\left(-\frac{1}{|T_p|*|T_q|} \sum_{a \in T_p, b \in T_q} D_\mathrm{sp}(a, b)\right) $$ Or use the longest distance between two terms: $$ \mathrm{GroupSim}(p, q) = \exp\left(-\frac{1}{|T_p|*|T_q|} \sum_{a \in T_p, b \in T_q} \mathrm{len}(a, b)\right) $$ There is a parameter distance which takes value of `"longest_distances_via_LCA"` (the default) or `"shortest_distances_via_NCA"`:
group_sim(dag, group1, group2, method = "GroupSim_SimALN", control = list(distance = "shortest_distances_via_NCA"))Paper link: https://doi.org/10.1109/CBMS.2008.27. ## Groupwise IC-based methods This category of methods depend on the IC of terms in the two groups as well as their ancestor terms. ### GroupSim_SimGIC, GroupSim_SimDIC and GroupSim_SimUIC, Denote $A$ and $B$ as the two sets of ancestors of terms in group $p$ and $q$ respectively: $$ \begin{align*} \mathcal{A}_p &= \bigcup_{a \in T_p} \mathcal{A}_a \\ \mathcal{A}_q &= \bigcup_{b \in T_q} \mathcal{A}_b \\ \end{align*} $$ The _GroupSim_SimGIC_, _GroupSim_SimDIC_ and _GroupSim_SimUIC_ are very similar. They are based on the IC of the ancestor terms, defined as: $$ \begin{align*} \mathrm{GroupSim}_\mathrm{SimGIC}(p, q) &= \frac{\sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q} \mathrm{IC}(x)}{\sum\limits_{x \in \mathcal{A}_p \cup \mathcal{A}_q} \mathrm{IC}(x)} \\ \mathrm{GroupSim}_\mathrm{SimDIC}(p, q) &= \frac{2 * \sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q} \mathrm{IC}(x)}{\sum\limits_{x \in \mathcal{A}_p} \mathrm{IC}(x) + \sum\limits_{x \in \mathcal{A}_q} \mathrm{IC}(x)} \\ \mathrm{GroupSim}_\mathrm{SimUIC}(p, q) &= \frac{2 * \sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q} \mathrm{IC}(x)}{\max\left\{\sum\limits_{x \in \mathcal{A}_p} \mathrm{IC}(x), \sum\limits_{x \in \mathcal{A}_q} \mathrm{IC}(x) \right\}} \\ \end{align*} $$ IC method can be set via the `control` argument. By default if there is annotation associated, _IC_annotation_ is used, or else _IC_offspring_ is used.
group_sim(dag, group1, group2, method = "GroupSim_SimGIC", control = list(IC_method = ...))### GroupSim_SimUI, GroupSim_SimDB, GroupSim_SimUB and GroupSim_SimNTO These four methods are based on the counts of ancestor terms: $$ \begin{align*} \mathrm{GroupSim}_\mathrm{SimUI}(p, q) &= \frac{|\mathcal{A}_p \cap \mathcal{A}_q|}{|\mathcal{A}_p \cup \mathcal{A}_q|} \\ \mathrm{GroupSim}_\mathrm{SimDB}(p, q) &= \frac{2*|\mathcal{A}_p \cap \mathcal{A}_q|}{|\mathcal{A}_p| + |\mathcal{A}_q|} \\ \mathrm{GroupSim}_\mathrm{SimUB}(p, q) &= \frac{|\mathcal{A}_p \cap \mathcal{A}_q|}{\max\{|\mathcal{A}_p|, |\mathcal{A}_q|\}} \\ \mathrm{GroupSim}_\mathrm{SimNTO}(p, q) &= \frac{|\mathcal{A}_p \cap \mathcal{A}_q|}{\min\{|\mathcal{A}_p|, |\mathcal{A}_q|\}} \end{align*} $$
group_sim(dag, group1, group2, method = "GroupSim_SimUI")### GroupSim_SimCOU Let's write $\mathcal{A}_p$ and $\mathcal{A}_q$ as two vectors $\mathbf{v_p}$ and $\mathbf{v_q}$. Taking $\mathbf{v_p}$ as an example, it is $\mathbf{v_p} = (w_1, ..., w_n)$ where $n$ is the number of total terms in the DAG. The value $w_i$ is assigned to the corresponding term $t_i$ and is defined as: $$ \mathcal{w}_{i} = \left\{ \begin{array}{ll} \mathrm{IC}(t_i) & \textrm{if} t_i \in \mathcal{A}_p \\ 0 & \textrm{otherwise} \end{array} \right. $$ The semantic similarity is defined as the cosine similarity between the two vectors: $$ \mathrm{GroupSim}(a, b) = \frac{ \mathbf{v_p} \cdot \mathbf{v_q} }{\left \| \mathbf{v_p} \right \| \cdot \left \| \mathbf{v_q} \right \|} $$ It can also be written as: $$ \mathrm{GroupSim}(a, b) = \frac{\sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q}\mathrm{IC}(x)^2}{\sqrt{\sum\limits_{x \in \mathcal{A}_p}\mathrm{IC}(x)^2} \cdot \sqrt{\sum\limits_{x \in \mathcal{A}_q}\mathrm{IC}(x)^2}} $$ IC method can be set via the `control` argument. By default if there is annotation associated, _IC_annotation_ is used, or else _IC_offspring_ is used.
group_sim(dag, group1, group2, method = "GroupSim_SimCOU", control = list(IC_method = ...))### GroupSim_SimCOT The semantic similarity is defined as: $$ \begin{align*} \mathrm{GroupSim}(a, b) &= \frac{ \mathbf{v_p} \cdot \mathbf{v_q} }{\left \| \mathbf{v_p} \right \|^2 + \left \| \mathbf{v_q} \right \|^2 - \mathbf{v_p} \cdot \mathbf{v_q}} \\ &= \frac{\sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q}\mathrm{IC}(x)^2}{\sum\limits_{x \in \mathcal{A}_p \cup \mathcal{A}_q}\mathrm{IC}(x)^2} \end{align*} $$ IC method can be set via the `control` argument. By default if there is annotation associated, _IC_annotation_ is used, or else _IC_offspring_ is used.
group_sim(dag, group1, group2, method = "GroupSim_SimCOT", control = list(IC_method = ...))## Groupwise edge-based methods ### GroupSim_SimLP It is the largest depth of terms in $\mathcal{A}_p \cap \mathcal{A}_q$. $$ \mathrm{GroupSim}(p, q) = \max\{\delta(t): t \in \mathcal{A}_p \cap \mathcal{A}_q\} $$
group_sim(dag, group1, group2, method = "GroupSim_SimLP")Link: https://bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/GOvis.html#go-induced-distances. ### GroupSim_Ye_2005 It is a normalized version of *GroupSim_SimLP*: $$ \begin{align*} \mathrm{GroupSim}(p, q) &= \max\left\{\frac{\delta(t) - \delta_\mathrm{min}}{\delta_\mathrm{max} - \delta_\mathrm{min}}: t \in \mathcal{A}_p \cap \mathcal{A}_q\right\} \\ &= \max\left\{\frac{\delta(t) }{\delta_\mathrm{max}}: t \in \mathcal{A}_p \cap \mathcal{A}_q\right\} \end{align*} $$ Since the minimal depth is zero for root.
group_sim(dag, group1, group2, method = "GroupSim_Ye_2005")Paper link: https://doi.org/10.1038/msb4100034. ## Annotated items-based methods This category of methods consider the items annotated to the two groups of terms. ### GroupSim_SimCHO It is based on the annotated items. Denote $\sigma(t)$ as the total number of annotated items of $t$ (after merging all its offspring terms). The similarity is calculated as: $$ \mathrm{GroupSim}(p, q) = \frac{\log(C_{pq})}{\log(C_\mathrm{min}/C_\mathrm{max})} $$ where $C_{pq} = \min\{\sigma(t): t \in T_p \cap T_q \}$, $C_\mathrm{min}$ is the minimal number of annotated items in the DAG which in most cases is 1, $C_\mathrm{max}$ is the maximal number of annotated items, which is the total number of items annotated to the complete DAG. The similarity can also be written in form of $\mathrm{IC}_\mathrm{anno}$: $$ \mathrm{GroupSim}(p, q) = \frac{\max\limits_{x \in T_p \cup T_q}\mathrm{IC}(x)}{\mathrm{IC}_\mathrm{max}} $$
group_sim(dag, group1, group2, method = "GroupSim_SimCHO")### GroupSim_SimALD The similarity is calculated as: $$ \mathrm{GroupSim}(p, q) = \max\left\{ 1 - \frac{\sigma(x)}{C_\mathrm{max}}: x \in T_p \cap T_q \right\} $$
group_sim(dag, group1, group2, method = "GroupSim_SimALD")## Set-based methods Since $T_p$ and $T_q$ are two sets, the Kappa coeffcient, Jaccard coeffcient, Dice coeffcient and overlap coeffcient can be naturally used.
group_sim(dag, group1, group2, method = "GroupSim_Jaccard", control = list(universe = ...)) group_sim(dag, group1, group2, method = "GroupSim_Dice", control = list(universe = ...)) group_sim(dag, group1, group2, method = "GroupSim_Overlap", control = list(universe = ...)) group_sim(dag, group1, group2, method = "GroupSim_Kappa", control = list(universe = ...))## Session info ```{r} sessionInfo() ```