---
title: "GO Semantic Similarity Analysis"
author: "\\

	Guangchuang Yu (<guangchuangyu@gmail.com>)\\

        School of Public Health, The University of Hong Kong"
date: "`r Sys.Date()`"
bibliography: GOSemSim.bib
csl: nature.csl
output:
  html_document:
    toc: true
  pdf_document:
    toc: true
vignette: >
  % \VignetteIndexEntry{An introduction to GOSemSim}
  % \VignetteEngine{knitr::rmarkdown}
  % \usepackage[utf8]{inputenc}
---

```{r style, echo=FALSE, results="asis", message=FALSE}
knitr::opts_chunk$set(tidy = FALSE,
                      warning = FALSE,
                      message = FALSE)
```

```{r echo=FALSE, results="hide", message=FALSE}
library(org.Hs.eg.db)
library(GO.db)
library(GOSemSim)
```

# Citation

Please cite the following article when using [GOSemSim](https://www.bioconductor.org/packages/GOSemSim):

__G Yu__, F Li, Y Qin, X Bo, Y Wu, S Wang.
GOSemSim: an R package for measuring semantic similarity among GO terms and gene products.
**_Bioinformatics_** 2010, 26(7):976-978.
doi: [10.1093/bioinformatics/btq064](http://dx.doi.org/10.1093/bioinformatics/btq064).


# Introduction

Functional similarity of gene products can be estimated by controlled
biological vocabularies, such as Gene Ontology (GO) and Disease Ontology (DO). GO comprises of
three orthogonal ontologies, i.e. molecular function (MF), biological
process (BP), and cellular component (CC).


Four methods including Resnik[@philip_semantic_1999], Jiang[@jiang_semantic_1997],
Lin[@lin_information-theoretic_1998] and Schlicker[@schlicker_new_2006] have been presented to determine the semantic similarity
of two GO terms based on the annotation statistics of their common
ancestor terms. Wang[@wang_new_2007]
proposed a method to measure the similarity based on the graph structure of
GO. Each of these methods has its own advantages and
weaknesses.

[GOSemSim](https://www.bioconductor.org/packages/GOSemSim) package[@yu2010] is developed to compute
semantic similarity among GO terms, sets of GO terms, gene products,
and gene clusters, providing five methods mentioned above. We have
developed another package, [DOSE](https://www.bioconductor.org/packages/DOSE)[@yu_dose_2015], for measuring semantic
similarity among Disease Ontology (DO) terms and gene products at disease perspective.


## Homepage

Please visit [GOSemSim homepage](https://guangchuangyu.github.io/GOSemSim/) for more information.


# Semantic Similarity Measurement Based on GO

## Information content-based methods

Four methods proposed by Resnik[@philip_semantic_1999],
Jiang[@jiang_semantic_1997], Lin[@lin_information-theoretic_1998]
and Schlicker[@schlicker_new_2006] are information content (IC) based, which
depend on the frequencies of two GO terms involved and that of their
closest common ancestor term in a specific corpus of GO
annotations. The information content of a GO term is computed by the
negative log probability of the term occurring in GO corpus. A rarely used term contains a
greater amount of information.

The frequency of a term t is defined as:
$p(t) = \frac{n_{t'}}{N} | t' \in \left\{t, \; children\: of\: t \right\}$

where $n_{t'}$ is the number of term $t'$, and $N$ is the total number of terms in GO corpus.

Thus the information content is defined as:
$IC(t) = -\log(p(t))$

As GO allow multiple parents for each concept, two terms can share
parents by multiple paths. IC-based methods calculate similarity of two GO terms based on the information content of their closest common ancestor term, which was also called most informative information ancestor (MICA).

### Resnik method

The Resnik method is defined as:
$sim_{Resnik}(t_1,t_2) = IC(MICA)$

### Lin method

The Lin method is defined as:
$sim_{Lin}(t_1,t_2) = \frac{2IC(MICA)}{IC(t_1)+IC(t_2)}$

### Rel method

The Relevance method, which was proposed by Schlicker, combine Resnik's and Lin's method
and is defined as:
$sim_{Rel}(t_1,t_2) = \frac{2IC(MICA)(1-p(MICA))}{IC(t_1)+IC(t_2)}$

### Jiang method

The Jiang and Conrath's method is defined as:
$sim_{Jiang}(t_1,t_2) = 1-\min(1, IC(t_1) + IC(t_2) - 2IC(MICA))$


## Graph-based method
Graph-based methods using the topology of GO graph structure to
compute semantic similarity. Formally, a GO term A can be represented
as $DAG_{A}=(A,T_{A},E_{A})$ where $T_{A}$ is the set of GO terms in
$DAG_{A}$, including term A and all of its ancestor terms in the GO
graph, and $E_{A}$ is the set of edges connecting the GO terms in
$DAG_{A}$.

### Wang method
To encode the semantic of a GO term in a measurable format to enable a quantitative comparison, Wang[@wang_new_2007] firstly defined the semantic value of term A as the aggregate contribution of all terms in $DAG_{A}$ to the semantics of term A, terms closer to term A in $DAG_{A}$ contribute more to its semantics. Thus, defined the contribution of a GO term $t$ to the semantic of GO term $A$ as the S-value of GO term $t$ related to term $A$.
For any of term $t$ in $DAG_{A}$, its S-value related to term $A$, $S_{A}(\textit{t})$ is defined as:


$\left\{\begin{array}{l} S_{A}(A)=1 \\ S_{A}(\textit{t})=\max\{w_{e} \times S_{A}(\textit{t}') | \textit{t}' \in children \: of(\textit{t}) \} \; if \: \textit{t} \ne A \end{array} \right.$

where $w_{e}$ is the semantic contribution factor for edge $e \in E_{A}$ linking term $t$ with its child term $t'$.
Term $A$ contributes to its own is defined as 1. After obtaining the S-values for all terms in $DAG_{A}$,
the semantic value of DO term A, $SV(A)$, is calculated as:

$SV(A)=\displaystyle\sum_{t \in T_{A}} S_{A}(t)$

Thus given two GO terms A and B, the semantic similarity between these two terms is defined as:

$sim_{Wang}(A, B) = \frac{\displaystyle\sum_{t \in T_{A} \cap T_{B}}{S_{A}(t) + S_{B}(t)}}{SV(A) + SV(B)}$

where $S_{A}(\textit{t})$ is the S-value of GO term $t$ related to term $A$
and $S_{B}(\textit{t})$ is the S-value of GO term $t$ related to term $B$.

This method proposed by Wang[@wang_new_2007] determines the semantic
similarity of two GO terms based on both the locations of these terms
in the GO graph and their relations with their ancestor terms.

## Supported organisms

For IC-based methods, information of GO term is species specific. We
need to calculate `IC` for all GO terms of a species before we measure
semantic
similarity. [GOSemSim](https://www.bioconductor.org/packages/GOSemSim)
support all organisms that have an `OrgDb` object available.

Bioconductor have already provided `OrgDb` for about 20 species, see <http://bioconductor.org/packages/release/BiocViews.html#___OrgDb>.

We can build query `OrgDb` online via [AnnotationHub](https://www.bioconductor.org/packages/AnnotationHub). For example:

```{r eval=FALSE}
library(AnnotationHub)
hub <- AnnotationHub()
q <- query(hub, "Cricetulus")
id <- q$ah_id[length(q)]
Cgriseus <- hub[[id]]
```


If organism is not supported by [AnnotationHub](https://www.bioconductor.org/packages/AnnotationHub), user can use [AnnotationForge](https://www.bioconductor.org/packages/AnnotationForge) to build `OrgDb`.

Once we have `OrgDb`, we can build annotation data needed by [GOSemSim](https://www.bioconductor.org/packages/GOSemSim) via `godata` function.

```{r}
library(GOSemSim)
hsGO <- godata('org.Hs.eg.db', ont="MF")
```

User can set `computeIC=FALSE` if they only want to use Wang's method.


## goSim and mgoSim function

In [GOSemSim](https://www.bioconductor.org/packages/GOSemSim), we implemented all these IC-based and graph-based methods. __*goSim*__ function calculates semantic similarity between two GO terms, while __*mgoSim*__ function calculates semantic similarity between two sets of GO terms.


```{r}
goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Jiang")
goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Wang")
go1 = c("GO:0004022","GO:0004024","GO:0004174")
go2 = c("GO:0009055","GO:0005515")
mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)
mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA")
```

# Gene Semantic Similarity Measurement

On the basis of semantic similarity between GO terms, [GOSemSim](https://www.bioconductor.org/packages/GOSemSim) can
also compute semantic similarity among sets of GO terms, gene products, and gene clusters.

Suppose we have gene $g_1$ annotated by GO terms sets $GO_{1}=\{go_{11},go_{12} \cdots go_{1m}\}$
and $g_2$ annotated by $GO_{2}=\{go_{21},go_{22} \cdots go_{2n}\}$, [GOSemSim](https://www.bioconductor.org/packages/GOSemSim) implemented four methods which called __*max*__, __*avg*__, __*rcmax*__, and __*BMA*__ to combine
semantic similarity scores of multiple GO terms. The similarities
among gene products and gene clusters which annotated by multiple GO
terms were also calculated by the same combine methods mentioned
above.


## Combine methods

### max

The __*max*__ method calculates the maximum semantic similarity score over all pairs of GO terms between these two GO term sets.


$sim_{max}(g_1, g_2) = \displaystyle\max_{1 \le i \le m, 1 \le j \le n} sim(go_{1i}, go_{2j})$

### avg

The __*avg*__ calculates the average semantic similarity score over all pairs of GO terms.


$sim_{avg}(g_1, g_2) = \frac{\displaystyle\sum_{i=1}^m\sum_{j=1}^nsim(go_{1i}, go_{2j})}{m \times n}$

### rcmax

Similarities among two sets of GO terms form a matrix, the __*rcmax*__ method uses the maximum of _RowScore_ and _ColumnScore_, where _RowScore_ (or _ColumnScore_) is the average of maximum similarity on each row (or column).


$sim_{rcmax}(g_1, g_2) = \max(\frac{\displaystyle\sum_{i=1}^m \max_{1 \le j \le n} sim(go_{1i}, go_{2j})}{m},\frac{\displaystyle\sum_{j=1}^n \max_{1 \le i \le m} sim(go_{1i},go_{2j})}{n})$

### BMA

The __*BMA*__ method, used the **B**est-**M**atch **A**verage strategy, calculates the average of all maximum similarities on each row and column, and is defined as:


$sim_{BMA}(g_1, g_2) = \frac{\displaystyle\sum_{1=i}^m \max_{1 \le j \le n}sim(go_{1i}, go_{2j}) + \displaystyle\sum_{1=j}^n \max_{1 \le i \le m}sim(go_{1i}, go_{2j})} {m+n}$


## geneSim and mgeneSim

In [GOSemSim](https://www.bioconductor.org/packages/GOSemSim), we implemented __*geneSim*__ to calculate semantic similarity between two gene products, and __*mgeneSim*__ to calculate semantic similarity among multiple gene products.

```{r}
geneSim("241", "251", semData=hsGO, measure="Wang", combine="BMA")
mgeneSim(genes=c("835", "5261","241", "994"),
         semData=hsGO, measure="Wang",verbose=FALSE)
mgeneSim(genes=c("835", "5261","241", "994"),
       semData=hsGO, measure="Rel",verbose=FALSE)
```

By default, `godata` function use `ENTREZID` as keytype, and the input ID type is `ENTREZID`. User can use other ID types such as `ENSEMBL`, `UNIPROT`, `REFSEQ`, `ACCNUM`, `SYMBOL` _et al_.

Here as an example, we use `SYMBOL` as `keytype` and calculate semantic similarities among several genes by using their gene symbol as input.

```{r}
hsGO2 <- godata('org.Hs.eg.db', keytype = "SYMBOL", ont="MF", computeIC=FALSE)
genes <- c("CDC45", "MCM10", "CDC20", "NMU", "MMP1")
mgeneSim(genes, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE)
```

Users can also use [`clusterProfiler::bitr`](https://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html#bitr-biological-id-translator) to translate biological IDs.


## clusterSim and mclusterSim

We also implemented __*clusterSim*__ for calculating semantic similarity between two gene clusters
and __*mclusterSim*__ for calculating semantic similarities among multiple gene clusters.

```{r}
gs1 <- c("835", "5261","241", "994", "514", "533")
gs2 <- c("578","582", "400", "409", "411")
clusterSim(gs1, gs2, semData=hsGO, measure="Wang", combine="BMA")

library(org.Hs.eg.db)
x <- org.Hs.egGO
hsEG <- mappedkeys(x)
set.seed <- 123
clusters <- list(a=sample(hsEG, 20), b=sample(hsEG, 20), c=sample(hsEG, 20))
mclusterSim(clusters, semData=hsGO, measure="Wang", combine="BMA")
```

# Applications

[GOSemSim](https://www.bioconductor.org/packages/GOSemSim) was cited by more than [200 papers](https://scholar.google.com.hk/scholar?oi=bibs&hl=en&cites=9484177541993722322,17633835198940746971,18126401808149291947) and had been applied to many research domains, including:

+ [Disease or Drug analysis](https://guangchuangyu.github.io/GOSemSim/featuredArticles/#diease-or-drug-analysis)
+ [Gene/Protein functional analysis](https://guangchuangyu.github.io/GOSemSim/featuredArticles/#geneprotein-functional-analysis)
+ [Protein-Protein interaction](https://guangchuangyu.github.io/GOSemSim/featuredArticles/#protein-protein-interaction)
+ [miRNA-mRNA interaction](https://guangchuangyu.github.io/GOSemSim/featuredArticles/#mirna-mrna-interaction)
+ [sRNA regulation](https://guangchuangyu.github.io/GOSemSim/featuredArticles/#srna-regulation)
+ [Evolution](https://guangchuangyu.github.io/GOSemSim/featuredArticles/#evolution)

Find out more on <https://guangchuangyu.github.io/GOSemSim/featuredArticles/>.


# GO enrichment analysis

GO enrichment analysis can be supported by our package [clusterProfiler](https://www.bioconductor.org/packages/clusterProfiler)[@yu2012], which supports hypergeometric test and Gene Set Enrichment Analysis (GSEA). Enrichment results across different gene clusters can be compared using __*compareCluster*__ function.

# Disease Ontology Semantic and Enrichment analysis

Disease Ontology (DO) annotates human genes in the context of disease. DO is an important annotation in translating molecular findings from high-throughput data to clinical relevance.
[DOSE](https://www.bioconductor.org/packages/DOSE)[@yu_dose_2015] supports semantic similarity computation among DO terms and genes.
Enrichment analysis including hypergeometric model and GSEA are also implemented to support discovering disease associations of high-throughput biological data.

# MeSH enrichment and semantic analyses

MeSH (Medical Subject Headings) is the NLM controlled vocabulary used to manually index articles for MEDLINE/PubMed. [meshes](https://www.bioconductor.org/packages/meshes) supports enrichment (hypergeometric test and GSEA) and semantic similarity analyses for more than 70 species.


# Session Information

Here is the output of `sessionInfo()` on the system on which this document was compiled:

```{r echo=FALSE}
sessionInfo()
```

# References