---
title: "Retrieve GO Gene Sets from BioMart"
author: "Zuguang Gu (z.gu@dkfz.de)"
date: '`r Sys.Date()`'
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Retrieve GO Gene Sets from BioMart}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, eval = TRUE, echo = FALSE}
library(knitr)
knitr::opts_chunk$set(
    error = FALSE,
    tidy  = FALSE,
    message = FALSE
)
```


This vignette demonstrates how to manually collect GO gene sets from [BioMart](https://www.ensembl.org/info/data/biomart/index.html). The source
code for retrieving GO gene sets for all supported organisms on BioMart can be found from the following location:

```{r}
system.file("scripts", "biomart_genesets.R", package = "BioMartGOGeneSets")
```

Retrieving GO gene sets from BioMart is a two-step process.

### Step1. Retrieve the relations between GO terms and genes for a specific organism

The relations can be obtained from BioMart with the R package **biomaRt**.
First let's create a "Mart" object which is like a connection to the BioMart web-service. 
Because this is a gene-related relation, we set argument `biomart` to `"genes"`.

```{r}
library(biomaRt)
ensembl = useEnsembl(biomart = "genes")
ensembl
```

```{r, echo = FALSE}
ensembl = useEnsembl(biomart = "genes")
```

Please note, if later when you use **biomaRt** and see errors of "timeout", you can select a different mirror, such as:

```{r, eval = FALSE}
useEnsembl(biomart = "genes", mirror = "uswest")
```

Next we need to select an organism. In BioMart, it is called a "dataset". To get a proper value of an organism dataset,
we can use `listDatasets()` to get a list of supported datasets.

```{r}
datasets = listDatasets(ensembl)
dim(datasets)
head(datasets)
```

You can see there are a huge number of organisms supported in BioMart. The first column in `datasets` contains valid values
for `dataset` which will be used in later functions.

Sometimes it is not easy to find the dataset of your organism, especially when the organism is not a model organism. The second
column in `datasets` contains dataset descriptions, which sometimes can be helpful to find the right one.

In this example, we use the dataset for human `"hsapiens_gene_ensembl"`.

To use the dataset, we specify `dataset` in the function `useDateset()`, which is like adding a flag of which dataset to use.

```{r}
ensembl = useDataset(dataset = "hsapiens_gene_ensembl", mart = ensembl)
ensembl
```

The dataset is like a giant table with a huge number of columns which provide massive additional information for genes. Here we are only 
interested in GO-related information. In the dataset, the table columns are called "attributes". There are a huge number
of supported attributes in a dataset. The complete list of attributes can be obtained by the function `listAttributes()`.

```{r}
all_at = listAttributes(mart = ensembl)
dim(all_at)
head(all_at)
```

To get proper values for the attributes, we need to go through the long table and sometimes this is not an easy task. 
The three attributes of GO-gene relations are `c("ensembl_gene_id", "go_id", "namespace_1003")`. 
Now we can use the function `getBM()` to obtain the GO-gene relation table.


```{r}
at = c("ensembl_gene_id", "go_id", "namespace_1003")
go = getBM(attributes = at, mart = ensembl)
```

Check the first several rows in `go`:


```{r}
head(go)
```

For some organisms, the returned table might be huge, or due to the bad internet connection, the data retrieving may exceed
the 5 min limit on BioMart, and you might see the following error: 

```
Error in curl::curl_fetch_memory(url, handle = handle) :
  Timeout was reached: [uswest.ensembl.org:443] Operation timed out after 300005 milliseconds with 2966434 bytes received
```

In this case, you might need to split the query into blocks and make sure each job query is small.

```{r, eval = FALSE}
genes = getBM(attributes = "ensembl_gene_id", mart = ensembl)[, 1]
go1 = getBM(attributes = at, mart = ensembl, filter = "ensembl_gene_id", value = genes[1:1000])
go2 = getBM(attributes = at, mart = ensembl, filter = "ensembl_gene_id", value = genes[1001:2000])
...
rbind(go1, go2, ...)
```


In this example, we will only demonstrate the Biological Process Ontology, and we convert the data frame `go` to a list of genes.

```{r}
go = go[go$namespace_1003 == "biological_process", , drop = FALSE]
gs = split(go$ensembl_gene_id, go$go_id)
```

In `gs`, there are a list of vectors where each vector can be thought as a gene set for the corresponding GO term. 

### Step 2. Complete genes in GO gene sets

GO has a tree structure where a parent term includes all genes annotated to its child terms. For every GO terms in `gs`, we need
to merge genes from all its child or offspring terms to form a complete gene set for this GO term.

The hierarchical structure of GO terms is stored in the **GO.db** package. In the following code, `bp_terms` contains a vector of
GO terms only in the Biological Process ontology. The variable `GOBPOFFSPRING` is simply a list where each element vector contains all offspring
terms (child and remote downstream terms) of a GO term.

```{r}
library(GO.db)
bp_terms = GOID(GOTERM)[Ontology(GOTERM) == "BP"]
GOBPOFFSPRING = as.list(GOBPOFFSPRING)
```

Now it is quite easy to merge genes from offspring terms. Just note as the final step, empty GO gene sets should be removed.

```{r}
gs2 = lapply(bp_terms, function(nm) {
  go_id = c(nm, GOBPOFFSPRING[[nm]]) # self + offspring
  unique(unlist(gs[go_id]))
})
names(gs2) = bp_terms
gs2 = gs2[sapply(gs2, length) > 0]
```


## Session info

```{r, echo = FALSE}
library(biomaRt)
library(GO.db)
```

```{r}
sessionInfo()
```