---
title: "Pre-compiled GO Gene Sets"
author: "Zuguang Gu (z.gu@dkfz.de)"
date: '`r Sys.Date()`'
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Pre-compiled GO Gene Sets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r, eval = TRUE, echo = FALSE}
library(knitr)
knitr::opts_chunk$set(
    error = FALSE,
    tidy  = FALSE,
    message = FALSE
)
```

The **BioMartGOGeneSets** contains pre-compiled GO gene sets for a huge number of
organisms supported in [BioMart](https://www.ensembl.org/info/data/biomart/index.html).
There are two types of data: 1. genes and 2 gene sets.

## Retrieve genes

To obtain the genes, use the function `getBioMartGenes()`. You need to provide
a proper "dataset", which can be found with the function `supportedOrganisms()` (A complete list can
be also found from "[**BioMart Gene Ontology Gene Sets Collections**](supported_organisms.html))". Here
we use the dataset `"hsapiens_gene_ensembl"` as an example which is for human.

```{r}
library(BioMartGOGeneSets)
gr = getBioMartGenes("hsapiens_gene_ensembl")
gr
```

The returned value is a `GRanges` object which contains the coordinates of genes. 
The meta columns contain additional information of genes such as different gene IDs
and type of genes.

You can also provide a "short name" for dataset and the function will perform a partial matching.

```{r, eval = FALSE}
gr = getBioMartGenes("hsapiens")
```

You can try the following command and see what will be printed:

```{r, eval = FALSE}
gr = getBioMartGenes("human")
```

You can also provide the taxon id for the organism:

```{r, eval = FALSE}
gr = getBioMartGenes(9606)
```

Chromosome names from Ensembl have no "chr" prefix. You can set `add_chr_prefix = TURE` to add `"chr"` prefix to some
of the chromosome names. Internally it uses `GenomeInfoDb::seqlevelsStyle(gr) = "UCSC"` to check and to add `"chr"` prefix.

```{r}
gr = getBioMartGenes("hsapiens_gene_ensembl", add_chr_prefix = TRUE)
gr
```

Note `add_chr_prefix` is just a helper argument. You can basically do the same as:

```{r}
gr = getBioMartGenes("hsapiens_gene_ensembl")
GenomeInfoDb::seqlevelsStyle(gr) = "UCSC"
gr
```

For some not-well-studied organisms, there might be no "official chromosome name". For example, `"cporcellus_gene_ensembl"` for the guinea Pig:

```{r}
gr = getBioMartGenes("cporcellus_gene_ensembl")
gr
```

The sequence names are in a special format of `DS\d+`. The source of the format can be obtained by `getBioMartGenomeInfo()`. In the `seqname_style` element of the returned list, there are several examples that you can compare to.

```{r}
getBioMartGenomeInfo("cporcellus_gene_ensembl")
```

Now we know they are the GenBank accession IDs. Next we might want to change them to the `"Sequence-Name"` style. Simply 
use `changeSeqnameStyle()`.

```{r}
gr2 = changeSeqnameStyle(gr, "cporcellus_gene_ensembl", 
    seqname_style_from = "GenBank-Accn", 
    seqname_style_to = "Sequence-Name")
gr2
```

Sometimes the internal sequence names need to be reformatted to fit the input `gr`. In the second example ( `"apercula_gene_ensembl"` for the orange clownfish), the sequence names are in format of `1, 2, 3, ...`, while internally they are represented as `chr1, chr2, ...`.

```{r}
gr = getBioMartGenes("apercula_gene_ensembl")
gr
getBioMartGenomeInfo("apercula_gene_ensembl")
```

In this case, we need to set the argument `reformat_from` as a function to reformat the internal format to fit the sequence names in `gr`.
Also you can set `reformat_to` as a function to reformat the converted sequence names.

```{r}
gr2 = changeSeqnameStyle(gr, "apercula_gene_ensembl", 
    seqname_style_from = "Sequence-Name", 
    seqname_style_to = "GenBank-Accn", 
    reformat_from =function(x) gsub("chr", "", x),
    reformat_to = function(x) gsub("\\.\\d+$", "", x)
)
gr2
```


## Retrieve gene sets

To obtain the gene sets, use the function `getBioMartGOGeneSets()`. Also you need to provide
the "dataset". Here we use a different dataset: `"mmusculus_gene_ensembl"` (mouse).

```{r}
lt = getBioMartGOGeneSets("mmusculus_gene_ensembl")
length(lt)
lt[1]
```

The variable `lt` is a list of vectors where each vector corresponds to a GO gene set with Ensembl
IDs as gene identifiers.

You can try the following command and see what will be printed:

```{r, eval = FALSE}
lt = getBioMartGOGeneSets("mouse")
```

Remember you can also set the taxon ID:

```{r, eval = FALSE}
lt = getBioMartGOGeneSets(10090)
```

In `getBioMartGOGeneSets()`, argument `as_table` can be set to `TRUE`, then the function returns
a data frame.

```{r}
tb = getBioMartGOGeneSets("mmusculus_gene_ensembl", as_table = TRUE)
head(tb)
```

Argument `ontology` controls which category of GO gene sets. Possible values should be `"BP"`, `"CC"`
and `"MF"`.

```{r, eval = FALSE}
getBioMartGOGeneSets("mmusculus_gene_ensembl", ontology = "BP") # the default one
getBioMartGOGeneSets("mmusculus_gene_ensembl", ontology = "CC")
getBioMartGOGeneSets("mmusculus_gene_ensembl", ontology = "MF")
```

Last, argument `gene_id_type` can be set to `"entrez_gene"` or `"gene_symbol"`, then genes in the gene sets
are in Entrez IDs or gene symbols. Note this depends on specific organisms, that not every organism supports 
Entrez IDs or gene symbols.

```{r}
lt = getBioMartGOGeneSets("mmusculus_gene_ensembl", gene_id_type = "entrez_gene")
lt[1]
```

## Version of the data

The object `BioMartGOGeneSets` contains the version and source of data.

```{r}
BioMartGOGeneSets
```

## Session Info

```{r}
sessionInfo()
```