--- title: "Retrieve GO Gene Sets from BioMart" author: "Zuguang Gu (z.gu@dkfz.de)" date: '`r Sys.Date()`' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Retrieve GO Gene Sets from BioMart} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, eval = TRUE, echo = FALSE} library(knitr) knitr::opts_chunk$set( error = FALSE, tidy = FALSE, message = FALSE ) ``` This vignette demonstrates how to manually collect GO gene sets from [BioMart](https://www.ensembl.org/info/data/biomart/index.html). The source code for retrieving GO gene sets for all supported organisms on BioMart can be found from the following location: ```{r} system.file("scripts", "biomart_genesets.R", package = "BioMartGOGeneSets") ``` Retrieving GO gene sets from BioMart is a two-step process. ### Step1. Retrieve the relations between GO terms and genes for a specific organism The relations can be obtained from BioMart with the R package **biomaRt**. First let's create a "Mart" object which is like a connection to the BioMart web-service. Because this is a gene-related relation, we set argument `biomart` to `"genes"`. ```{r} library(biomaRt) ensembl = useEnsembl(biomart = "genes") ensembl ``` ```{r, echo = FALSE} ensembl = useEnsembl(biomart = "genes") ``` Please note, if later when you use **biomaRt** and see errors of "timeout", you can select a different mirror, such as: ```{r, eval = FALSE} useEnsembl(biomart = "genes", mirror = "uswest") ``` Next we need to select an organism. In BioMart, it is called a "dataset". To get a proper value of an organism dataset, we can use `listDatasets()` to get a list of supported datasets. ```{r} datasets = listDatasets(ensembl) dim(datasets) head(datasets) ``` You can see there are a huge number of organisms supported in BioMart. The first column in `datasets` contains valid values for `dataset` which will be used in later functions. Sometimes it is not easy to find the dataset of your organism, especially when the organism is not a model organism. The second column in `datasets` contains dataset descriptions, which sometimes can be helpful to find the right one. In this example, we use the dataset for human `"hsapiens_gene_ensembl"`. To use the dataset, we specify `dataset` in the function `useDateset()`, which is like adding a flag of which dataset to use. ```{r} ensembl = useDataset(dataset = "hsapiens_gene_ensembl", mart = ensembl) ensembl ``` The dataset is like a giant table with a huge number of columns which provide massive additional information for genes. Here we are only interested in GO-related information. In the dataset, the table columns are called "attributes". There are a huge number of supported attributes in a dataset. The complete list of attributes can be obtained by the function `listAttributes()`. ```{r} all_at = listAttributes(mart = ensembl) dim(all_at) head(all_at) ``` To get proper values for the attributes, we need to go through the long table and sometimes this is not an easy task. The three attributes of GO-gene relations are `c("ensembl_gene_id", "go_id", "namespace_1003")`. Now we can use the function `getBM()` to obtain the GO-gene relation table. ```{r} at = c("ensembl_gene_id", "go_id", "namespace_1003") go = getBM(attributes = at, mart = ensembl) ``` Check the first several rows in `go`: ```{r} head(go) ``` For some organisms, the returned table might be huge, or due to the bad internet connection, the data retrieving may exceed the 5 min limit on BioMart, and you might see the following error: ``` Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached: [uswest.ensembl.org:443] Operation timed out after 300005 milliseconds with 2966434 bytes received ``` In this case, you might need to split the query into blocks and make sure each job query is small. ```{r, eval = FALSE} genes = getBM(attributes = "ensembl_gene_id", mart = ensembl)[, 1] go1 = getBM(attributes = at, mart = ensembl, filter = "ensembl_gene_id", value = genes[1:1000]) go2 = getBM(attributes = at, mart = ensembl, filter = "ensembl_gene_id", value = genes[1001:2000]) ... rbind(go1, go2, ...) ``` In this example, we will only demonstrate the Biological Process Ontology, and we convert the data frame `go` to a list of genes. ```{r} go = go[go$namespace_1003 == "biological_process", , drop = FALSE] gs = split(go$ensembl_gene_id, go$go_id) ``` In `gs`, there are a list of vectors where each vector can be thought as a gene set for the corresponding GO term. ### Step 2. Complete genes in GO gene sets GO has a tree structure where a parent term includes all genes annotated to its child terms. For every GO terms in `gs`, we need to merge genes from all its child or offspring terms to form a complete gene set for this GO term. The hierarchical structure of GO terms is stored in the **GO.db** package. In the following code, `bp_terms` contains a vector of GO terms only in the Biological Process ontology. The variable `GOBPOFFSPRING` is simply a list where each element vector contains all offspring terms (child and remote downstream terms) of a GO term. ```{r} library(GO.db) bp_terms = GOID(GOTERM)[Ontology(GOTERM) == "BP"] GOBPOFFSPRING = as.list(GOBPOFFSPRING) ``` Now it is quite easy to merge genes from offspring terms. Just note as the final step, empty GO gene sets should be removed. ```{r} gs2 = lapply(bp_terms, function(nm) { go_id = c(nm, GOBPOFFSPRING[[nm]]) # self + offspring unique(unlist(gs[go_id])) }) names(gs2) = bp_terms gs2 = gs2[sapply(gs2, length) > 0] ``` ## Session info ```{r, echo = FALSE} library(biomaRt) library(GO.db) ``` ```{r} sessionInfo() ```