--- title: "Pre-compiled GO Gene Sets" author: "Zuguang Gu (z.gu@dkfz.de)" date: '`r Sys.Date()`' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Pre-compiled GO Gene Sets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, eval = TRUE, echo = FALSE} library(knitr) knitr::opts_chunk$set( error = FALSE, tidy = FALSE, message = FALSE ) ``` The **BioMartGOGeneSets** contains pre-compiled GO gene sets for a huge number of organisms supported in [BioMart](https://www.ensembl.org/info/data/biomart/index.html). There are two types of data: 1. genes and 2 gene sets. ## Retrieve genes To obtain the genes, use the function `getBioMartGenes()`. You need to provide a proper "dataset", which can be found with the function `supportedOrganisms()` (A complete list can be also found from "[**BioMart Gene Ontology Gene Sets Collections**](supported_organisms.html))". Here we use the dataset `"hsapiens_gene_ensembl"` as an example which is for human. ```{r} library(BioMartGOGeneSets) gr = getBioMartGenes("hsapiens_gene_ensembl") gr ``` The returned value is a `GRanges` object which contains the coordinates of genes. The meta columns contain additional information of genes such as different gene IDs and type of genes. You can also provide a "short name" for dataset and the function will perform a partial matching. ```{r, eval = FALSE} gr = getBioMartGenes("hsapiens") ``` You can try the following command and see what will be printed: ```{r, eval = FALSE} gr = getBioMartGenes("human") ``` You can also provide the taxon id for the organism: ```{r, eval = FALSE} gr = getBioMartGenes(9606) ``` Chromosome names from Ensembl have no "chr" prefix. You can set `add_chr_prefix = TURE` to add `"chr"` prefix to some of the chromosome names. Internally it uses `GenomeInfoDb::seqlevelsStyle(gr) = "UCSC"` to check and to add `"chr"` prefix. ```{r} gr = getBioMartGenes("hsapiens_gene_ensembl", add_chr_prefix = TRUE) gr ``` Note `add_chr_prefix` is just a helper argument. You can basically do the same as: ```{r} gr = getBioMartGenes("hsapiens_gene_ensembl") GenomeInfoDb::seqlevelsStyle(gr) = "UCSC" gr ``` For some not-well-studied organisms, there might be no "official chromosome name". For example, `"cporcellus_gene_ensembl"` for the guinea Pig: ```{r} gr = getBioMartGenes("cporcellus_gene_ensembl") gr ``` The sequence names are in a special format of `DS\d+`. The source of the format can be obtained by `getBioMartGenomeInfo()`. In the `seqname_style` element of the returned list, there are several examples that you can compare to. ```{r} getBioMartGenomeInfo("cporcellus_gene_ensembl") ``` Now we know they are the GenBank accession IDs. Next we might want to change them to the `"Sequence-Name"` style. Simply use `changeSeqnameStyle()`. ```{r} gr2 = changeSeqnameStyle(gr, "cporcellus_gene_ensembl", seqname_style_from = "GenBank-Accn", seqname_style_to = "Sequence-Name") gr2 ``` Sometimes the internal sequence names need to be reformatted to fit the input `gr`. In the second example ( `"apercula_gene_ensembl"` for the orange clownfish), the sequence names are in format of `1, 2, 3, ...`, while internally they are represented as `chr1, chr2, ...`. ```{r} gr = getBioMartGenes("apercula_gene_ensembl") gr getBioMartGenomeInfo("apercula_gene_ensembl") ``` In this case, we need to set the argument `reformat_from` as a function to reformat the internal format to fit the sequence names in `gr`. Also you can set `reformat_to` as a function to reformat the converted sequence names. ```{r} gr2 = changeSeqnameStyle(gr, "apercula_gene_ensembl", seqname_style_from = "Sequence-Name", seqname_style_to = "GenBank-Accn", reformat_from =function(x) gsub("chr", "", x), reformat_to = function(x) gsub("\\.\\d+$", "", x) ) gr2 ``` ## Retrieve gene sets To obtain the gene sets, use the function `getBioMartGOGeneSets()`. Also you need to provide the "dataset". Here we use a different dataset: `"mmusculus_gene_ensembl"` (mouse). ```{r} lt = getBioMartGOGeneSets("mmusculus_gene_ensembl") length(lt) lt[1] ``` The variable `lt` is a list of vectors where each vector corresponds to a GO gene set with Ensembl IDs as gene identifiers. You can try the following command and see what will be printed: ```{r, eval = FALSE} lt = getBioMartGOGeneSets("mouse") ``` Remember you can also set the taxon ID: ```{r, eval = FALSE} lt = getBioMartGOGeneSets(10090) ``` In `getBioMartGOGeneSets()`, argument `as_table` can be set to `TRUE`, then the function returns a data frame. ```{r} tb = getBioMartGOGeneSets("mmusculus_gene_ensembl", as_table = TRUE) head(tb) ``` Argument `ontology` controls which category of GO gene sets. Possible values should be `"BP"`, `"CC"` and `"MF"`. ```{r, eval = FALSE} getBioMartGOGeneSets("mmusculus_gene_ensembl", ontology = "BP") # the default one getBioMartGOGeneSets("mmusculus_gene_ensembl", ontology = "CC") getBioMartGOGeneSets("mmusculus_gene_ensembl", ontology = "MF") ``` Last, argument `gene_id_type` can be set to `"entrez_gene"` or `"gene_symbol"`, then genes in the gene sets are in Entrez IDs or gene symbols. Note this depends on specific organisms, that not every organism supports Entrez IDs or gene symbols. ```{r} lt = getBioMartGOGeneSets("mmusculus_gene_ensembl", gene_id_type = "entrez_gene") lt[1] ``` ## Version of the data The object `BioMartGOGeneSets` contains the version and source of data. ```{r} BioMartGOGeneSets ``` ## Session Info ```{r} sessionInfo() ```