--- title: "Analyze with online GREAT" author: "Zuguang Gu ( z.gu@dkfz.de )" date: '`r Sys.Date()`' output: html_document: toc: true toc_depth: 3 toc_collapsed: false toc_float: true vignette: > %\VignetteIndexEntry{1. Analyze with online GREAT} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE} library(knitr) knitr::opts_chunk$set( error = FALSE, tidy = FALSE, message = FALSE, warning = FALSE, fig.align = "center") ``` **Note: [On Aug 19 2019 GREAT released version 4](http://great.stanford.edu/help/display/GREAT/Version+History) which supports `hg38` genome and removes some ontologies such pathways. `submitGreatJob()` still takes `hg19` as default. `hg38` can be specified by argument `species = "hg38"`. To use the older versions such as 3.0.0, specify as `submitGreatJob(..., version = "3")`.** **GREAT** ([Genomic Regions Enrichment of Annotations Tool](http://great.stanford.edu)) is a popular web-based tool to associate biological functions to genomic regions. The **rGREAT** package makes GREAT anlaysis automatic by first constructing a HTTP POST request according to user's input and retrieving results from **GREAT** web server afterwards. ## Submit the job Load the package: ```{r} library(rGREAT) ``` The input data is either a `GRanges` object or a _BED_-format data frame, no matter it is sorted or not. In following example, we use a `GRanges` object which is randomly generated. ```{r} set.seed(123) gr = randomRegions(nr = 1000, genome = "hg19") head(gr) ``` Submit genomic regions by `submitGreatJob()`. The returned variable `job` is a `GreatJob` class instance which can be used to retrieve results from **GREAT** server and store results which are already downloaded. ```{r, eval = FALSE} job = submitGreatJob(gr) ``` ```{r, echo = FALSE} job = readRDS(system.file("extdata", "GreatJob.rds", package = "rGREAT")) ``` You can get the summary of your job by directly printing `job`. ```{r} job ``` More parameters can be set for the job: ```{r, eval = FALSE} job = submitGreatJob(gr, species = "mm9") # of course, gr should be from mm9 job = submitGreatJob(gr, adv_upstream = 10, adv_downstream = 2, adv_span = 2000) job = submitGreatJob(gr, rule = "twoClosest", adv_twoDistance = 2000) job = submitGreatJob(gr, rule = "oneClosest", adv_oneDistance = 2000) ``` Also you can choose different versions of GREAT for the analysis. ```{r, eval = FALSE} job = submitGreatJob(gr, version = "3.0") job = submitGreatJob(gr, version = "2.0") ```

Note: from rGREAT package 1.99.0, background by `bg` argument is not supported any more (currently you can still use it, but you will see a warning message), because GREAT requires a special format for gr and bg if both are set, and it uses a different method for the enrichment analysis and returns enrichment tables in a different format. But still, you can use local GREAT to integrate background regions. Seel the rGREAT paper for more details.

Available parameters are (following content is copied from **GREAT** website): - `species`: "hg38", "hg19", "mm10", "mm9" are supported in GREAT version 4.x.x, "hg19", "mm10", "mm9", "danRer7" are supported in GREAT version 3.x.x and "hg19", "hg18", "mm9", "danRer7" are supported in GREAT version 2.x.x. - `includeCuratedRegDoms`: Whether to include curated regulatory domains. - `rule`: How to associate genomic regions to genes. * `basalPlusExt`: mode 'Basal plus extension'. Gene regulatory domain definition: Each gene is assigned a basal regulatory domain of a minimum distance upstream and downstream of the TSS (regardless of other nearby genes). The gene regulatory domain is extended in both directions to the nearest gene's basal domain but no more than the maximum extension in one direction. + `adv_upstream`: proximal extension to upstream (unit: kb) + `adv_downstream`: proximal extension to downstream (unit: kb) + `adv_span`: maximum extension (unit: kb) * `twoClosest`: mode 'Two nearest genes'. Gene regulatory domain definition: Each gene is assigned a regulatory domain that extends in both directions to the nearest gene's TSS but no more than the maximum extension in one direction. + `adv_twoDistance`: maximum extension (unit: kb) * `oneClosest`: mode 'Single nearest gene'. Gene regulatory domain definition: Each gene is assigned a regulatory domain that extends in both directions to the midpoint between the gene's TSS and the nearest gene's TSS but no more than the maximum extension in one direction. + `adv_oneDistance`: maximum extension (unit: kb) **GREAT** uses [the UCSC bed-format](https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655452/File+Formats#FileFormats-WhatisBEDformat?) where genomic coordinates are 0-based. Many R packages generate genomic regions as 1-based. Thus by default, the start positions of regions are subtracted by 1. If your regions are already 0-based, you can specify `gr_is_zero_based = TRUE` in `submitGreatJob()`. Anyway in most cases, this will only slightly affect the enrichment results. ## Get enrichment tables With `job`, we can now retrieve results from **GREAT**. The first and the primary results are the tables which contain enrichment statistics for the analysis. By default it will retrieve results from three GO Ontologies. All tables contains statistics for all terms no matter they are significant or not. Users can then make filtering with self-defined cutoff. There is a column for adjusted p-values by "BH" method. Other p-value adjustment methods can be applied by `p.adjust()`. The returned value of `getEnrichmentTables()` is a list of data frames in which each one corresponds to the table for a single ontology. The structure of data frames are same as the tables on **GREAT** website. ```{r, message = TRUE} tbl = getEnrichmentTables(job) names(tbl) tbl[[1]][1:2, ] ``` Information stored in `job` will be updated after retrieving enrichment tables. ```{r} job ``` You can get results by either specifying the ontologies or by the pre-defined categories (categories already contains pre-defined sets of ontologies): ```{r, eval = FALSE} tbl = getEnrichmentTables(job, ontology = c("GO Molecular Function", "Human Phenotype")) tbl = getEnrichmentTables(job, category = c("GO")) ``` As you have seen in the previous messages and results, The enrichment tables contain no associated genes. However, you can set `download_by = 'tsv'` in `getEnrichmentTables()` to download the complete tables, but due to [the restriction from GREAT web server](https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655401/Export#Export-GlobalExport), only the top 500 regions can be retreived (check the last two columns of `tbl2[["GO Molecular Function"]]` in the following example). ```{r, eval = FALSE} tbl2 = getEnrichmentTables(job, download_by = "tsv") ``` All available ontology names for a given species can be get by `availableOntologies()` and all available ontology categories can be get by `availableCategories()`. Here you do not need to provide species information because `job` already contains it. ```{r} availableOntologies(job) availableCategories(job) availableOntologies(job, category = "GO") ``` ## Make volcano plot In differential gene expression analysis, volcano plot is used to visualize relations between log2 fold change and (adjusted) p-values. Similarly, we can also use volcano plot to visualize relations between fold enrichment and (adjusted) p-values for the enrichment analysis. The plot is made by the function `plotVolcano()`: ```{r, fig.width = 6, fig.height = 6} plotVolcano(job, ontology = "GO Biological Process") ``` As the enrichment analysis basically only looks for over-representations, it is actually half volcano. ## Get region-gene associations Association between genomic regions and genes can be plotted by `plotRegionGeneAssociations()`. The function will make the three plots which are same as on **GREAT** website. ```{r, fig.width = 10, fig.height = 10/3, fig.align = 'center'} plotRegionGeneAssociations(job) ``` `getRegionGeneAssociations()` returns a `GRanges` object which contains the gene-region associations. Note the column `dist_to_TSS` is based on the middle points of the input regions to TSS. ```{r} getRegionGeneAssociations(job) ``` Please note the two meta columns are in formats of `CharacterList` and `IntegerList` because a region may associate to multiple genes. You can also choose only plotting one of the three figures. ```{r, eval = FALSE} plotRegionGeneAssociations(job, which_plot = 1) ``` By specifying ontology and term ID, you can get the associations in a certain term. Here the term ID is from the first column of the data frame from `getEnrichmentTables()`. ```{r, fig.width = 10, fig.height = 10/3} plotRegionGeneAssociations(job, ontology = "GO Molecular Function", term_id = "GO:0004984") getRegionGeneAssociations(job, ontology = "GO Molecular Function", term_id = "GO:0004984") ``` ## The Shiny application `shinyReport()` creates a Shiny application to view the complete results: ```{r, eval = FALSE} shinyReport(job) ``` ## Session info ```{r} sessionInfo() ```