---
title: "methylGSA: Gene Set Analysis for DNA Methylation Datasets"
author: "Xu Ren and Pei Fen Kuan"
date: "`r Sys.Date()`"
output: 
    rmarkdown::html_document:
        highlight: pygments
        toc: true
vignette: >
    %\VignetteIndexEntry{methylGSA: Gene Set Analysis for DNA Methylation Datasets}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---


```{r, echo = FALSE}
knitr::opts_chunk$set(comment = "", message=FALSE, warning = FALSE)
```

## Installation

To install and load methylGSA

```{r eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("methylGSA")
```

```{r}
library(methylGSA)
```

Depending on the DNA methylation array type, other packages may be needed 
before running the analysis.    

If analyzing 450K array, `IlluminaHumanMethylation450kanno.ilmn12.hg19` needs 
to be loaded.
```{r}
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
```

If analyzing EPIC array, `IlluminaHumanMethylationEPICanno.ilm10b4.hg19` needs 
to be loaded.
```{r eval = FALSE}
library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
```

If analyzing user-supplied mapping between CpG ID and gene name, neither
`IlluminaHumanMethylation450kanno.ilmn12.hg19` nor 
`IlluminaHumanMethylationEPICanno.ilm10b4.hg19` needs to be loaded.


## Introduction

The methylGSA package contains functions to carry out gene set analysis 
adjusting for the number of CpGs per gene. It has been shown by 
Geeleher et al [1] that gene set analysis is extremely biased for DNA 
methylation data. This package contains three main functions to adjust for the 
bias in gene set analysis. 

* methylglm: Incorporating number of CpGs as a covariate in 
logistic regression. 

* methylRRA: Adjusting for multiple p-values of each gene by Robust 
Rank Aggregation [2], and then apply either 
over-representation analysis (ORA) or Preranked version of Gene Set Enrichment 
Analysis (GSEAPreranked) [3] in gene set testing. 

* methylgometh: Adjusting the number of CpGs for each gene by 
weighted resampling and Wallenius non-central hypergeometric approximation 
(via missMethyl [4]).

## Supported gene sets and gene ID types

* Gene Ontology (via org.Hs.eg.db [5])
* KEGG (via org.Hs.eg.db [5])
* Reactome (via reactome.db [6])
* User-supplied gene sets. Supported gene input ID types:
    + "SYMBOL"
    + "ENSEMBL"
    + "ENTREZID"
    + "REFSEQ"
    
## Supported array types

* Infinium Human Methylation 450K BeadChip 
(via IlluminaHumanMethylation450kanno.ilmn12.hg19 [7])
* Infinium Methylation EPIC BeadChip 
(via IlluminaHumanMethylationEPICanno.ilm10b4.hg19 [8])
* User supplied mapping between CpG ID and gene name

## Description of methylglm

methylglm is an extention of GOglm [9]. GOglm adjusts 
length bias for RNA-Seq data by incorporating gene length as a covariate 
in logistic regression model. methylglm adjusts length bias in DNA methylation 
by the number of CpGs instead of gene length. For each gene set, we fit a 
logistic regression model:

$$ \text{logit} (\pi_{i}) = \beta_{0} + \beta_{1}x_{i} + \beta_{2}c_{i}$$
For each gene $i$, $\pi_{i}$ = Pr(gene $i$ is in gene set), $x_{i}$ represents 
negative logarithmic transform of its minimum p-value in differential 
methylation analysis, and $c_{i}$ is logarithmic transform of its number 
of CpGs. 

methylglm requires only a simple named vector. This vector contains p-values of 
each CpG. Names should be their corresponding CpG IDs. 

### Example

Here is what the input vector looks like:

```{r}
data(cpgtoy)
head(cpg.pval, 20)
```

Please note that the p-values presented here in `cpg.pval` is for illustration 
purpose only. They are used to show how to utilize the functions in methylGSA. 
It does not represent p-values from any real data analysis. 
The actual p-values in differential methylation analysis may be quite 
different from the p-values in `cpg.pval` in terms of the magnitude.    

Run methylglm:

```{r}
res1 = methylglm(cpg.pval = cpg.pval, minsize = 200, 
                 maxsize = 500, GS.type = "KEGG")
head(res1, 15)
```

Result is a data frame ranked by p-values of gene sets. The meaning of each
column is given below.

```{r echo=FALSE}
glm_res = data.frame(
    "Column" = c("ID", 
                 "Description (N/A for user supplied gene set)",
                 "Size", "pvalue", "padj"),
    "Explanation" = c("Gene set ID", "Gene set description", 
                "Number of genes in gene set", 
                "p-value in logistic regression", "Adjusted p-value")
)
knitr::kable(glm_res)
```

Bioconductor package
[org.Hs.eg.db](
https://bioconductor.org/packages/data/annotation/html/org.Hs.eg.db.html)
can be used to pull out the genes corresponding a specific pathway. 
For instance, one of the pathways in the methylglm output above is 
"Neuroactive ligand-receptor interaction" with KEGG ID 04080. The following 
code can be used to obtain its genes.

```{r}
library(org.Hs.eg.db)
genes_04080 = select(org.Hs.eg.db, "04080", "SYMBOL", keytype = "PATH")
head(genes_04080)
```

The following code can be used to get the genes in all the pathways in 
methylglm output.

```{r eval=FALSE}
# include all the IDs as the 2nd argument in select function
genes_all_pathway = select(org.Hs.eg.db, as.character(res1$ID), 
                     "SYMBOL", keytype = "PATH")
head(genes_all_pathway)
```


## Description of methylRRA

Robust rank aggregation [2] is a parameter free model that 
aggregates several ranked gene lists into a single gene list. The aggregation 
assumes random order of input lists and assign each gene a p-value based on 
order statistics. We apply this order statistics idea to adjust for number 
of CpGs. 

For gene $i$, let $P_{1}, P_{2}, ... P_{n}$ be the p-values of individual CpGs 
in differential methylation analysis. 
Under the null hypothesis, 
$P_{1}, P_{2}, ... P_{n} ~ \overset{i.i.d}{\sim} Unif[0, 1]$. 
Let $P_{(1)}, P_{(2)}, ... P_{(n)}$ be the order statistics. Define: 
$$\rho = 
\text{min}\{\text{Pr}(P_{(1)}<P_{(1)\text{obs}}), 
\text{Pr}(P_{(2)}<P_{(2)\text{obs}})..., 
\text{Pr}(P_{(n)}<P_{(n)\text{obs}}) \} $$

methylRRA supports two approaches to adjust for number of CpGs, ORA and 
GSEAPreranked [3]. In ORA approach, for gene $i$, conversion from 
$\rho$ score into p-value is done by Bonferroni correction [2]. 
We get a p-value for each gene and these p-values are then corrected for 
multiple testing use Benjamini & Hochberg procedure 
[10]. By default, genes with False Discovery Rate (FDR) below 0.05 are 
considered differentially expressed (DE) genes. If there are
no DE genes under FDR 0.05, users are able to use `sig.cut` option to specify
a higher FDR cut-off or `topDE` option to declare top genes to be 
differentially expressed. We then apply ORA based on these DE genes. 

In GSEAPreranked approach, for gene $i$, we also convert $\rho$ score into 
p-value by Bonferroni correction. p-values are converted into z-scores. We then 
apply Preranked version of Gene Set Enrichment Analysis (GSEAPreranked) on 
the gene list ranked by the z-scores.  

### Example

To apply ORA approach, we use argument `method = "ORA"` (default) in methylRRA

```{r eval=FALSE}
res2 = methylRRA(cpg.pval = cpg.pval, method = "ORA", 
                    minsize = 200, maxsize = 210)
head(res2, 15)
```

The meaning of each column in the output is given below.

```{r echo=FALSE}
ora_res = data.frame(
    "Column" = c("ID", 
                 "Description (N/A for user supplied gene set)", "Count",
                 "overlap", "Size", "pvalue", "padj"),
    "Explanation" = c("Gene set ID", "Gene set description", 
                "Number of significant genes in the gene set",
                "Names of significant genes in the gene set",
                "Number of genes in gene set", 
                "p-value in ORA", "Adjusted p-value")
)
knitr::kable(ora_res)
```


To apply GSEAPreranked approach, we use argument `method = "GSEA"` in methylRRA

```{r eval=FALSE}
res3 = methylRRA(cpg.pval = cpg.pval, method = "GSEA", 
                    minsize = 200, maxsize = 210)
head(res3, 10)
```

The meaning of each column in the output is given below.

```{r echo=FALSE}
gsea_res = data.frame(
    "Column" = c("ID", 
                 "Description (N/A for user supplied gene set)", 
                 "Size", "enrichmentScore", "NES",
                 "pvalue", "padj"),
    "Explanation" = c("Gene set ID", "Gene set description", 
                "Number of genes in gene set",
                "Enrichment score (see [3] for details)", 
                "Normalized enrichment score (see [3] for details)",
                "p-value in GSEA", "Adjusted p-value")
)
knitr::kable(gsea_res)
```


## Description of methylgometh

methylgometh calls `gometh` or `gsameth` function in missMethyl 
package [4] to adjust number of CpGs in gene set 
testing. `gometh` modifies goseq method [11] by fitting a 
probability weighting function and resampling from Wallenius non-central 
hypergeometric distribution. 

methylgometh requires two inputs, `cpg.pval` and `sig.cut`. `sig.cut` specifies
the cut-off point to declare a CpG as differentially methylated. By default, 
`sig.cut` is 0.001. Similar to methylRRA, if no CpG is significant, users are 
able to specify a higher cut-off or use `topDE` option to declare 
top CpGs to be differentially methylated.

### Example

```{r eval=FALSE}
res4 = methylgometh(cpg.pval = cpg.pval, sig.cut = 0.001, 
                        minsize = 200, maxsize = 210)
head(res4, 15)
```


## Other options

methylGSA provides many other options for users to customize the analysis. 

* `array.type` is to specify which array type to use. It is either "450K" or 
"EPIC". Default is "450K". This argument will be ignored if `FullAnnot` is 
provided.
* `FullAnnot` is preprocessed mapping between CpG ID and gene name provided by
prepareAnnot function. Default is NULL. Check example below for details. 
* `group` is the type of CpG to be considered in methylRRA or methylglm. By 
default, `group` is "all", which means all CpGs are considered regardless 
of their gene group. If `group` is "body", only CpGs on gene body will be 
considered. If `group` is "promoter1" or "promoter2", only CpGs on promoters 
will be considered. Based on the annotation in 
IlluminaHumanMethylation450kanno.ilmn12.hg19 and 
IlluminaHumanMethylationEPICanno.ilm10b4.hg19, "body", "promoter1" and 
"promoter2" are defined as: 
    + body: CpGs whose gene group correspond to "Body" or "1stExon" 
    + promoter1: CpGs whose gene group correspond to "TSS1500" or "TSS200"
    + promoter2: CpGs whose gene group correspond to "TSS1500", "TSS200", 
      "1stExon", or "5'UTR"    
* `GS.list` is user supplied gene sets to be tested. It should be a list with 
entry names gene set IDs and elements correpond to genes that gene set 
contain. If there is no input list, Gene Ontology is used by default.
* `GS.idtype` is the type of gene ID in user supplied gene sets. If `GS.list` 
is not empty, then the user is expected to provide gene ID type. Supported ID 
types are "SYMBOL", "ENSEMBL", "ENTREZID", "REFSEQ". Default is "SYMBOL". 
* `GS.type` is the published gene sets/pathways to be tested if `GS.list` is 
empty. Supported pathways are "GO" (Gene Ontology), "KEGG", and "Reactome". 
Default is "GO".
* `minsize` is an integer. If the number of genes in a gene set is less than 
this integer, this gene set is not tested. Default is 100.
* `maxsize` is also an integer. If the number of genes in a gene set is 
greater than this integer, this gene set is not tested. Default is 500.
* In methylRRA, `method` is to specify gene set test method. It is either 
"ORA" or "GSEA" as described in the previous section. Default is "ORA".
* In methylglm, `parallel` is either `TRUE` or `FALSE` indicating whether 
parallel should be used. Default is `FALSE`.

### Examples

An example of user supplied gene sets is given below. The gene ID type is 
gene symbol.
```{r}
data(GSlisttoy)
## to make the display compact, only a proportion of each gene set is shown
head(lapply(GS.list, function(x) x[1:30]), 3)   
```

Below is an example of running methylglm with parallel option
```{r eval=FALSE}
library(BiocParallel)
res_p = methylglm(cpg.pval = cpg.pval, minsize = 200, 
                  maxsize = 500, GS.type = "KEGG", parallel = TRUE)
```

methylglm and methylRRA support user supplied CpG ID to gene mapping. The 
mapping is expected to be a matrix, or a data frame or a list. For a 
matrix or data frame, 1st column should be CpG ID and 2nd column should be gene 
name. For a list, entry names should be gene names and elements correpond to 
CpG IDs. Below an example of user supplied CpG to gene mapping.
```{r}
data(CpG2Genetoy)
head(CpG2Gene)   
```

To use user supplied mapping in methylglm or methylRRA, first preprocess the
mapping by prepareAnnot function
```{r}
FullAnnot = prepareAnnot(CpG2Gene) 
```

Test the gene sets using "ORA" in methylRRA, use `FullAnnot` argument to 
provide the preprocessed CpG ID to gene mapping
```{r}
GS.list = GS.list[1:10]
res5 = methylRRA(cpg.pval = cpg.pval, FullAnnot = FullAnnot, method = "ORA", 
                    GS.list = GS.list, GS.idtype = "SYMBOL", 
                    minsize = 100, maxsize = 300)
head(res5, 10)
```

Below is another example testing Reactome pathways using methylglm.

```{r eval=FALSE}
res6 = methylglm(cpg.pval = cpg.pval, array.type = "450K", 
                    GS.type = "Reactome", minsize = 100, maxsize = 110)
head(res6, 10)
```

## Visualization

Following bar plot implemented in enrichplot [12], we also provide bar plot to 
visualize the gene set analysis results. The input of `barplot` function can 
be any result returned by methylglm, methylRRA, or methylgometh. Various 
options are provided for users to customize the plot.

* `xaxis` is to specify the label in x-axis. It is either "Count" 
(number of significant genes in gene set) or "Size" (total number of genes 
in gene set). "Count" option is not available for methylglm and 
methylRRA(GSEA). Default is "Size".
* `num` is to specify the number of genes sets to display on the bar plot.
Default is 5.
* `colorby` is a string. Either "pvalue" or "padj". Default is "padj".
* `title` is a string. The title to display on the bar plot. Default is NULL. 

### Example

Below is an example of using barplot to visualize the result of methylglm.
```{r}
barplot(res1, num = 8, colorby = "pvalue")
```


## Session info

```{r sessionInfo}
sessionInfo()
```


## References
[1] Geeleher, Paul, Lori Hartnett, Laurance J Egan, Aaron Golden, 
Raja Affendi Raja Ali, and Cathal Seoighe. 2013. Gene-Set Analysis Is 
Severely Biased When Applied to Genome-Wide Methylation Data.
Bioinformatics 29 (15). Oxford University Press: 1851â€“7.

[2] Kolde, Raivo, Sven Laur, Priit Adler, and Jaak Vilo. 2012. 
Robust Rank Aggregation for Gene List Integration and Meta-Analysis. 
Bioinformatics 28 (4). Oxford University Press: 573â€“80.

[3] Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, 
Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. 
Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting 
Genome-Wide Expression Profiles. Proceedings of the National Academy of 
Sciences 102 (43). National Acad Sciences: 15545â€“50.

[4] Phipson, Belinda, Jovana Maksimovic, and Alicia Oshlack. 2015. 
MissMethyl: An R Package for Analyzing Data from Illumina's Humanmethylation450 
Platform. Bioinformatics 32 (2). Oxford University Press: 286â€“88.

[5] Carlson M (2018). org.Hs.eg.db: Genome wide annotation for Human. 
R package version 3.6.0.

[6] Ligtenberg W (2018). reactome.db: A set of annotation maps for reactome. 
R package version 1.64.0.

[7] Hansen, KD. (2016). IlluminaHumanMethylation450kanno.ilmn12.hg19: 
Annotation for Illuminaâ€™s 450k Methylation Arrays. R Package, Version 0.6.0 1.

[8] Hansen, KD. (2017). IlluminaHumanMethylationEPICanno.ilm10b4.hg19: 
Annotation for Illuminaâ€™s Epic Methylation Arrays. R Package, Version 0.6.0 1.

[9] Mi, Gu, Yanming Di, Sarah Emerson, Jason S Cumbie, and Jeff H Chang. 2012. 
Length Bias Correction in Gene Ontology Enrichment Analysis Using 
Logistic Regression. PloS One 7 (10). Public Library of Science: e46128.

[10] Benjamini, Yoav, and Yosef Hochberg. 1995. Controlling the False 
Discovery Rate: A Practical and Powerful Approach to Multiple Testing.
Journal of the Royal Statistical Society. Series B (Methodological). 
JSTOR, 289â€“300.

[11] Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia 
Oshlack. 2012. Goseq: Gene Ontology Testing for Rna-Seq Datasets.
R Bioconductor.

[12] Yu G (2018). enrichplot: Visualization of Functional Enrichment Result. 
R package version 1.0.2, https://github.com/GuangchuangYu/enrichplot.