--- title: "methylGSA: Gene Set Analysis for DNA Methylation Datasets" author: "Xu Ren and Pei Fen Kuan" date: "`r Sys.Date()`" output: rmarkdown::html_document: highlight: pygments toc: true vignette: > %\VignetteIndexEntry{methylGSA: Gene Set Analysis for DNA Methylation Datasets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE} knitr::opts_chunk$set(comment = "", message=FALSE, warning = FALSE) ``` ## Installation To install and load methylGSA ```{r eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("methylGSA") ``` ```{r} library(methylGSA) ``` Depending on the DNA methylation array type, other packages may be needed before running the analysis. If analyzing 450K array, `IlluminaHumanMethylation450kanno.ilmn12.hg19` needs to be loaded. ```{r} library(IlluminaHumanMethylation450kanno.ilmn12.hg19) ``` If analyzing EPIC array, `IlluminaHumanMethylationEPICanno.ilm10b4.hg19` needs to be loaded. ```{r eval = FALSE} library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) ``` If analyzing user-supplied mapping between CpG ID and gene name, neither `IlluminaHumanMethylation450kanno.ilmn12.hg19` nor `IlluminaHumanMethylationEPICanno.ilm10b4.hg19` needs to be loaded. ## Introduction The methylGSA package contains functions to carry out gene set analysis adjusting for the number of CpGs per gene. It has been shown by Geeleher et al [1] that gene set analysis is extremely biased for DNA methylation data. This package contains three main functions to adjust for the bias in gene set analysis. * methylglm: Incorporating number of CpGs as a covariate in logistic regression. * methylRRA: Adjusting for multiple p-values of each gene by Robust Rank Aggregation [2], and then apply either over-representation analysis (ORA) or Preranked version of Gene Set Enrichment Analysis (GSEAPreranked) [3] in gene set testing. * methylgometh: Adjusting the number of CpGs for each gene by weighted resampling and Wallenius non-central hypergeometric approximation (via missMethyl [4]). ## Supported gene sets and gene ID types * Gene Ontology (via org.Hs.eg.db [5]) * KEGG (via org.Hs.eg.db [5]) * Reactome (via reactome.db [6]) * User-supplied gene sets. Supported gene input ID types: + "SYMBOL" + "ENSEMBL" + "ENTREZID" + "REFSEQ" ## Supported array types * Infinium Human Methylation 450K BeadChip (via IlluminaHumanMethylation450kanno.ilmn12.hg19 [7]) * Infinium Methylation EPIC BeadChip (via IlluminaHumanMethylationEPICanno.ilm10b4.hg19 [8]) * User supplied mapping between CpG ID and gene name ## Description of methylglm methylglm is an extention of GOglm [9]. GOglm adjusts length bias for RNA-Seq data by incorporating gene length as a covariate in logistic regression model. methylglm adjusts length bias in DNA methylation by the number of CpGs instead of gene length. For each gene set, we fit a logistic regression model: $$ \text{logit} (\pi_{i}) = \beta_{0} + \beta_{1}x_{i} + \beta_{2}c_{i}$$ For each gene $i$, $\pi_{i}$ = Pr(gene $i$ is in gene set), $x_{i}$ represents negative logarithmic transform of its minimum p-value in differential methylation analysis, and $c_{i}$ is logarithmic transform of its number of CpGs. methylglm requires only a simple named vector. This vector contains p-values of each CpG. Names should be their corresponding CpG IDs. ### Example Here is what the input vector looks like: ```{r} data(cpgtoy) head(cpg.pval, 20) ``` Please note that the p-values presented here in `cpg.pval` is for illustration purpose only. They are used to show how to utilize the functions in methylGSA. It does not represent p-values from any real data analysis. The actual p-values in differential methylation analysis may be quite different from the p-values in `cpg.pval` in terms of the magnitude. Run methylglm: ```{r} res1 = methylglm(cpg.pval = cpg.pval, minsize = 200, maxsize = 500, GS.type = "KEGG") head(res1, 15) ``` Result is a data frame ranked by p-values of gene sets. The meaning of each column is given below. ```{r echo=FALSE} glm_res = data.frame( "Column" = c("ID", "Description (N/A for user supplied gene set)", "Size", "pvalue", "padj"), "Explanation" = c("Gene set ID", "Gene set description", "Number of genes in gene set", "p-value in logistic regression", "Adjusted p-value") ) knitr::kable(glm_res) ``` Bioconductor package [org.Hs.eg.db]( https://bioconductor.org/packages/data/annotation/html/org.Hs.eg.db.html) can be used to pull out the genes corresponding a specific pathway. For instance, one of the pathways in the methylglm output above is "Neuroactive ligand-receptor interaction" with KEGG ID 04080. The following code can be used to obtain its genes. ```{r} library(org.Hs.eg.db) genes_04080 = select(org.Hs.eg.db, "04080", "SYMBOL", keytype = "PATH") head(genes_04080) ``` The following code can be used to get the genes in all the pathways in methylglm output. ```{r eval=FALSE} # include all the IDs as the 2nd argument in select function genes_all_pathway = select(org.Hs.eg.db, as.character(res1$ID), "SYMBOL", keytype = "PATH") head(genes_all_pathway) ``` ## Description of methylRRA Robust rank aggregation [2] is a parameter free model that aggregates several ranked gene lists into a single gene list. The aggregation assumes random order of input lists and assign each gene a p-value based on order statistics. We apply this order statistics idea to adjust for number of CpGs. For gene $i$, let $P_{1}, P_{2}, ... P_{n}$ be the p-values of individual CpGs in differential methylation analysis. Under the null hypothesis, $P_{1}, P_{2}, ... P_{n} ~ \overset{i.i.d}{\sim} Unif[0, 1]$. Let $P_{(1)}, P_{(2)}, ... P_{(n)}$ be the order statistics. Define: $$\rho = \text{min}\{\text{Pr}(P_{(1)}<P_{(1)\text{obs}}), \text{Pr}(P_{(2)}<P_{(2)\text{obs}})..., \text{Pr}(P_{(n)}<P_{(n)\text{obs}}) \} $$ methylRRA supports two approaches to adjust for number of CpGs, ORA and GSEAPreranked [3]. In ORA approach, for gene $i$, conversion from $\rho$ score into p-value is done by Bonferroni correction [2]. We get a p-value for each gene and these p-values are then corrected for multiple testing use Benjamini & Hochberg procedure [10]. By default, genes with False Discovery Rate (FDR) below 0.05 are considered differentially expressed (DE) genes. If there are no DE genes under FDR 0.05, users are able to use `sig.cut` option to specify a higher FDR cut-off or `topDE` option to declare top genes to be differentially expressed. We then apply ORA based on these DE genes. In GSEAPreranked approach, for gene $i$, we also convert $\rho$ score into p-value by Bonferroni correction. p-values are converted into z-scores. We then apply Preranked version of Gene Set Enrichment Analysis (GSEAPreranked) on the gene list ranked by the z-scores. ### Example To apply ORA approach, we use argument `method = "ORA"` (default) in methylRRA ```{r eval=FALSE} res2 = methylRRA(cpg.pval = cpg.pval, method = "ORA", minsize = 200, maxsize = 210) head(res2, 15) ``` The meaning of each column in the output is given below. ```{r echo=FALSE} ora_res = data.frame( "Column" = c("ID", "Description (N/A for user supplied gene set)", "Count", "overlap", "Size", "pvalue", "padj"), "Explanation" = c("Gene set ID", "Gene set description", "Number of significant genes in the gene set", "Names of significant genes in the gene set", "Number of genes in gene set", "p-value in ORA", "Adjusted p-value") ) knitr::kable(ora_res) ``` To apply GSEAPreranked approach, we use argument `method = "GSEA"` in methylRRA ```{r eval=FALSE} res3 = methylRRA(cpg.pval = cpg.pval, method = "GSEA", minsize = 200, maxsize = 210) head(res3, 10) ``` The meaning of each column in the output is given below. ```{r echo=FALSE} gsea_res = data.frame( "Column" = c("ID", "Description (N/A for user supplied gene set)", "Size", "enrichmentScore", "NES", "pvalue", "padj"), "Explanation" = c("Gene set ID", "Gene set description", "Number of genes in gene set", "Enrichment score (see [3] for details)", "Normalized enrichment score (see [3] for details)", "p-value in GSEA", "Adjusted p-value") ) knitr::kable(gsea_res) ``` ## Description of methylgometh methylgometh calls `gometh` or `gsameth` function in missMethyl package [4] to adjust number of CpGs in gene set testing. `gometh` modifies goseq method [11] by fitting a probability weighting function and resampling from Wallenius non-central hypergeometric distribution. methylgometh requires two inputs, `cpg.pval` and `sig.cut`. `sig.cut` specifies the cut-off point to declare a CpG as differentially methylated. By default, `sig.cut` is 0.001. Similar to methylRRA, if no CpG is significant, users are able to specify a higher cut-off or use `topDE` option to declare top CpGs to be differentially methylated. ### Example ```{r eval=FALSE} res4 = methylgometh(cpg.pval = cpg.pval, sig.cut = 0.001, minsize = 200, maxsize = 210) head(res4, 15) ``` ## Other options methylGSA provides many other options for users to customize the analysis. * `array.type` is to specify which array type to use. It is either "450K" or "EPIC". Default is "450K". This argument will be ignored if `FullAnnot` is provided. * `FullAnnot` is preprocessed mapping between CpG ID and gene name provided by prepareAnnot function. Default is NULL. Check example below for details. * `group` is the type of CpG to be considered in methylRRA or methylglm. By default, `group` is "all", which means all CpGs are considered regardless of their gene group. If `group` is "body", only CpGs on gene body will be considered. If `group` is "promoter1" or "promoter2", only CpGs on promoters will be considered. Based on the annotation in IlluminaHumanMethylation450kanno.ilmn12.hg19 and IlluminaHumanMethylationEPICanno.ilm10b4.hg19, "body", "promoter1" and "promoter2" are defined as: + body: CpGs whose gene group correspond to "Body" or "1stExon" + promoter1: CpGs whose gene group correspond to "TSS1500" or "TSS200" + promoter2: CpGs whose gene group correspond to "TSS1500", "TSS200", "1stExon", or "5'UTR" * `GS.list` is user supplied gene sets to be tested. It should be a list with entry names gene set IDs and elements correpond to genes that gene set contain. If there is no input list, Gene Ontology is used by default. * `GS.idtype` is the type of gene ID in user supplied gene sets. If `GS.list` is not empty, then the user is expected to provide gene ID type. Supported ID types are "SYMBOL", "ENSEMBL", "ENTREZID", "REFSEQ". Default is "SYMBOL". * `GS.type` is the published gene sets/pathways to be tested if `GS.list` is empty. Supported pathways are "GO" (Gene Ontology), "KEGG", and "Reactome". Default is "GO". * `minsize` is an integer. If the number of genes in a gene set is less than this integer, this gene set is not tested. Default is 100. * `maxsize` is also an integer. If the number of genes in a gene set is greater than this integer, this gene set is not tested. Default is 500. * In methylRRA, `method` is to specify gene set test method. It is either "ORA" or "GSEA" as described in the previous section. Default is "ORA". * In methylglm, `parallel` is either `TRUE` or `FALSE` indicating whether parallel should be used. Default is `FALSE`. ### Examples An example of user supplied gene sets is given below. The gene ID type is gene symbol. ```{r} data(GSlisttoy) ## to make the display compact, only a proportion of each gene set is shown head(lapply(GS.list, function(x) x[1:30]), 3) ``` Below is an example of running methylglm with parallel option ```{r eval=FALSE} library(BiocParallel) res_p = methylglm(cpg.pval = cpg.pval, minsize = 200, maxsize = 500, GS.type = "KEGG", parallel = TRUE) ``` methylglm and methylRRA support user supplied CpG ID to gene mapping. The mapping is expected to be a matrix, or a data frame or a list. For a matrix or data frame, 1st column should be CpG ID and 2nd column should be gene name. For a list, entry names should be gene names and elements correpond to CpG IDs. Below an example of user supplied CpG to gene mapping. ```{r} data(CpG2Genetoy) head(CpG2Gene) ``` To use user supplied mapping in methylglm or methylRRA, first preprocess the mapping by prepareAnnot function ```{r} FullAnnot = prepareAnnot(CpG2Gene) ``` Test the gene sets using "ORA" in methylRRA, use `FullAnnot` argument to provide the preprocessed CpG ID to gene mapping ```{r} GS.list = GS.list[1:10] res5 = methylRRA(cpg.pval = cpg.pval, FullAnnot = FullAnnot, method = "ORA", GS.list = GS.list, GS.idtype = "SYMBOL", minsize = 100, maxsize = 300) head(res5, 10) ``` Below is another example testing Reactome pathways using methylglm. ```{r eval=FALSE} res6 = methylglm(cpg.pval = cpg.pval, array.type = "450K", GS.type = "Reactome", minsize = 100, maxsize = 110) head(res6, 10) ``` ## Visualization Following bar plot implemented in enrichplot [12], we also provide bar plot to visualize the gene set analysis results. The input of `barplot` function can be any result returned by methylglm, methylRRA, or methylgometh. Various options are provided for users to customize the plot. * `xaxis` is to specify the label in x-axis. It is either "Count" (number of significant genes in gene set) or "Size" (total number of genes in gene set). "Count" option is not available for methylglm and methylRRA(GSEA). Default is "Size". * `num` is to specify the number of genes sets to display on the bar plot. Default is 5. * `colorby` is a string. Either "pvalue" or "padj". Default is "padj". * `title` is a string. The title to display on the bar plot. Default is NULL. ### Example Below is an example of using barplot to visualize the result of methylglm. ```{r} barplot(res1, num = 8, colorby = "pvalue") ``` ## Session info ```{r sessionInfo} sessionInfo() ``` ## References [1] Geeleher, Paul, Lori Hartnett, Laurance J Egan, Aaron Golden, Raja Affendi Raja Ali, and Cathal Seoighe. 2013. Gene-Set Analysis Is Severely Biased When Applied to Genome-Wide Methylation Data. Bioinformatics 29 (15). Oxford University Press: 1851–7. [2] Kolde, Raivo, Sven Laur, Priit Adler, and Jaak Vilo. 2012. Robust Rank Aggregation for Gene List Integration and Meta-Analysis. Bioinformatics 28 (4). Oxford University Press: 573–80. [3] Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles. Proceedings of the National Academy of Sciences 102 (43). National Acad Sciences: 15545–50. [4] Phipson, Belinda, Jovana Maksimovic, and Alicia Oshlack. 2015. MissMethyl: An R Package for Analyzing Data from Illumina's Humanmethylation450 Platform. Bioinformatics 32 (2). Oxford University Press: 286–88. [5] Carlson M (2018). org.Hs.eg.db: Genome wide annotation for Human. R package version 3.6.0. [6] Ligtenberg W (2018). reactome.db: A set of annotation maps for reactome. R package version 1.64.0. [7] Hansen, KD. (2016). IlluminaHumanMethylation450kanno.ilmn12.hg19: Annotation for Illumina’s 450k Methylation Arrays. R Package, Version 0.6.0 1. [8] Hansen, KD. (2017). IlluminaHumanMethylationEPICanno.ilm10b4.hg19: Annotation for Illumina’s Epic Methylation Arrays. R Package, Version 0.6.0 1. [9] Mi, Gu, Yanming Di, Sarah Emerson, Jason S Cumbie, and Jeff H Chang. 2012. Length Bias Correction in Gene Ontology Enrichment Analysis Using Logistic Regression. PloS One 7 (10). Public Library of Science: e46128. [10] Benjamini, Yoav, and Yosef Hochberg. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 289–300. [11] Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia Oshlack. 2012. Goseq: Gene Ontology Testing for Rna-Seq Datasets. R Bioconductor. [12] Yu G (2018). enrichplot: Visualization of Functional Enrichment Result. R package version 1.0.2, https://github.com/GuangchuangYu/enrichplot.