--- title: "EnrichDO: a Global Weighted Model for Disease Ontology Enrichment Analysis" author: - name: Liang Cheng affiliation: College of Bioinformatics Science and Technology, Harbin Medical University - name: Haixiu Yang email: yanghaixiu@ems.hrbmu.edu.cn affiliation: College of Bioinformatics Science and Technology, Harbin Medical University - name: Hongyu Fu affiliation: College of Bioinformatics Science and Technology, Harbin Medical University date: "`r Sys.Date()`" package: EnrichDO output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{EnrichDO: Disease Ontology Enrichment Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Installation `EnrichDO` can be installed from Bioconductor: ```{r install_chunk, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("EnrichDO") ``` or github page ```{r install_chunk2, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") library(devtools) devtools::install_github("liangcheng-hrbmu/EnrichDO") ``` # Introduction Disease Ontology (DO) enrichment analysis is an effective means to discover the associations between genes and diseases. However, most current DO-based enrichment methods were unable to solve the over enriched problem caused by the “true-path” rule. To address this problem, we presents EnrichDO, a double weighted iterative model, which is based on the latest annotations of the human genome with DO terms and integrates the DO graph topology on a global scale. On one hand, to reinforce the saliency of direct gene-DO annotations, different initial weights are assigned to directly annotated genes and indirectly annotated genes, respectively. On the other hand, to detect locally most significant node between the parent and its children, less significant nodes are dynamically down-weighted. EnrichDO exhibits high accuracy that it can identify more specific DO terms, which alleviates the over enriched problem. EnrichDO encompasses various statistical models and visualization schemes for discovering the associations between genes and diseases from biological big data. Currently uploaded to Bioconductor, EnrichDO aims to provide a more convenient and effective DO enrichment analysis tool. ```{r setup,results='hide'} library(EnrichDO) ``` # Disease Ontology Enrichment Analysis EnrichDO presents a double weighted iterative model for DO enrichment analysis. Based on the latest annotations of the human genome with DO terms, EnrichDO can identify locally significant enriched terms by applying different initial weights and dynamic weights for annotated genes and integrating the DO graph topology on a global scale. EnrichDO presents an effective and flexible model, which supplies various statistical testing models and multiple testing correction methods. ## Data Preparation The input data for EnrichDO is a predefined gene set, such as differentially expressed gene sets, significant genes from GWAS, gene sets from high-throughput biological data, etc. The interest gene sets should be protein-coding genes, in the ENTREZID format from NCBI. Taking the *input_example* as an example, it stores validated protein-coding genes associated with Alzheimer’s disease from the DisGeNET database. ```{r} Alzheimer<-read.delim(file.path(system.file("extdata", package="EnrichDO"), "Alzheimer_curated.csv"), header = FALSE) input_example<-Alzheimer[,1] ``` In order to improve the speed of the case, in the following example, several genes (*demo.data*) are randomly selected from the protein-coding genes (NCBI Entrez ID) for analysis. ```{r} demo.data<-c(1636,351,102,2932,3077,348,4137,54209,5663,5328,23621,3416,3553) ``` ## doEnrich Function EnrichDO provides disease ontology enrichment analysis through the ***doEnrich*** function. Depending on the setting of the *traditional* parameter, you can choose between weighted enrichment analysis or classic enrichment analysis. The following is a simple running example using weighted enrichment analysis by default: ```{r eval=TRUE,results='hide',cache=TRUE,message=FALSE} demo_result<-doEnrich(interestGenes=demo.data) ``` ### Weighted Enrichment Function In addition to performing weighted enrichment analysis with default parameters, the following parameters can also be modified to better meet the analysis requirements. The specific meanings of these parameters can be accessed by using the ***?doEnrich*** method. ```{r,eval=FALSE} weighted_demo<-doEnrich(interestGenes=demo.data, test="fisherTest", method="holm", m=1, minGsize=10, maxGsize=2000, delta=0.05, penalize=TRUE) ``` ```{r,echo=FALSE, results='hold'} weighted_demo<-doEnrich(interestGenes=demo.data,test="fisherTest",method="holm",m=1,minGsize=10, maxGsize=2000,delta=0.05,penalize=TRUE) ``` ### Classic Enrichment Function The ***doEnrich*** function can control the parameter *traditional* to perform traditional overexpression enrichment analysis (ORA), which means that it will not down-weight based on the topological structure of the disease ontology. Additionally, Parameters *test, method, m, maxGsize* and *minGsize* can be used flexibly. ```{r } Tradition_demo<-doEnrich(demo.data, traditional=TRUE) ``` ## Result description and Written ### Result description Running ***doEnrich*** will output the terms and total genes involved in each layer of Directed acyclic graph (DAG) to the user. The enrichment results can be summarized using the ***show*** function. ```{r,eval=FALSE} show(demo_result) ``` ```{r,echo=FALSE, results='hold'} show(demo_result) ``` The result of ***doEnrich*** is *demo_result* which contains *enrich, interestGenes, test, method, m, maxGsize, minGsize, delta, traditional, penalize*. There are 16 columns of *enrich*, including: - The standard ID corresponding to the DO Term (*DOID*). - the standard name of the DO Term (*DOTerm*), each DO Term has a unique DOID. - We constructed a directed acyclic graph according to the is_a relationship between each node in the DO database, and each DO Term has a corresponding level (*level*). - The DO database stores the parent node of each DO Term (*parent.arr*) and its number (*parent.len*). For example, “B-cell acute lymphoblastic leukemia” (DOID:0080638) is_a “acute lymphoblastic leukemia” (DOID:9952) and “lymphoma” (DOID:0060058), then the node “B-cell acute lymphoblastic leukemia” is a child of “acute lymphoblastic leukemia” and “lymphoma”, and the child is a more specific biological classification than its parent. - child nodes of the DO Term (*child.arr*) and their number (*child.len*). - the latest GeneRIF information is used to annotate DO Terms, each DO Term has its corresponding disease-associated genes (*gene.arr*), and its number (*gene.len*). - Assigning a weight to each gene helps assess the contribution of different genes to DO Terms (*weight.arr*). - The smaller the weights of indirectly annotated genes, the less contribution of these genes in the enrichment analysis.(*gene.w*). - the P-value of the DO Term (*p*), which arranges the order of enrich, and the value of P-value correction (*p.adjust*). - the genes of interest annotated to this DO Term (*cg.arr*) and its number (*cg.len*). - the number of genes in the interest gene set (*ig.len*), this represents the number of genes that are actually used for enrichment analysis. Generally, a significant P value of the enrichment results should be less than 0.05 or 0.01, indicating a significant association between the interesting gene set and the disease node. In the *enrich*, the node with the most significant enrichment is DOID:0080832, and the DO Term is "mild cognitive impairment", with its P-value being 9.22e-16. These results suggests that there is statistical significance between the interesting gene set and the DO Term of mild cognitive impairment. ### Result writing The ***writeResult*** function can output *DOID, DOTerm, p, p.adjust, geneRatio, bgRatio* and *cg* in the data frame *enrich* as text. The default file name is "result.txt". ```{r} writeResult(EnrichResult = demo_result,file=file.path(tempdir(), "result.txt"), Q=1, P=1) ``` ***writeResult*** has four parameters. *EnrichResult* indicates the enrichment result of ***doEnrich***, *file* indicates the write address of a file. The parameter *Q* (and *P*) indicates that a DO term is output only when *p.adjust* (and *p* value) is less than or equal to *Q* (and *P*). The default values for *P* and *Q* are 1. In the output file *result.txt*, *geneRatio* represents the intersection of the DO term annotated genes with the interest gene set divided by the interest gene set, and *bgRatio* represents all genes of the DO term divided by the background gene set. # Visualization of enrichment results EnrichDO provides four methods to visualize enrichment results, including bar plot (***drawBarGraph***), bubble plot (***drawPointGraph***), tree plot (***drawGraphviz***) and heatmap (***drawHeatmap***), which can show the research results more concisely and intuitively. Pay attention to the threshold setting for each visual method, if the threshold is too low, the display is insufficient. ## drawBarGraph function ***drawBarGraph*** can draw the top *n* nodes with the most significant p-value as bar chart, and the node's p-value is less than *delta* (By default, *n* is 10 and *delta* is 1e-15). ```{r fig.cap="bar plot",fig.align='center',fig.width=8,fig.height=6} drawBarGraph(EnrichResult=demo_result, n=10, delta=0.05) ``` ## drawPointGraph function ***drawPointGraph*** can draw the top *n* nodes with the most significant p-value as bubble plot, and the node's p-value is less than *delta* (By default, *n* is 10 and *delta* is 1e-15). ```{r fig.cap="point plot",fig.align='center',fig.width=8,fig.height=6} drawPointGraph(EnrichResult=demo_result, n=10, delta=0.05) ``` ## drawGraphViz function ***drawGraphViz*** draws the DAG structure of the most significant *n* nodes, and *labelfontsize* can set the font size of labels in nodes (By default, *n* is 10 and *labelfontsize* is 14). Each node has a corresponding disease name displayed. In addition, the ***drawGraphViz*** function can also display the P-value of each node in the enrichment analysis (*pview*=TRUE), and the number of overlapping genes of each DO term and interest gene set (*numview*=TRUE). ```{r fig.cap="tree plot",fig.align='center',fig.width=8,fig.height=6} drawGraphViz(EnrichResult=demo_result, n=10, numview=FALSE, pview=FALSE, labelfontsize=17) ``` ## drawHeatmap function ***drawHeatmap*** function visualizes the strength of the relationship between the top *DOID_n* nodes from enrichment results and the genes whose weight sum ranks the top *gene_n* in these nodes. And the gene must be included in the gene of interest. *readable* indicates whether the gene is displayed as its symbol. ***drawHeatmap*** also provides additional parameters from the pheatmap function, which you can set according to your needs. Default *DOID_n* is 10, *gene_n* is 50, *fontsize_row* is 10, *readable* is TRUE. ```{r fig.cap="heatmap",fig.align='center',fig.width=8,fig.height=6} drawHeatmap(interestGenes=demo.data, EnrichResult=demo_result, gene_n=10, fontsize_row=8, readable=TRUE) ``` ## convenient drawing Draw(***drawBarGraph ,drawPointGraph ,drawGraphViz***) from ***writeResult*** output files, so you don't have to wait for the algorithm to run. ```{r fig.width=8,fig.height=6} #Firstly, read the wrireResult output file,using the following two lines data <- read.delim(file.path(system.file("examples", package="EnrichDO"), "result.txt")) enrich <- convDraw(resultDO=data) #then, Use the drawing function you need drawGraphViz(enrich=enrich) #Tree diagram drawPointGraph(enrich=enrich, delta = 0.05) #Bubble diagram drawBarGraph(enrich=enrich, delta = 0.05) #Bar plot ``` # Session information ```{r session-info,cache = FALSE,echo=TRUE,message=TRUE,warning=FALSE} sessionInfo() ```