--- title: "KnowYourCG" shorttitle: "KYCG" package: sesame output: rmarkdown::html_vignette fig_width: 6 fig_height: 5 vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{"5. knowYourCG"} %\VignetteEncoding{UTF-8} --- An automated discovery tool for discovering hidden biological and technical links # Quick Start knowYourCG is a tool for evaluating CpG feature enrichment using Illumina probe IDs. Tthis tool automates the hypothesis testing by asking whether a set of CpGs (indexed by Illumina methylation chip IDs, hence a sparse representation of the methylome) is enriched in certain categories or features. These categories or features can be categorical (e.g., CpGs located at specific tissue-specific transcription factors) or continuous (e.g., the local CpG density of CpGs). Additionally, the set of CpGs to which the test will be applied can be categorical or continuous as well. The set of CpGs that will be tested for enrichment is called the query set, and the set of CpGs that will be used to determine enrichment is called the database set. A query set, for example, might be the results of a differential methylation analysis or from an epigenome-wide association study. We have taken the time to curate our own database sets from a variety of sources that describe different categorical and continuous features such as technical characterization of the probes, CpGs associated with certain chromatin states, gene association, transcription factor binding sites, CpG density, etc. Additionally, knowYourCG has support for feature selection and feature engineering, which is currently in development. The following commands prepares the use of KnowYourCG: ```{r ky1, load-depenencies, results="hide", message=FALSE, warning=FALSE} library(sesame) sesameDataCache() ``` Our example uses a specific mouse design group as input (PGCMeth, methylated in primoridal germ cells). First get the CG list using the following code: ```{r ky2, message=FALSE} query <- KYCG_getDBs("MM285.designGroup")[["PGCMeth"]] head(query) ``` Now test the enrichment over database groups. By default, KYCG will select all the categorical groups and overlapping genes (CpGs associated with a gene). ```{r ky3, fig.width=8, fig.height=5, message=FALSE} results_pgc <- testEnrichment(query, platform="MM285") head(results_pgc) ``` We can visualize the result of this test using the `KYCG_plotEnrichAll` function: ```{r ky4} KYCG_plotEnrichAll(results_pgc) ``` This plot groups different database sets along the x-axis and plot -log10(FDR) on the y-axis. As expected, the PGCMeth group itself appear on the top of the list. But one can also find histone H3K9me3, chromHMM `Het` and transcription factor `Trim28` binding enriched in this CG group. # 4 Testing Scenarios There are four testing scenarios depending on the type format of the query set and database sets. They are shown with the respective testing scenario in the table below. `testEnrichment`, `testEnrichmentSEA` are for Fisher's exact test and Set Enrichment Analysis respectively. ```{r ky9, echo = FALSE, results="asis"} library(knitr) df = data.frame( "Continuous DB"=c("Correlation-based","Set Enrichment Analysis"), "Discrete DB"=c("Set Enrichment Analysis","Fisher's Exact Test")) rownames(df) = c("Continuous Query", "Discrete Query") kable(df, caption="Four KnowYourCG Testing Scenarios") ``` # Set Enrichment Analysis The main work horse function for test enrichment of a categorical query against categorical databases is the `testEnrichment` function. This function calculates the extent of overlap and apply different statistics for enrichment testing. The ```testEnrichment()``` will perform Fisher's exact test (one-tailed by default, but two-tailed optionally) and report metrics about each of the the loaded database sets. > **Choice of universal set:** Universal set is the set of all probes for a given platform. It can either be passed in as an argument called ```universeSet``` or the platform name can be passed with argument ```platform```. If neither of these are supplied, the universe set will be implied from the probes. ```{r ky10, run-test-single, echo=TRUE, eval=TRUE, message=FALSE} library(SummarizedExperiment) ## prepare a query df <- rowData(sesameDataGet('MM285.tissueSignature')) query <- df$Probe_ID[df$branch == "fetal_brain" & df$type == "Hypo"] results <- testEnrichment(query, "TFBS", platform="MM285") results %>% dplyr::filter(overlap>10) %>% head ## prepare another query query <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"] results <- testEnrichment(query, "TFBS", platform="MM285") results %>% dplyr::filter(overlap>10) %>% dplyr::select(dbname, estimate, test, FDR) %>% head ``` The output of each test contains at least four variables: the estimate (fold enrichment, not the test statistics), p-value, type of test, and whether meta data is included in the tested database set (hasMeta), as well as the name of the database set and the database group. By default, the estimate column is sorted. It should be noted that the estimate (or test statistic) is test dependent and comparison between p-values should be limited to within the same type of test. For instance, the test statistics for Fisher's exact test and SEA are log fold change and the test statistic for Spearman's test is simply the rank order correlation coefficient. For simplicity, we report all of the test types in one data frame. The ```nQ``` and ```nD``` columns identify the length of the query set and the database set, respectively. Often, it's important to examine the extent of overlap between the two sets, so that metric is reported as well in the ```overlap``` column. See [Supplemental Vignette](https://zhou-lab.github.io/sesame/v1.16/supplemental.html#kycg_vis) for other ways of visualizing enrichment results. # Database Sets The success of enrichment testing depends critically on the availability of biologically-relevant databases. To reflect the biological meaning of databases and facilitate selective testing, we have organized our database sets into different groups. Each group contains one or multiple databases. Here is how to find the names of pre-built database groups: ``` {r ky5, list-data, eval=TRUE, echo=TRUE} KYCG_listDBGroups("MM285") ``` The `KYCG_listDBGroups()` function returns a data frame containing information of these databases. The Title column is the accession key one needs for the `testEnrichment` function. With the accessions, one can either directly use them in the `testEnrichment` function or explicitly call the ```KYCG_getDBs()``` function to retrieve databases themselves. Caching these databases on the local machine is important, for two reasons: it limits the number of requests sent to the Bioconductor server, and secondly it limits the amount of time the user needs to wait when re-downloading database sets. For this reason, one should run ```sesameDataCache()``` before loading in any database sets. This will take some time to download all of the database sets but this only needs to be done once per installation. During the analysis the database sets can be identified using these accessions. Sesame also does some guessing when a unique substring is given. For example, the following "MM285.designGroup" retrieves the "KYCG.MM285.designGroup.20210210" database. Let's look at the database group which we had used as the query (query and database are reciprocal) in our first example: ``` {r ky6, cache-data, eval=TRUE, warning=FALSE} dbs <- KYCG_getDBs("MM285.design") ``` In total, 32 datasets have been loaded for this group. We can get the "PGCMeth" as an element of the list: ``` {r ky7, view-data1, eval=TRUE, warning=FALSE} str(dbs[["PGCMeth"]]) ``` On subsequent runs of the ```KYCG_getDBs()``` function, the database loading can be faster thanks to the sesameData [in-memory caching](https://tinyurl.com/2wh9tyzk), if the corresponding database has been loaded. # Query Set(s) A query set represents probes of interest. It may either be in the form of a character vector where the values correspond to probe IDs or a named numeric vector where the names correspond to probe IDs. The query and database definition is rather arbitrary. One can regard a database as a query and turn a query into a database, like in our first example. In real world scenario, query can come from differential methylation testing, unsupervised clustering, correlation with a phenotypic trait, and many others. For example, we could consider CpGs that show tissue-specific methylation as the query. We are getting the B-cell-specific hypomethylation. ```{r ky8, message=FALSE} df <- rowData(sesameDataGet('MM285.tissueSignature')) query <- df$Probe_ID[df$branch == "B_cell"] head(query) ``` This query set represents hypomethylated probes in Mouse B-cells from the MM285 platform. This specific query set has 168 probes. # Gene Enrichment A special case of set enrichment is to test whether CpGs are associated with specific genes. Automating the enrichment test process only works when the number of database sets is small. This is important when targeting all genes as there are tens of thousands of genes on each platform. By testing only those genes that overlap with the query set, we can greatly reduce the number of tests. For this reason, the gene enrichment analysis is a special case of these enrichment tests. We can perform this analysis using the ```KYCG_buildGeneDBs()``` function. ```{r ky16, fig.width=7, fig.height=6, echo=TRUE, warning=FALSE, message=FALSE} query <- names(sesameData_getProbesByGene("Dnmt3a", "MM285")) results <- testEnrichment(query, KYCG_buildGeneDBs(query, max_distance=100000, platform="MM285"), platform="MM285") results[,c("dbname","estimate","gene_name","FDR", "nQ", "nD", "overlap")] ``` Using these sample results, we can plot a volcano plot and lollipop plot. ```{r ky17, fig.width=5, fig.height=4, echo=TRUE} KYCG_plotLollipop(results, label="gene_name") ``` As expected, we recover our targeted gene (Dnmt3a). # Get more database Sets Some commonly used database sets are stored in ExperimentHub/sesameData package. But more database sets can be found [here](https://github.com/zhou-lab/KYCG_knowledgebase_EPICv2). You can use a convenience function to download these database sets to a local folder. ```{r eval=FALSE} KYCG_listDBGroups(path = "~/Downloads/KYCG_knowledgebase_EPICv2") ## [1] "ABCompartment.20220911.gz" "Blacklist.20220304.gz" ## [3] "CGI.20220904.gz" "ChromHMM.20220303.gz" ## [5] "CTCFbind.20220911.gz" "HM.20221013.gz" ## [7] "ImprintingDMR.20220818.gz" "MetagenePC.20220911.gz" ## [9] "nFlankCG.20220321.gz" "PMD.20220911.gz" ## [11] "ProbeType.gz" "REMCChromHMM.20220911.gz" ## [13] "rmsk1.20220307.gz" "rmsk2.20220321.gz" ## [15] "Tetranuc2.20220321.gz" "TFBS.20220921.gz" ## [17] "TFBSrm.20221005.gz" ## load all database files in the folder dbs <- KYCG_loadDBs("~/Downloads/KYCG_knowledgebase_EPICv2/") ## or one database file dbs <- KYCG_loadDBs("~/Downloads/KYCG_knowledgebase_EPICv2/hg38/CGI.20220904.gz") ``` # GO/Pathway Enrichment One can get all the genes associated with a probe set by ```{r ky18, message=FALSE} df <- rowData(sesameDataGet('MM285.tissueSignature')) query <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"] genes <- sesameData_getGenesByProbes(query, platform="MM285") genes ``` Here we demonstrate the use of [g:Profiler2](https://biit.cs.ut.ee/gprofiler/gost) to perform Gene ontology enrichment analysis: ```{r ky19, eval = FALSE} library(gprofiler2) ## use gene name gostres <- gost(genes$gene_name, organism = "mmusculus") gostres$result[order(gostres$result$p_value),] gostplot(gostres) ## use Ensembl gene ID, note we need to remove the version suffix gene_ids <- sapply(strsplit(names(genes),"\\."), function(x) x[1]) gostres <- gost(gene_ids, organism = "mmusculus") gostres$result[order(gostres$result$p_value),] gostplot(gostres) ``` # Set Enrichment Analysis The query may be a named continuous vector. In that case, either a gene enrichment score will be calculated (if the database is discrete) or a Spearman correlation will be calculated (if the database is continuous as well). The three other cases are shown below using biologically relevant examples. To display this functionality, let's load two numeric database sets individually. One is a database set for CpG density and the other is a database set corresponding to the distance of the nearest transcriptional start site (TSS) to each probe. ```{r ky21, run-test-data, echo=TRUE, eval=TRUE, message=FALSE} query <- KYCG_getDBs("KYCG.MM285.designGroup")[["TSS"]] ``` ```{r ky22, echo=TRUE, eval=TRUE, message=FALSE} res <- testEnrichmentSEA(query, "MM285.seqContextN") res[, c("dbname", "test", "estimate", "FDR", "nQ", "nD", "overlap")] ``` The estimate here is enrichment score. > **NOTE:** Negative enrichment score suggests enrichment of the categorical database with the higher values (in the numerical database). Positive enrichment score represent enrichment with the smaller values. As expected, the designed TSS CpGs are significantly enriched in smaller TSS distance and higher CpG density. One can plot the set enrichment analysis result by `prepPlot=TRUE` command followed by calling the `KYCG_plotSetEnrichment()` function. ```{r ky24, GSEA, fig.width=6, fig.height=6, message=FALSE} query <- KYCG_getDBs("KYCG.MM285.designGroup")[["TSS"]] db <- KYCG_getDBs("MM285.seqContextN", "distToTSS") res <- testEnrichmentSEA(query, db, prepPlot = TRUE) KYCG_plotSetEnrichment(res[[1]]) ``` Alternatively one can test the enrichment of a continuous query with discrete databases. Here we will use the methylation level from a sample as the query and test it against the chromHMM chromatin states. ```{r ky23, warning=FALSE, eval=FALSE} beta_values <- getBetas(sesameDataGet("MM285.1.SigDF")) res <- testEnrichmentSEA(beta_values, "MM285.chromHMM") res[, c("dbname", "test", "estimate", "FDR", "nQ", "nD", "overlap")] ``` As expected, chromatin states `Tss`, `Enh` has negative enrichment score, meaning these databases are associated with small values of the query (DNA methylation level). On the contrary, `Quies` states are associated with high methylation level. # Correlation Network Analysis Methylation Correlation Network Analysis is motivated by gene expression methods that seek to identify modules and networks from expression data. Genes that are co-expressed behave similarly over different environments and thus may share similar biological function and participate in functional networks. Similarly, CpGs that have highly correlated methylation fractions across different environments may be involved in common epigenetic and biological pathways. Two highly correlated CpGs belonging to the same module. Across 256 samples from different tissues, ages and sexes, methylation changes at one CpG is accompanied by similar changes in a separate co-methylated CpG ```{r ky25, echo=FALSE} se <- sesameDataGet("MM285.10.SE.tissue") library(tibble) df_plot <- as.data.frame(t(assay(se)[ c("cg30910045_BC21","cg34861418_TC21"),])) %>% rownames_to_column("IDAT") library(ggplot2) ggplot(df_plot) + geom_line(mapping=aes(IDAT, y=cg30910045_BC21, group=1), color="blue") + geom_line(mapping=aes(IDAT, y=cg34861418_TC21, group=1), color="red") + labs(y="Methylation fraction", x="Sample") + theme_bw() + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) ``` # Session Info ```{r} sessionInfo() ```