--- title: "DMRs identification with mCSEA package" author: - name: Jordi Martorell-Marugán affiliation: - Bioinformatics Unit. GENYO, Centre for Genomics and Oncological Research - name: Pedro Carmona-Sáez affiliation: - Bioinformatics Unit. GENYO, Centre for Genomics and Oncological Research email: pedro.carmona@genyo.es package: mCSEA date: "`r BiocStyle::doc_date()`" abstract: > mCSEA (methylathed CpGs Set Enrichment Analysis) searches Differentially Methylated Regions (DMRs) between conditions using methylation data from Illumina's 450k or EPIC microarrays. The evaluated DMRs are predefined regions (promoters, gene bodies, CpG Islands and user-defined regions). This package contains functions to rank the CpG probes, to apply a GSEA analysis for DMRs identification, to plot the results and to integrate them with expression data. output: BiocStyle::pdf_document #output: word_document vignette: > %\VignetteIndexEntry{Predefined DMRs identification with mCSEA package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} references: - id: du10 title: "Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis" author: - family: Du given: P - family: Zhang given: X - family: Huang given: C - family: Jafari given: N - family: Kibbe given: WA - family: Hou given: L - family: Lin given: SM container-title: BMC Bioinformatics issued: year: 2010 volume: 11 issue: 587 - id: jones12 title: "Functions of DNA Methylation: islands, start sites, gene bodies and beyond" author: - family: Jones given: Peter A. container-title: Nature Reviews Genetics issued: year: 2012 volume: 13 issue: 7 - id: mcgregor16 title: "An evaluation of methods correcting for cell-type heterogeneity in DNA methylation studies" author: - family: McGregor given: K - family: Bernatsky given: S - family: Colmegna given: I - family: Pastinen given: T - family: Labbe given: A - family: Greenwood given: CMT container-title: Genome Biology issued: year: 2016 volume: 17 issue: 84 - id: jones12 title: "Functions of DNA Methylation: islands, start sites, gene bodies and beyond" author: - family: Jones given: Peter A. container-title: Nature Reviews Genetics issued: year: 2012 volume: 13 issue: 7 --- # Previous steps The input of mCSEA is tipically a matrix with the processed $\beta$-values for each probe and sample. If you start from the raw methylation files (like .idat files) you should first preprocess the data with any of the available packages for that purpose (e.g. `r BiocStyle::Biocpkg("minfi")` or `r BiocStyle::Biocpkg("ChAMP")`). Minfi includes functions to get a matrix of $\beta$-values (_getBeta()_) or M-values (_getM()_). ChAMP output class depends on the type of analysis performed. For instance, _champ.norm()_ function returns a matrix, while _champ.load()_ returns a list of results, and one of them is a $\beta$-values matrix. So mCSEA is totally compatible with minfi and ChAMP outputs as long as a matrix with the methylation values is obtained. ## Reading .idat files Here we provide a minimal example to read .idat files with `r BiocStyle::Biocpkg("minfi")` package. A samplesheet must be provided in order to read the raw files and generate a _RGChannelSet_ object. For more information about this step, read minfi's [vignette](https://bioconductor.org/packages/release/bioc/vignettes/minfi/inst/doc/minfi.html#3_reading_data). ```{r, message = FALSE} library(minfi) minfiDataDir <- system.file("extdata", package = "minfiData") targets <- read.metharray.sheet(minfiDataDir, verbose = FALSE) RGset <- read.metharray.exp(targets = targets) ``` ## Cell type heterogeneity correction Different cell types proportions across samples are one of the major sources of variability in methylation data from tissues like blood or saliva (@mcgregor16). There are a lot of packages which can be used to estimate cell types proportions in each sample in order to correct for this bias (reviewed in @mcgregor16). Here we supply an example where blood reference data is used to estimate cell types proportions of each blood sample. ```{r, message=FALSE} library(FlowSorted.Blood.450k) library(mCSEA) data(mcseadata) cellCounts = estimateCellCounts(RGset) print(cellCounts) ``` These proportions could be introduced as covariates in the linear models. # Step 1: Ranking CpGs probes To run a mCSEA analysis, you must rank all the evaluated CpGs probes with some metric (e.g. t-statistic, Fold-Change...). You can use _rankProbes()_ function for that aim, or prepare a ranked list with the same structure as the _rankProbes()_ output. We load sample data to show how _rankProbes()_ works: ```{r, message = FALSE, results='hide'} library(mCSEA) data(mcseadata) ``` We loaded to our R environment **betaTest** and **phenoTest** objects, in addition to **exprTest**, annotation objects and association objects (we will talk about these after). **betaTest** is a matrix with the $\beta$-values of 10000 EPIC probes for 20 samples. **phenoTest** is a dataframe with the explanatory variable and covariates associated to the samples. When you load your own data, the structure of your objects should be similar. ```{r} head(betaTest, 3) print(phenoTest) ``` _rankProbes()_ function uses these two objects as input and apply a linear model with `r BiocStyle::Biocpkg("limma")` package. By default, _rankProbes()_ considers the first column of the phenotypes table as the explanatory variable in the model (e.g. cases and controls) and does not take into account any covariate to adjust the models. You can change this behaviour modifying **explanatory** and **covariates** options. By default, _rankProbes()_ assumes that the methylation data object contains $\beta$-values and transform them to M-values before calculating the linear models. If your methylation data object contains M-values, you must specify it (typeInput = "M"). You can also use $\beta$-values for models calculation (typeAnalysis = "beta"), although we do not recommend it due to it has been proven that M-values better accomplish the statistical assumptions of limma analysis (@du10). ```{r} myRank <- rankProbes(betaTest, phenoTest, refGroup = "Control") ``` **myRank** is a named vector with the t-values for each CpG probe. ```{r} head(myRank) ``` You can also supply _rankProbes()_ function with a SummarizedExperiment object. In that case, if you don't specify a **pheno** object, phenotypes will be extracted from the SummarizedExperiment object with _colData()_ function. ## Paired analysis It is possible to take into account paired samples in _rankProbes()_ analysis. For that aim, you should use paired = TRUE parameter and specify the column in pheno containing pairing information (pairColumn parameter). # Step 2: Searching DMRs in predefined regions Once you calculated a score for each CpG, you can perform the mCSEA analysis. For that purpose, you should use _mCSEATest()_ function. This function takes as input the vector generated in the previous step, the methylation data and the phenotype information. By default, it searches for differentially methylated promoters, gene bodies and CpG Islands. You can specify the regions you want to test with _regionsTypes_ option. _minCpGs_ option specifies the minimum amount of CpGs in a region to be considered in the analysis (5 by default). You can increase the number of processors to use with _nproc_ option (recommended if you have enough computational resources). By default, mCSEA performs 10000 permutations to calculate P-values, but you can change that with _nperm_ option. Finally, you should specify if the array platform is 450k or EPIC with the _platform_ option. Note that _mCSEATest()_ performs permutations to get the P-values, so each time it is executed, the results are not exactly the same. To avoid that, we recommend to include a _set.seed()_ in order to get reproducible results. ```{r, warning=FALSE} set.seed(123) myResults <- mCSEATest(myRank, betaTest, phenoTest, regionsTypes = "promoters", platform = "EPIC") ``` _mCSEATest()_ returns a list with the GSEA results and the association objects for each region type analyzed, in addition to the input data (methylation, phenotype and platform). ```{r} ls(myResults) ``` **promoters** is a data frame with the following columns (partially extracted from `r BiocStyle::Biocpkg("fgsea")` help): * *pval:* Estimated P-value. * *padj:* P-value adjusted by BH method. * *ES:* Enrichment score. * *NES:* Normalized enrichment score by number of CpGs associated to the feature. * *nMoreExtreme:* Number of times a random gene set had a more extreme enrichment score value. * *size:* Number of CpGs associated to the feature. * *leadingEdge:* Leading edge CpGs which drive the enrichment. ```{r} head(myResults[["promoters"]][,-7]) ``` On the other hand, **promoters_association** is a list with the CpG probes associated to each feature: ```{r} head(myResults[["promoters_association"]], 3) ``` You can also provide a custom association object between CpG probes and regions (_customAnnotation_ option). This object should be a list with a structure similar to this: ```{r} head(assocGenes450k, 3) ``` # Step 3: Plotting the results Once you found some DMRs, you can make a plot with the genomic context of the interesting ones. For that, you must provide _mCSEAPlot()_ function with the _mCSEATest()_ results, and you must specify which type of region you want to plot and the name of the DMR to be plotted (e.g. gene name). There are some graphical parameters you can adjust (see _mCSEAPlot()_ help). Take into account that this function connects to some online servers in order to get genomic information. For that reason, this function could take some minutes to finish the plot, specially the first time it is executed. ```{r, message = FALSE, results='hide'} mCSEAPlot(myResults, regionType = "promoters", dmrName = "CLIC6", transcriptAnnotation = "symbol", makePDF = FALSE) ``` You can also plot the GSEA results for a DMR with _mCSEAPlotGSEA()_ function. ```{r} mCSEAPlotGSEA(myRank, myResults, regionType = "promoters", dmrName = "CLIC6") ``` # Integrating methylation and expression data If you have both methylation and expression data for the same samples, you can integrate them in order to discover significant associations between methylation changes in a DMR and an expression alterations in a close gene. _mCSEAIntegrate()_ considers the DMRs identified by _mCSEATest()_ passing a P-value threshold (0.05 by default). It calculates the mean methylation for each condition using the leading edge CpGs and performs a correlation test between this mean DMR methylation and the expression of close genes. This function automatically finds the genes located within a determined distance (1.5 kb) from the DMR. Only correlations passing thresholds (0.5 for correlation value and 0.05 por P-value by default) are returned. For promoters, only negative correlations are returned due to this kind of relationship between promoters methylation and gene expression has been largely observed (@jones12). On the contrary, only positive correlations between gene bodies methylation and gene expression are returned, due to this is a common relationship observed (@jones12). For CpG islands and custom regions, both positive and negative correlations are returned, due to they can be located in both promoters and gene bodies. To test this function, we extracted a subset of 100 genes expression from bone marrows of 10 healthy and 10 leukemia patients (**exprTest**). Data was extracted from `r BiocStyle::Biocpkg("leukemiasEset")` package. ```{r} # Explore expression data head(exprTest, 3) # Run mCSEAIntegrate function resultsInt <- mCSEAIntegrate(myResults, exprTest, "promoters", "ENSEMBL") resultsInt ``` It is very important to specify the correct gene identifiers used in the expression data (_geneIDs_ parameter). _mCSEAIntegrate()_ automatically generates correlation plots for the significant results and save them in the directory specified by _folder_ parameter (current directory by default). ![Integration plot for GATA2 promoter methylation and ENSG00000179348 expression. Note that, actually, both names refers to the same gene, but SYMBOL was used to analyze promoters methylation and ENSEMBL ID was used as gene identifiers in the expression data.](GATA2_ENSG00000179348_promoters.pdf) # Session info ```{r sessionInfo, echo=FALSE} sessionInfo() ``` # References