---
title: "DMRs identification with mCSEA package"
author:
- name: Jordi Martorell-Marugán
  affiliation:
  - Bioinformatics Unit. GENYO, Centre for Genomics and Oncological Research
- name: Pedro Carmona-Sáez
  affiliation:
  - Bioinformatics Unit. GENYO, Centre for Genomics and Oncological Research
  email: pedro.carmona@genyo.es
package: mCSEA
date: "`r BiocStyle::doc_date()`"
abstract: >
  mCSEA (methylathed CpGs Set Enrichment Analysis) searches Differentially
  Methylated Regions (DMRs) between conditions using methylation data from 
  Illumina's 450k or EPIC microarrays. The evaluated DMRs are predefined 
  regions (promoters, gene bodies, CpG Islands and user-defined regions).
  This package contains functions to rank the CpG probes, to apply a 
  GSEA analysis for DMRs identification, to plot the results and to integrate 
  them with expression data.
output:
  BiocStyle::pdf_document
#output: word_document

vignette: >
  %\VignetteIndexEntry{Predefined DMRs identification with mCSEA package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
references:
- id: du10
  title: "Comparison of Beta-value and M-value methods for quantifying 
methylation levels by microarray analysis"
  author:
  - family: Du
    given: P
  - family: Zhang
    given: X
  - family: Huang
    given: C
  - family: Jafari
    given: N
  - family: Kibbe
    given: WA
  - family: Hou
    given: L
  - family: Lin
    given: SM
  container-title: BMC Bioinformatics
  issued:
    year: 2010
  volume: 11
  issue: 587
- id: jones12
  title: "Functions of DNA Methylation: islands, start sites, 
gene bodies and beyond"
  author:
  - family: Jones
    given: Peter A.
  container-title: Nature Reviews Genetics
  issued:
    year: 2012
  volume: 13
  issue: 7
  
  
- id: mcgregor16
  title: "An evaluation of methods correcting for cell-type heterogeneity in DNA methylation studies"
  author:
  - family: McGregor
    given: K
  - family: Bernatsky
    given: S
  - family: Colmegna
    given: I
  - family: Pastinen
    given: T
  - family: Labbe
    given: A
  - family: Greenwood
    given: CMT
  container-title: Genome Biology
  issued:
    year: 2016
  volume: 17
  issue: 84
  
- id: jones12
  title: "Functions of DNA Methylation: islands, start sites, 
gene bodies and beyond"
  author:
  - family: Jones
    given: Peter A.
  container-title: Nature Reviews Genetics
  issued:
    year: 2012
  volume: 13
  issue: 7
---

# Previous steps

The input of 
mCSEA is tipically a matrix with the processed $\beta$-values for each probe 
and sample. If you start from the raw methylation files (like .idat files) you 
should first preprocess the data with any of the available packages for 
that purpose (e.g. `r BiocStyle::Biocpkg("minfi")` or 
`r BiocStyle::Biocpkg("ChAMP")`). Minfi 
includes functions to get a matrix of $\beta$-values (_getBeta()_) or 
M-values (_getM()_). ChAMP output class depends on the type of analysis 
performed. For instance, _champ.norm()_ function returns a matrix, while 
_champ.load()_ returns a list of results, and one of them is a $\beta$-values
matrix. So mCSEA is totally compatible with minfi and ChAMP outputs as long as 
a matrix with the methylation values is obtained.

## Reading .idat files

Here we provide a minimal example to read .idat files with 
`r BiocStyle::Biocpkg("minfi")` package. A samplesheet must 
be provided in order to read the raw files and generate a 
_RGChannelSet_ object. For more information about this step, 
read minfi's [vignette](https://bioconductor.org/packages/release/bioc/vignettes/minfi/inst/doc/minfi.html#3_reading_data).

```{r, message = FALSE}
library(minfi)
minfiDataDir <- system.file("extdata", package = "minfiData")
targets <- read.metharray.sheet(minfiDataDir, verbose = FALSE)
RGset <- read.metharray.exp(targets = targets)
```


## Cell type heterogeneity correction

Different cell types proportions across samples are one of the major 
sources of variability in methylation data from tissues like blood or 
saliva (@mcgregor16). There are a lot of packages which can be used to 
estimate cell types proportions in each sample in order to correct for 
this bias (reviewed in @mcgregor16). Here we supply an example where blood 
reference data is used to estimate cell types proportions of each blood sample.

```{r, message=FALSE}
library(FlowSorted.Blood.450k)
library(mCSEA)
data(mcseadata)

cellCounts = estimateCellCounts(RGset)
print(cellCounts)
```

These proportions could be introduced as covariates in the linear models.

# Step 1: Ranking CpGs probes

To run a mCSEA analysis, you must rank all the evaluated CpGs probes with some 
metric (e.g. t-statistic, Fold-Change...). You can use _rankProbes()_ function 
for that aim, or prepare a ranked list with the same structure as the 
_rankProbes()_ output.

We load sample data to show how _rankProbes()_ works: 

```{r, message = FALSE, results='hide'}
library(mCSEA)
data(mcseadata)
```

We loaded to our R environment **betaTest** and **phenoTest** objects, in 
addition to **exprTest**, annotation objects and association objects 
(we will talk about these after). **betaTest** 
is a matrix with the $\beta$-values of 10000 EPIC probes for 20 samples. 
**phenoTest** is a dataframe with the explanatory variable and covariates 
associated to the samples. When you load your own data, the structure of 
your objects should be similar.

```{r}
head(betaTest, 3)
print(phenoTest)
```

_rankProbes()_ function uses these two objects as input and apply a linear 
model with `r BiocStyle::Biocpkg("limma")` package. By default, _rankProbes()_ 
considers the first column of the phenotypes table as the explanatory variable 
in the model (e.g. cases and controls) and does not take into account any 
covariate to adjust the models. You can change this behaviour modifying 
**explanatory** and **covariates** options.

By default, _rankProbes()_ assumes that the methylation data object contains 
$\beta$-values and transform them to M-values before calculating the linear 
models. If your methylation data object contains M-values, you must specify 
it (typeInput = "M"). You can also use $\beta$-values for models calculation 
(typeAnalysis = "beta"), although we do not recommend it due to it has been 
proven that M-values better accomplish the statistical assumptions of limma 
analysis (@du10).

```{r}
myRank <- rankProbes(betaTest, phenoTest, refGroup = "Control")
```
**myRank** is a named vector with the t-values for each CpG probe.

```{r}
head(myRank)
```

You can also supply _rankProbes()_ function with a SummarizedExperiment 
object. In that case, if you don't specify a **pheno** object, phenotypes 
will be extracted from the SummarizedExperiment object with _colData()_ 
function.

## Paired analysis

It is possible to take into account paired samples in _rankProbes()_
analysis. For that aim, you should use paired = TRUE parameter and 
specify the column in pheno containing pairing information 
(pairColumn parameter).

# Step 2: Searching DMRs in predefined regions

Once you calculated a score for each CpG, you can perform the mCSEA analysis. 
For that purpose, you should use _mCSEATest()_ function. This function takes 
as input the vector generated in the previous step, the methylation data and 
the phenotype information. By default, it searches 
for differentially methylated promoters, gene bodies and CpG Islands. You can 
specify the regions you want to test with _regionsTypes_ option. _minCpGs_ 
option specifies the minimum amount of CpGs in a region to be considered in 
the analysis (5 by default). You can increase the number of processors to use 
with _nproc_ option (recommended if you have enough computational resources). 
By default, mCSEA performs 10000 
permutations to calculate P-values, but you can change that with _nperm_ 
option. Finally, you should specify if the array platform is 450k or EPIC with 
the _platform_ option. 
Note that _mCSEATest()_ performs permutations to get the P-values, so each 
time it is executed, the results are not exactly the same. To avoid that, we 
recommend to include a _set.seed()_ in order to get reproducible results. 

```{r, warning=FALSE}
set.seed(123)
myResults <- mCSEATest(myRank, betaTest, phenoTest, 
                        regionsTypes = "promoters", platform = "EPIC")
```

_mCSEATest()_ returns a list with the GSEA results and the association objects 
for each region type analyzed, in addition to the input data (methylation, 
phenotype and platform).

```{r}
ls(myResults)
```

**promoters** is a data frame with the following columns (partially extracted 
from `r BiocStyle::Biocpkg("fgsea")` help):

* *pval:* Estimated P-value.
* *padj:* P-value adjusted by BH method.
* *ES:* Enrichment score.
* *NES:* Normalized enrichment score by number of CpGs associated to the 
feature.
* *nMoreExtreme:* Number of times a random gene set had a more extreme 
enrichment score value.
* *size:* Number of CpGs associated to the feature.
* *leadingEdge:* Leading edge CpGs which drive the enrichment.

```{r}
head(myResults[["promoters"]][,-7])
```

On the other hand, **promoters_association** is a list with the CpG probes 
associated to each feature:

```{r}
head(myResults[["promoters_association"]], 3)
```

You can also provide a custom association object between CpG 
probes and regions (_customAnnotation_ option). This object should be a list 
with a structure similar to this:

```{r}
head(assocGenes450k, 3)
```

# Step 3: Plotting the results

Once you found some DMRs, you can make a plot with the genomic context of the 
interesting ones. For that, you must provide _mCSEAPlot()_ function with the 
_mCSEATest()_ results, and 
you must specify which type of region you want to plot and the name of the 
DMR to be plotted (e.g. gene name). There are some graphical parameters you can
adjust (see _mCSEAPlot()_ help). Take into account that this function connects 
to some online servers in order to get genomic information. For that reason, 
this function could take some minutes to finish the plot, specially the first 
time it is executed.

```{r, message = FALSE, results='hide'}
mCSEAPlot(myResults, regionType = "promoters", 
           dmrName = "CLIC6",
           transcriptAnnotation = "symbol", makePDF = FALSE)
```

You can also plot the GSEA results for a DMR with _mCSEAPlotGSEA()_ 
function.

```{r}
mCSEAPlotGSEA(myRank, myResults, regionType = "promoters", dmrName = "CLIC6")
```

# Integrating methylation and expression data

If you have both methylation and expression data for the same samples, you can 
integrate them in order to discover significant associations between 
methylation changes in a DMR and an expression alterations in a close gene.
_mCSEAIntegrate()_ considers the DMRs identified by _mCSEATest()_ passing a 
P-value threshold (0.05 by default). It calculates the mean methylation for 
each condition using the leading edge CpGs and performs a correlation test 
between this mean DMR methylation and the expression of close genes. This 
function automatically finds the genes located within a determined distance 
(1.5 kb) from the DMR. Only correlations passing thresholds (0.5 for 
correlation value and 0.05 por P-value by default) are returned. For 
promoters, only negative correlations are returned due to this kind of 
relationship between promoters methylation and gene expression has been 
largely observed (@jones12). On the contrary, only positive correlations 
between gene bodies methylation and gene expression are returned, due to this 
is a common relationship observed (@jones12). For CpG islands and custom 
regions, both positive and negative correlations are returned, due to they can 
be located in both promoters and gene bodies.

To test this function, we extracted a subset of 100 genes expression from bone 
marrows of 10 healthy and 10 leukemia patients (**exprTest**). Data was 
extracted from `r BiocStyle::Biocpkg("leukemiasEset")` package.

```{r}
# Explore expression data
head(exprTest, 3)

# Run mCSEAIntegrate function
resultsInt <- mCSEAIntegrate(myResults, exprTest, "promoters", "ENSEMBL")

resultsInt
```

It is very important to specify the correct gene identifiers used in the 
expression data (_geneIDs_ parameter). _mCSEAIntegrate()_ automatically 
generates correlation plots for the significant results and save them in the 
directory specified by _folder_ parameter (current directory by default).

![Integration plot for GATA2 promoter methylation and ENSG00000179348 
expression. Note that, actually, both names refers to the same gene, but 
SYMBOL was used to analyze promoters methylation and ENSEMBL ID was used as 
gene identifiers in the expression data.](GATA2_ENSG00000179348_promoters.pdf)

# Session info
```{r sessionInfo, echo=FALSE}
sessionInfo()
```

# References