---
title: "Modeling"
date: "`r BiocStyle::doc_date()`"
package: sesame
output: BiocStyle::html_document
fig_width: 8
fig_height: 6
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{3. Modeling}
  %\VignetteEncoding{UTF-8}
---

# Linear Modeling

Here I will try to address the analysis workflow of studying DNA methylation
with one or more independent variable, be it tissue, cell type, sex, age,
tumor/normal, treatment/control or a combination of these factors.
The `DML` (short for Differential Methylation Locus) function models beta
values (DNA methylation levels) on known co-variates (e.g., case vs control)
using linear modeling. This general supervised learning framework
identifies CpG loci whose differential methylation is associated with known
co-variates. Let's first load needed packages
```{r message=FALSE}
options(rmarkdown.html_vignette.check_title = FALSE)
library(sesame)
sesameDataCacheAll()
library(dplyr)
library(tidyr)
library(SummarizedExperiment)
library(ggplot2)
```

In the following, we will use 10 mouse array samples as our example dataset.
This dataset contains mouse samples
from different tissues and individuals of different ages and sexes.
The dataset is stored
in a `SummarizedExperiment` object, which contains a data matrix combined
with column-wise metadata accessible with `colData`:
```{r message=FALSE}
se = sesameDataGet("MM285.10.tissues")[1:1000,] # use a random 1000 probes
colData(se)
```

> **CRITICAL:** If your data is NA-masked (see the Basic Usage tutorial), it is
critical to preclude CpGs with missing levels. For example, if a CpG does
not contain any non-NA values for all male mice, one cannot assess
sex-specific DNA methylation for this CpG. One needs to preclude
such CpGs from the modeling of sex-specific differential methylation.
This can be done using the `checkLevels` function.

In the following, we will test for both sex and
tissue as co-variates using the `checkLevels` function:
```{r}
se_ok = (checkLevels(assay(se), colData(se)$sex) &
    checkLevels(assay(se), colData(se)$tissue))
sum(se_ok)
se = se[se_ok,]
```

> **NOTE:** If your model include discrete contrast variables like tissue and
sex as in the current example, you should consider explicitly turning it into a
factor variable with a reference level (we use `treatment coding`, see
[different coding systems](
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-
for-categorical-variables/)).

For example, to use `Colon` as the
reference tissue and `Female` as the reference sex, one can do the following
```{r}
colData(se)$tissue <- relevel(factor(colData(se)$tissue), "Colon")
colData(se)$sex <- relevel(factor(colData(se)$sex), "Female")
```

Then we will model DNA methylation variation
treating tissue and sex as covariates. To do that we will call the `DML`
function and specify the R formula
`~tissue + sex`. This function fits DNA methylation reading to a linear
model and perform the
corresponding slope test and goodness-of-fit test (F-test holding out each
contrast variable). All these results are returned in an
object of class `DMLSummary`;
```{r}
smry = DML(se, ~tissue + sex)
smry
```
You can use `mc.cores` argument to parallelize the computing.

## Test Interpretation

The `DMLSummary` object is a list of expanded `summary.lm` objects from the
linear model (as would be returned by `summary(lm())`. The
`summaryExtractTest` function abstracts a `DMLSummary` object
into a data frame with rows corresponding to CpGs/loci and columns containing
the slopes and p-values of each variable.
```{r}
test_result = summaryExtractTest(smry)
colnames(test_result) # the column names, show four groups of statistics
head(test_result)
```

With the exception of the `Intercept`, there are four groups of columns, each
starting with "Est_", "Pval_", "FPval_", and "Eff_" as prefix. Here are what
they represent:

* **Est_**
: The slope estimate (aka the beta coefficient) for continuous
variable. DNA methylation difference of the current level
with respect to the reference level for nominal contrast variables. Each suffix
is concatenated from the contrast variable name (e.g., tissue, sex) and the
level name if the contrast variable is discrete (e.g, Cecum, Esophagus, Fat).
For example, `Est_tissueFat` should be interpreted as the
estimated beta value difference of Fat compared to the reference tissue (which
is `Colon`, as set above. If reference is not set, the first level in
alphabetic order is used as the reference level).
There are is a special column named ``Est_`(Intercept)` ``.
They correspond to the beta value of the reference level (in this case a Female
Colon sample).

* **Pval_**
: The p-value of t-testing the slope (statistical
significance of the difference). For example, `Pval_tissueFat` tests whether
`Fat` is significantly different from `Colon` (the reference level) in DNA
methylation. The ``Pval_`(Intercept)` `` tests whether the reference level is
significantly different from zero.

* **FPval_**
: The p-value of the F-test contrasting the full model against a
reduced model with the labeled contrasting held out. Note that "Pval_" and
"FPval_" are equivalent when the contrast variable is a 2-level factor, i.e.,
a pairwise comparison.

* **Eff_**
: The effect size of each normial contrast variable. This is
equivalent to the maximum slope subtracted by the minimum level including the
reference level (0).

## Goodness of fit

One may want to ask a question like _whether the methylation of the probe is
tissue-specific_ rather than whether the methylation of the probe is
differentially methylated in fat.
The former question ask about the contrast variable
as a whole instead of a specific level of the contrast variable. To answer
this question, we can use an F-test contasting the full model with a reduced
model with the target contrast held out. By default, this analysis is already
completed in the `DML` function. The test result is recorded
in the **FPval_** columns. For example, to get all CpGs that are methylated
specific to sex,

```{r}
test_result %>% dplyr::filter(FPval_sex < 0.05, Eff_sex > 0.1) %>%
    select(FPval_sex, Eff_sex)
```
Here we used 0.1 as the effect size threshold. This means DNA methylation
difference under 0.1 (10%) is not considered a biologically meaningful
difference. This is a valid assumption for homogenous cell composition as
most cells would be either biallelically methylated, unmethylated or
monoallelically methylated.

We can define CpG methylation as sex-specific, tissue-specific or both
```{r}
test_result %>%
    mutate(sex_specific =
        ifelse(FPval_sex < 0.05 & Eff_sex > 0.1, TRUE, FALSE)) %>%
    mutate(tissue_specific =
        ifelse(FPval_tissue < 0.05 & Eff_tissue > 0.1, TRUE, FALSE)) %>%
    select(sex_specific, tissue_specific) %>% table
```

As you can see from the result, two probes are sex-specific and one probe
13 probes are tissue-specific. There is no overlap between probes whose
methylation reading is differential across both contrasts.

## Pairwise Comparison

From the test result, we can also ask whether the DNA methylation might
be different between two sexes or two tissues. For example, `Est_sexMale`
compares male from females. The following code creates a volcano plot.

```{r}
ggplot(test_result) + geom_point(aes(Est_sexMale, -log10(Pval_sexMale)))
```

Likewise, we can ask whether DNA methylation might be different between fat
and colon. We can do

```{r}
ggplot(test_result) + geom_point(aes(Est_tissueFat, -log10(Pval_tissueFat)))
```

## Continuous Variable

The variable tested in the `DML` function can be continuous.
Suppose we are interested in `age` besides `sex`. We will call the
same function but with the following formula:
```{r}
smry2 = DML(se, ~ age + sex)
test_result2 = summaryExtractTest(smry2) %>% arrange(Est_age)
```

Let's verify the CpGs positively associated with age.
```{r}
test_result2 %>% dplyr::select(Est_age, Pval_age) %>% tail # positive assoc.
df = data.frame(Age = colData(se)$age,
    BetaValue = assay(se)[rownames(test_result2)[nrow(test_result2)],])
ggplot(df, aes(Age, BetaValue)) + geom_smooth(method="lm") + geom_point()
```

# DMR

For a given contrast, one can merge neighboring CpGs that show consistent
methylation variation into differentially methylated regions (DMRs).
For example, we can merge sex-specific differential methylation identified
above to chromosome X regions that show X-inactivation-related methylation
difference. To do this, we need to pick a contrast:

```{r}
dmContrasts(smry)                       # pick a contrast
merged = DMR(se, smry, "sexMale")       # merge CpGs to regions / segments
```

# Track View

SeSAMe provide utilities to view methylation reading in a track view.
Next, we will demonstrate how to create track view with transcript
position marked focusing on a genomic region, a gene and specific probes.
Let's first load example HM450 data
```{r}
betas <- sesameDataGet('HM450.10.TCGA.PAAD.normal')
```

## By Genomic Region
To visualize probes from arbitrary region, we will call `visualizeRegion`:
```{r, message=FALSE, fig.width=6, fig.height=5}
visualizeRegion(
    'chr19',10260000,10380000, betas, platform='HM450',
    show.probeNames = FALSE)
```
Zero to full methylation are displayed using the jet color scheme with blue
representing no methylation and red full methylation.

## By Gene Name
To visualize all probes from a gene, we will call `visualizeGene`
```{r, message=FALSE, fig.width=6, fig.height=5}
visualizeGene('DNMT1', betas, platform='HM450')
```

## By Probe ID
To visualize genome neighborhood using probe names, we will call
`visualizeProbes`:
```{r, message=FALSE, fig.width=6}
visualizeProbes(c("cg02382400", "cg03738669"), betas, platform='HM450')
```

# Session Info

```{r}
sessionInfo()
```