Here I will try to address the analysis workflow of studying DNA methylation with one or more independent variable, be it tissue, cell type, sex, age, tumor/normal, treatment/control or a combination of these factors. The DML
(short for Differential Methylation Locus) function models beta values (DNA methylation levels) on known co-variates (e.g., case vs control) using linear modeling. This general supervised learning framework identifies CpG loci whose differential methylation is associated with known co-variates. Let’s first load needed packages
## [1] TRUE
In the following, we will use 10 mouse array samples as our example dataset. This dataset contains mouse samples from different tissues and individuals of different ages and sexes. The dataset is stored in a SummarizedExperiment
object, which contains a data matrix combined with column-wise metadata accessible with colData
:
## DataFrame with 10 rows and 4 columns
## IDAT tissue sex age
## <character> <character> <character> <numeric>
## 204875570008_R06C01 204875570008_R06C01 Cecum Female 1
## 204875570011_R02C02 204875570011_R02C02 Cecum Male 1
## 204875570008_R02C02 204875570008_R02C02 Colon Female 1
## 204875570008_R03C02 204875570008_R03C02 Colon Female 1
## 204875570008_R05C01 204875570008_R05C01 Esophagus Female 16
## 204875570011_R03C01 204875570011_R03C01 Esophagus Male 16
## 204875570011_R01C01 204875570011_R01C01 Fat Male 16
## 204875570022_R03C02 204875570022_R03C02 Fat Female 16
## 204875570011_R04C01 204875570011_R04C01 Heart Male 31
## 204875570014_R04C02 204875570014_R04C02 Heart Male 31
CRITICAL: If your data is NA-masked (see the Basic Usage tutorial), it is critical to preclude CpGs with missing levels. For example, if a CpG does not contain any non-NA values for all male mice, one cannot assess sex-specific DNA methylation for this CpG. One needs to preclude such CpGs from the modeling of sex-specific differential methylation. This can be done using the
checkLevels
function.
In the following, we will test for both sex and tissue as co-variates using the checkLevels
function:
se_ok = (checkLevels(assay(se), colData(se)$sex) &
checkLevels(assay(se), colData(se)$tissue))
sum(se_ok)
## [1] 995
NOTE: If your model include discrete contrast variables like tissue and sex as in the current example, you should consider explicitly turning it into a factor variable with a reference level (we use
treatment coding
, see different coding systems).
For example, to use Colon
as the reference tissue and Female
as the reference sex, one can do the following
colData(se)$tissue <- relevel(factor(colData(se)$tissue), "Colon")
colData(se)$sex <- relevel(factor(colData(se)$sex), "Female")
Then we will model DNA methylation variation treating tissue and sex as covariates. To do that we will call the DML
function and specify the R formula ~tissue + sex
. This function fits DNA methylation reading to a linear model and perform the corresponding slope test and goodness-of-fit test (F-test holding out each contrast variable). All these results are returned in an object of class DMLSummary
;
## DMLSummary Object with 995 Loci, 10 samples.
## Contrasts: tissue sex
You can use mc.cores
argument to parallelize the computing.
The DMLSummary
object is a list of expanded summary.lm
objects from the linear model (as would be returned by summary(lm())
. The summaryExtractTest
function abstracts a DMLSummary
object into a data frame with rows corresponding to CpGs/loci and columns containing the slopes and p-values of each variable.
test_result = summaryExtractTest(smry)
colnames(test_result) # the column names, show four groups of statistics
## [1] "Est_X.Intercept." "Est_tissueCecum" "Est_tissueEsophagus"
## [4] "Est_tissueFat" "Est_tissueHeart" "Est_sexMale"
## [7] "Pval_X.Intercept." "Pval_tissueCecum" "Pval_tissueEsophagus"
## [10] "Pval_tissueFat" "Pval_tissueHeart" "Pval_sexMale"
## [13] "FPval_tissue" "FPval_sex" "Eff_tissue"
## [16] "Eff_sex"
With the exception of the Intercept
, there are four groups of columns, each starting with “Est_”, “Pval_”, “FPval_”, and “Eff_” as prefix. Here are what they represent:
Est_tissueFat
should be interpreted as the estimated beta value difference of Fat compared to the reference tissue (which is Colon
, as set above. If reference is not set, the first level in alphabetic order is used as the reference level). There are is a special column named Est_`(Intercept)`
. They correspond to the beta value of the reference level (in this case a Female Colon sample).
Pval_tissueFat
tests whether Fat
is significantly different from Colon
(the reference level) in DNA methylation. The Pval_`(Intercept)`
tests whether the reference level is significantly different from zero.
One may want to ask a question like whether the methylation of the probe is tissue-specific rather than whether the methylation of the probe is differentially methylated in fat. The former question ask about the contrast variable as a whole instead of a specific level of the contrast variable. To answer this question, we can use an F-test contasting the full model with a reduced model with the target contrast held out. By default, this analysis is already completed in the DML
function. The test result is recorded in the FPval_ columns. For example, to get all CpGs that are methylated specific to sex,
Here we used 0.1 as the effect size threshold. This means DNA methylation difference under 0.1 (10%) is not considered a biologically meaningful difference. This is a valid assumption for homogenous cell composition as most cells would be either biallelically methylated, unmethylated or monoallelically methylated.
We can define CpG methylation as sex-specific, tissue-specific or both
test_result %>%
mutate(sex_specific =
ifelse(FPval_sex < 0.05 & Eff_sex > 0.1, TRUE, FALSE)) %>%
mutate(tissue_specific =
ifelse(FPval_tissue < 0.05 & Eff_tissue > 0.1, TRUE, FALSE)) %>%
select(sex_specific, tissue_specific) %>% table
## tissue_specific
## sex_specific FALSE TRUE
## FALSE 891 99
## TRUE 5 0
As you can see from the result, two probes are sex-specific and one probe 13 probes are tissue-specific. There is no overlap between probes whose methylation reading is differential across both contrasts.
From the test result, we can also ask whether the DNA methylation might be different between two sexes or two tissues. For example, Est_sexMale
compares male from females. The following code creates a volcano plot.
Likewise, we can ask whether DNA methylation might be different between fat and colon. We can do
The variable tested in the DML
function can be continuous. Suppose we are interested in age
besides sex
. We will call the same function but with the following formula:
Let’s verify the CpGs positively associated with age.
df = data.frame(Age = colData(se)$age,
BetaValue = assay(se)[rownames(test_result2)[nrow(test_result2)],])
ggplot(df, aes(Age, BetaValue)) + geom_smooth(method="lm") + geom_point()
## `geom_smooth()` using formula 'y ~ x'
For a given contrast, one can merge neighboring CpGs that show consistent methylation variation into differentially methylated regions (DMRs). For example, we can merge sex-specific differential methylation identified above to chromosome X regions that show X-inactivation-related methylation difference. To do this, we need to pick a contrast:
## [1] "X.Intercept." "tissueCecum" "tissueEsophagus" "tissueFat"
## [5] "tissueHeart" "sexMale"
## Merging correlated CpGs ... Done.
## Generated 508 segments.
## Combine p-values ...
## - 29 significant segments.
## - 3 significant segments (after BH).
## Done.
SeSAMe provide utilities to view methylation reading in a track view. Next, we will demonstrate how to create track view with transcript position marked focusing on a genomic region, a gene and specific probes. Let’s first load example HM450 data
To visualize probes from arbitrary region, we will call visualizeRegion
:
Zero to full methylation are displayed using the jet color scheme with blue representing no methylation and red full methylation.
To visualize all probes from a gene, we will call visualizeGene
To visualize genome neighborhood using probe names, we will call visualizeProbes
:
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] tidyr_1.2.0 dplyr_1.0.7
## [3] knitr_1.37 SummarizedExperiment_1.24.0
## [5] Biobase_2.54.0 MatrixGenerics_1.6.0
## [7] matrixStats_0.61.0 scales_1.1.1
## [9] DNAcopy_1.68.0 GenomicRanges_1.46.1
## [11] GenomeInfoDb_1.30.1 IRanges_2.28.0
## [13] S4Vectors_0.32.3 wheatmap_0.1.0
## [15] ggplot2_3.3.5 sesame_1.12.9
## [17] sesameData_1.12.0 rmarkdown_2.11
## [19] ExperimentHub_2.2.1 AnnotationHub_3.2.1
## [21] BiocFileCache_2.2.1 dbplyr_2.1.1
## [23] BiocGenerics_0.40.0
##
## loaded via a namespace (and not attached):
## [1] fgsea_1.20.0 colorspace_2.0-2
## [3] ellipsis_0.3.2 class_7.3-20
## [5] XVector_0.34.0 proxy_0.4-26
## [7] farver_2.1.0 ggrepel_0.9.1
## [9] bit64_4.0.5 interactiveDisplayBase_1.32.0
## [11] AnnotationDbi_1.56.2 fansi_1.0.2
## [13] splines_4.1.2 cachem_1.0.6
## [15] jsonlite_1.7.3 png_0.1-7
## [17] shiny_1.7.1 BiocManager_1.30.16
## [19] compiler_4.1.2 httr_1.4.2
## [21] assertthat_0.2.1 Matrix_1.4-0
## [23] fastmap_1.1.0 cli_3.1.1
## [25] later_1.3.0 htmltools_0.5.2
## [27] tools_4.1.2 gtable_0.3.0
## [29] glue_1.6.1 GenomeInfoDbData_1.2.7
## [31] reshape2_1.4.4 rappdirs_0.3.3
## [33] fastmatch_1.1-3 Rcpp_1.0.8
## [35] jquerylib_0.1.4 vctrs_0.3.8
## [37] Biostrings_2.62.0 nlme_3.1-155
## [39] preprocessCore_1.56.0 xfun_0.29
## [41] stringr_1.4.0 mime_0.12
## [43] lifecycle_1.0.1 zlibbioc_1.40.0
## [45] MASS_7.3-55 BiocStyle_2.22.0
## [47] promises_1.2.0.1 parallel_4.1.2
## [49] RColorBrewer_1.1-2 yaml_2.2.2
## [51] curl_4.3.2 memoise_2.0.1
## [53] gridExtra_2.3 sass_0.4.0
## [55] stringi_1.7.6 RSQLite_2.2.9
## [57] BiocVersion_3.14.0 highr_0.9
## [59] randomForest_4.7-1 e1071_1.7-9
## [61] filelock_1.0.2 BiocParallel_1.28.3
## [63] rlang_1.0.1 pkgconfig_2.0.3
## [65] bitops_1.0-7 evaluate_0.14
## [67] lattice_0.20-45 purrr_0.3.4
## [69] labeling_0.4.2 bit_4.0.4
## [71] tidyselect_1.1.1 plyr_1.8.6
## [73] magrittr_2.0.2 R6_2.5.1
## [75] generics_0.1.2 DelayedArray_0.20.0
## [77] DBI_1.1.2 mgcv_1.8-38
## [79] pillar_1.7.0 withr_2.4.3
## [81] KEGGREST_1.34.0 RCurl_1.98-1.5
## [83] tibble_3.1.6 crayon_1.4.2
## [85] KernSmooth_2.23-20 utf8_1.2.2
## [87] grid_4.1.2 data.table_1.14.2
## [89] blob_1.2.2 digest_0.6.29
## [91] xtable_1.8-4 httpuv_1.6.5
## [93] munsell_0.5.0 bslib_0.3.1