---
title: "Simulating Whole-Genome Inherited Bisulphite Sequencing Data"
author: Pascal Belleau, Astrid Deschênes and Arnaud Droit
output:
BiocStyle::html_document:
toc: true
vignette: >
%\VignetteIndexEntry{Simulating Whole-Genome Inherited Bisulphite Sequencing Data}
%\VignettePackage{methInheritSim}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r style, echo = FALSE, warning=FALSE, message=FALSE, results = 'asis'}
BiocStyle::markdown()
library(knitr)
```
**Package**: `r Rpackage("methInheritSim")`
**Authors**: `r packageDescription("methInheritSim")[["Author"]]`
**Version**: `r packageDescription("methInheritSim")$Version`
**Compiled date**: `r Sys.Date()`
**License**: `r packageDescription("methInheritSim")[["License"]]`
# Licensing
The `r Rpackage("methInheritSim")` package and the underlying
`r Rpackage("methInheritSim")` code
are distributed under the Artistic license 2.0. You are free to use and
redistribute this software.
# Introduction
DNA methylation plays an important role in the biology of tissue development
and diseases. High-throughput sequencing techniques enable genome-wide
detection of differentially methylated elements (DME), commonly sites (DMS) or
regions (DMR). The analysis of treatment effects on DNA methylation, from
one generation to the next (inter-generational) and across generations that
were not exposed to the initial environment (trans-generational) represent
complex designs. There are two main approaches to study the methylation
inheritance, the first
is based on segregation in pedigree while the second uses the intersection
between the DME of each generation (useful when pedigree is unknown). The
power and the false positve rate of those types of design are relatively
hard to evaluate.
We present a package that simulates the methylation inheritance. Using real
datasets, the package generates a synthetic chromosome by sampling regions.
Two different distributions are used to simulate the methylation level at
each CpG site: one for the DMS and one for all the other sites. The second
distribution takes advantage of parameters estimated using the control
datasets. The package also
offers the option to select the proportion of sites randomly fixed as DMS,
as well as, the fraction of the cases that inherited the DMS in the
subsequent generations.
The `r Rpackage("methInheritSim")` package generates simulated
multigenerational DMS datasets that are useful to evaluate the power and the
false discovery rate of experiment design analysis, such as the
`r Rpackage("methylInheritance")` package does.The multigenerational DMS
datasets can also be used to compare the efficiency of different inheritance
detection software.
# Loading methInheritSim package
As with any R package, the `r Rpackage("methInheritSim")` package should
first be loaded with the following command:
```{r loadingPackage, warning=FALSE, message=FALSE}
library(methInheritSim)
```
# Description of the simulation process
The first step of the simulation process is to create a synthetic chromosome
made up of methylated sites. The synthetic methylated sites (or CpG sites) are
generated
using a real dataset (**methData** parameter). The read dataset only needs to
contain methylation for controls on one generation; a real multigenerational
dataset is not needed.
Two parameters are critical during this process:
* **nbBlock**: The number of blocks randomly selected in the
real dataset genome.
* **nbCpG**: The number of consecutive methylated sites that must contain
each selected block.
Those two parameters unable to reproduce CpG islands of customizable size.
It also reproduces the relation between the methylation level and the
distance associated to adjacent methylated sites.
![Figure 1. Creation of a synthetic chromosome](syntheticChr.png "Synthetic Chr")
For each methylated site of the synthetic chromosome, the
alpha and beta parameters of a Beta distribution are estimated
from the mean and variance of
the proportion of C/T at the site of the real control dataset.
## Simulated control dataset
A Beta distribution is
used to simulate the proportion of C/T in the methylated sites of the
simulated control dataset.
Using the synthetic chromosome, DMS are randomly selected from the methylated
sites. The **rateDiff** parameter fixes the mean of the proportion sites that
are differentially methylated (DMS).
To recreate differentially methylated regions (DMR), the successors site
of a DMS, located within 1000 base pairs, has a
higher probability to be selected as a DMS.
The inheritance is done through the DMR. This means that when the following
generation inherits of a DMR region, it inherits all of the DMS present in the
region. The **propInherite** parameter fixes the proportion of DMR
that are inherited.
## Simulated case dataset
For the methylated sites in the F1 generation of the
simulated case dataset, a Beta distribution is used to simulate the
proportion of C/T. This is the exact same distribution as for the control
dataset.
A proportion of cases, fixed by the **vpDiff** parameter, are
selected to be have DMS. Those DMS are assigned an updated proportion of C/T
that follows a shifted Beta distribution with parameters estimated using
the mean of control $\pm$ **vDiff**. The **vpDiff** parameter is similar
to penetrance. Not all sites of the selected cases will have DMS, only a
proportion of those sites, as fixed by **rateDiff** than represent the mean
proportion of sites selected as DMS.
In the subsequent generation, only a proportion of the DMS present in the
initial simulated case dataset are selected to be
inherited. The proposition of inherited DMS is calculated as:
$$ \mathbf{vpDiff\ \times\ {vInheritance}^{number\ of\ generations\ after\ F2} }$$
The proportion of C/T of those selected inherited sites follows a shifted
Beta distribution with parameters estimated using mean of
control $\pm$ (**vDiff** x**propHetero**).
The **propHetero** is 0.5 if one of the parent is a control.
# Case study
## The simulated dataset
A dataset containing methylation data (6 cases and 6 controls) has been
generated using the `r Rpackage("methInheritSim")` package using a real
dataset from Rat experiment (the real dataset is not public yet, so we used a
simulation based on it). The data have been formated, using
the `r Rpackage("methylkit")` package, into a *methylBase* object
(using the `r Rpackage("methylkit")` functions: *filterByCoverage*,
*normalizeCoverage* and *unite*).
```{r caseStudy01, warning=FALSE, message=FALSE, collapse=TRUE}
## Load read DMS dataset (not in this case but normaly)
data(samplesForChrSynthetic)
## Print the first three rows of the object
head(samplesForChrSynthetic, n = 3)
```
## The simulation
The simulation is run using the **runSim** function.
The **outputDir** parameter fixes the directory where the results are stored.
```{r runSim01, warning=FALSE, message=FALSE, collapse=TRUE, cache=TRUE}
## Directory where the files related to the simulation will be saved
temp_dir <- "test_runSim"
## Run the simulation
runSim(methData = samplesForChrSynthetic, # The dataset use for generate
# the synthetic chr.
nbSynCHR = 1, # The number of synthetic chromosome
nbSimulation = 2, # The number of simulation for each parameter
nbBlock = 10, nbCpG = 20, # The number of site in the
# synthetic chr is nbBLock * nbCpG
nbGeneration = 3, # At least 2 generations must be present
vNbSample = c(3, 6), # The number of controls (= number of cases) in
# each simulation
vpDiff = c(0.9), # Mean proportion of samples with
# differentially methylated values
vpDiffsd = c(0.1), # Standard deviation associated to vpDiff
vDiff = c(0.8), # The shift of the mean of the C/T ratio in
# the differentially methylated sites
vInheritance = c(0.5), # The proportion of cases that inherit
# differentially methylated sites
propInherite = 0.3, # The proportion of diffementially methylated
# regions that are inherited
rateDiff = 0.3, # The mean frequency of the differentially
# methylated regions
minRate = 0.2, # The minimum rate for differentially
# methylated sites
propHetero = 0.5, # The reduction of vDiff for the following
# generations
keepDiff = FALSE, # When FALSE, the differentially methylated
# sites are the same in all simulations
outputDir = temp_dir, # Directory where files are saved
fileID = "S1",
runAnalysis = TRUE,
nbCores = 1,
vSeed = 32) # Fix seed to unable reproductive results
# The files generated
dir(temp_dir)
```
```{r removeFiles, warning=FALSE, message=FALSE, collapse=TRUE, echo=FALSE}
if (dir.exists(temp_dir)) {
unlink(temp_dir, recursive = TRUE, force = FALSE)
}
```
# Files generated by the simulation
Three types of files are generated by default:
1. Synthetic chromosome in **GRanges** format ("syntheticChr" prefix)
2. Simulation information in **GRanges** format ("simData" prefix)
3. Information about DMS state ("stateDiff" prefix)
## Synthetic chromosome in GRanges format ("syntheticChr" prefix)
The first type of files contains information about the synthetic chromosome.
This information is stored as a **GRanges** that contains the CpG
(or methylated sites).The **GRanges** has four metadata inherited from the
real dataset:
* **chrOri**, the chromosome from the real dataset
* **startOri**, the position of the site in the real dataset
* **meanCTRL**, the mean of the C/T proportion of the control in the real dataset
* **varCTRL**, the variance of the C/T proportion of the control in the real dataset
The file name is composed of those elements, separated by "_":
1. The string "syntheticChr"
2. The code of the simulation (the **fileID** parameter, ex: "S1"")
3. The chromosome number
4. The file extension ".rds"
An example of a valid file name: syntheticChr_S1_1.rds
```{r syntheticChr, warning=FALSE, message=FALSE, collapse=TRUE, cache=TRUE}
## The synthetic chromosome
syntheticChr <- readRDS("demo_runSim/syntheticChr_S1_1.rds")
## In GRanges format, only Cpg present
head(syntheticChr, n=3)
```
## Simulation information in GRanges format ("simData" prefix)
The second type of files contains information about the simulation
stored in a **GRanges** format. The **GRanges** object has four metadata
related to real dataset:
* **meanDiff**, the mean of the C/T proportion for the shifted distribution
* **meanCTRL.meanCTRL**, the mean of the C/T proportion for the control distribution
* **partitionCase**, the number of cases simulated with the shifted distribution
* **partitionCtrl**, the number of cases simulated with the control distribution
Plus a metadata for each sample (case or control):
* **case.V[number]** or **ctrl.V[number]** the simulated proportion of C/T
The file name is composed of those elements, separated by "_":
1. The string "simData"
2. The code of the simulation (the **fileID** parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the **nbSynCHR**)
4. The number of controls as specified by the **vNBSample** parameter
5. The value of the **vpDiff** parameter
6. The value of the **vDiff** parameter
7. The value of the **vInheritance** parameter
8. The ID of the simulation (between 1 and the value of the **nbSimulation**)
9. The file extension ".rds"
An example of a valid file name: simData_S1_1_3_0.9_0.8_0.5_1.rds
```{r simData, warning=FALSE, message=FALSE, collapse=TRUE, cache=TRUE}
#### The simulation dataset
simData <- readRDS("demo_runSim/simData_S1_1_3_0.9_0.8_0.5_1.rds")
#### Information for the first generation F1
head(simData[[1]], n=3)
#### Information for the second generation F2
head(simData[[2]], n=3)
```
## Information about DMS state ("stateDiff" prefix)
The third type of files contains a **list** with 2 entries. The first entry is
called **stateDiff** and contains a **vector** of **integer** (0 and 1) with
a length corresponding the length of **stateInfo** object. The **statDiff**
object indicates, using a 1, the positions where the CpG sites are
differentially methylated. The second entry is called **statInherite** and
contains a **vector** of **integer** (0 and 1) with a length corresponding
the length of **stateInfo**. The **statInherite** indicates, using a 1, the
positions where the CpG values are inherited.
The file name is composed of those elements, separated by "_":
1. The string "stateDiff"
2. The code of the simulation (the **fileID** parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the **nbSynCHR**)
4. The number of controls as specified by the **vNBSample** parameter
5. The value of the **vpDiff** parameter
6. The value of the **vDiff** parameter
7. The value of the **vInheritance** parameter
8. The ID of the simulation (between 1 and the value of the **nbSimulation**)
9. The file extension ".rds"
An example of a valid file name: stateDiff_S1_1_3_0.9_0.8_0.5_1.rds
```{r stateDiff, warning=FALSE, message=FALSE, collapse=TRUE, cache=TRUE}
#### The DMS state information
stateDiff <- readRDS("demo_runSim/stateDiff_S1_1_3_0.9_0.8_0.5_1.rds")
#### In stateDiff, the position of DMS is indicated by 1
#### in stateInherite, the position of inherited DMS is indicated by 1
head(stateDiff)
```
## Files related to **saveGRanges** parameter
When **saveGRanges** parameter is **TRUE**, the package saves two extra types
of files:
1. Raw methylation data for all samples in **GRanges** format ("methylGR" prefix)
2. Information about controls and cases ("treatment" prefix)
### Raw methylation data for all samples in **GRanges** format ("methylGR" prefix)
The first type of files is generated for each simulation and contains
a **list** of **GRangesList**. The length of the **list** corresponds to
the number of generations (as specified by the **nbGeneration** paramater).
The generations are stored in order (first entry = first generation, second
entry = second generation, etc..). All samples related to one generations are
stored in a **GRangesList** object. The **GRangesList** object contains
a **list** of **GRanges**. Each **GRanges** stores the raw methylation data
of one sample.
There is one file per simulation. The file name is composed of those
elements, separated by "_":
1. The string "methylGR"
2. The code of the simulation (the **fileID** parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the **nbSynCHR**)
4. The number of controls as specified by the **vNBSample** parameter
5. The value of the **vpDiff** parameter
6. The value of the **vDiff** parameter
7. The value of the **vInheritance** parameter
8. The ID of the simulation (between 1 and the value of the **nbSimulation**)
9. The file extension ".rds"
An example of a valid file name: methylGR_S1_1_3_0.9_0.8_0.5_1.rds
```{r methylGR, warning=FALSE, message=FALSE, collapse=TRUE, cache=TRUE}
#### The raw methylation data in GRanges
methylGR <- readRDS("demo_runSim/methylGR_S1_1_3_0.9_0.8_0.5_1.rds")
#### The third sample of the first generation
head(methylGR[[1]][[3]], n = 3)
#### The fourth sample of the third generation
head(methylGR[[3]][[4]], n = 3)
```
### Information about controls and cases ("treatment" prefix)
The second type of files contains a numeric **vector** denoting controls and
cases (controls = 0 and cases = 1). One file is generated for each entry
in the **vNbSample** vector parameter.
The file name is composed of those elements, separated by "_":
1. The string "treatment"
2. The code of the simulation (the **fileID** parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the **nbSynCHR**)
4. The number of controls as specified by the **vNBSample** parameter
5. The file extension ".rds"
An example of a valid file name: treatment_S1_1_3.rds
```{r treatment, warning=FALSE, message=FALSE, collapse=TRUE, cache=TRUE}
#### The information about controls and cases
treatment <- readRDS("demo_runSim/treatment_S1_1_3.rds")
#### 0 = control, 1 = case, length = number of samples
head(treatment)
```
## Files related to **saveMethylKit** parameter
When **saveMethylKit** is **TRUE**, one extra file is saved for each
generation:
1. Raw methylation data in **methylRaw** format ("methylObj" prefix)
### Raw methylation data in **methylRaw** format ("methylObj" prefix)
The file contains the raw methylation information from the simulated dataset
formated into **S4 methylRaw** objects using `r Biocpkg("methylKit")` package.
All samples related to the same generation are contained in a
**S4 methylRawList** object that is present inside a **list**. The length of
the **list** corresponds to the number of generations. The generations
are stored in order (first entry = first generation, second entry = second
generation, etc..). The **S4 methylRawList** object contains two Slots:
1. **treatment**: A numeric vector denoting controls and cases.
2. **.Data**: A list of **methylRaw** objects. Each object stores the raw
methylation data of one sample.
There is one file per simulation. The file name is composed of those
elements, separated by "_":
1. The string "methylObj"
2. The code of the simulation (the **fileID** parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the **nbSynCHR**)
4. The number of controls as specified by the **vNBSample** parameter
5. The value of the **vpDiff** parameter
6. The value of the **vDiff** parameter
7. The value of the **vInheritance** parameter
8. The ID of the simulation (between 1 and the value of the **nbSimulation**)
9. The file extension ".rds"
An example of a valid file name: methylObj_S1_1_3_0.9_0.8_0.5_1.rds
```{r methylObj, warning=FALSE, message=FALSE, collapse=TRUE, cache=TRUE}
## The raw methylation data
methylObj <- readRDS("demo_runSim/methylObj_S1_1_3_0.9_0.8_0.5_1.rds")
#### The third sample of the first generation
head(methylObj[[1]][[3]], n = 3)
#### The fourth sample of the third generation
head(methylObj[[3]][[4]], n = 3)
```
## Files related to **runAnalysis** parameter
When **runAnalysis** is **TRUE**, two extra files are saved for each
simulation:
1. Methylation events present in multiple samples in **methylBase** format
("meth" prefix)
2. Differential methylation statistics in **methylDiff** format
("methDiff" prefix)
### Methylation events present in multiple samples in **methylBase** format ("meth" prefix)
The first file contains the simulated dataset formated with
the `r Biocpkg("methylKit")` package into a **S4 methylBase** object. The
transformation is made using the `r Biocpkg("methylKit")` functions:
filterByCoverage(), normalizeCoverage() and unite(). Each simulation
has it own file. Only sites having minimum reads alignment in all
samples are present in the file.
The file name is composed of those elements, separated by "_":
1. The string "meth"
2. The code of the simulation (the **fileID** parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the **nbSynCHR**)
4. The number of controls as specified by the **vNBSample** parameter
5. The value of the **vpDiff** parameter
6. The value of the **vDiff** parameter
7. The value of the **vInheritance** parameter
8. The ID of the simulation (between 1 and the value of the **nbSimulation**)
9. The file extension ".rds"
An example of a valid file name: meth_S1_1_3_0.9_0.8_0.5_1.rds
```{r meth, warning=FALSE, message=FALSE, collapse=TRUE, cache=TRUE}
#### The methylation events present in multiple samples
meth <- readRDS("demo_runSim/meth_S1_1_3_0.9_0.8_0.5_1.rds")
#### Information for all samples in the first generation
head(meth[[1]], n = 3)
```
### Differential methylation statistics in **methylDiff** format ("methDiff" prefix)
The second file contains the result of the differential methylation calculation
done on the simulated dataset. Each generation of the dataset is analysed
separately using the calculateDiffMeth() function of the
`r Biocpkg("methylKit")` package. A **S4 methylDiff** object is created for
each generation and is stored in the file inside a **list** (first entry =
first generation, second entry = second generation, etc...).
The file name is composed of those elements, separated by "_":
1. The string "methDiff"
2. The code of the simulation (the **fileID** parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the **nbSynCHR**)
4. The number of controls as specified by the **vNBSample** parameter
5. The value of the **vpDiff** parameter
6. The value of the **vDiff** parameter
7. The value of the **vInheritance** parameter
8. The ID of the simulation (between 1 and the value of the **nbSimulation**)
9. The file extension ".rds"
An example of a valid file name: methDiff_S1_1_3_0.9_0.8_0.5_1.rds
```{r methDiff, warning=FALSE, message=FALSE, collapse=TRUE, cache=TRUE}
#### The differential methylation statistics
methDiff <- readRDS("demo_runSim/methDiff_S1_1_3_0.9_0.8_0.5_1.rds")
#### Information for the first generation
head(methDiff[[1]], n = 3)
```
# Conclusion
The `r Rpackage("methInheritSim")` package generates simulated
multigenerational DMS datasets. Several simulator parameters can be derived
from real dataset provided by the user in order to replicate realistic
case-control scenarios.
The results of a simulation could be analysed, using the
`r Rpackage("methylInheritance")` package, to evaluate the power and the
false discovery rate of an experiment design.