---
title: "SIMD Tutorial"
author: "Yan Zhou"
package: SIMD
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{SIMD Tutorial}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---


# 1 Introduction
This package provides a inferential analysis method 
for detecting differentially expressed CpG sites in MeDIP-seq data. It uses 
statistical framework and EM algorithm, to identify differentially expressed 
CpG sites. The methods on this package are described in the article 
*Methylation-level Inferences and Detection of Differential Methylation 
with Medip-seq Data* by Yan Zhou, Jiadi Zhu, Mingtao Zhao, Baoxue Zhang, 
Chunfu Jiang and Xiyan Yang (2018, pending publication).

SIMD method is developed for jointly analyzing MeDIP-seq and MRE-seq data, 
which are derived from methylated DNA immunoprecipitation (MeDIP) experiments
followed by sequencing (MeDIP-seq) and methyl-sensitive restriction enzymes
experiments for unmethylated CpGs (MRE-seq). We have implemented the
SIMD method via a set of R functions with the computational intensive parts
written in C. We make a R package named SIMD, which is the abbreviation of 
Statistical Inferences with MeDIP-seq Data, and give a tutorial for
the package. The method consist three steps.

Step 1: Data Pre-processing;  

* Calculate the CpG count.   
* Calculate the MRE CpG count.  
* Calculate MeDIP-seq tag count of each site of control and treatment samples.  
* Calculate MRE-seq tag count of each site of control and treatment samples.

Step 2: Calculating p-values of each CpG site by the EM algorithm;

Step 3: Select Significants.

We use a part of a real dataset to illustrate the usage of the SIMD package.
The programs can run under the Linux system and windows 10 system. The
R versions should larger than 3.5.0.

# 2 Preparations
Before installing SIMD package, the user must install three other R pack-
ages, which can be done using the following commands:
```{r, eval=FALSE}
install.packages("BiocManager")
BiocManager::install(c('edgeR', 'statmod','methylMnM'))

library(edgeR)
library(statmod)
library(methylMnM)
```
Next, to install the SIMD package into your R environment, start R and
enter:
```{r, eval=FALSE}
BiocManager::install("SIMD")
```
Then, the SIMD package is ready to load.
```{r}
library(SIMD)
```

# 3 Data format
In order to reproduce the presented SIMD workflow, the package includes the 
example data sets, which is a part of a real data sets, including:  

**all_CpGsite_bin_chr18**,  
**three_mre_cpg**,  
**EM2_H1ESB1_MeDIP_sigleCpG**,  
**EM2_H1ESB2_MeDIP_sigleCpG**,  
**H1ESB1_MRE_sigleCpG**,  
and **H1ESB2_MRE_sigleCpG**, which is included in file example_data.RDdata in 
the data subdirectory of the SIMD package.

The files contain genomic regions from chromosome 18 only, as covered by
short reads obtained from a MeDIP and MRE experiment of human H1ESB1
cells.

All output files are in a .bed format. The input file contain the
following columns.  

* the first coulumn is of type character and contains the chromosome of the
region (e.g. chr1)  
* the second column is of type numeric and contains the start position of
the mapped read  
* the third column is of type numeric and contains the stop position of the
mapped read  
* the fourth column is of type character and contains the strand information
of the mapped read  

For MRE-seq data, we need "+" and "-" strand information in mapping
process, which is general located at the sixth column of the input file. Each 
row represents a mapped read. These information can be extracted from the output
file(s) of common mapping tools.

# 4 Data Pre-processing
The SIMD program requires users to provide the genome-wide MeDIP-seq
and MRE-seq reads and the reference genome. The pre-processing step involves
calculation of MeDIP-seq and MRE-seq count and CpG and MRE-specific count in 
each CpG site. The mainly steps please refer the data pre-processing steps of 
R package MethylMnM. 

In this manual, we use example data located in the data subdirectory of the 
SIMD package, and the example data is processed by MethylMnM package in 
advance. The path is: 
```{r}
datafile <- system.file("extdata", package="methylMnM")
filepath <- datafile[1]
```

The CpG count, MRE-CpG count, MeDIP-seq count and MRE-seq count of each site 
are stored at: 
```{r}
dirwrite <- paste(setwd(getwd()), "/", sep="")
```

Then we compute the number of actual short reads in each CpG site by the 
function of *EMalgorithm()*, which use EM algorithm to infer the actual reads 
by the observation fragments. The results are saved in file *writefile* and 
*reportfile*. We give an example as follows, the data is from data subdirectory 
of the SIMD package:
```{r}
data(example_data)
allcpgfile <- EM_H1ESB1_MeDIP_sigleCpG
readshort <- paste(filepath, "/H1ESB1_MeDIP_18.extended.txt", sep="")
writefile <- paste(dirwrite, "EM2_H1ESB1_MeDIP_sigleCpG.bed", sep="")
reportfile <- paste(dirwrite, "EM2_H1ESB1_MeDIP_sigleCpG_report.bed", sep="")
EMalgorithm(cpgsitefile=readshort, allcpgfile=allcpgfile, category="1",
            writefile=writefile, reportfile=reportfile)
```

# 5 p-values of SIMD test
In order to detect different methylated CpG sites, we should calculate p-value 
of each site. The below codes are calculate p-value of each site with function 
*EMtest()*. The input files which include "datafile", "cpgfile" and 
"mrecpgfile" are should have been generated by Step 1. The output file 
"writefile" will owneleven columns, that is, "*chr*", "*chrSt*","*chrEnd*",
"*Medip1*","*Medip2*", "*MRE1*", "*MRE2*", "*cg*","*mrecg*","*pvalue*","*Ts*". 
We also output a reportfile which will include parameters "*s1/s2*"; "*s3/s4*"; 
"*N1*"; "*N2*"; "*N3*"; "*N4*"; "*c1*"; "*c2*" and "*Spend time*".

```{r}
data(example_data)
data1 <- EM2_H1ESB1_MeDIP_sigleCpG
data2 <- EM2_H1ESB2_MeDIP_sigleCpG
data3 <- H1ESB1_MRE_sigleCpG
data4 <- H1ESB2_MRE_sigleCpG
datafile <- cbind(data1,data2,data3,data4)
allcpg <- all_CpGsite_bin_chr18
mrecpg <- three_mre_cpg
dirwrite <-paste(setwd(getwd()), "/", sep="")

writefile <- paste(dirwrite, "pval_EM_H1ESB1_H1ESB21.bed", sep="")
reportfile <- paste(dirwrite, "report_pvalH1ESB1_H1ESB21.bed", sep="")

EMtest(datafile=datafile, chrstring=NULL, cpgfile=allcpg,mrecpgfile=mrecpg, 
       writefile=writefile, reportfile=reportfile,mreratio=3/7, psd=2, 
       mkadded=1, f=1)
```

# 6 Select Significants
After getting p-values, we choose differentially expressed methylated CpG sites
with p-values. The input file is the file which have been generated by Step 2. 
Then we assume the sites to be differentially expressed methylated CpGs when
p-values less than pre-setting cutoffs, such as $10^{-3}$ or $10^{-5}$.