%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{ELMER: Inferring Regulatory Element Landscapes and Transcription Factor Networks Using Methylomes}
% !Rnw weave = knitr
\documentclass{article}
<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\usepackage{grffile}
\usepackage{float}
\usepackage{amsmath,epsfig,fullpage}
\usepackage{hyperref}
\usepackage{url}
\usepackage[authoryear,round]{natbib}
%\usepackage[OT1]{fonitenc}
\usepackage{textcomp}
%\\newcommand{\Rpackage}[1]{{\textit{#1}}}


\title{ELMER: An R/Bioconductor Tool \\
Inferring Regulatory Element Landscapes and Transcription Factor Networks Using Methylomes}
\author {Lijing Yao [cre, aut], 
Ben Berman [aut], 
Peggy Farnham [aut]\\ 
Hui Shen [ctb], 
Peter Laird [ctb], 
Simon Coetzee [ctb]}

\begin{document}
%\SweaveOpts{concordance=TRUE}
% #' <<setup, include=FALSE, cache=FALSE>>=
% #' library(knitr)
% #' opts_chunk$set(fig.align='center', par=TRUE)
% #' knit_hooks$set(par=function(before, options, envir){
% #' if (before && options$fig.show!='none') par(mar=c(1,1,.1,.1),cex.lab=1,cex.axis=1,mgp=c(2,.7,0),tcl=-.3)
% #' }, crop=hook_pdfcrop)
% #' @
\maketitle

\tableofcontents
\section{Introduction}
This document provides an introduction of the \Rpackage{ELMER}, which is designed 
to combine DNA methylation and gene expression data from human tissues to infer 
multi-level cis-regulatory networks. \Rpackage{ELMER} uses DNA methylation to 
identify enhancers, and correlates enhancer state with expression of nearby genes 
to identify one or more transcriptional targets. Transcription factor (TF) binding 
site analysis of enhancers is coupled with expression analysis of all TFs to 
infer upstream regulators. This package can be easily applied to TCGA public 
available cancer data sets and custom DNA methylation and gene expression data sets.

ELMER analyses have 5 main steps: 

1. Identify distal enhancer probes on HM450K.\\
2. Identify distal enhancer probes with significantly different DNA methyaltion level
in control group and experiment group.\\
3. Identify putative target genes for differentially methylated distal enhancer probes.\\
4. Identify enriched motifs for the distal enhancer probes which are significantly 
differentially methylated and linked to putative target gene.\\
5. Identify regulatory TFs whose expression associate with DNA methylation at motifs.\\


\subsection{Installing and loading ELMER}
To install this package, start R and enter
<<installing, eval=FALSE,hide=TRUE, message=FALSE,include=FALSE>>=
install.packages(devtools)
library(devtools);
devtools::install_github("lijingya/ELMER");
@
<<install, eval=FALSE>>=
source("http://bioconductor.org/biocLite.R")
biocLite("ELMER")
@

\section{Download example data}
The following steps can be used to download the example data set for \Rpackage{ELMER}.
<<example.data.run, echo=FALSE, hide=TRUE, message=FALSE, include=FALSE>>=
if(!file_test("-d", "./ELMER.example")) {
  if(Sys.info()["sysname"] == "Windows") mode <- "wb" else  mode <- "w"
  URL <- "https://www.dropbox.com/s/mxf2u1wcauenx8m/ELMER.example.zip?raw=1"
  downloader::download(URL,destfile = "ELMER.example.zip",mode= mode)
  utils::unzip("ELMER.example.zip")
}
library(ELMER, quietly = TRUE)
@

<<example.data, eval=FALSE>>=
#Example file download from URL: https://dl.dropboxusercontent.com/u/61961845/ELMER.example.tar.gz
URL <- "https://dl.dropboxusercontent.com/u/61961845/ELMER.example.tar.gz"
download.file(URL,destfile = "ELMER.example.tar.gz",method= "curl")
untar("./ELMER.example.tar.gz")
library(ELMER)
@


\section{Quick start: running TCGA example}

TCGA.pipe is a function for easily downloading TCGA data and performing all 
the analyses in ELMER. For illustration purpose, we skip the downloading step. 
The user can use the getTCGA function to download TCGA data or
use TCGA.pipe by including "download" in the analysis option.

The following command will do distal enhancer DNA methylation analysis and predict putative
target genes, motif analysis and idnetify regulatory transcription factors.
<<tcga.pipe,eval=FALSE>>=
TCGA.pipe("LUSC",wd="./ELMER.example",cores=detectCores()/2,permu.size=300,Pe=0.01,
          analysis = c("distal.probes","diffMeth","pair","motif","TF.search"),
          diff.dir="hypo",rm.chr=paste0("chr",c("X","Y")))
@


\section{Input data}

MEE.data object is the input for all main functions of \Rpackage{ELMER}. MEE.data 
object can be generated by fetch.mee function. In order to perform all analyses
in \Rpackage{ELMER}. MEE.data needs at least 4 files: a matrix of DNA methylation 
from HM450K platform for multiple samples; a matrix of gene expression for the same samples; 
a GRanges object containing the information for probes on HM450K such as names 
and coordinates; a gene annotation which is also a GRanges object. When TCGA 
data are used, the sample information will be automatically generated by fetch.mee 
function. However sample information should be provided when using custom data.

\subsection{DNA methylation data}

DNA methyaltion data feeding to \Rpackage{ELMER} should be a matrix of DNA methylation
beta ($\beta$) value for samples (column) and probes (row) processed from row HM450K 
array data. If TCGA data were used, level 3 processed data from TCGA website will be downloaded  
and automatically transformed to the matrix by \Rpackage{ELMER}. The level 3 TCGA 
DNA methylation data were calculated as (M/(M+U)), where M represents the methylated 
allele intensity and U the unmethylated allele intensity. Beta values range from 0 to 1, 
reflecting the fraction of methylated alleles at each CpG in the each tumor; beta values 
close to 0 indicates low levels of DNA methylation and beta values close to 1 
indicates high levels of DNA methylation.

If user have raw HM450K data, these data can be processed by \Rpackage{Methylumi} 
or \Rpackage{minfi} generating DNA methylation beta ($\beta$) value for each CpG site 
and multiple samples. The getBeta function in \Rpackage{minfi} can be used to generate a matrix 
of DNA methylation beta ($\beta$) value to feed in \Rpackage{ELMER}. And we recommend to 
save this matrix as meth.rda since fetch.mee can read in files by specifying files' 
path which will help to reduce memory usage.

<<dna.methylation.data>>=
load("./ELMER.example/Result/LUSC/LUSC_meth_refined.rda")
Meth[1:10, 1:2]
@

\subsection{Gene expression data}

Gene expresion data feeding to \Rpackage{ELMER} should be a matrix of gene expression
values for samples (column) and genes (row). Gene expression value can be generated
from different platforms: array or RNA-seq. The row data should be processed by other
software to get gene or transcript level gene expression calls such as mapping by tophat, 
calling expression value by cufflink, RSEM or GenomeStudio for expression array. 
It is recommended to normalize expression data making gene expression 
comparable across samples such as quantile normalization. User can refer TCGA RNA-seq
analysis pipeline to do generate comparable gene expression data. Then transform the 
gene expression values from each sample to the matrix for feeding into \Rpackage{ELMER}.
If users want to use TCGA data, \Rpackage{ELMER} has functions to download the level 3 
RNA-seq Hiseq data from TCGA website and transform the data to the matrix for feeding 
into \Rpackage{ELMER}. It is recommended to save this matrix as RNA.rda since fetch.mee
can read in files by specifying the path of files which will help to reduce memory usage.

<<gene.expression.data>>=
load("./ELMER.example/Result/LUSC/LUSC_RNA_refined.rda")
GeneExp[1:10, 1:2]
@


\subsection{Sample information}

Sample information should be stored as a data.frame object containing sample ID, 
group labels (control and experiment). Sample ID and groups labels are required.
Other information for each sample can be added to this data.frame object.
When TCGA data were used, group labels (control and experiment) will be automatically generated by 
fetch.mee function by specifying option TCGA=TRUE.

<<sample.information>>=
mee <- fetch.mee(meth=Meth, exp=GeneExp, TCGA=TRUE)
head(getSample(mee))
@

\subsection{Probe information}

Probe information should be stored as a GRanges object containing the coordinates 
of each probe on the DNA methylation array and names of each probe. 
The default probe information is for HM450K fetching from 
\Rpackage{IlluminaHumanMethylation450kanno.ilmn12.hg19} 

<<probe.information>>=
probe <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19, what="Locations")
probe <- GRanges(seqnames=probe$chr,
                 ranges=IRanges(probe$pos,
                                width=1,
                                names=rownames(probe)),
                 strand=probe$strand,
                 name=rownames(probe))
mee <- fetch.mee(meth=Meth, exp=GeneExp, TCGA=TRUE, probeInfo=probe)
getProbeInfo(mee)
@

\subsection{Gene information}

Gene information should be stored as a GRanges object containing coordinates of 
each gene, gene id, gene symbol and gene isoform id. The default gene information 
is the UCSC gene annotation fetching from \Rpackage{Homo.sapiens} by \Rpackage{ELMER} 
function: txs. 


<<gene.information>>=
geneAnnot <- txs()
## In TCGA expression data, geneIDs were used as the rowname for each row. However, numbers 
## can't be the rownames, "ID" was added to each gene id functioning as the rowname.
## If your geneID is consistent with the rownames of the gene expression matrix, adding "ID" 
## to each geneID can be skipped.
geneAnnot$GENEID <- paste0("ID",geneAnnot$GENEID)            
geneInfo <- promoters(geneAnnot,upstream = 0, downstream = 0)
save(geneInfo,file="./ELMER.example/Result/LUSC/geneAnnot.rda")
mee <- fetch.mee(meth=Meth, exp=GeneExp, TCGA=TRUE, geneInfo=geneInfo)
getGeneInfo(mee)
@

\subsection{MEE.data}

MEE.data object is the input for multiple main functions of \Rpackage{ELMER}. MEE.data contains
the above 5 components and making MEE.data object by fetch.mee function will keep each
component consistent with each other. For example, althougth DNA methylation and gene 
expression matrixes have different rows (probe for DNA methylation and geneid for gene
expression), the column (samples) order should be same in the two matrixes. fetch.mee 
function will keep them consistent when it generates the MEE.data object.


<<mee.data>>=
mee <- fetch.mee(meth=Meth, exp=GeneExp, TCGA=TRUE, probeInfo=probe, geneInfo=geneInfo)
mee
@

\section{Illustration of ELMER analysis steps}

The example data set is a subset of chromosome 1 data from TCGA LUSC.
ELMER analysis have 5 main steps which are shown below individually. TCGA.pipe 
introduced above is a pipeline combining all 5 steps and producing all
results and figures. 

\subsection{Selection of probes within biofeatures}

This step is to select HM450K probes, which locate far from TSS (at least 2Kb away) 
and within distal enhancer regions. These probes are called distal enhancer probes.
For distal enhancer regions, we constructed a comprehensive list of putative enhancers 
combining REMC, ENCODE and FANTOM5 data. Enhancers were identified using ChromHMM 
for 98 tissues or cell lines from REMC and ENCODE project and we used the union of 
genomic elements labeled as EnhG1, EnhG2, EnhA1 or EnhA2 (representing intergenic and 
intragenic active enhancers) in any of the 98 cell types, resulting in a total of 
389,967 non-overlapping enhancer regions. Additionally, FANTOM5 enhancers (43,011) 
identified by eRNAs for 400 distinct cell types were added to this list. 
Be default, this comprehensive list of putative enhancer and TSS annotated by GENCODE V15
and UCSC-gene will be used to select distal enhancer probes. But user can use their 
own TSS annotation or features such as H3K27ac ChIP-seq in a certain cell line.


<<selection.of.probes.within.biofeatures>>=
#get distal enhancer probes that are 2kb away from TSS and overlap with REMC and FANTOM5 
#enhancers on chromosome 1
Probe <- get.feature.probe(rm.chr=paste0("chr",c(2:22,"X","Y")))
save(Probe,file="./ELMER.example/Result/LUSC/probeInfo_feature_distal.rda")
@


\subsection{Identifying differentially methylated probes}

This step is to identify DNA methylation changes at distal enhancer probes which is
carried out by function get.diff.meth.    

For each enhancer probe, function first ranked experiment samples and control samples 
by their DNA methylation beta values. To identify hypomethylated probes, function 
compared the lower control quintile (20\% of control samples with the lowest methylation) 
to the lower experiment quintile (20\% of experiment samples with the lowest methylation), 
using an unpaired one-tailed t-test. Only the lower quintiles were used because we 
did not expect all cases to be from a single molecular subtype, and we sought to 
identify methylation changes within cases from the same molecular subtype. 20\% (i.e. a quintile) 
was picked as a cutoff to include high enough sample numbers to yield t-test p-values 
that could overcome multiple hypothesis correction, yet low enough to be able to 
capture changes in individual molecular subtypes occurring in 20\% or more of the cases.
This number can be set arbitrarily as an input to the get.diff.meth function in the 
\Rpackage{ELMER}, and should be tuned based on sample sizes in individual studies. 
The one tailed t-test was used to rule out the null hypothesis: $\mu experiment \geq \mu control$, 
where $\mu$experiment is the mean methylation within the lowest experiment quintile and $\mu$control 
is the mean within the lowest control quintile. Raw p-values were adjusted for multiple
hypothesis testing using the Benjamini-Hochberg method, and probes were selected 
when they had adjusted p-value less than 0.01. For additional stringency, probes 
were only selected if the methylation difference: $\Delta = \mu experiment - \mu control$ was 
greater than 0.3. The same method was used to identify hypermethylated probes, 
except we used upper experiment quintile and upper control quintile, and chose 
the opposite tail in the t-test. 

If save parameter of get.diff.meth is true, two cvs files will be saved. If false,
a data frame with the same content as the second file mentioned below will be reported.

The first file contains all statistic results for all probes which were fed into the 
function. Based on this file, user can change different P value or sig.dir cutoff 
to select the significant results without redo the analysis.

The second file contains statistic results for the probes that pass the significant
criteria (P value and sig.dir).

Both files contain four columns: probe, pvalue, ExperimentMinControl, adjust.p.

1. probe: the name of probes.\\
2. pvalue: the raw P value from t-test.\\
3. ExperimentMinControl:  methylation difference $\Delta$.\\
4. adjust.p: adjusted P value for t-test.\\
<<identifying.differentially.methylated.probes, eval=FALSE>>=
## fetch.mee can take path as input.
mee <- fetch.mee(meth="./ELMER.example/Result/LUSC/LUSC_meth_refined.rda",
                 exp="./ELMER.example/Result/LUSC/LUSC_RNA_refined.rda", TCGA=TRUE, 
                 probeInfo="./ELMER.example/Result/LUSC/probeInfo_feature_distal.rda", 
                 geneInfo="./ELMER.example/Result/LUSC/geneAnnot.rda") 

sig.diff <- get.diff.meth(mee, cores=detectCores()/2, dir.out ="./ELMER.example/Result/LUSC", 
                          diff.dir="hypo", pvalue = 0.01)


sig.diff[1:10,]   ## significantly hypomethylated probes

# get.diff.meth automatically save output files. 
# getMethdiff.hypo.probes.csv contains statistics for all the probes.
# getMethdiff.hypo.probes.significant.csv contains only the significant probes which
# is the same with sig.diff
dir(path = "./ELMER.example/Result/LUSC", pattern = "getMethdiff")  
@


\subsection{Identifying putative probe-gene pairs}

This step is to link enhancer probes with mehtylation changes to target genes with expression
changes and report the putative target gene for selected probes. This is carried out
by function get.pair.

For each enhancer probe with differential methylation, the closest 10 upstream 
genes and the closest 10 downstream genes were tested for correlation between 
methylation of the probe and expression of the gene. To select these genes, 
the probe-gene distance was defined as the distance from the probe to a transcription 
start site specified by the UCSC gene annotation. Thus, exactly 20 statistical 
tests were performed for each probe, as follows. For each probe-gene pair, 
the samples (all experiment samples and control samples) were divided into two 
groups: the M group, which consisted of the upper methylation quintile (the 20\% 
of samples with the highest methylation at the enhancer probe), and the U group, 
which consisted of the lowest methylation quintile (the 20\% of samples with the 
lowest methylation.) The 20\%ile cutoff is a configurable parameter in the get.pair 
Default is 20\% as a balance, which would allow us to identify changes in a 
molecular subtype making up a minority (i.e. 20\%) of cases, while also yielding 
enough statistical power to make strong predictions. For each candidate probe-gene pair, 
the Mann-Whitney U test was used to test the null hypothesis that overall gene 
expression in group M was greater or equal than that in group U. This non-parametric 
test was used in order to minimize the effects of expression outliers, which can 
occur across a very wide dynamic range. For each probe-gene pair tested, the 
raw p-value Pr was corrected for multiple hypothesis using a permutation approach 
as follows (implemented in the get.permu function of the \Rpackage{ELMER} package).
The gene in the pair was held constant, and x random methylation probes were 
used to perform the same one-tailed U test, generating a set of x permutation
p-values (Pp). We chose the x random probes only from among those that were 
“distal” (greater than 2kb from an annotated transcription start site), in order 
to make these null-model probes qualitatively similar to the probe being tested. 
An empirical p-value Pe value was calculated using the following formula 
(which introduces a pseudo-count of 1):

$$Pe = \frac{num(Pp \leq Pr)+ 1}{x+1}$$

This step is the most time consuming step since it requires a large amount calculations 
for permutation. The greater the permutation time is, the longer it will take. 
It is recommended to use multiple cores for this step. Default permutation time 
is 1000 which may need 12 hrs by 4 cores. However 10,000 permutations is recommended 
if high confidence results are desired but it may cost 2 days.

If save paramter of get.pair function is true, two cvs files will be output. 
If save parameter is false, a data frame with the same contect as the second file
mentioned as below will be output.

The first file contains all statistic results for all probe-gene pairs. Based on this file, user 
can change different P value or sig.dir cutoff to select the significant results
without redo the analysis.

The second file contains statistic results for the probes that pass the significant criteria (Pe).

Both files contain four columns: probe, GeneID, Symbol, Distance, Sides, Raw.p, Pe.

1. Probe: the name of probes.\\
2. GeneID and Symbol is for the genes which are linked to the probe.\\
3. Distance: the distance between the probe and the gene.\\
4. Sides: right (R) side or left (L) side of probe and the rank based on distance.
For example, L3 means the gene is the number 3 closest gene from the left side of the probe.\\
5. Raw.p:  P value from the Mann-Whitney U test for each pair.\\
6. Pe: the empirical P value for each pair.\\


<<identifying.putative.probe.gene.pairs, eval=FALSE>>=
### identify target gene for significantly hypomethylated probes.

Sig.probes <- read.csv("./ELMER.example/Result/LUSC/getMethdiff.hypo.probes.significant.csv",
                       stringsAsFactors=FALSE)[,1]  
head(Sig.probes)  # significantly hypomethylated probes

## Collect nearby 20 genes for Sig.probes
nearGenes <-GetNearGenes(TRange=getProbeInfo(mee,probe=Sig.probes),
                         geneAnnot=getGeneInfo(mee),cores=detectCores()/2)

## Identify significant probe-gene pairs
Hypo.pair <-get.pair(mee=mee,probes=Sig.probes,nearGenes=nearGenes,
                     permu.dir="./ELMER.example/Result/LUSC/permu",permu.size=200,Pe = 0.01,
                     dir.out="./ELMER.example/Result/LUSC",cores=detectCores()/2,label= "hypo")

head(Hypo.pair)  ## significant probe-gene pairs

# get.pair automatically save output files. 
#getPair.hypo.all.pairs.statistic.csv contains statistics for all the probe-gene pairs.
#getPair.hypo.pairs.significant.csv contains only the significant probes which is 
# same with Hypo.pair.
dir(path = "./ELMER.example/Result/LUSC", pattern = "getPair") 
@


\subsection{Motif enrichment analysis on the selected probes}

This step is to identify enriched motif in a set of probes which is carried out by
function get.enriched.motif.

The build in data Probes.motif is generated using FIMO with a p-value < 1e–4 
to scan a +/- 100bp region around each probe using Factorbook motif position weight 
matrices (PWMs) and Jasper core human motif PWMs generated from the R package MotifDb. 
For each probe set tested (i.e. the list of gene-linked hypomethylated probes in a given experiment 
group), a motif enrichment Odds Ratio and a 95\% confidence interval were 
calculated using following formulas:
$$ p= \frac{a}{a+b} $$
$$ P= \frac{c}{c+d} $$
$$ Odds\quad  Ratio = \frac{\frac{p}{1-p}}{\frac{P}{1-P}} $$
$$ SD = \sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}} $$
$$ lower\quad boundary\quad of\quad  95\%\quad  confidence\quad  interval = \exp{(\ln{OR}-SD)} $$

where a is the number of probes within the selected probe set that contain one 
or more motif occurrences; b is the number of probes within the selected probe 
set that do not contain a motif occurrence; c and d are the same counts within 
the entire enhancer probe set. A probe set was considered significantly enriched 
for a particular motif if the 95\% confidence interval of the Odds Ratio was 
greater than 1.1 (specified by option lower.OR, 1.1 is default), and the motif 
occurred at least 10 times (specified by option min.incidence. 10 is default) in 
the probe set. As described in the text, Odds Ratios were also used for ranking 
candidate motifs. 

There will be two results if save parameter of get.enriched.motif is true. When save is false, 
only second result mentioned below will be reported.

The first one is a csv file. This file contains the Odds Ratio and 95\% confidence 
interval for these Odds Ratios which pass the signficant cutoff (lower.OR and 
min.incidence). It contains 5 columns: motif, NumOfProbes, OR, lowerOR and upperOR.

1. motif: the name of motif.\\
2. NumOfProbes: the number of probes with this motif in the given set of probes
corresponding to min.incidence option.\\
3. OR: the Odds Ratio.\\
4. lowerOR: the lower boundary of 95\% confidence interval.\\
5. upperOR: the upper boundary of 95\% confidence interval.

The second file is a rda file listing each enriched motif and the probes containing
the motif.

<<motif.enrichment.analysis.on.selected.probes, eval=FALSE>>=
### identify enriched motif for significantly hypomethylated probes which 
##have putative target genes.

Sig.probes.paired <- read.csv("./ELMER.example/Result/LUSC/getPair.hypo.pairs.significant.csv",
                              stringsAsFactors=FALSE)[,1]  
head(Sig.probes.paired) # significantly hypomethylated probes with putative target genes

enriched.motif <-get.enriched.motif(probes=Sig.probes.paired, 
                                    dir.out="./ELMER.example/Result/LUSC", label="hypo",
                                    min.incidence = 10,lower.OR = 1.1)
names(enriched.motif)  # enriched motifs
head(enriched.motif["TP53"]) ## probes in the given set that have TP53 motif.

# get.enriched.motif automatically save output files. 
# getMotif.hypo.enriched.motifs.rda contains enriched motifs and the probes with the motif. 
# getMotif.hypo.motif.enrichment.csv contains summary of enriched motifs.
dir(path = "./ELMER.example/Result/LUSC", pattern = "getMotif") 

# motif enrichment figure will be automatically generated.
dir(path = "./ELMER.example/Result/LUSC", pattern = "motif.enrichment.pdf") 
@


\subsection{Identifying regulatory TFs}

This step is to identify regulatory TF whose expression associates with TF binding motif 
DNA methylation which is carried out by function get.TFs.\\
For each motif considered to be enriched within a particular probe set, function will
compare the average DNA methylation at all distal enhancer probes within +/- 100bp 
of a motif occurrence, to the expression of 1,982 human TFs A statistical test 
was performed for each motif-TF pair, as follows. The samples (all control and 
experiment samples) were divided into two groups: the M group, which consisted 
of the 20\% of samples with the highest average methylation at all motif-adjacent 
probes, and the U group, which consisted of the 20\% of samples with the lowest 
methylation. The 20th percentile cutoff is a parameter to the get.TFs function 
and was set to allow for identification of molecular subtypes present in 20\% of 
cases. For each candidate motif-TF pair, the Mann-Whitney U test was used to test 
the null hypothesis that overall gene expression in group M was greater or equal 
than that in group U. This non-parametric test was used in order to minimize the 
effects of expression outliers, which can occur across a very wide dynamic range. 
For each motif tested, this resulted in a raw p-value (Pr) for each of the 1982 TFs.
All TFs were ranked by the -log10(Pr), and those falling within the top 5\% of 
this ranking were considered candidate upstream regulators. The best upstream 
TFs which are known recognized the motif was automatically extracted as putative 
regulatory TFs.

If save parameter of get.TFs function is true, two files will be generated. 
If save parameter is false, only a data frame containing the same content 
with the first file mentioned below will be output.

The first one csv file. This file contain the regulatory TF significantly 
associate with average DNA methylation at particular motif sites. It contains 4 
columns: motif, top.potential.TF, potential.TFs and top\textunderscore 5percent.

1. motif: the names of motif.\\
2. top.potential.TF: the highest ranking upstream TFs which are known recognized the motif.\\
3. potential.TFs: TFs which are within top 5\% list and are known recognized the motif.\\
4. top\textunderscore 5percent: all TFs which are within top 5\% list.\\

The second file is rda file. This file contains a matrix storing the statistic 
results for associations between TFs (row) and average DNA methylation 
at motifs (column). This matrix can be use to generate TF ranking plots by function
TF.rank.plot.

<<identifying.regulatory.tf, eval=FALSE>>=
### identify regulatory TF for the enriched motifs

load("./ELMER.example/Result/LUSC/getMotif.hypo.enriched.motifs.rda")
TF <- get.TFs(mee=mee, enriched.motif=enriched.motif,dir.out="./ELMER.example/Result/LUSC", 
              cores=detectCores()/2, label= "hypo")

# get.TFs automatically save output files. 
# getTF.hypo.TFs.with.motif.pvalue.rda contains statistics for all TF with average 
# DNA methylation at sites with the enriched motif.
# getTF.hypo.significant.TFs.with.motif.summary.csv contains only the significant probes.
dir(path = "./ELMER.example/Result/LUSC", pattern = "getTF")  

# TF ranking plot based on statistics will be automatically generated.
dir(path = "./ELMER.example/Result/LUSC/TFrankPlot", pattern = "pdf") 
@
\section{Generating figures}
\subsection{Scatter plots}
Generate scatter plots for one probes' nearby 20 gene expression vs DNA methylation at this probe. Figure \ref{fig:figure1}

<<eval=FALSE>>=
scatter.plot(mee,byProbe=list(probe=c("cg19403323"),geneNum=20), 
             category="TN", save=TRUE) 
@

\begin{figure}[H]
\begin{center}  
\includegraphics[scale=0.5]{ELMER.example/cg19403323.byprobe.pdf} 
\protect\caption{SEach scatter plot shows the methylation level of an example 
probe cg19403323 in all LUSC samples plotted against the expression of one of 
20 adjacent genes.
\label{fig:figure1}}
\end{center}
\end{figure}

\subsubsection{Scatter plot of one pair}
Generate a scatter plot for one probe-gene pair. Figure \ref{fig:figure2}

<<eval=FALSE>>=
scatter.plot(mee,byPair=list(probe=c("cg19403323"),gene=c("ID255928")), 
             category="TN", save=TRUE,lm_line=TRUE) 
@

\begin{figure}[H]
\begin{center}  
\includegraphics[scale=0.5]{ELMER.example/cg19403323_SYT14.bypair.pdf}
\protect\caption{Scatter plot shows the methylation level of an example probe 
cg19403323 in all LUSC samples plotted against the expression of the putative 
target gene SYT14.
\label{fig:figure2}}
\end{center}
\end{figure}

\subsubsection{TF expression vs. average DNA methylation}
Generate scatter plot for TF expression vs average DNA methylation of the sites 
with certain motif. Figure \ref{fig:figure3}

<<eval=FALSE>>=
load("ELMER.example/Result/LUSC/getMotif.hypo.enriched.motifs.rda")
scatter.plot(mee,byTF=list(TF=c("TP53","TP63","TP73"),
             probe=enriched.motif[["TP53"]]), category="TN",
             save=TRUE,lm_line=TRUE)
@

\begin{figure}[H]
\begin{center}  
\includegraphics[scale=0.5]{ELMER.example/TP53_TP63_TP73.byTF.pdf} 
\protect\caption{Each scatter plot shows the average 
methylation level of sites with the TP53 motif in all LUSC samples plotted 
against the expression of the transcription factor TP53, TP63, TP73 respectively.
\label{fig:figure3}}
\end{center}
\end{figure}

\subsection{Schematic plot}
Schematic plot shows a breif view of linkages between genes and probes. To make a 
schematic plot, "Pair" object should be generated first.
<<eval=FALSE>>=
# Make a "Pair" object for schematic.plot
pair <- fetch.pair(pair="./ELMER.example/Result/LUSC/getPair.hypo.pairs.significant.withmotif.csv",
                   probeInfo = "./ELMER.example/Result/LUSC/probeInfo_feature_distal.rda",
                   geneInfo = "./ELMER.example/Result/LUSC/geneAnnot.rda")
@
\subsubsection{Nearby Genes}
Generate schematic plot for one probe with 20 nearby genes and label 
the gene significantly linked with the probe in red. Figure \ref{fig:figure4}

<<eval=FALSE>>=
schematic.plot(pair=pair, byProbe="cg19403323",save=TRUE)
@

\begin{figure}[H]
\begin{center}  
\includegraphics[scale=0.5]{ELMER.example/cg19403323.schematic.byProbe.pdf} 
\protect\caption{The schematic plot shows probe colored 
in blue and the location of nearby 20 genes. The genes significantly linked to the probe 
were shown in red.
\label{fig:figure4}}
\end{center}  
\end{figure}
\clearpage


\subsubsection{Nearby Probes}
Generate schematic plot for one gene with the probes which the gene is significantly 
linked to. Figure \ref{fig:figure5}

<<eval=FALSE>>=
schematic.plot(pair=pair, byGene="ID255928",save=TRUE)
@

\begin{figure}[H]
\begin{center}
\includegraphics[scale=0.5]{ELMER.example/ID255928.schematic.byGene.pdf} 
\protect\caption{The schematic plot shows the gene 
colored in red and all blue colored probes, which are significantly linked to the 
expression of this gene.
\label{fig:figure5}}
\end{center}
\end{figure}

\subsection{Motif enrichment plot}
Motif enrichment plot shows the enrichment levels for the selected motifs. Figure\ref{fig:figure6}

<<eval=FALSE>>=
motif.enrichment.plot(motif.enrichment="./ELMER.example/Result/LUSC/getMotif.hypo.motif.enrichment.csv", 
                      significant=list(OR=1.3,lowerOR=1.3), 
                      label="hypo", save=TRUE)  ## different signficant cut off.
@

\begin{figure}[H]
\begin{center}
\includegraphics[scale=0.5]{ELMER.example/hypo.motif.enrichment.pdf} 
\protect\caption{The plot shows the Odds Ratio (x axis) 
for the selected motifs with OR above 1.3 and lower boundary of OR above 1.3. 
The range shows the 95\% confidence interval for each Odds Ratio.
\label{fig:figure6}}
\end{center}
\end{figure}
\clearpage

\subsection{TF ranking plot}
TF ranking plot shows statistic -log10(P value) assessing the anti-correlation level 
of TFs expression level with average DNA methylation level at sites with a given motif. Figure \ref{fig:figure7}

<<eval=FALSE>>=
load("./ELMER.example/Result/LUSC/getTF.hypo.TFs.with.motif.pvalue.rda")
TF.rank.plot(motif.pvalue=TF.meth.cor, motif="TP53", TF.label=list(TP53=c("TP53","TP63","TP73")),
            save=TRUE) 
@

\begin{figure}[H]
\begin{center}  
\includegraphics[scale=0.5]{ELMER.example/TP53.TFrankPlot.pdf} 
\protect\caption{Shown are TF ranking plots based on 
the score (-log(P value)) of association between TF expression and DNA methylation of 
the TP53 motif in the LUSC cancer type . The dashed line indicates the boundary 
of the top 5\% association score. The top 3 associated TFs and the TF family members
(dots in red) that are associated with that specific motif are labeled in the plot.
\label{fig:figure7}}
\end{center}  
\end{figure}
\newpage
<<finalsession>>=
sessionInfo()
@

\end{document}