%\VignetteIndexEntry{ChIPComp}
%\VignettePackage{BiocStyle}
%\VignetteEngine{utils::Sweave}

\documentclass{article}

<<style, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex(use.unsrturl=FALSE)
@ 

\newcommand{\exitem}[3]
{\item \texttt{\textbackslash#1\{#2\}} #3 \csname#1\endcsname{#2}.}

\title{ChIPComp: A novel statistical method for quantitative comparison of multiple ChIP-seq datasets}
\author{Li Chen, Chi Wang, Zhaohui Qin, Hao Wu}

\begin{document}
\SweaveOpts{concordance=TRUE}

\maketitle

\tableofcontents


%---------------------------------------------------------
\section{Introduction}
%---------------------------------------------------------

This vignette introduces the use of the Bioconductor package \texttt{ChIPComp}, which is designed for differential binding sites analyses based on high-throughput sequencing data. The core of \texttt{ChIPComp} is a new procedure to incorporate the control sequencing data in a linear model framework. 
\texttt{ChIPComp} focus on analyzing the DBS ( transcription factor binding or histone modifications ) generated by peak-calling software
between two treatment conditions.  Since an increasing number of ChIP experiments are investigating the same type of binding event (protein-DNA binding or histone modification) under different treatment conditions (cell lines),  \texttt{ChIPComp} is to address how significant difference each binding site is between two treatment conditions by considering the control sequencing data. Compared with existing methods, ChIPComp provides excellent statistical and computational performance. Currently, \texttt{ChIPComp} only supports the situation when replicates are available for each treatment condition.


%---------------------------------------------------------
\section{Overview}
%---------------------------------------------------------

Here below is the \texttt{ChIPComp} work flow

\emph{1. Detect binding sites}:  The first step is to detect binding sites  (transcription factor binding or histone modifications) for each ChIP sequencing data using existing peak-calling software. 

\emph{2. Merge binding sites}: Binding sites from all replicates in two treatment conditions are merged into one set of binding sites. In the process, common binding sites are also recorded.

\emph{3. Count reads}: Both ChIP read counts and smoothing control read counts are calculated for each merged binding site.

\emph{4. Perform Hypothesis testing}: We fit the model and perform hypothesis testing on each merged binding site.


%---------------------------------------------------------
\section{Example}
%---------------------------------------------------------

To utilize the \texttt{ChIPComp} software, we need a data frame that represents the ChIP experiment information. We also need a design matrix retrieved from ChIP experiment to fit the linear model.  
\texttt{ChIPComp} provides two ways to obtain the configuration data frame and the design matrix.\\
The first way is to enter ChIPComp experiment information into one csv file as an input for function \texttt{makeConf}. The configuration data frame and design matrix are the output of \texttt{makeConf}, for example,
<<echo=TRUE,eval=TRUE>>=
library(ChIPComp)
confs=makeConf(system.file("extdata", "conf.csv", package="ChIPComp"))
conf=confs$conf
design=confs$design
@
\texttt{ChIPComp} defines a configuration data frame and design matrix manually, for example,
<<echo=TRUE,eval=TRUE>>=
conf=data.frame(
SampleID=1:4,
condition=c("Helas3","Helas3","K562","K562"),
factor=c("H3k27ac","H3k27ac","H3k27ac","H3k27ac"),
ipReads=system.file("extdata",c("Helas3.ip1.bed","Helas3.ip2.bed","K562.ip1.bed","K562.ip2.bed"),package="ChIPComp"),
ctReads=system.file("extdata",c("Helas3.ct.bed","Helas3.ct.bed","K562.ct.bed","K562.ct.bed"),package="ChIPComp"),
peaks=system.file("extdata",c("Helas3.peak.bed","Helas3.peak.bed","K562.peak.bed","K562.peak.bed"),package="ChIPComp")
)
conf$condition=factor(conf$condition)
conf$factor=factor(conf$factor)
design=as.data.frame(lapply(conf[,c("condition","factor")],as.numeric))-1
design=as.data.frame(model.matrix(~condition,design))
@
%
Once we have the configuration data frame and design matrix, we could merge binding sites, detect common binding sites and calculate read counts for each merged binding site.
<<echo=TRUE,eval=TRUE>>=
countSet=makeCountSet(conf,design,filetype="bed", species="hg19",binsize=1000)
@
Currently, if \texttt{filetype} is "bam", it is not necessary to specify \texttt{species}. However, if \texttt{filetype} is "bed", we need to specify \texttt{species} either "hg19" or "mm9".
%
We could explore the correlation between ChIP sample and control sample.
<<figureexample, fig=TRUE, include=TRUE, width=4.6, height=4.6>>=
plot(countSet)
@
%
Finally, we perform hypothesis testing on each binding site and print the top differential binding sites.
<<echo=TRUE,eval=TRUE>>=
countSet=ChIPComp(countSet)
print(countSet)
@
%
For the example data in the package, we collect 50 common binding sites between H3K27ac Helas3 and K562 cell lines and 10 unique binding sites for each cell line. Therefore, there are 60 binding sites for each cell line. We also extract the ChIP and control counts for each binding site in each condition. The configuration csv file, read bed files and peak bed files are stored in \texttt{inst/extdata} directory. The data frame that contains all binding sites and read counts have been pre-calculated and saved as a \texttt{ChIPComp} object \texttt{seqData} in \texttt{data} directory.
<<echo=TRUE,eval=TRUE>>=
data(seqData)
@


%---------------------------------------------------------
\section{Session info}
%---------------------------------------------------------

Here is the output of \Rfunction{sessionInfo} on the system on which
this document was compiled:
<<sessionInfo, results=tex, print=TRUE, eval=TRUE>>=
toLatex(sessionInfo())
@


\bibliography{ChIPComp}

\end{document}