%\VignetteIndexEntry{iSeq}
%\VignetteDepends{}
%\VignetteKeywords{ChIP-seq,next generation sequencing, massively parallel sequencing}
%\VignettePackage{iSeq}

\documentclass[11pt]{article}
\usepackage{amsmath}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}
\SweaveOpts{echo=FALSE}

\setlength{\textheight}{8.5in}
\setlength{\textwidth}{6in}
\setlength{\topmargin}{-0.25in}
\setlength{\oddsidemargin}{0.25in}
\setlength{\evensidemargin}{0.25in}

\begin{document}
\setkeys{Gin}{width=0.99\textwidth}

\title{\bf iSeq: Bayesian Hierarchical Modeling of ChIP-seq Data Through Hidden Ising Models}

\author{Qianxing Mo}

\maketitle

\begin{center}
Department of Epidemiology and Biostatistics\\
Memorial Sloan-Kettering Cancer Center\\
{\tt moq@mskcc.org}
\end{center}

\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}

This package implements the models proposed by Mo (2010) for ChIP-seq data analysis, 
which are the extensions of the models proposed for ChIP-chip data (Mo and Liang, 2010a,b).  
The package can be used to analyze ChIP-seq data with or without controls. \\

\noindent
This package processes ChIP-seq data in three steps. \\
\begin{itemize}
\item {Build a dynamic signal profile for each chromosome.  To create a dynamic signal profile, 
    sequence tags are aggregated into non-overlapping dynamic bins whose lengths vary according to 
    local tag density (See the help file of function 'mergetag').  The number of tags falling in each bin is counted.
    A dynamic signal profile is made of the bin-based tag counts and the corresponding genomic positions on 
    the chromosome.  The bins and their tag counts are ordered along the chromosome according to 
    their genomic positions. }
\item {Model the dynamic signal profiles.  It is to model the tag counts on the chromosome.  
    The models assume that each bin is associated with a binary latent variable $X_i \in (1, -1)$, 
    where $i$ denotes the ID for the bin, and $X_i=1$ denotes that the bin is an enriched bin, 
    and -1 for a non-enriched bin. In the first stage, conditioning on the latent variable $X_i$, 
    the bin-based tag counts are modeled by Poisson-Gamma distributions.  
    If $X_i=-1$, the tag count for bin $i$ is assumed to follow a Poisson distribution 
    with parameter $\lambda_0$, where $\lambda_0 \sim \textrm{Gamma}(a_0,b_0)$. 
    If $X_i= 1$,  the tag count for bin $i$ is assumed to follow a Poisson distribution with 
    parameter $\lambda_1$, where $\lambda_1 \sim \textrm{Gamma}(a_1,b_1)$. $\textrm{Gamma}(a,b)$ 
    denotes a gamma distribution with mean $a/b$ and variance $a/b^2$.  In the second stage, 
    the latent variable is modeled by ferromagnetic Ising models.  
    The Gibbs sampler and Metropolis algorithm are used to simulate from the posterior 
    distributions of the model parameters.}
\item {Call enriched regions.  The posterior probabilities for the bins in the enriched state ($X_i = 1$) 
    are used for statistical inference.  A bin with a high posterior probability in the enriched state 
    will provide strong evidence that the bin is enriched. Enriched bins are then merged into 
    enriched regions.}  
\end{itemize}

\noindent
For more information, we refer the user to Mo and Liang (2010 a, b) because the manuscript (Mo, 2010c) is still under 
review.  The major difference for modeling ChIP-chip and ChIP-seq data is at the first stage, where normal distributions
are used for ChIP-chip data, while Poisson distributions are used for ChIP-seq data. 

\section{The NRSF ChIP-seq Data}

Johnson et al.(2007) carried out genome-wide identification of the binding sites of 
human neuron-restrictive silencer factor (NRSF) in Jurkat T cells.  We use the data of chromosomes 22 and Y
to illustrate the proposed method.  The NRSF sequence tags (25 bp) were generated by 
the Illumina/Solexa sequencing platform, and mapped to The human genome May 2004 (hg17). 
We only use the uniquely mapped tags (up to two mismatches) for the analysis. 
Note that iSeq package doesn't provide functions to read the raw data generated by 
the sequencers.  However, it should not be difficult to prepare the data in the format as shown in 
the following.

\section{Example --- Analyze the NRSF Data}
\noindent
First, let's load the library and check the data.  There are three ChIP and control samples, respectively. 
<<echo=TRUE,print=FALSE>>=
library(iSeq)
data(nrsf)
names(nrsf)
@ 
\subsection{Build dynamic signal profiles using function 'mergetag'}
\noindent
The following two steps just merge the ChIP and control samples, respectively.
<<echo=TRUE,print=FALSE>>=
chip = rbind(nrsf$chipFC1592,nrsf$chipFC1862,nrsf$chipFC2002)
mock = rbind(nrsf$mockFC1592,nrsf$mockFC1862,nrsf$mockFC2002)
print(chip[1:3,])
print(mock[1:3,])
@ 
\noindent
We suggest building the signal profiles using dynamic bins (80, 40, 20, 10 bp) and 10 tags as the threshold  
for triggering bin size change for the NRSF data,  which is equivalent to setting maxlen=80,minlen=10 and 
ntagcut=10 for the arguments of function 'megertag'.  Of course, the user can use other settings 
according to the sequencing depth of the experiments.
According to my experience, the dynamic profiles with bin sizes (80, 40, 20, 10 bp) work well for both deeply 
and not-deeply sequenced data.  However, the users may need to adjust argument 'ntagcut' to reflect 
the sequencing depth.  For example, for deeply sequenced data, the users may set ntagcut = 20, or 30, etc.
In general, increasing the value of 'ntagcut' leads to bigger bin sizes used for building the dynamic profiles
and less regions called enriched.  Note, by default, the bin sizes decrease or increase by a factor of 2,
thus the user should set maxlen = ($2^n$)*minlen, where n is an integer.  

<<echo=TRUE,print=FALSE>>=
tagct = mergetag(chip=chip,control=mock,maxlen=80,minlen=10,ntagcut=10)
print(tagct[1:3,])
@ 
\noindent
See the help file for function 'mergetag' for the meanings of the tagct columns. \\
 
\noindent
The user can quickly get the 'enriched' regions without calculating
statistical confidence using function 'peakreg'.  For example, if we claim that a bin with 
adjusted tag count $>$ 10 is an enriched bin,  the 'enriched' regions can be obtained by using 
the following code.  Let's use the chromosome 22 data as an example.  
<<echo=TRUE,print=FALSE>>=
tagct22 = tagct[tagct[,1]=="chr22",]  #chr22 data 
reg0 = peakreg(chrpos = tagct22[, 1:3], count = (tagct22[, 5:6] -tagct22[, 7:8]), pp = (tagct22[,4]>10),cutoff = 0, method = "ppcut",maxgap=200)
print(dim(reg0)) 
print(reg0[1:3,]) # each row contain the information for an enriched region
@ 
Note the second argument 'count = ' is just used for inferring the predicted exact binding sites 
(or peaks) of the enriched regions.  Here, we use the adjusted tag counts without truncating at 0. 
The user can also just use the ChIP tag counts.  Usually it will not make much difference.  
The last column (sym) in 'reg0' indicates whether the forward and reverse tag counts are symmetrical in the regions.
It is to measure whether the regions are bimodal, which is a typical characteristic of binding regions. 
The values range from 0.5 (the perfect symmetry) to 0 (asymmetry).  The user can order the regions 
based on 'sym' and 'ct12' to get the 'top' binding regions.  For example, 
<<echo=TRUE,print=FALSE>>=
reg0 = reg0[order(reg0$sym,reg0$ct12,decreasing = TRUE),]
print(reg0[1:5,]) # Top five regions
@ 

\subsection{Model dynamic signal profiles using function 'iSeq1'}
\noindent 
Function iSeq1 implements a fully Bayesian hidden Ising model in which the latent variable $X$ is modeled by 
the standard 1D Ising model.  The columns 1-4 of tagct are the signal profiles for modeling.  
Note that the column 4 contains the adjusted tag counts of the ChIP samples. For one sample analysis, 
it is the total tag counts for the forward and reverse chains. 
<<echo=TRUE,print=FALSE>>=
set.seed(777)
res1 = iSeq1(Y=tagct22[,1:4],gap=200,burnin=200,sampling=1000,ctcut=0.95,a0=1,b0=1,a1=5,b1=1,
  k0=3,mink=0,maxk=10,normsd=0.1,verbose=FALSE)
@ 

\noindent 
Plot the model parameters to see whether they converge.  In general, the MCMC chains have converged when the parameters 
fluctuate around the modes of their distributions.  If there is an obvious trend(e.g. continuous increase or decrease), 
the user should increase the number of iterations in the burn-in and/or sampling phases. 
If the chains do not mix well, the user can adjust the argument normsd to see how it affects the results. 
\begin{center}
<<fig=TRUE,echo=TRUE>>=
par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0))
hist(res1$pp)
plot(res1$kappa, pch=".", xlab="Iterations", ylab="kappa")
plot(res1$lambda0, pch=".", xlab="Iterations", ylab="lambda0")
plot(res1$lambda1, pch=".", xlab="Iterations", ylab="lambda1")
@
\end{center}
\noindent
From the trace plots, we see the chains converge quite fast.  

\subsection{Call enriched regions detected by iSeq1 using function 'peakreg'}
\noindent
Call the enriched regions detected by iSeq1 using 0.5 posterior probability (pp) cutoff.  
Note the argument count is the net tag counts for the two sample analysis.   
The net tag counts are not truncated at zero, but this 
doesn't matter because it is just used for inferring the center of the enriched region, 
which is usually the true binding site.  The user can also use the ChIP tag counts only.  
The results are little different. See the help file of peakreg for details. 
<<echo=TRUE,print=FALSE>>=
reg1 = peakreg(chrpos=tagct22[,1:3],count=(tagct22[,5:6]-tagct22[,7:8]),pp=res1$pp,cutoff=0.5,method="ppcut",maxgap=200)
print(dim(reg1))
print(reg1[1:3,])
@ 
\noindent
Note, the 5' positions of the sequence tags are used as the genomic positions for the NRSF data.  To infer the actual 
binding sites, one may add 13 bp to the peak position (reg1\$peakpos + 13) because the tags' length is 25 bp.  
If the middle positions of the sequence tags are used as the genomic positions, the user doesn't need to do the adjustment.\\


\noindent
Plot some enriched regions. The dash lines indicate the region center, usually the true binding sites. 
\begin{center}
<<fig=TRUE,echo=TRUE>>=
par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0))
for(i in 1:4){
  ID = (reg1[i,4]):(reg1[i,5])
  plotreg(tagct22[ID,2:3],tagct22[ID,5:6],tagct22[ID,7:8],peak=reg1[i,6])
}
@ 
\end{center}

\noindent
Call the enriched regions using 0.05 FDR cutoff.  The FDR is calculated
using a direct posterior probability approach (Newton et al., 2004). 
<<echo=TRUE,print=FALSE>>=
reg2 = peakreg(tagct22[,1:3],tagct22[,5:6]-tagct22[,7:8],res1$pp,0.05,method="fdrcut",maxgap=200)
print(dim(reg2))
print(reg2[1:3,])

@ 

\noindent
The columns 1-3 of the enriched regions (e.g. reg2[,1:3]) can be used to extract the sequences 
from the UCSC genome browser. Alternatively, one may create a BED file using the peak position of 
the enriched regions.  For example, 
<<echo=TRUE,print=FALSE>>=
bed = data.frame(chr=reg2[,1],gstart=reg2[,6]-100,gend=reg2[,6]+100)
@ 

\subsection{Model dynamic signal profiles using function 'iSeq2' }
\noindent
The latent variable $X$ can be modeled through a high-order Ising model using function iSeq2. 
The interaction parameter $k$ for the high-order (or the standard/first-order) Ising model is fixed 
and set by the user in iSeq2.  To apply the second-order Ising model to ChIP-seq data, the user can 
let winsize = 2.  If set winsize = 1, it will be the standard Ising model.  To use a high-order Ising model,
according to our experience, a balance between high sensitivity and low FDR can be
achieved when winsize = 2.  The critical value for the second-order Ising model is about 1.0. 
In general, increasing the value of $k$ will lead to less enriched regions, which amounts to 
setting a stringent criterion for detecting enriched regions.  

\noindent
Model the NRSF data using the second-order Ising model. 
<<echo=TRUE,print=FALSE>>=
res2 = iSeq2(Y=tagct22[,1:4],gap=200, burnin=100,sampling=500,winsize=2,ctcut=0.95,
  a0=1,b0=1,a1=5,b1=1,k=1.0,verbose=FALSE)
@ 

\noindent 
Plot the model parameters to see whether they converge.  If the chains do not mix well, one can adjust 
the parameter {\bf k} and/or normsd to see how they affect the results. 
\begin{center}
<<fig=TRUE,echo=TRUE>>=
par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0))
hist(res2$pp)
plot(res2$lambda0, pch=".", xlab="Iterations", ylab="lambda0")
plot(res2$lambda1, pch=".", xlab="Iterations", ylab="lambda1")
@
\end{center}

\subsection{Call enriched regions detected by iSeq2 using function 'peakreg'}
<<echo=TRUE,print=FALSE>>=
reg2 = peakreg(tagct22[,1:3],tagct22[,5:6],res2$pp,0.5,method="ppcut",maxgap=200)
print(dim(reg2))
print(reg2[1:3,])
@ 

\noindent
Plot some enriched regions detected by iSeq2. 
\begin{center}
<<fig=TRUE,echo=TRUE>>=
par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0))
for(i in 1:4){
  ID = (reg2[i,4]):(reg2[i,5])
  plotreg(tagct22[ID,2:3],tagct22[ID,5:6],tagct22[ID,7:8],peak=reg2[i,6])
}
@ 
\end{center}

\section{Tips}
\noindent
Finally, let's analyze the chromosome Y data. Intuitively, it seems that there is no binding region on
chromosome Y.  
<<echo=TRUE,print=FALSE>>=
tagctY = tagct[tagct[,1]=="chrY",]
print(table(tagctY[,4]))
res1 = iSeq1(Y=tagctY[,1:4],gap=200,burnin=1000,sampling=5000,ctcut=0.95,a0=1,b0=1,a1=5,b1=1,
  k0=3,mink=0,maxk=10,normsd=0.5,verbose=FALSE)
res2 = iSeq2(Y=tagctY[,1:4],gap=200, burnin=1000,sampling=5000,winsize=2,ctcut=0.95,
  a0=1,b0=1,a1=5,b1=1,k=3.0,verbose=FALSE)
@ 

\begin{center}
<<fig=TRUE,echo=TRUE>>=
par(mfcol=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0))
hist(res1$pp)
plot(res1$lambda1, pch=".", xlab="Iterations", ylab="lambda1")
hist(res2$pp)
plot(res2$lambda1, pch=".", xlab="Iterations", ylab="lambda1")
@ 
\end{center}

\noindent
In this case, all the bins are only in the non-enriched state and the posterior probabilities are zero or close to zero.
In some cases, if the $\lambda_1$ is small (e.g. $\approx a_1/b_1$, the mean value of the gamma prior), it suggests that
there is no enriched region on that chromosome.  The user may not claim any enriched regions found on that chromosome, 
even if some posterior probabilities are high.  

\noindent
According to my experience, iSeq1 methods work well for most of ChIP-seq data I have analyzed.  
For very noisy data, iSeq1 may not work.  The parameter $\lambda_1$ estimates for very noisy data 
are usually less than 1 and the posterior probabilities are not dichotomized and dominated by 0,
which suggest that the Ising model is not in the super-paramagnetic phase.  
Only the super-paramagnetic phase reflects the binding events on the chromosomes.  
Therefore, if iSeq1 doesn't work, the user can use iSeq2.  In iSeq2 function, 
the interaction parameter $k$ is fixed by the user.  
The user can increase the value of $k$ to let the phase transition occur so that
the Ising model reaches the super-paramagnetic phase. \\

%\noindent
%The iSeq methods take into account the spatial dependency, the global and local distributions of the sequence tags.   
%If the signal profiles have very sparse signals and some bins have very large (adjusted) tag counts (e.g., the NRSF data), 
%the estimated $\lambda_1$ will be relatively large, which makes the bins with relatively small counts (e.g., tens) have
%low posterior probabilities.  In this case, the user can truncate
%the (adjusted) tag counts at certain value (e.g., if count $>$ 100, set count = 100) to increase the power to detect 
%the regions with small tag counts.  For the NRSF data, if the adjusted tag counts are truncated at 100, more enriched
%regions can be detected.  

\section{Parallel Computation}
\noindent
To speed up the analysis, the user can do parallel computation easily. The user needs to install the snow and snowfall 
packages.  The following is an example. \\

\noindent
library(snow) \\
library(snowfall)  \\ 
dataList = list(chr22=tagct22,chrY=tagctY) \\
sfInit(parallel=TRUE,cpus=2,type="SOCK") \\
res=sfLapply(dataList,iSeq1,gap=200,burnin=100,sampling=200,ctcut=0.95,a0=1,b0=1,a1=1,b1=1,
  k0=3,mink=0,maxk=10,normsd=0.1,verbose=FALSE) \\

\section{Citing iSeq}
\noindent
If you use iSeq package, please cite ``Mo, Q. (2010). A fully Bayesian hidden Ising model for ChIP-seq data analysis (Submitted).''. 

<<echo=TRUE,print=TRUE>>=
sessionInfo()
@ 

\begin{thebibliography}{}
\bibitem[\protect\citeauthoryear{Mo and Liang.}{2010}]{mo2010a}
Mo, Q., Liang, F. (2010a).
Bayesian Modeling of ChIP-chip data through a high-order Ising Model.  
{\it Biometrics} {\bf 66}, 1284-1294.

\bibitem[\protect\citeauthoryear{Mo and Liang.}{2010}]{mo2010b}
Mo, Q., Liang, F. (2010b).
A hidden Ising model for ChIP-chip data analysis.  
{\it Bioinformatics} {\bf 26(6)}, 777-783.

\bibitem[\protect\citeauthoryear{Mo}{2010}]{mo2010c}
Mo, Q. (2010). A fully Bayesian hidden Ising model for ChIP-seq data analysis. (Submitted). 

\bibitem[\protect\citeauthoryear{Newton et al.}{2004}]{new04}
Newton, M., Noueiry, A., Sarkar, D., Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric
hierarchical mixture method.
{\it Biostatistics} {\bf 5} , 155-176.

\bibitem[\protect\citeauthoryear{Johnson et al.}{2007}]{johnson07}
Johnson, D.S., Mortazavi, A., Myers, R.M., Wold, B. (2007).  
 Genome-wide mapping of in vivo protein-DNA interactions. 
{\it Science} {\bf 316}, 1497-1502. 

\end{thebibliography}
\end{document}