%\VignetteIndexEntry{iSeq} %\VignetteDepends{} %\VignetteKeywords{ChIP-seq,next generation sequencing, massively parallel sequencing} %\VignettePackage{iSeq} \documentclass[11pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \SweaveOpts{echo=FALSE} \setlength{\textheight}{8.5in} \setlength{\textwidth}{6in} \setlength{\topmargin}{-0.25in} \setlength{\oddsidemargin}{0.25in} \setlength{\evensidemargin}{0.25in} \begin{document} \setkeys{Gin}{width=0.99\textwidth} \title{\bf iSeq: Bayesian Hierarchical Modeling of ChIP-seq Data Through Hidden Ising Models} \author{Qianxing Mo} \maketitle \begin{center} Department of Biostatistics \& Bioinformatics\\ H. Lee Moffitt Cancer Center \& Research Institute \\ 12902 Magnolia Drive, Tampa, FL 33612 \\ {\tt qianxing.mo@moffitt.org} \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} This package implements the models proposed by Mo (2012) for ChIP-seq data analysis, which are the extensions of the models proposed for ChIP-chip data (Mo and Liang, 2010a,b). The package can be used to analyze ChIP-seq data with and without controls and replicates. A tutorial of ChIP-seq data analysis using iSeq is available at \\ \url{https://sites.google.com/site/quincymobio/teaching-materials}. \noindent This package processes ChIP-seq data in three steps. \\ \begin{itemize} \item {Build a dynamic signal profile for each chromosome. To create a dynamic signal profile, sequence tags are aggregated into non-overlapping dynamic bins whose lengths vary according to local tag density (See the help file of function 'mergetag'). The number of tags falling in each bin is counted. A dynamic signal profile is made of the bin-based tag counts and the corresponding genomic positions on the chromosome. The bins and their tag counts are ordered along the chromosome according to their genomic positions. } \item {Model the dynamic signal profiles. It is to model the tag counts on the chromosome. The models assume that each bin is associated with a binary latent variable $X_i \in (1, -1)$, where $i$ denotes the ID for the bin, and $X_i=1$ denotes that the bin is an enriched bin, and -1 for a non-enriched bin. In the first stage, conditioning on the latent variable $X_i$, the bin-based tag counts are modeled by Poisson-Gamma distributions. If $X_i=-1$, the tag count for bin $i$ is assumed to follow a Poisson distribution with parameter $\lambda_0$, where $\lambda_0 \sim \textrm{Gamma}(a_0,b_0)$. If $X_i= 1$, the tag count for bin $i$ is assumed to follow a Poisson distribution with parameter $\lambda_1$, where $\lambda_1 \sim \textrm{Gamma}(a_1,b_1)$. $\textrm{Gamma}(a,b)$ denotes a gamma distribution with mean $a/b$ and variance $a/b^2$. In the second stage, the latent variable is modeled by ferromagnetic Ising models. The Gibbs sampler and Metropolis algorithm are used to simulate from the posterior distributions of the model parameters.} \item {Call enriched regions. The posterior probabilities for the bins in the enriched state ($X_i = 1$) are used for statistical inference. A bin with a high posterior probability of enrichment will provide strong evidence that the bin is enriched. Enriched bins are then merged into enriched regions.} \end{itemize} \section{The NRSF ChIP-seq Data} Johnson et al.(2007) carried out genome-wide identification of the binding sites of human neuron-restrictive silencer factor (NRSF) in Jurkat T cells. We use the data of chromosomes 22 and Y to illustrate the proposed method. The NRSF sequence tags (25 bp) were generated by the Illumina/Solexa sequencing platform, and mapped to The human genome May 2004 (hg17). We only use the uniquely mapped tags (up to two mismatches) for the analysis. Note that iSeq package doesn't provide functions to read bam/sam files generated by tag/read aligners. The user may use bedtools (\url{https://bedtools.readthedocs.io/en/latest/}) to convert bam files to bed files, which can be read into R easily. \section{Example --- Analyze the NRSF Data} \noindent First, let's load the library and check the data. There are three ChIP and control samples, respectively. <>= library(iSeq) data(nrsf) names(nrsf) @ \subsection{Build dynamic signal profiles using function 'mergetag'} \noindent The following two steps just merge the ChIP and control samples, respectively. <>= chip = rbind(nrsf$chipFC1592,nrsf$chipFC1862,nrsf$chipFC2002) mock = rbind(nrsf$mockFC1592,nrsf$mockFC1862,nrsf$mockFC2002) print(chip[1:2,]) print(mock[1:2,]) @ \noindent Note the 'position' is the tag start position (5' position). We recommend to use the tag middle position to build the signal profiles. Since the tag length is 25 bp, we add 12 to the start position to get the middle position. <>= chip[,2] = chip[,2]+12 mock[,2] = mock[,2]+12 print(chip[1:2,]) print(mock[1:2,]) @ If tag start and end positions are provided, the user can calculate tag middle position using R code 'floor((start+end)/2)'. We recommend to build the signal profiles using dynamic bins (80, 40, 20, 10 bp) and 10 tags as the threshold for triggering bin size change for the NRSF data, which is equivalent to setting maxlen=80, minlen=10 and ntagcut=10 for the arguments of function 'megertag'. Of course, the user can use other settings according to the sequencing depth of the experiments. \noindent According to my experience, the dynamic profiles with bin sizes (80, 40, 20, 10 bp) work well for both deeply and not-deeply sequenced data. However, the users may need to adjust argument 'ntagcut' to reflect the sequencing depth. For example, for deeply sequenced data, the users may set ntagcut = 20, or 30, etc. In general, increasing the value of 'ntagcut' leads to bigger bin sizes used for building the dynamic profiles and less enriched regions called. Note, by default, the bin sizes decrease or increase by a factor of 2, thus the user should set maxlen = ($2^n$)*minlen, where n is an integer. <>= tagct = mergetag(chip=chip,control=mock,maxlen=80,minlen=10,ntagcut=10) print(tagct[1:3,]) @ \noindent Note, the 'gstart' and 'gend' in 'tagct' record the positions of the first and last tags falling in the dynamic bins and the 'gstart' is also the bin's start position. In the above printout, the first 3 bin sizes are 80 bp. Because only 1 tag falls in each bin, the 'gstart' is equal to the 'gend'. For more information, please see Mo (2012) and the help file of function 'mergetag'. \\ \noindent The user can quickly get the 'enriched' regions without calculating statistical confidence using function 'peakreg'. For example, if we claim that a bin with adjusted tag count $>$ 10 is an enriched bin, the 'enriched' regions can be obtained by using the following code. Let's use the chromosome 22 data as an example. <>= tagct22 = tagct[tagct[,1]=="chr22",] #chr22 data reg0 = peakreg(chrpos=tagct22[,1:3],count=(tagct22[,5:6]-tagct22[,7:8]), pp=(tagct22[,4]>10),cutoff=0,method="ppcut",maxgap=200) print(dim(reg0)) print(reg0[1:3,]) # each row contain the information for an enriched region @ Note the second argument 'count = ' is just used for inferring the predicted exact binding sites (or peaks) of the enriched regions. Here, we use the adjusted tag counts without truncating at 0. The user can also just use the ChIP tag counts. Usually it will not make much difference. The last column (sym) in 'reg0' indicates whether the forward and reverse tag counts are symmetrical in the regions. It is to measure whether the regions are bimodal, which is a typical characteristic of binding regions. The values range from 0.5 (the perfect symmetry) to 0 (asymmetry). The user can order the regions based on 'sym' and 'ct12' to get the 'top' binding regions. For example, <>= reg0 = reg0[order(reg0$sym,reg0$ct12,decreasing = TRUE),] print(reg0[1:5,]) # Top five regions @ \subsection{Model dynamic signal profiles using function 'iSeq1'} \noindent Function iSeq1 implements a fully Bayesian hidden Ising model in which the latent variable $X$ is modeled by the standard 1D Ising model. The columns 1-4 of tagct are the signal profiles for modeling. Note that the column 4 contains the adjusted tag counts of the ChIP samples. For one sample analysis, it is the total tag counts for the forward and reverse chains. <>= set.seed(777) res1 = iSeq1(Y=tagct22[,1:4],gap=200,burnin=200,sampling=1000,ctcut=0.95, a0=1,b0=1,a1=5,b1=1,k0=3,mink=0,maxk=10,normsd=0.1,verbose=FALSE) @ \noindent Plot the model parameters to see whether they converge. In general, the MCMC chains have converged when the parameters fluctuate around the modes of their distributions. If there is an obvious trend (e.g. continuous increase or decrease), the user should increase the number of iterations in the burn-in and/or sampling phases. If the chains do not mix well, the user can adjust the argument normsd to see how it affects the results. \begin{center} <>= par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0)) hist(res1$pp) plot(res1$kappa, pch=".", xlab="Iterations", ylab="kappa") plot(res1$lambda0, pch=".", xlab="Iterations", ylab="lambda0") plot(res1$lambda1, pch=".", xlab="Iterations", ylab="lambda1") @ \end{center} \noindent From the trace plots, we see the chains converge quite fast. \subsection{Call enriched regions detected by iSeq1 using function 'peakreg'} \noindent Call the enriched regions detected by iSeq1 using 0.5 posterior probability (pp) cutoff. Note the argument count is the net tag counts for the two sample analysis. The net tag counts are not truncated at zero, but this doesn't matter because it is just used for inferring the center of an enriched region, which is usually the true binding site. The user can also use the ChIP tag counts only. The results are little different. See the help file of peakreg for details. <>= reg1 = peakreg(chrpos=tagct22[,1:3],count=(tagct22[,5:6]-tagct22[,7:8]),pp=res1$pp, cutoff=0.5,method="ppcut",maxgap=200) print(dim(reg1)) print(reg1[1:3,]) @ \noindent Plot some enriched regions. The dash lines indicate the region centers, usually the true binding sites. \begin{center} <>= par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0)) for(i in 1:4){ ID = (reg1[i,4]):(reg1[i,5]) plotreg(tagct22[ID,2:3],tagct22[ID,5:6],tagct22[ID,7:8],peak=reg1[i,6]) } @ \end{center} \noindent Call the enriched regions using 0.05 FDR cutoff. The FDR is calculated using the direct posterior probability approach (Newton et al., 2004). <>= reg2 = peakreg(tagct22[,1:3],tagct22[,5:6]-tagct22[,7:8],res1$pp,0.05, method="fdrcut",maxgap=200) print(dim(reg2)) print(reg2[1:3,]) @ \noindent The columns 1-3 of the enriched regions (e.g. reg2[,1:3]) can be used to extract the sequences from the UCSC genome browser. Alternatively, one may create a BED file using the peak position of the enriched regions. For example, <>= bed = data.frame(chr=reg2[,1],gstart=reg2[,6]-100,gend=reg2[,6]+100) @ \subsection{Model dynamic signal profiles using function 'iSeq2' } \noindent The latent variable $X$ can be modeled through a high-order Ising model using function iSeq2. The interaction parameter $k$ for the high-order (or the standard/first-order) Ising model is fixed and set by the user in iSeq2. To apply the second-order Ising model to ChIP-seq data, the user can let winsize = 2. If set winsize = 1, it will be the standard Ising model. To use a high-order Ising model, according to our experience, a balance between high sensitivity and low FDR can be achieved when winsize = 2. The critical value for the second-order Ising model is about 1.0. In general, increasing the value of $k$ will lead to less enriched regions, which amounts to setting a stringent criterion for detecting enriched regions. \noindent Model the NRSF data using the second-order Ising model. <>= res2 = iSeq2(Y=tagct22[,1:4],gap=200, burnin=100,sampling=500,winsize=2, ctcut=0.95,a0=1,b0=1,a1=5,b1=1,k=1.0,verbose=FALSE) @ \noindent Plot the model parameters to see whether they converge. If the chains do not mix well, one can adjust the parameter {\bf k} and/or normsd to see how they affect the results. \begin{center} <>= par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0)) hist(res2$pp) plot(res2$lambda0, pch=".", xlab="Iterations", ylab="lambda0") plot(res2$lambda1, pch=".", xlab="Iterations", ylab="lambda1") @ \end{center} \subsection{Call enriched regions detected by iSeq2 using function 'peakreg'} <>= reg2 = peakreg(tagct22[,1:3],tagct22[,5:6],res2$pp,0.5,method="ppcut",maxgap=200) print(dim(reg2)) print(reg2[1:3,]) @ \noindent Plot some enriched regions detected by iSeq2. \begin{center} <>= par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0)) for(i in 1:4){ ID = (reg2[i,4]):(reg2[i,5]) plotreg(tagct22[ID,2:3],tagct22[ID,5:6],tagct22[ID,7:8],peak=reg2[i,6]) } @ \end{center} \section{Tips} \noindent Finally, let's analyze the chromosome Y data. Intuitively, it seems that there is no binding region on chromosome Y. <>= tagctY = tagct[tagct[,1]=="chrY",] print(table(tagctY[,4])) res1 = iSeq1(Y=tagctY[,1:4],gap=200,burnin=1000,sampling=5000,ctcut=0.95, a0=1,b0=1,a1=5,b1=1,k0=3,mink=0,maxk=10,normsd=0.5,verbose=FALSE) res2 = iSeq2(Y=tagctY[,1:4],gap=200, burnin=1000,sampling=5000,winsize=2,ctcut=0.95, a0=1,b0=1,a1=5,b1=1,k=3.0,verbose=FALSE) @ \begin{center} <>= par(mfcol=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0)) hist(res1$pp) plot(res1$lambda1, pch=".", xlab="Iterations", ylab="lambda1") hist(res2$pp) plot(res2$lambda1, pch=".", xlab="Iterations", ylab="lambda1") @ \end{center} \noindent In this case, all the bins are in the non-enriched state and the posterior probabilities are zero or close to zero. In some cases, if the $\lambda_1$ is small (e.g. $\approx a_1/b_1$, the mean value of the gamma prior), it suggests that there is no enriched region on that chromosome. The user may not claim any enriched regions found on that chromosome, even if some posterior probabilities are high. \noindent According to my experience, iSeq1 methods work well for most of ChIP-seq data I have analyzed. For very noisy data, iSeq1 may not work. The parameter $\lambda_1$ estimates for very noisy data are usually less than 1 and the posterior probabilities are not dichotomized and dominated by 0, which suggest that the Ising model is not in the super-paramagnetic phase. Only the super-paramagnetic phase reflects the binding events on the chromosomes. Therefore, if iSeq1 doesn't work, the user can use iSeq2. In iSeq2 function, the interaction parameter $k$ is fixed by the user. The user can increase the value of $k$ to let the phase transition occur so that the Ising model reaches the super-paramagnetic phase. \\ %\noindent %The iSeq methods take into account the spatial dependency, the global and local distributions of the sequence tags. %If the signal profiles have very sparse signals and some bins have very large (adjusted) tag counts (e.g., the NRSF data), %the estimated $\lambda_1$ will be relatively large, which makes the bins with relatively small counts (e.g., tens) have %low posterior probabilities. In this case, the user can truncate %the (adjusted) tag counts at certain value (e.g., if count $>$ 100, set count = 100) to increase the power to detect %the regions with small tag counts. For the NRSF data, if the adjusted tag counts are truncated at 100, more enriched %regions can be detected. \section{Parallel Computation and R script for running iSeq not interactively} \noindent To speed up the analysis, the user can do parallel computation easily. The user needs to install the snow and snowfall packages. The following is an example. \\ \noindent library(snow) \\ library(snowfall) \\ dataList = list(chr22=tagct22,chrY=tagctY) \\ sfInit(parallel=TRUE,cpus=2,type="SOCK") \\ res=sfLapply(dataList,iSeq1,gap=200,burnin=100,sampling=200,ctcut=0.95,a0=1,b0=1,a1=1,b1=1, k0=3,mink=0,maxk=10,normsd=0.1,verbose=FALSE) \\ \noindent An R script called 'iSeq.R' has implemented the above parallel computation strategy and can be used as a command line program in Unix/Linux environment to automate ChIP-seq data analysis, which is available at \\ \url{https://sites.google.com/site/quincymobio/teaching-materials}. \section{Citing iSeq} \noindent If you use iSeq package, please cite ``Mo, Q. (2012). A fully Bayesian hidden Ising model for ChIP-seq data analysis. Biostatistics 13(1), 113-128''. <>= sessionInfo() @ \begin{thebibliography}{} \bibitem[\protect\citeauthoryear{Mo and Liang.}{2010}]{mo2010a} Mo, Q., Liang, F. (2010a). Bayesian Modeling of ChIP-chip data through a high-order Ising Model. {\it Biometrics} {\bf 66}, 1284-1294. \bibitem[\protect\citeauthoryear{Mo and Liang.}{2010}]{mo2010b} Mo, Q., Liang, F. (2010b). A hidden Ising model for ChIP-chip data analysis. {\it Bioinformatics} {\bf 26(6)}, 777-783. \bibitem[\protect\citeauthoryear{Mo}{2011}]{mo20112} Mo, Q. (2012). A fully Bayesian hidden Ising model for ChIP-seq data analysis. {\it Biostatistics} {\bf 13(1)}, 113-28. \bibitem[\protect\citeauthoryear{Newton et al.}{2004}]{new04} Newton, M., Noueiry, A., Sarkar, D., Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. {\it Biostatistics} {\bf 5} , 155-176. \bibitem[\protect\citeauthoryear{Johnson et al.}{2007}]{johnson07} Johnson, D.S., Mortazavi, A., Myers, R.M., Wold, B. (2007). Genome-wide mapping of in vivo protein-DNA interactions. {\it Science} {\bf 316}, 1497-1502. \end{thebibliography} \end{document}