% \VignetteIndexEntry{Introduction to ChIPseqR} \documentclass[a4paper]{article} \usepackage[OT1]{fontenc} \usepackage{Sweave} \usepackage{url} \usepackage{hyperref} \usepackage{fullpage} \begin{document} \title{Introduction to ChIPseqR} \author{Peter Humburg} \maketitle \section{Introduction} The \texttt{ChIPseqR} package was developed for the analysis of nucleosome ChIP-seq data. It is suitable for the high resolution analysis of end-sequenced nucleosomes after MNase digest and can be used for the analysis of nucleosome positioning as well as histone modification experiments. The parameters of the model used by \texttt{ChIPseqR} provide some flexibility and choosing them appropriately should allow for the analysis of other types of experiments although this has not been tested extensively. In the following sections we will discuss the application of \texttt{ChIPseqR} to simulated nucleosome sequencing data to demonstrate its basic functionality. This is followed by some details on the underlying method and a discussion of how this method may be adapted to other types of experiments. \section{Simulating data} We start by simulating a small dataset representing reads from a nucleosome positioning experiment. The information in this section details the data used for the following examples. To quickly see an application of \texttt{ChIPseqR} just skip ahead to the next section. For the purpose of this example we will restrict ourselves to a very small genome and relatively few reads. This keeps time and memory requirements low but may affect \texttt{ChIPseqR}'s ability to locate binding sites. While it was designed to handle low coverage data it works better with genomes that are larger than the 2Mb we will simulate here. Also note that this simulation is somewhat simplistic and is used here for demonstration purposes only. If we were interested in carrying out a serious simulation study for ChIP-seq data we could use a more sophisticated approach\footnote{The \texttt{ChIPsim} package available from Bioconductor provides functionality for this as does the simulation of Zhang~\emph{et al.} (2008) available at \href{http://www.gersteinlab.org/proj/chip-seq-simu}{\url{www.gersteinlab.org/proj/chip-seq-simu}}} but this will be sufficient to demonstrate the basic use of \texttt{ChIPseqR}. The first step in our simple simulation is to define some nucleosome positions. We will assume that the distance between the centres of adjacent nucleosomes lies between 170 and 300~bp and follows a negative binomial distribution with mean 200. In addition we will allow nucleosomes to be missing, introducing a gap of 400~bp between adjacent nucleosomes. Here we generate the distances between nucleosomes and translate them into nucleosome positions: <>= set.seed(1) dist <- sample(c(170:300, 400), 10100, prob=c(dnbinom(0:130, mu=30, size=5), 0.2), replace=TRUE) pos <- cumsum(dist) pos <- pos[pos < 2e6 - 200] @ Now we generate read positions relative to binding sites. For this purpose we will make a number of simplifying assumptions: \begin{enumerate} \item Reads fall within 15~bp of either end of a nucleosome (with forward strand reads preceding the binding site and reverse strand reads following it); \item Reads may be located up to 5~bp inside the binding site; \item Read positions within these 20~bp are uniformly distributed; \item The same number of forward and reverse strand reads are produced; \item All nucleosomes cover exactly 147~bp; \item Nucleosome positions are fixed; \item Non-specific background reads are uniformly distributed throughout the genome. \end{enumerate} Note that these assumptions are not used for the analysis and are not made by \texttt{ChIPseqR}, they just simplify the simulation. We start by identifying all potential start positions for binding site related reads. <>= fwdRegion <- unlist(lapply(pos, function(x) (x-88):(x-68))) revRegion <- unlist(lapply(pos, function(x) (x+68):(x+88))) @ Then we sample 50,000 reads from each strand. This should give us an average of five reads per nucleosome and strand, not a lot of coverage. <>= fwd <- sample(fwdRegion, 5e4, replace=TRUE) rev <- sample(revRegion, 5e4, replace=TRUE) @ Finally we add 200,000 non-specific background reads to each strand. That is a lot of background noise so finding the nucleosomes will not be easy. <>= fwd <- c(fwd, sample(25:(2e6-25), 2e5, replace=TRUE)) rev <- c(rev, sample(25:(2e6-25), 2e5, replace=TRUE)) @ At this stage \texttt{fwd} and \texttt{rev} contain all positions of forward and reverse strand reads. The final step in our simulation is to organise the read position in a data frame that can then serve as input for \texttt{ChIPseqR}. <>= reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=rep(c("+", "-"), each=250000)) @ \section{Using \texttt{simpleNucCall}} <>= library(ChIPseqR) @ The easiest way to predict nucleosome positions with \texttt{ChIPseqR} is to call \texttt{simpleNucCall} with mapped reads. Here we will use the data frame created above but for real data it is usually more convenient to read the data into an \texttt{AlignedRead} object using the functionality provided by the \texttt{ShortRead} package. It is possible to simply pass the name of a file containing the mapped reads to \texttt{simpleNucCall} and they will be read in before the analysis. Here we first convert the read positions into read counts so that we can use the read counts again later. If we were not interested in inspecting the read counts later on we could just pass the data frame to \texttt{simpleNucCall} which would convert it into read counts for us. <>= counts <- strandPileup(reads, chrLen=2e6, extend=1, plot=FALSE) nucs <- simpleNucCall(counts, bind=147, support=15, plot=FALSE) @ Note the parameters in the above call. We specify the length of the binding and support regions as well as the chromosome length. The values chosen for binding and support region length do not match the values in the simulation perfectly (where they are 137 and 20 respectively). Below we will discuss briefly how these parameters can be estimated from the data. It is possible to obtain some diagnostic plots by setting \texttt{plot = TRUE}. Diagnostic plots can also be obtained later by plotting the R objects created by \texttt{strandPileup} and \texttt{simpleNucCall}. For example, to genarate some diagnostic plots to assess how well the data fits the null distribution used by \texttt{ChIPseqR} we can simply plot \texttt{nucs}: <>= plot(nucs, type="density") plot(nucs, type="qqplot") @ \begin{figure}[!thb] \centering <>= par(mfrow=c(1,2)) plot(nucs, type="density") plot(nucs, type="qqplot") @ \caption{Diagnostic plots to assess fit of null distribution to observed binding site scores. The blue line marks the chosen significance cut-off.} \end{figure} The output is shown in Figure~1. The heavy right tail expected in the presence of nucleosomes in the sample is clearly visible. There is also some evidence that \texttt{ChIPseqR} overestimated the variance of the null distribution. As a result of this $p$-values computed from this fit will be conservative. Here we have identified \Sexpr{length(nucs)} binding sites. Clearly some are missing but that is to be expected considering the relatively low coverage and high noise level. We can plot some of the data together with the binding site score and predicted nucleosome positions using \texttt{plotWindow}: <>= predicted <- peaks(nucs)[[1]][911] plot(counts, chr="chr1", center=predicted, score=nucs, width=1000, type="window") @ The output is shown in Figure~2. Note how we use \texttt{peaks} to obtain a list of predicted binding sites. To see how these predictions relate to the actual position of simulated nucleosomes we add some additional markers. <>= abline(v=pos[pos < predicted + 1000 & pos > predicted - 1000], col=3, lty=3) @ \begin{figure}[htb] \centering <>= predicted <- peaks(nucs)[[1]][911] plot(counts, chr="chr1", center=predicted, score=nucs, width=1000, type="window") abline(v=pos[pos < predicted + 1000 & pos > predicted - 1000], col=3, lty=3) @ \caption{Read counts and predicted binding sites. Read counts on forward (red) and reverse (blue) strand are shown as vertical bars. The binding site score is shown in green with green boxes at the bottom indicating predicted nucleosome positions. The true position of simulated nucleosomes is indicated by vertical green lines.} \end{figure} As we can see in Figure~2 some nucleosome predictions overlap. These correspond to alternative positions of the same nucleosome. On real data this may indicate variations in the actual nucleosome position observed in the sample or simply uncertainty due to low coverage, high noise level or inaccurate choices of parameters. We can easily obtain non-overlapping predictions: <>= calls <- peaks(nucs)[[1]][c(1,which(diff(peaks(nucs)[[1]]) >= 170)+1)] length(calls) @ Of course we would like to know how many of these are near actual nucleosome positions. <>= table(sapply(calls, function(x) any((x-20):(x+20) %in% pos))) @ Of the 118 apparent false positives 76 are within 30~bp of a simulated nucleosome. \section{Some details on \texttt{callBindingSites}} Now that we have seen how to obtain binding site predictions it is time to take a closer look at how they are produced. Above we used \texttt{simpleNucCall} to identify nucleosome positions. The actual work required to locate binding sites is carried out by \texttt{callBindingSites}. The process of locating binding sites is divided into several stages: \begin{enumerate} \item Estimate length of binding and support region (if required); \item Calculate binding site score; \item Determine significance threshold; \item Locate significant peaks in binding site score. \end{enumerate} In this section we will take a closer look at these different stages to get a better understanding of how binding sites are identified and briefly introducing the functions used for each stage. \subsection{Estimating parameters} In practice it may not always be obvious what the best choice of parameters is. It is possible to get estimates for both parameters from the data by considering the cross-correlation between the read counts on the forward and reverse strand. The function \texttt{getBindLen} is provided for this purpose. However, it may not always produce good results. The approach taken in \texttt{getBindLen} is to identify the first two peaks in cross-correlation and to derive values for binding and support region length consistent with the location of these peaks. One of the main underlying assumptions is that the distribution of read counts in the support region is symmetric about the centre of the support region and the accuracy of results depends largely on how well this assumption holds. It is also possible to supply one parameter and estimate the other which can improve results substantially if the supplied parameter value is accurate. It should also be noted that binding sites are assumed to occur in regular intervals and that they are located relatively close to each other. This may be approximately true for nucleosomes but does not hold for other proteins of interest. Here is an example of how we might use \texttt{getBindLen} to estimate the parameters for our simulated data. \begin{figure}[htb] \centering <>= bLen <- getBindLen(counts, bind = c(100,200), support = c(5, 50)) @ \caption{Cross-correlation between reads on forward and reverse strand used to estimate length of binding and support regions.} \end{figure} <>= bLen <- getBindLen(counts, bind = c(100,200), support = c(5, 50)) @ This tells \texttt{getBindLen} to look for peaks in the cross-correlation that correspond to a binding region length between 100 and 200~bp and a support region length between 5 and 50~bp. The resulting estimates are \Sexpr{bLen[1]}~bp for the binding region and \Sexpr{bLen[2]}~bp for the support regions. Since we know that the correct answer should be 137 and 20~bp respectively it is clear that this is not perfect. To use this parameter estimation method as part of the binding site prediction we can simply pass \texttt{bind = c(100,200)} and \texttt{support = c(5, 50)} as arguments to \texttt{callBindingSites} (or \texttt{simpleNucCall}). \subsection{Scoring binding sites} During this step the read counts are scanned along the genome and each position is assigned a score indicating how likely it is to be the centre of a binding site. This involves a sliding window partitioned into three regions, the binding region in the centre and forward and reverse strand support regions on either side. Read counts in each region are compared to a background estimate obtained from a larger window (the default is 2000~bp). A score is calculated for each region, with larger scores indication increasing departure from the background distribution, and these are then combined into a binding site score. The necessary computations are carried out by a call to \texttt{startScore}. Further parameters used here are \texttt{bgCutoff} and \texttt{supCutoff}. They can be used to limit the change in the estimate for adjacent background windows and between forward and reverse support regions. These parameters should be chosen between 0.5 and 1. Lower values correspond to a more restrictive cut-off. Setting \texttt{bgCutoff = 0.5} will essentially force the use of a uniform background (which is not recommended) while \texttt{bgCutoff = 1} allows arbitrarily large changes in background estimates. The cut-off for support regions works in a similar way. Generally lower cut-off values reduce the model's ability to adapt to changes in local coverage while larger values make it more sensitive to artificially large read counts observed in some locations. \subsection{Determining significance} Once binding site scores are calculated an attempt is made to assess their significance by estimating the parameters of the null distribution. To achieve this a (truncated) normal distribution is fitted to the left tail of the observed scores. This assumes that low scores are largely unaffected by the presence of binding sites. Computations carried out by \texttt{getCutoff} produce a binding site score cut-off corresponding to the requested false discovery rate and the parameters of the fitted null distribution. \subsection{Peak calling} During the final stage of the analysis significant peaks in the binding site score are identified and reported as predicted binding sites by \texttt{pickPeak}. This simply identifies all peaks that exceed the threshold determined in the previous step and identifies the location of the maximum of each peak as a binding site. Note that this may produce overlapping binding site predictions if two such peaks are located close to each other. By default peaks have to be seperated by at least one value below the threshold but in some cases, e.g. when peaks are relatively wide compared to the length of a binding site, we may want to relax this requirement such that peaks can be separated by a relatively low scoring region even if the score does not drop below the threshold. This can be achieved by using \texttt{sub = TRUE}. \section{Beyond nucleosomes} Although \texttt{ChIPseqR} has been designed to locate nucleosomes it should be possible to apply the same approach to other types of ChIP-seq experiments. The most obvious way to adjust \texttt{ChIPseqR} to different experiments is through the choice of binding and support region length. These relate to the physical length of the binding site and the length of DNA fragments. Generally a shorter binding site will require a shorter binding region but the best choice for this parameter also depends on the experimental protocol used. If DNA is sonicated it may be useful to further reduce the length of the binding region while increasing the length of support regions. Longer support regions will be required if the binding site is short compared to the fragment length. \section{Internal data representation} To reduce memory requirements the internal representation of read counts is compressed through run-length encoding. In some cases this can lead to an increase in required memory. Should this be the case for a particular dataset we can disable compression by calling \texttt{strandPileup} with \texttt{compress = FALSE}. If we just want to obtain an expanded representation of an already existing \texttt{ReadCounts} object we can use \texttt{decompress}: <>= counts2 <- decompress(counts) @ In this case the decompressed representation is \Sexpr{sprintf("%.1f",object.size(counts2)/object.size(counts))} times larger than the compressed one. For real datasets this factor can be substantially larger, especially when coverage is low. A similar approach is used to compress the binding site score. Since the score is a floating point value a step function approximation is used to achieve better compression. This involves rounding the calculated score to a certain number of digits. The use of fewer digits will improve compression but also results in the loss of information and may affect results. The level of compression can be set by passing argument \texttt{digits} to \texttt{callBindingSite} or \texttt{simpleNucCall}. The default is 16 which results in low compression but retains the relevant information. The scoring method used by \texttt{ChIPseqR} is likely to produce missing values, especially in regions of low coverage. This can result in relatively long runs of \texttt{NA}'s that are always compressed. Again it is possible to expand the representation of binding site scores through a call to \texttt{decompress}. Note however that information that was lost during compression can only be restored through recalculating binding site scores at a lower compression setting. It is also possible to compress an expanded representation of read counts or binding site scores through a call to \texttt{compress}. \section{Session info} <>= sessionInfo() @ \end{document}