% % NOTE -- ONLY EDIT bayespeak.rnw!!! % %\VignetteIndexEntry{BayesPeak Vignette} %\VignetteDepends{} %\VignetteKeywords{BayesPeak} %\VignettePackage{BayesPeak} \documentclass{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage{cite} \usepackage{Sweave} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \SweaveOpts{keep.source=TRUE} \newcommand{\classdef}[1]{ {\em #1} } \begin{document} \title{\Rpackage{BayesPeak}: Bayesian Analysis of ChIP-seq data} \author{Jonathan Cairns and Christiana Spyrou} \maketitle \tableofcontents \section*{Introduction} \Rpackage{BayesPeak} is a Bioconductor package for the analysis of data sets from ChIP-seq experiments, particularly for identifying the genomic sites of protein--DNA interactions. The model is designed for use on transcription factor data sets, and on H3K4me3 histone mark data sets (in which the peak structure is similar to that found in a transcription factor data set). The algorithm models the positions and orientations of the sequenced fragments and determines the locations of enriched areas, corresponding to binding sites and histone modifications, by using a hidden Markov model and Bayesian statistical methodology. The Bayesian approach to parameter and state estimation returns posterior probabilities as measure of certainty, and offers great scope for interpretation, as well as allowing for the use of these probabilities as weights in subsequent analyses (e.g. motif discovery). The other important feature of the algorithm is the use of the negative binomial distribution to model the counts of sequenced reads. This allows for overdispersion and provides a better fit to the data than the Poisson distribution that has been widely used by other methods. \section{What is ChIP-seq?} Chromatin ImmunoPrecipitation (ChIP) is an experiment designed to study protein-DNA interactions, particularly to identify the genomic sites where proteins, such as transcription factors, bind to the DNA, or sites where histone modifications occur \citep{Robertson2007}. The experiment produces samples that are enriched for the sites of interest compared to the rest of the genome. The use of this method combined with high-throughput sequencing of the samples is referred to as ChIP-seq. Given our protein of interest, the ChIP-seq protocol usually consists of the following steps. (The exact protocol may vary between different experiments, but \Rpackage{BayesPeak} will still be able to perform the peak-calling step.) \begin{itemize} \item Cross-linking the proteins to the DNA - The protein is permanently bound to the DNA, usually with formaldehyde. \item Shearing - The cells are lysed, and the DNA is randomly cut into small fragments by sonication. \item Immunoprecipitation - An antibody specific to the protein of interest is used to isolate the protein and the attached DNA fragments. The resulting sample is enriched for those genomic regions. If we are instead interested in locating histones with a certain epigenetic modification, then we use an antibody specific to histones with that modification. \item Reverse crosslinking and purification - The bonds between the protein and DNA are broken. The DNA is subsequently purified. \item Sequencing - The contents of the samples are size selected such that the fragments' length lies in the region of 200-300 bp. This step is required by the sequencing protocol. Adaptors are attached to the fragments and amplification usually takes place. Subsequently, the sample undergoes ``high-throughput sequencing" during which the sequences of the ends of the present fragments are identified. For ChIP-seq applications usually one end of each fragment is sequenced to produce ``single-end" reads. \item Alignment - The DNA is aligned back to reference genome, taking the quality of the reads into account. Usually only the reads that map to unique locations in the genome are included in the downstream analysis. \item Analysis - The sites of interest correspond to the genomic regions where there is high abundance of reads compared to the background or the control sample. \Rpackage{BayesPeak} performs this identification step, also referred to as ``peak-calling". \end{itemize} Usually this process is repeated omitting the immunoprecipitation step to produce a sample with no preferential enrichment. This control sample has the same characteristics as ChIP-seq data and its inclusion is important to identify experimental biases. There are many sources of error - for example, misalignment, impurities, or DNA that simply has a high affinity for being sequenced - which can result in noise across the genome, or even false peaks. The \Rpackage{BayesPeak} model is designed to separate the peaks from the noise, and to avoid calling false peaks. \section{Simple workflow} \label{workflow} Load the package as follows: <<>>= library(BayesPeak) @ The example data set used below, consisting of the files ``H3K4me3-chr16.bed" and ``Input-chr16.bed", can be downloaded from \newline\texttt{http://www.compbio.group.cam.ac.uk/Resources/BayesPeak/csbayespeak.html}. These data were generated from a histone mark ChIP-seq experiment, performed on mouse chromosome 16. (Other data sets will contain data from multiple chromosomes - there is no need to split these up.) %FIXME The following code is a very simple example of a \Rpackage{BayesPeak} workflow, where we analyse the region 92,000,000 - 95,000,000 bp on chromosome 16. It should take a couple of minutes on a relatively modern machine. <>= raw.output <- bayespeak("H3K4me3-chr16.bed", "Input-chr16.bed", chr = "chr16", start = 9.2E7, end = 9.5E7, job.size = 6E6) output <- summarize.peaks(raw.output, method = "lowerbound") @ \begin{itemize} \item \texttt{bayespeak()} runs the actual BayesPeak algorithm on the data. The two input files are bed files located in the working directory, with H3K4me3 being the IP-treated data set and Input ID being the control data. In each case, we could have provided a data.frame or a RangedData object (from the \Rpackage{IRanges} package) instead of a file path. (A GRange from the \Rpackage{GenomicRanges} package can trivially be coerced to a RangedData object, using the \texttt{as()} function.) In a valid bed file, each row of the file contains information for a single read. In particular, the chromosome, start position, end position and DNA strand appear in the 1st, 2nd, 3rd and 6th columns respectively. The function applies the algorithm to 6 Mb partitions of the genome (or ``jobs") by default, as explained in section \ref{algorithm}. \item \texttt{raw.output} is a list - it contains not only the bins called, but also some useful QC information (such as the model fit - in particular, this can be used to spot unreliable jobs as described in section \ref{overfitting}). This output needs to be summarized. \item \texttt{summarize.peaks()} is used to summarize the \texttt{raw.output} object. This consolidates the raw bin calls into peaks and combines data across jobs. \end{itemize} We can analyse all of the data present in the .bed file with the following code, although this will take somewhat longer. <>= raw.output <- bayespeak("H3K4me3-chr16.bed", "Input-chr16.bed") output <- summarize.peaks(raw.output, method = "lowerbound") @ A parallelization strategy is available to reduce the running time of this process, which is given in section \ref{MC}. \subsection{Exporting output} The output can now be saved to disk with the \texttt{export()} function from the \Rpackage{rtracklayer} package. Many standard file formats are supported - for example: <>= library(rtracklayer) export(output, "H3K4me3output.bed") export(output, "H3K4me3output.gff") @ These examples, however, will only output peak regions - they will not export the posterior probability associated with each peak. To export the full output, one can coerce the output to a data frame, and export this with standard output functions provided in R. For example: <>= write.table(as.data.frame(output), file = "H3K4me3output.txt", quote = FALSE) write.csv(as.data.frame(output), file = "H3K4me3output.csv", quote = FALSE) @ We now go into this workflow in more depth. \section{The algorithm} \label{algorithm} \Rpackage{BayesPeak} fits a Markov model to the data (the aligned reads) via Markov Chain Monte Carlo (MCMC) techniques. The genome is firstly divided up into ``jobs", i.e. short regions, by default of size 6 Mb, on which the algorithm is run independently. This allows us to account for the variation in read abundance across each chromosome. Within a job, we divide the region into small bins (by default, 100 bases each), and we consider the number of reads whose starts lie within each bin, for each strand. A hidden Markov model is fitted to the bins, thereby classifying them as enriched or unenriched for sites of interest. However, since the parameters of the model are unknown (for example, the mean number of counts within enriched or unenriched bins), we estimate them by sampling from their posterior distributions using MCMC methods. The output of the algorithm is the Posterior Probability (often abbreviated to \emph{PP}) of each bin being enriched. The \emph{PP} value is useful not only for calling the peaks, but could also be used in downstream analyses - for example, to weight observations when searching for a novel transcription factor motif. The \emph{PP} value is not to be confused with the \emph{p} value from hypothesis testing. For a full description of the model, please refer to \citep{spyrou2009bayespeak}. \section{The \texttt{bayespeak()} function} The \texttt{bayespeak()} function performs the algorithm described above on a given experiment, consisting of either one or two data sets - a treated data set must be supplied, which can optionally be complemented by a control data set. Each of these data sets must be in .bed format, and can be specified as a file location, a data.frame containing the columns ``chr", ``start", ``end", ``strand" (specified as forward, ``+", or reverse, ``-"), or a RangedData object from the \Rpackage{IRanges} package in BioConductor. For example: <>= raw.output <- bayespeak("H3K4me3-chr16.bed", "Input-chr16.bed") @ As mentioned in section \ref{algorithm}, we break down the chromosome into ``jobs", which by default are of length 6 Mb, and run the algorithm on each job to account for the variability of conditions across the chromosome. Thus, each job has its own set of associated parameters. Within each job, read abundance is modelled using non-overlapping bins. To avoid missing any peaks that straddle a boundary between two bins, we run the algorithm a second time. This second job is described as ``offset" and shifts the boundaries of the bins by half a bin's length. The output of this function is a list of four things: \begin{itemize} \item \texttt{raw.output\$peaks}: Locations of ``potentially enriched" bins, with their associated posterior probabilities. (A ``potentially enriched" bin is defined as any bin with \emph{PP} > 0.01.) Note that this is output is preliminary and does not correspond to the final result of the analysis. \item \texttt{raw.output\$QC}: Information about the individual jobs. \item \texttt{raw.output\$call}: A record of the arguments used when the function was called. \item \texttt{raw.output\$p.samples}: A list of parameter samples from the MCMC runs. This output can be used to assess convergence e.g. by applying the Geweke test through the CRAN packages \Rpackage{coda} or \Rpackage{boa}. See section \ref{convergence} for more details. \end{itemize} %% \subsection{Parallelizing \Rpackage{BayesPeak}} \label{MC} Due to its computational intensity, \Rpackage{BayesPeak} can be slow. However, the jobs can be run in parallel. This allows us to take advantage of multiple processors and dramatically reduce the time the algorithm takes. \Rpackage{BayesPeak} optionally supports two packages for parallelization: \Rpackage{multicore} and \Rpackage{snow}. Note that if both options are enabled, then the \Rpackage{snow} cluster will take precedence. \subsubsection{multicore} At the time of writing, the \Rpackage{multicore} package can be downloaded from \texttt{http://www.rforge.net/multicore/} and is available for Mac or Linux. \begin{sloppypar}Having installed \Rpackage{multicore}, you can run the \texttt{bayespeak()} function in parallel by using the \texttt{use.multicore = TRUE} option. You can override the number of cores that multicore uses with the \texttt{mc.cores} argument.\end{sloppypar} <>= library(multicore) raw.output <- bayespeak("H3K4me3-chr16.bed", "Input-chr16.bed", use.multicore = TRUE, mc.cores = 4) output <- summarize.peaks(raw.output, method = "lowerbound") @ \subsubsection{snow} With the use of functionality from the \Rpackage{snow} package, \Rpackage{BayesPeak} can be run on a cluster of machines. The \Rpackage{snow} package is available from CRAN and, therefore, can be directly installed with the command \texttt{install.packages("snow")}. One can create a cluster and use it to parallelize the \texttt{bayespeak()} function. For example: <>= library(snow) cl <- makeCluster(4, type="SOCK") ##adapt this line as appropriate for your cluster clusterEvalQ(cl, library(BayesPeak)) ##load BayesPeak on the cluster raw.output <- bayespeak("H3K4me3-chr16.bed", "Input-chr16.bed", snow.cluster = cl) output <- summarize.peaks(raw.output, method = "lowerbound") @ NB: If \Rpackage{snow} support is active, then \texttt{bayespeak()} will not display the progress that has been made through a chromosome. It will, however, display when a chromosome is complete. %% \section{The \texttt{summarize.peaks()} function} The raw output of \texttt{bayespeak()} consists of the details of all of the individual bins in each individual job. The function \texttt{summarize.peaks()} combines the output of the individual jobs and joins adjacent bins into contiguous regions to give the final peak calls. In more detail, \texttt{summarize.peaks()} does the following: \begin{itemize} \item Filtering of unenriched jobs: The model naturally tries to identify enriched states in the background even for jobs with consistently low read abundance. This can result in many unreliable calls and, to avoid this, all calls associated with such jobs can be removed. (This issue, ``overfitting", is described in detail in section \ref{overfitting}.) \item Filtering of unenriched bins: We remove all bins whose \emph{PP} values are below a certain value (see section \ref{PPthresh}). \item Assembly of enriched bins: The bins across all remaining jobs are collected together. If two jobs call exactly the same bin, which could happen in regions where jobs overlap, then the one with the larger \emph{PP} value is used. \item Conversion of bins to peaks: Where a number of bins form a contiguous region, we combine them into one large peak. The large peak is assigned a \emph{PP} value, based on the \emph{PP} values of its component bins. Combining the \emph{PP} values from multiple bins may be done in several ways. By default, the ``lowerbound" method is used, which calculates a lower bound for the overall \emph{PP} by a dynamic programming technique. (See the help file for more information.) The alternative method that the package provides is the ``max" method, which simply takes the maximum \emph{PP} value in the component bins and is therefore ill-equipped to provide appropriate credit for sustained regions of moderate enrichment. \end{itemize} \section{Examining the results} The quality of ChIP-seq samples should be monitored to ensure that the experimental procedure was successful, in terms of the immunoprecipitation, the library preparation and the sequencing. Occasionally, very low read abundance in the data sets may be related to a failed step in the process and will lead to unreliable findings when the experimental noise and signal cannot be distinguished. Moreover, care should be taken when even successful ChIP-seq samples are analysed and the results are interpreted. In the following sections we discuss the stringency with which the potential peaks are chosen and how to avoid false peaks being called in regions of ``noisy background". Enrichment varies along the genome and is naturally low in long regions including the centromere and the ends of the chromosomes. If an entire job is contained in such a region, the model will still try to identify peaks and in the next sections we explain how to identify and filter out these false discoveries. \section{Choosing an appropriate \emph{PP} threshold value} \label{PPthresh} Selecting the threshold for \emph{PP} values (i.e. the threshold argument in \texttt{summarize.peaks()}) is an important step in the analysis, and its choice will affect the peaks returned. First, we will observe how the \emph{PP} values are distributed within jobs. For this section, we load the example \texttt{raw.output} object included with the package: <>= data(raw.output) raw.output$call @ (Note that although \texttt{raw.output} was generated from running \texttt{bayespeak()} with the above arguments, \texttt{raw.output\$peaks} has been manually reduced to only contain the data from chr16 to save space. \texttt{raw.output\$p.samples} has also been reduced, for the same reason.) %From this object, extract the \emph{PP} values for each job with the following: %<>= %PP <- split(raw.output$peaks$PP, raw.output$peaks$job) %@ %Recall that any bin with \emph{PP} < 0.01 was removed during the initial analysis. Therefore, to show the full range of the \emph{PP} values, we assume that all of the missing bins have \emph{PP} = 0 and reinstate them: %<>== %bin.width <- raw.output$peaks$end[1] - raw.output$peaks$start[1] %job.bins <- (raw.output$QC$end - raw.output$QC$start)/bin.width %job.bins <- job.bins[as.integer(names(PP))] %for(i in 1:length(PP)) %{ % PP[[i]] <- c(PP[[i]], rep(0, job.bins[i] - length(PP[[i]]))) %} %@ %We can view each job in turn as follows: %<>== %par(mfrow = c(2,2), ask = TRUE) %for(i in 1:length(PP)) hist(PP[[i]], breaks =150, main = names(PP)[i]) %@ We can view the PP values generated from each job as follows: <>== min.job <- min(raw.output$peaks$job) max.job <- max(raw.output$peaks$job) par(mfrow = c(2,2), ask = TRUE) for(i in min.job:max.job) {plot.PP(raw.output, job = i, ylim = c(0,50))} @ (We will not wish to do this in most large data sets, as the number of jobs will be prohibitively large, making it impractical to look at each plot in turn.) Examining the plots, they appear to have two distinct behaviours, as explained below. \subsubsection*{High density of reads} For example, consider the \emph{PP} profile of job 324: %<>= %i <- 9 %hist(PP[[i]], breaks =150, main = names(PP)[i], ylim= c(0,50)) %@ <>= i <- 324 plot.PP(raw.output, job = i, ylim = c(0,50)) @ The plot is given in Figure \ref{fig:job324} \begin{figure} \begin{center} <>= <> @ \end{center} \caption{A histogram of the \emph{PP} values found in job 324. Nearly all of the bins have a \emph{PP} of 0 or 1.} \label{fig:job324} \end{figure} For data with a high density of reads, peaks are clearer and thus the \emph{PP} values of bins converge to 0 or 1, as observed in the figure above. So, the default \emph{PP} threshold value of 0.5 will suffice to distinguish between enrichment and unenrichment. This is the expected outcome from jobs that include enriched peaks. \subsubsection*{Low density of reads} On the other hand, as expected before, some jobs fall in areas of low enrichment and this affects their results. We demonstrate this by considering the \emph{PP} values across job 325: %<>= %i <- 10 %hist(PP[[i]], breaks =150, main = names(PP)[i], ylim= c(0,50)) %@ <>= i <- 325 plot.PP(raw.output, job = i, ylim = c(0,50)) @ The plot is given in Figure \ref{fig:job325}. \begin{figure} \begin{center} <>= <> @ \end{center} \caption{A histogram of the \emph{PP} values found in job 325. The PP values are more evenly distributed over the interval (0,1) than they are in Figure \ref{fig:job324}.} \label{fig:job325} \end{figure} When the coverage is sparse and therefore less information is available, the \emph{PP} values tend to be more uniformly spread over the interval [0,1], as above. This means that the distinction between peaks and background is harder to make, which is usually a result of poor enrichment, if any. Such a uniform trend may be indicative of overfitting - we explore this effect in section \ref{overfitting}. In these jobs even the bins with highest \emph{PP} cannot be classified as enriched since they are likely to correspond to random noise and, instead of choosing a high \emph{PP} threshold, the whole job should be considered as unenriched. \section{Overfitting} \label{overfitting} \Rpackage{BayesPeak} can run into an overfitting problem when a job does not contain any peaks, or when peaks are weak compared to the background. For this section, we again load the example \texttt{raw.output} object included with the package: <>= data(raw.output) @ The model assumes that there are both unenriched and enriched regions present. When the data contains no enriched regions, the model still tries to identify peaks in the data. Since some bins in the background will have higher counts than others, purely by chance, these will be marked as enriched. (See Figure \ref{fig:OF}.) %examples \begin{figure} \begin{center} \includegraphics{OF.png} \end{center} \caption{An example of overfitting. In each of these histograms, only reads on the positive strand are shown for clarity. A blue bar indicates that \Rpackage{BayesPeak} has called that bin as containing a peak at \emph{PP} > 0.5. On the left, BayesPeak uses the enriched and unenriched states to explain the variance present in the background. On the right, the algorithm correctly identifies the large peaks present in the region of interest. (Notice the difference in scale between the two plots.) These plots were produced with the \texttt{plot.job()} function.} \label{fig:OF} \end{figure} This effect will be reflected in the parameters of the model, since the expected number of reads allocated at ``enriched" regions will be much lower for the jobs where no peaks are present compared to the other jobs. This is one purpose of the QC component of the output - we can diagnose which peaks were called simply because they are in an unenriched job rather than because they are actual peaks. Jobs suffering from this problem tend to exhibit three properties: \begin{itemize} \item Unusually large number of $calls$. \item Low mean number of counts in an enriched bin, defined as $\lambda_1$ in the model description in \citep{spyrou2009bayespeak}. \item \emph{PP} values spread out over the interval [0,1] rather than mostly falling at 0 or 1, as explained in section \ref{PPthresh}. We quantify this using a score: of the bins with \emph{PP} values > 0.01, the score is the proportion that have \emph{PP} > 0.5. A low score is therefore indicative of overfitting. \end{itemize} We can observe when overfitting has occured in our data set by plotting these properties against each other. We generate two such plots with this code: %<>= %plot(log(raw.output$QC$calls), log(raw.output$QC$lambda1), % main = "Job parameters - enriched bin counts against calls") %@ %<>= %plot(log(raw.output$QC$calls), log(raw.output$QC$score), % main = "Job parameters - score against calls") %@ <>= plot.overfitdiag(raw.output, whatX = "calls", whatY = "lambda1", logX = TRUE, logY = TRUE) @ <>= plot.overfitdiag(raw.output, whatX = "calls", whatY = "score", logX = TRUE, logY = TRUE) @ The plots generated are given in Figures \ref{fig:OF3} and \ref{fig:OF3b}. Two clusters are visible in each, one of which is a cluster of jobs suffering from overfitting. \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Each point represents a job. On the X axis, we plot the log of the number of bins with \emph{PP} > 0.01 in that job. On the Y axis, we plot the log of the mean number of counts in enriched bins. The plot is on the log scale to aid visualisation of the clusters.} \label{fig:OF3} \end{figure} \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Each point represents a job. On the X axis, we plot the log of the number of bins with \emph{PP} > 0.01 in that job. On the Y axis, we plot the log of the score of that job (i.e. of the bins with \emph{PP} > 0.01, this is the proportion of those bins with \emph{PP} > 0.5).} \label{fig:OF3b} \end{figure} It is worth trying different arguments to this function to find the plot in which the overfit clusters are most clearly defined - e.g. <>= plot.overfitdiag(raw.output, whatX = "calls", whatY = "score", logX = TRUE, logY = FALSE) @ or <>= plot.overfitdiag(raw.output, whatX = "lambda1", whatY = "score", logX = TRUE, logY = TRUE) @ (These figures are not shown in this vignette.) Some jobs are expected to show no enrichment throughout, such as the centromere, and the location of these regions can be taken into account at a later version of the algorithm. At the moment, our approach is to apply BayesPeak to the whole genome and subsequently to filter out peaks that are allocated within regions of no enrichment. \subsection{Excluding calls from unenriched jobs} We can choose to simply remove all calls from the jobs that we believe to be overfit. For example, having looked at Figure \ref{fig:OF3}, we can specify the overfit cluster by removing all jobs with low counts in their enriched bins - for example, $\log(\lambda_1) < 1.5$: <>= unreliable.jobs <- log(raw.output$QC$lambda1) < 1.5 output <- summarize.peaks(raw.output, method = "lowerbound", exclude.jobs = unreliable.jobs) @ Alternatively, from looking at either Figure \ref{fig:OF3} or Figure \ref{fig:OF3b}, we could try to specify the overfit cluster better by adding jobs with excessive numbers of calls e.g. $\log(calls) > 5$. This gives us two selection criteria as follows: <>= unreliable.jobs2 <- log(raw.output$QC$lambda1) < 1.5 | log(raw.output$QC$calls) > 5 output.2 <- summarize.peaks(raw.output, method = "lowerbound", exclude.jobs = unreliable.jobs2) @ Another method provided by the \Rpackage{BayesPeak} package is the ability to define an overfit cluster on one of the above plots by drawing a polygon onto it. This is achieved via use of the \texttt{region.overfitdiag()} function (using exactly the same arguments as \texttt{plot.overfitdiag()}), which subsequently collects the job IDs of any points in that polygon for use as the \texttt{exclude.jobs} argument in \texttt{summarize.peaks}. To perform this procedure on Figure \ref{fig:OF3b}, we would use the following code: <>= unreliable.jobs3 <- region.overfitdiag(raw.output, whatX = "lambda1", whatY = "score", logX = TRUE, logY = TRUE) ##user defines a polygon on the resulting plot ##left-click to place each vertex, right-click to close polygon output.3 <- summarize.peaks(raw.output, method = "lowerbound", exclude.jobs = unreliable.jobs3) @ An example of a region that we might define on the plot is given in Figure \ref{fig:region.OF}. %examples \begin{figure} \begin{center} \includegraphics{regionOFdiag.pdf} \end{center} \caption{Application of \texttt{region.overfitdiag()} to the \texttt{raw.output} data. Jobs in the red hatched region will be excluded from the summary.} \label{fig:region.OF} \end{figure} Finally, for a more reproducible method of defining the overfit cluster, one could use one's favourite clustering method on the rows of the QC data, obtained as follows: <>= log(raw.output$QC[,c("calls", "score", "lambda1")]) @ Of course, the drawback of such methods is that they have no concept of selecting conservatively (i.e. picking a small overfit cluster) and may be confused by strangely-shaped or close-together clusters - therefore, direct use of output from clustering algorithms should be treated with care. %%example %\subsection{Choose a stronger prior} %For the mathematically inclined, choosing a strong enough prior on $\lambda_0$, $\lambda_1$ may overcome this problem. Under the prior, %$$\lambda_i \sim Gamma(\alpha_i,\beta_i)$$ %with %$$\alpha_i \sim Gamma(\alpha_{\alpha_i}, \beta_{\alpha_i})$$ %$$\beta_i \sim Gamma(\beta_{\beta_i}, \beta_{\beta_i})$$ %Note that $\beta$ is always a scale parameter and not a rate parameter. You can specify a prior by specifying this vector: %$$prior = \left(\alpha_{\alpha_0}, \beta_{\alpha_0}, \alpha_{\beta_0}, \beta_{\beta_0}, \alpha_{\alpha_1}, \beta_{\alpha_1}, \alpha_{\beta_1}, \beta_{\beta_1}\right)$$ %If the expectation of any of these variables is too small (i.e. $\alpha_?\beta_? < 0.0001$) then this will result in an error (this is to avoid division by 0 when Gibbs Sampling). %<>= %output <- bayespeak("H3K4me3-chr16.bed", "Input-chr16.bed", prior = c(5, 5, 10, 5, 25, 4, 0.5, 5)) %eps <- 1E-6 %output <- bayespeak("H3K4me3-chr16.bed", "Input-chr16.bed", prior = c(5, 5, 10, 5, 2.5/eps, eps, 0.25/eps, eps)) %output <- combine.peaks(output, method = "lowerbound") %@ %In the second case, by making the variance of $\alpha_1$ and $\beta_1$ very small, we have essentially set: $$\lambda_1 \sim gamma(2.5, 0.25)$$ \section{Assessing convergence} \label{convergence} If desired, we can assess the convergence of the MCMC chains by examining the parameter samples obtained. The output of the parameter samples from job $i$ is given in \texttt{raw.output\$p.samples[[i]]}, with half of the data discarded as burn-in and samples taken every 10 iterations. This data can be imported into MCMC packages, such as the CRAN packages \Rpackage{coda} and \Rpackage{boa}, allowing use of their inbuilt convergence assessment functions. For example, we can assess the convergence of job 316 with \Rpackage{coda}, using the Geweke diagnostic, with the following code. This code requires that \Rpackage{coda} has been installed from CRAN. <>= data(raw.output) library(coda) mcmc.job1 <- mcmc(raw.output$p.samples[[316]], thin = 10) geweke.diag(mcmc.job1) @ \begin{verbatim} Fraction in 1st window = 0.1 Fraction in 2nd window = 0.5 p theta a0 b0 lambda0 a1 b1 lambda1 0.6900 -0.1239 -0.4860 -0.7983 1.9774 0.5859 -0.3364 0.3297 loglhood 0.3716 \end{verbatim} More information on the interpretation of this diagnostic can be found in the individual packages' help files, and in \citep{Geweke1992}. If convergence is deemed unsatisfactory, this may be solved by increasing the \texttt{iterations} parameter in the \texttt{bayespeak()} function. \section{Citing \Rpackage{BayesPeak}} \label{cite} \begin{flushleft} If you use the \Rpackage{BayesPeak} algorithm, then please cite \citep{cairns2011bayespeak} and \citep{spyrou2009bayespeak}. \end{flushleft} %\section{Asking for help on \Rpackage{BayesPeak}} %Wherever possible, please send all queries about \Rpackage{BayesPeak} to the %Bioconductor mailing list at {\tt bioconductor@stat.math.ethz.ch}. This will %help maintain a searchable archive of questions and responses. %When posting to the list, please include the commands you used along with the %version of \Rpackage{BayesPeak} and {\tt R} you are working with. %Version information can be obtained by running the following command: %<>= %sessionInfo() %@ \section{Session Info} <<>>= sessionInfo() @ \section{Acknowledgements} Many thanks to Dr. Duncan Odom's group for permission to use the H3K4me3 data set used in our examples, and to Dr. Jason Carroll's group for permission to use the data set shown in the \texttt{raw.output} data file. \bibliographystyle{plainnat} \bibliography{BayesPeak} \end{document}