%\VignetteIndexEntry{The PING users guide} %\VignetteDepends{PING,snowfall,parallel} %\VignetteKeywords{Preprocessing, ChIP-Seq, Sequencing} %\VignettePackage{PING} \documentclass[11pt]{article} \usepackage{hyperref} \usepackage{url} \usepackage{color, pdfcolmk} \usepackage[authoryear,round]{natbib} \bibliographystyle{plainnat} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \author{Xuekui Zhang\footnote{ubcxzhang@gmail.com} and Raphael Gottardo\footnote{rgottard@fhcrc.org}} \begin{document} \title{PING: Probabilistic Inference for Nucleosome Positioning with MNase-based or Sonicated Short-read Data.} \maketitle \textnormal {\normalfont} A step-by-step guide to inferring nucleosome positioning with high-throughput sequencing data using the PING package in R. %A step-by-step guide in the nucleosome positioning with high-throughput sequencing data using the PING package in R. \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newpage \section{Licensing and citing} Under the Artistic License 2.0, you are free to use and redistribute this software. %However, If you use this package for %your own a publication, we would ask you to cite the following: %paper: \begin{itemize} \item[] Xuekui Zhang, Gordon Robertson, Sangsoon Woo, Brad G. Hoffman, and Raphael Gottardo. Probabilistic Inference for Nucleosome Positioning with MNase-based or Sonicated Short-read Data. Submitted. \end{itemize} \section{Introduction} The structural unit for chromatin packaging is the nucleosome, which is composed of approximately 147 bps of DNA wrapped around a core histone octamer. Nucleosome-associated DNA is less accessible to regulatory proteins like transcription factors, and nucleosome positioning, as well as histone modifications and histone variants (e.g. H2A.Z, H3.3), are therefore influential in cellular processes that depend on chromatin accessibility. %\cite{Jiang:2009p1001, RadmanLivaja:2010p2431, Conerly:2010p3414, Elsaesser:2010p3415, Goldberg:2010p3416, Heintzman:2009p3417}. Because nucleosome positions depend on cellular processes as well as intrinsic factors (e.g. DNA sequence), understanding how these positions influence cell states can require determining nucleosome positions within individual genomic regions.% \cite{Zhang:2011p3428}. Currently, genome-wide nucleosome-based data are typically generated by high-throughput single-end short-read sequencing of DNA obtained by either MNase digestion (MNase-seq), or chromatin immunoprecipitation (ChIP-seq) of MNase-digested or sonicated DNA. MNase digests linker DNA with relatively high specificity, %\cite{Clark:2010p2479}, and this specificity is reflected in the narrow spatial distribution of aligned reads. However, sonication protocols are widely used; for example, in work to identify classes of functional genomic regions by integrated analysis of diverse sets of short-read sequence data. % \cite{modENCODEConsortium:2010p3439, Ernst:2010p3421, Bernstein:2010p3346}. PING is a probabilistic method that extends PICS (Zhang et al. Biometrics, 2011, 67(1):151-63), which was developed for ChIP-seq data for transcription factors. PING infers nucleosome positions from either MNase-digested or sonicated nucleosome-based short-read data. Like PICS, PING models bi-directional read densities, uses mixture models, and imputes missing reads; however, it uses a new prior specification for the spatial positioning of nucleosomes, has different model selection criteria, model parameters, and different post-processing for estimated parameters. In addition, PING also includes novel statistical methods to identify nucleosomes whose read densities are lower than those of neighboring nucleosomes. %\section{PING pipeline} \section{PING analysis steps} A typical PING analysis consists of the following steps: \begin{enumerate} \item Convert aligned-read data to a `GenomeData' list for efficient processing \item Segment the genome into candidate regions that have sufficient aligned reads via `segmentReads' \item Estimate nucleosome positions and other parameters with PING \item Post-process PING predictions to correct certain predictions % \item Repeat 2-4 by swapping the IP and Control samples (optional) % \item Use 2-5 to estimate the FDR via `picsFDR' (optional) % \item Output enriched regions and `PING' profiles with bed- and wig-format files \end{enumerate} As with any R package, you should first load it with the following command: <>= library(PING) @ \section{Data Input and Formatting} The first step of the PING pipeline consists of converting the files representing aligned reads for IP and optional Control samples into a format that allows efficient segmentation of the genome into a set of candidate regions, each of which has at least a threshold number of Forward and Reverse reads. The data format could be a BED type dataframe, or an `AlignedReads' object returned by the `ShortRead' package's `ReadAligned' function, which can read various file formats including Eland, MAQ, Bowtie, etc. Please refer to the `ShortRead' package vignette for details. % on how to read data into R from formats other than a BED file. In the example listed below, we use tab-delimited BED type files, in which each line represents a single read, and has the following columns: space, start, end, strand. In addition to the IP and Control datasets, we use an another datatype that represents a read-mappability profile for the genome. %interrogated. As explained in the PICS paper, for each chromosome, a mappability profile for a specific read length consists of a vector of zeros and ones that gives an estimated read mappability `score' for each base pair in the chromosome. A score of one at a position means that we should be able to align a read of that length uniquely at that position, while a score of zero indicates that no read of that length should be uniquely alignable at that position. In ChIP-seq work, reads that cannot be mapped to unique genomic locations are typically discarded. For convenience, and because transitions between mappable and non-mappable regions are typically much shorter than these regions, we compactly summarize each chromosome's mappability profile as a set of %disjoint union of non-mappable intervals in a BED-format file. %that specify only zero-valued profile. We store this information as a BED file where each line corresponds a non-mappable interval. PING makes use of such mappability profiles to estimate the reads that are missing because they could not be uniquely aligned. %For more details on mappability files, please refer to \citep{}/. Mappability profiles for a range of read lengths and species are available from \begin{verbatim} http:/wiki.rglab.org/index.php?title=Public:Mappability_Profile \end{verbatim} %Once the data have been read in, we can create the list of `GenomeData' object where each element of the list corresponds to a different chromosome. Once the data have been read in, we can create a list of `GenomeData' objects, in which each element of the list corresponds to a different chromosome, and contains the aligned IP and Control reads, and the mappability profile, for the corresponding chromosome. When reading the data, reads are automatically sorted, which is required for a PING analysis. As well, we recommend that the number of duplicates be controlled by using PING's default protocol-specific thresholds. %When reads are duplicated too many times, we recommend users to collapse them, since that might be artificial effect. We supply a demonstration set of MNase-seq data for chr1 from budding yeast GSM351492, NOCL\_R4 (Kaplan et al, Nature, 2009, 458(7236):362-6). The demonstration data is in the 'extdata' folder of the PING installation: .../PING/inst/extdata. <>= path<- system.file("extdata",package="PING") #Read the experiment : dataIP<-read.table(file.path(path,"GSM351492_R4_chr1.bed"),header=TRUE) dataIP<-as(dataIP,"RangedData") dataIP<-as(dataIP,"GenomeData") @ %<>= %map<-read.table(paste(path,"mapProfileShort",sep=""),header=TRUE,colClasses=c("factor","integer","integer","NULL")) %map<-as(map,"RangedData") %## Remove the chrM %map<-map[-23] %@ \section{PING analysis} \subsection{Genome segmentation} We segment the genome %into candidate regions by pre-processing %bidirectional aligned read data from a single-end ChIP-Seq experiment to detect `candidate regions' that have a minimum number of forward and reverse reads. Each candidate region will then be processed separately by PING. The segmentation `minReads' parameter can heavily influence the number of candidate regions returned. In principle, it is better to use a small value for `minReads' in order not to miss any nucleosomes. %true binding sites. However, too small a value will likely return many false positive regions, which can result in a longer computing time for the \texttt{PING} function. %Table , gives guidelines on how to chose `minReads' as a function of the number of aligned reads in a ChIP-seq dataset. %If a Control sample is available you may also chose a value of `minReads' that gives you a ratio of about 25\% for the number of candidate regions in the control over the number of candidate regions in the IP. A ratio larger than 25\% would most likely result in a large FDR when using \texttt{picsFDR}, and therefore it is not necessary to use too low of a threshold. If users are not sure what values to use, they can assign `minReads=NULL', so that the value will be estimated from data. See below for an illustration. \newline In order to improve the computational efficiency of the PING package, if you have access to multiple cores we recommend that you do parallel computations via the \texttt{snowfall} package. %, which simplifies doing parallel computations. In what follows, we assume that \texttt{snowfall} is installed on your machine. If it is not, you could omit the first three lines, and calculations will occur on a single CPU. %which will lead a serial execution. By default the command is not run. Note that the \texttt{segmentReads} and \texttt{PING} functions will automatically detect whether you have initialized a cluster and will use it if you have. <>= library(snowfall) sfInit(parallel=TRUE,cpus=2) sfLibrary(PING) @ <>= seg<-segmentReads(dataIP, minReads=NULL, maxLregion=1200,minLregion=80, jitter=T) @ \subsection{Parameter estimation} After having segmented the genome into candidate regions, we use the \texttt{PING} function to probabilistically detect positioned nucleosomes. %binding regions %in a way. The function returns position estimates, confidence intervals, etc. %In the code below, we assume that you are processing nucleosome data. In our case, assuming that we have already segmented the genome using \texttt{segmentReads}, we can proceed with the following command: <>= ping<-PING(seg) @ %\subsection{FDR estimation} %In order to estimate the FDR, we need to rerun our analysis after swapping the IP and control samples. Note that this requires the presence of a control sample. We proceed with the following commands to compute the FDR after removing binding site estimates with noisy parameters as specify by the filter. % %<>= %segC<-segmentReads(dataCont,dataC=dataIP,map=map,minReads=1) %picsC<-PING(segC,dataType="TF") %fdr<-picsFDR(pics,picsC,filter=list(delta=c(50,Inf),se=c(0,50),sigmaSqF=c(0,22500),sigmaSqR=c(0,22500))) %@ % %Then once the FDR has been estimated, one can visualize the FDR as a function of the score to pick a threshold that would lead to the desired FDR. % %<>= %plot(pics,picsC,xlim=c(0,50),ylim=c(0,.2),filter=list(delta=c(50,300),se=c(0,50),sigmaSqF=c(0,22500),sigmaSqR=c(0,22500)),type="l") %@ % %You can also visualize the FDR as a function of the number of regions. % %<>= %plot(fdr[,c(3,1)]) %@ \section{Post-processing PING results} Given the variation in nucleosome-based short-read data, some of PING's predictions may be inaccurate, and we need to detect and resolve such problematic cases. %For example, when one nucleosome is supported by relatively strong aligned-read signals, the mixture model may fit it well, while fitting adjacent nucleosomes that have weaker signals more poorly. %when one nucleosome has a much stronger signal, the mixture model might try to obtain best fit its forward and reverse peaks, but fit other much weaker nucleosomes badly. %In such a case a nucleosome might be fitted with two forward mixture components and one reverse mixture component. This will cause a mismatch of forward/reverse peaks of other nucleosomes. For example, a nucleosome that is supported by relatively strong aligned-read signals may be adjacent to nucleosomes that have weaker signals. In such a case PING might fit two forward mixture components and one reverse mixture component, which will cause a mismatch of forward/reverse peaks of other nucleosomes. We can detect such a problem by screening for abnormal estimated $\delta$ values. As another example, a few adjacent nucleosomes with weak signals might be described by one wide forward/reverse peak. To detect such problems, we can screen for two large estimations of $\sigma_f$ or $\sigma_r$. When we detect candidate regions with such problems, we reanalyze them with stronger prior on $\delta$, $\sigma_f$, or $\sigma_r$ to penalize atypical estimations. When initial segmentation returns too long a candidate region, recursive cutting is automatically applied to split the regions into smaller ones. However, such splitting can result in two nucleosome predictions that are too close to each other on each side of the cutting boundary. Such cases may be predictions of the same nucleosome in the adjacent candidate regions. To address this, we merge their reads and refit the model, and let the model selection method decide whether these two predictions are the same nucleosome. We post-process results with the following command: <>= PS=postPING(ping, seg) @ The result output $PS$ is a dataframe that contains estimated parameters of each nucleosome, users can use write.table command to export the selected columns of the result. %\section{Result output} % %To facilitate data processing and data output, we make use of the \texttt{IRanges} package to summarize our results as `RangedData' objects. The resulting `RangedData' objects can then be analyzed with the \texttt{IRanges} package and/or exported to bed/wig files with the \texttt{rtracklayer} package. In each case, we can make use of filters noisy regions and or regions with too low of score. %In particular the \texttt{picsFDR} function can be used to set the score threshold leading to the desired `FDR'. % %\subsection{Writing enriched regions to BED files} % %<>= %myFilter=list(delta=c(50,300),se=c(0,50),sigmaSqF=c(0,22500),sigmaSqR=c(0,22500)) %rdBed<-makeRangedDataOutput(ping,type="bed",filter=c(myFilter,list(score=c(0,Inf)))) %@ %The following command will create the appropriate bed files, and because it requires the \texttt{rtracklayer} package, we do not run it here but simply displayed for your purpose. %<>= %library(rtracklayer) %export(rdBed,"myfile.bed") %@ % %% writing WIG files have bugs, error message complain about overlaped ranges %\subsection{Writing density scores to WIG files} %The following command will create the appropriate `RangedData' and corresponding wig file, and because it requires the \texttt{rtracklayer} package, we do not run it here but simply displayed for your purpose. % %<>= %rdBed<-makeRangedDataOutput(ping,type="wig",filter=c(myFilter,list(score=c(0,Inf)))) %export(rdBed,"myfile.wig") %@ <>= sfStop() @ \end{document}