%\VignetteIndexEntry{ CexoR Vignette } %\VignetteDepends{} %\VignetteKeywords{Visualization,ChIPseq,Transcription,Genetics,Coverage,ChIPSeq} %\VignettePackage{CexoR} \documentclass{article} <<style, eval=TRUE, echo=FALSE, results=tex>>= BiocStyle::latex() @ \begin{document} \title{\Biocpkg{ CexoR }: An R package to uncover high-resolution protein-DNA interactions in ChIP-exo replicates} \date{Last Modified: October, 2021. Compiled: \today} %\author{Pedro Madrigal\footnote{ pmadrigal@ebi.ac.uk }} \author{Pedro Madrigal} \maketitle \begin{center} EMBL-EBI, UK\\ \end{center} \tableofcontents <<setup, echo=FALSE, results=hide>>= options(width=200) options(continue=" ") options(prompt="R> ") @ \section{Citation} Madrigal P (2015) CexoR: an R/Bioconductor package to uncover high-resolution protein-DNA interactions in ChIP-exo replicates. \textbf{EMBnet.journal} 21, e837. \href{http://dx.doi.org/10.14806/ej.21.0.837}{http://dx.doi.org/10.14806/ej.21.0.837}. \section{Introduction} For its unprecedented level of resolution, chromatin immunoprecipitation combined with lambda exonuclease digestion followed by sequencing (ChIP-exo) is a potential candidate to replace ChIP-seq as the standard approach for high-confidence mapping of protein-DNA interactions. Numerous algorithms have been developed for peak calling in ChIP-seq data. However, adjusting the statistical models to ChIP-exo making use of its strand-specificity can improve the identification of protein-DNA binding sites. The midpoint between the strand-specific paired peaks formed at its forward and reverse strands is delimited by the exonuclease stop sites, within the protein binding event is located (Rhee and Pugh, 2011). \section{Methodology} Lambda exonuclease stop site (\Robject{5}' end of the reads) counts are calculated separately for both DNA strands from the alignment files in BAM format using the Bioconductor \Biocpkg{Rsamtools}. Counts are then normalized using linear scaling to the same sample depth of the smaller dataset. Using the Skellam distribution (Skellam, 1946), \Biocpkg{CexoR} models at each nucleotide position the discrete signed difference of two Poisson counts at forward and reverse strands, respectively. Then, detecting nearby located significant count differences of opposed sign (peak-pairs) at both strands allows \Biocpkg{CexoR} to delimit the flanks of the protein binding event location at base pair resolution. A one-sided $p$-value is obtained for each peak using the complementary cumulative Skellam distribution function, and a final $p$-value for the peak-pair (default cut-off $1e-12$) is reported as the sum of the two $p$-values. To account for the reproducibility of replicated peak-pairs, which central point must be located at a user-defined maximum distance, $p$-values are submitted for irreproducible discovery rate estimation (Li et al., 2011). Stouffer's and Fisher's combined p-values are given for the final peak-pair calls. Finally, BED files containing reproducible binding event locations formed within peak-pairs are reported, as well as their midpoints. More information can be found in Madrigal (2015). \section{Example} We downloaded the 3 replicates of human CTCF ChIP-exo data from GEO (SRA044886) (Rhee and Pugh, 2011), and aligned the reads to the human reference genome (hg19) using Bowtie 1.0.0. Reads not mapping uniquely were discarded. We can search reproducible binding events between peak-pairs in the first million bp of Chr2 in the 3 biological replicates by: \begin{small} <<Call reproducible peak-pairs in ChIP-exo replicates, eval=TRUE>>= options(width=60) ## hg19. chr2:1-1,000,000 owd <- setwd(tempdir()) library(CexoR) rep1 <- "CTCF_rep1_chr2_1-1e6.bam" rep2 <- "CTCF_rep2_chr2_1-1e6.bam" rep3 <- "CTCF_rep3_chr2_1-1e6.bam" r1 <- system.file("extdata", rep1, package="CexoR",mustWork = TRUE) r2 <- system.file("extdata", rep2, package="CexoR",mustWork = TRUE) r3 <- system.file("extdata", rep3, package="CexoR",mustWork = TRUE) peak_pairs <- cexor(bam=c(r1,r2,r3), chrN="chr2", chrL=1e6, idr=0.01, N=3e4, p=1e-12) peak_pairs$bindingEvents peak_pairs$bindingCentres setwd(owd) @ \end{small} 13 reproducible peak-pair events are reported for the established thresholds ($p$-value $\leq 1e-12$, IDR $\leq 0.01$). We can now plot the mean profile of lambda exonuclease stop sites and reads, 500 bp around the central position of reproducible peak-pair locations, by running the function "plotcexor": \\ \begin{small} <<Visualisation of ChIP-exo replicates, fig.width=4.5, fig.height=9, eval=FALSE>>= options(width=60) plotcexor(bam=c(r1,r2,r3), peaks=peak_pairs, EXT=500) @ \end{small} \textbf{Important notes}: \begin{itemize} \item For the correct estimation of the IDR (Li et al., 2011) peak-pair calling should be relaxed (e.g., $p$-value=1e-3, or smaller depending on the sequencing depth), enabling the noise component be present in the data and therefore allowing the peak-pairs to be separated into a reproducible and an irreproducible groups. In the example shown above, as the dataset is very small and peaks are highly reproducible, IDR in the overlapped peak-pairs across the 3 replicates is zero. \item IDR calculation could produce varying results depending on the choice of initial estimates for four parameters needed by the algorithm (mu, sigma, rho, prop). Li et al. (2011) recommend trying several choices, so that the parameter estimation does not get trapped in a local maximum. \item For more information about using IDR in high-throughput sequencing datasets see Land et al. (2012) and Bailey et al. (2013), or the mathematical description in Li et al. (2011). \end{itemize} \section{References} \begin{itemize} \item Madrigal P (2015) CexoR: an R/Bioconductor package to uncover high-resolution protein-DNA interactions in ChIP-exo replicates. \textbf{EMBnet.journal} 21: e837. \item Bailey TL, et al. (2013). Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. \textbf{PLoS Comput Biol} 9: e1003326. \item Landt SG, et al. (2012). ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. \textbf{Genome Res} 22: 1813-1831. \item Skellam JG (1946) The frequency distribution of the difference between two Poisson variates belonging to different populations. \textbf{J R Stat Soc Ser A} 109: 296. \item Li Q, Brown J, Huang H, Bickel P (2011) Measuring reproducibility of high-throughput experiments. \textbf{Ann Appl Stat} 5: 1752-1779. \item Rhee HS, Pugh BF (2011) Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. \textbf{Cell} 147: 1408-1419. \end{itemize} \section{References} The author wishes to acknowledge Christoph Sebastian Borlin, Md. Abul Hassan Samee, Vivek Rexwal and Sudhir Jadhao for their feedback. \section{Details} This document was written using: \begin{small} <<sessionInfo>>= sessionInfo() @ \end{small} \end{document}