%\VignetteIndexEntry{DNaseR Vignette} %\VignetteDepends{} %\VignetteKeywords{Visualization,ChIPseq,Transcription,Genetics} %\VignettePackage{DNaseR} \documentclass{article} \usepackage[utf8]{inputenc} <>= BiocStyle::latex() @ \begin{document} \title{\Biocpkg{DNaseR}: DNase I footprinting analysis of DNase-seq data} \date{Created: February, 2013. Last modified: January, 2014. Compiled: \today} \author{Pedro Madrigal} \maketitle \begin{small} \begin{flushleft} $^{1}$ Department of Biometry and Bioinformatics, Institute of Plant Genetics, Polish Academy of Sciences, Poznan, Poland \\ $^{2}$ \underline{Present address}: Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK \\ $^{3}$ \underline{Present address}: Department of Surgery, University of Cambridge, UK \end{flushleft} \end{small} \tableofcontents <>= options(width=80) options(continue=" ") options(prompt="R> ") @ \section{Introduction to Digital Genomic Footprinting} The combination of DNase I digestion and high-throughput sequencing (DNase-seq) has been used recently to map chromatin accessibility in a given tissue or cell type on a genome-wide scale (Song and Crawford, 2010). In addition to DNase I hypersensitive sites (DHSs), short regions of protected nucleotides known as footprints can be detected using a technique known as "digital genomic footprinting" (DGF). These methodology can potentially indicate the location of transcription factor binding ocuppancy events at a nucleotide resolution (Neph et al., 2012). However, available software for DGF analysis is still at a very immature state (Madrigal and Krajewski, 2012).\\ \Biocpkg{DNaseR} is an R package that aims to identify protein binding footprints in 'double-hit' DNase I hypersensitive sites sequencing (DNase-seq) data provided in BAM standard aligment format. It relies on the cumulative function of the Skellam distribution (correlation of two Poisson distributions) to detect narrow-depleted regions of read-enrichment formed by the mapped reads in the forward and reversed DNA strands. Studying the imbalance of DNase I cuts separately at both DNA strands is of great help in the detection of reliable protein-binding footprints, as demostrated by the Wellington algorithm (Piper et al., 2013), which uses the binomial cumulative distribution function for that purpose. As in Wellington, \Biocpkg{DNaseR}'s main characteristic consists in that consensus DNA sequences (motifs) are not required \emph{a priori} to detect footprints. \\ Any BAM file storing aligned reads coming from a DNase-seq experiment is suitable for footprinting analysis, but the ones more deeply sequenced will retrieve a higher number of significant footprints at a fixed $p$-value cutoff. \section{Methodology} \subsection{Brief summary} DNase I cuts (\Robject{5}' end of the mapped reads) counts are calculated separately for both DNA strands from the alignment files in BAM format using the Bioconductor package \Biocpkg{Rsamtools}. Using the Skellam distribution (Skellam, 1946), \Biocpkg{DNaseR} models at each nucleotide position the discrete signed difference of two Poisson counts at forward and reverse strands, respectively. Then, detecting nearby located significant count differences of opposed sign at both strands (in the direction \Robject{5}' to \Robject{3}') allows \Biocpkg{DNaseR} to delimit the flanks of the footprint location at a base-pair resolution. A one-sided $p$-value is obtained for each flank using the complementary cumulative Skellam distribution function. To control for multiple testing the $p$-values delimiting each flank of the footprint (\Rcode{pval.forward} and \Rcode{pval.reverse}) are corrected using Benjamini-Hochberg procedure (default). A final $p$-value for each footprint (default cut-off $1e-9$) is reported as the sum of the two adjusted $p$-values. \subsection{Difference of two Poisson variables: The Skellam distribution} Footprints from a DGF DNase-seq assay are retrieved using the probability density function of the Skellam distribution (Skellam, 1948), on the DNase I cuts mapped at forward and reverse strands. It has been shown that at the flanking edges of a protein-DNA footprint in DNase-seq data, the difference between DNase I cuts (read-start sites) is higher than on loci not occupied by a TF (Neph et al., 2012; Piper et al., 2013). \Biocpkg{DNaseR} uses the Skellam distribution to model DNase I cuts differences around the footprints as a cross-correlation of two Poisson distributions. We model the count difference $n_{1} - n_{2}$ of two statistically independent random variables $N_{1}$ (DNase I cuts in + strand) and $N_{2}$ (DNase I cuts in - strand), each one having Poisson distribution with different (but almost equal in practice, for big samples) expected values $\mu_{1}$ and $\mu_{2}$. This is done under the assumption that DNase I enzyme cleaves each strand of DNA independently, and cleavage sites are random. The probability mass function for the Skellam distribution for a count difference $k = n_{1} - n_{2}$ from two Poisson distributed variables with means $\mu_{1}$ and $\mu_{2}$ is given by: \begin{equation} f(k;\mu_{1},\mu_{2})=e^{-(\mu_{1}+\mu_{2})} \left( \frac{\mu_{1}}{\mu_{2}} \right) I_{k}(2\sqrt{\mu_{1} \mu_{2}}) \end{equation} Where $I_{k} (z)$ is the modified Bessel function of the first kind, \begin{equation} I_{k}(z)={ \left( \frac{z}{2} \right) }^k \sum_{j=0}^{\infty} \frac{{\left( \frac{z^2}{4} \right) }^j}{j!\Gamma (k+j+1)} \end{equation} Where $\Gamma(a)$ is the gamma function. Then, a footprint ranging between a minimum $\eta_{min}$ (bp) and a maximum $\eta_{max}$ (bp.) width is reported when two consecutive statistically significant events are encountered, and only in the cases where the count difference is significantly positive in the direction \Robject{5}’ to \Robject{3}’ ($n_{1} - n_{2} >0$), and significantly negative from \Robject{3}’ to \Robject{5}’ ($n_{1} - n_{2} <0$). One side $p$-values, one for each strand, are calculated for each footprint flank using the complementary cumulative Skellam distribution. Benjamini-Hochberg correction for multiple testing was used to adjust the $p$-values (Benjamini and Hochberg, 1995). As a conservative estimate, \Biocpkg{DNaseR} sums the $p$-values calculated at each strand for each footprint event, but other ways to combine $p$-values, as the Fisher’s method are under study. \section{Examples} To test \Biocpkg{DNaseR}, we downloaded the DNase-seq data files \file{wgEncodeUwDgfTh1Aln.bam} and \file{wgEncodeUwDgfTh1Aln.bam.bai} from the ENCODE Project (Neph et al., 2012) [dataType=DnaseDgf; view=Alignments; cell=Th1; origAssembly=hg18; geoSampleAccession=GSM646569; labVersion=Bowtie 0.12.5; type=bam]. We have selected the reads in the first 3000Kb of chrY, and run the DGF analysis in by using only one execution of the function \Rfunction{footprints} (see the manual of \Biocpkg{DNaseR}): \begin{small} <>= options(width=80) ## hg18. chrY:1 - 3000Kb reads from DNase-seq dataset wgEncodeUwDgfTh1Aln.bam ## from the ENCODE Project. ## ## Downloaded from: ## http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwDgf/ ## release1/wgEncodeUwDgfTh1Aln.bam owd <- setwd(tempdir()) library(DNaseR) bamfile <- "chrY_3Kb_wgEncodeUwDgfTh1Aln.bam" f <- system.file("extdata", bamfile, package="DNaseR",mustWork = TRUE) dgf <- footprints(bam=f, chrN="chrY", chrL=3e6, p=1e-9, width=c(6,40), N=2e6) head(dgf$footprint.events) nrow(dgf$footprint.events) setwd(owd) @ \end{small} 32 protein-binding footprints are reported spannig a width range of 6bp-40bp in the first 3000Kb of chrY for this dataset at $p$-value $\leq 1e-9$.\\ If we increase the $p$-value ($\leq 1e-7$) we get a higher number of footprints (40): \begin{small} <>= options(width=80) owd <- setwd(tempdir()) library(DNaseR) bamfile <- "chrY_3Kb_wgEncodeUwDgfTh1Aln.bam" f <- system.file("extdata", bamfile, package="DNaseR",mustWork = TRUE) dgf <- footprints(bam=f, chrN="chrY", chrL=3e6, p=1e-7, width=c(6,40), N=2e6) head(dgf$footprint.events) nrow(dgf$footprint.events) setwd(owd) @ \end{small} For several reasons one might be interested only in footprints of a certain size. For example, to report only 15bp width footprints ($p \leq 1e-9$) we can do: \begin{small} <>= options(width=80) owd <- setwd(tempdir()) library(DNaseR) bamfile <- "chrY_3Kb_wgEncodeUwDgfTh1Aln.bam" f <- system.file("extdata", bamfile, package="DNaseR",mustWork = TRUE) dgf <- footprints(bam=f, chrN="chrY", chrL=3e6, p=1e-9, width=c(15,15), N=2e6) head(dgf$footprint.events) nrow(dgf$footprint.events) setwd(owd) @ \end{small} However, it is recommended to be flexible in the max. and min. width during footprint search, as transcription factors are not expected to bind forming unique footprint configurations, nor to totally overlap highly-scored matches of their consensus sequence motifs. \section{FAQ} \begin{itemize} \item \textbf{How replicates are treated in DNaseR?} The package analyses just one sample. If you have several biological replicates you might like to consider as a result only the consensus list of footprints, or submitting footprint scores (\Rcode{log10.pval.footprint.event}) for Irreproducible Discovery Rate analysis (Li et al., 2011), which is an accepted practice in RNA-seq and ChIP-seq (Bailey et al., 2013). \item \textbf{Can I get False Discovery Rate for the list of footprints reported?} See the Bioconductor package \Biocpkg{qvalue}. \end{itemize} \section{References} \begin{itemize} \item Bailey T, Krajewski P, Ladunga I, Lefebvre C, Li Q, Liu T, Madrigal P, Taslim C, Zhang J (2013) Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. \textbf{PLoS Comput Biol} 9(11): e1003326. \item Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. \textbf{J R Stat Soc Ser B} 57: 289-300. \item Li Q, Brown JB, Huang H, Bickel PJ (2011) Measuring reproducibility of high-throughput experiments. \textbf{Ann Appl Stat} 5(3): 1699-2264. \item Madrigal P, Krajewski P (2012) Current bioinformatic approaches to identify DNase I hypersensitive sites and genomic footprints from DNase-seq data. \textbf{Front Genet} 3: 230. \item Neph S, Vierstra J, Stergachis AB, Reynolds AP, Haugen E, Vernot B, Thurman RE, John S, Sandstrom R, Johnson AK, Maurano MT, Humbert R, Rynes E, Wang H, Vong S, Lee K, Bates D, Diegel M, Roach V, Dunn D, Neri J, Schafer A, Hansen RS, Kutyavin T, Giste E, Weaver M, Canfield T, Sabo P, Zhang M, Balasundaram G, Byron R, MacCoss MJ, Akey JM, Bender MA, Groudine M, Kaul R, Stamatoyannopoulos JA (2012) An expansive human regulatory lexicon encoded in transcription factor footprints. \textbf{Nature} 489: 83-90. \item Piper J, Elze MC, Cauchy P, Cockerill PN, Bonifer C, Ott S (2013) Wellington: a novel method for the accurate identification of digital genomic footprints from DNase-seq data. \textbf{Nucleic Acids Res}, in press. \item Skellam JG (1946) The frequency distribution of the difference between two Poisson variates belonging to different populations. \textbf{J R Stat Soc Ser A} 109: 296. \item Song L, Crawford GE (2010) DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. \textbf{Cold Spring Harb Protoc} 2:pdb.prot5384. \end{itemize} \section{Details} This document was written using: \begin{small} <>= sessionInfo() @ \end{small} % \begin{figure*}[t] % \centering % \includegraphics[width=1.0\textwidth]{dalliance-view.pdf} % \caption{Footprint found at chrY: 2781394-2781409 (\Rcode{pval.footprint.event} = 1.705822e-18). Figure obtained using the lightweight genome visualization tool Dalliance: http://www.biodalliance.org/.} % \end{figure*} \end{document}