%\VignetteIndexEntry{DNaseR Vignette}
%\VignetteDepends{DNaseR}
%\VignetteKeywords{Visualization,ChIPseq,Transcription,Genetics}
%\VignettePackage{DNaseR}
\documentclass{article}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\DNaseR}{\Rpackage{DNaseR}}
\newcommand{\Rsamtools}{\Rpackage{Rsamtools}}
\newcommand{\footprints}{\Rfunction{footprints}}


\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{natbib}


\begin{document}

\title{{ \DNaseR }: DNase I footprinting analysis of DNase-seq data}
\date{February, 2013}
\author{Pedro Madrigal$^{1,2,}$\footnote{pm@engineering.com}}
\maketitle

\begin{small}
\begin{flushleft}
$^{1}$ Department of Biometry and Bioinformatics, Institute of Plant Genetics, Polish Academy of Sciences, Poznan, Poland \\
$^{2}$ \underline{Present address}: Wellcome Trust Sanger Institute, Cambridge, United Kingdom
\end{flushleft}
\end{small}

<<setup, echo=FALSE, results=hide>>=
options(width=80)
options(continue=" ")
options(prompt="R> ")
@

\section{Introduction}

The combination of DNase I digestion and high-throughput sequencing (DNase-seq) has been used recently to map chromatin accessibility in a given tissue or cell type on a genome-wide scale (Song and Crawford, 2010). In addition to DNase I hypersensitive sites (DHSs), short regions of protected nucleotides known as footprints can be detected using a technique known as "digital genomic footprinting" (DGF). These analyses potentially indicate the location of transcription factor binding ocuppancy events (Neph et al., 2012). However, available software for DGF analysis is still at a very immature state (Madrigal and Krajewski, 2012).\\

{\DNaseR} is an R package that aims to identify protein binding footprints in DNase I hypersensitive sites sequencing (DNase-seq) data provided in BAM standard aligment format. It relies on the cumulative function of the Skellam distribution (correlation of two Poisson distributions) to detect narrow-depleted regions of read-enrichment formed by the mapped reads in the forward and reversed DNA strands. Study the imbalance of DNase I cuts separately at both DNA strands is of great help in the detection of reliable protein-binding footprints, as proved by the Wellington algorithm (Piper et al., 2013), which uses the binomial cumulative distribution function for that purpose. As in Wellington, DNaseR's main characteristic consists in that consensus DNA sequences (motifs) search is not required \emph{a priori} to detect footprints. \\

Any BAM file storing aligned reads coming from a DNase-seq experiment is suitable for footprinting analysis, but the ones more deeply sequenced will retrieve a higher number of significant footprints at a fixed $p$-value cutoff.


\section{Methodology}
DNase I cuts (\Robject{5}' end of the mapped reads) counts are calculated separately for both DNA strands from the alignment files in BAM format using the Bioconductor package {\Rsamtools}. Using the Skellam distribution (Skellam, 1946), {\DNaseR} models at each nucleotide position the discrete signed difference of two Poisson counts at forward and reverse strands, respectively. Then, detecting nearby located significant count differences of opposed sign at both strands (in the direction \Robject{5}' to \Robject{3}') allows {\DNaseR} to delimit the flanks of the footprint location at base pair resolution. A one-sided $p$-value is obtained for each flank using the complementary cumulative Skellam distribution function. To control for multiple testing the $p$-values delimiting each flank of the footprint (pval.forward and pval.reverse) are corrected using Benjamini-Hochberg procedure (default).
A final $p$-value for each footprint (default cut-off $1e-9$) is reported as the sum of the two adjusted $p$-values. 


\section{Examples}
To test {\DNaseR}, we downloaded the DNase-seq data files wgEncodeUwDgfTh1Aln.bam and wgEncodeUwDgfTh1Aln.bam.bai from the ENCODE Project (Neph et al., 2012) [dataType=DnaseDgf; view=Alignments; cell=Th1; origAssembly=hg18; geoSampleAccession=GSM646569; labVersion=Bowtie 0.12.5; type=bam].
We have selected the reads in the first 3000Kb of chrY, and run the digital genomic footprinting analysis in {\DNaseR} by using only one execution of the function {\footprints} (see the manual):

\begin{scriptsize}
<<DNase I footprinting analysis of DNase-seq data in Bioconductor. Example 1>>=
options(width=80)

## hg18. chrY:1 - 3000Kb reads from DNase-seq dataset wgEncodeUwDgfTh1Aln.bam
## from the ENCODE Project.
##
## Downloaded from:
## http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwDgf/
## release1/wgEncodeUwDgfTh1Aln.bam

owd <- setwd(tempdir())

library(DNaseR)

bamfile <- "chrY_3Kb_wgEncodeUwDgfTh1Aln.bam"

f <- system.file("extdata", bamfile, package="DNaseR",mustWork = TRUE)


dgf <- footprints(bam=f, chrN="chrY", chrL=3e6, p=1e-9, width=c(6,40), N=2e6)


head(dgf$footprint.events)

nrow(dgf$footprint.events)

setwd(owd)
@
\end{scriptsize}

32 protein-binding footprints are reported spannig a width range of 6bp-40bp in the first 3000Kb of chrY for this dataset at $p$-value $\leq 1e-9$.\\
If we increase the $p$-value ($\leq 1e-7$) we get a higher number of footprints (40):

\begin{scriptsize}
<<DNase I footprinting analysis of DNase-seq data in Bioconductor. Example 2>>=
options(width=80)
owd <- setwd(tempdir())
library(DNaseR)
bamfile <- "chrY_3Kb_wgEncodeUwDgfTh1Aln.bam"
f <- system.file("extdata", bamfile, package="DNaseR",mustWork = TRUE)
dgf <- footprints(bam=f, chrN="chrY", chrL=3e6, p=1e-7, width=c(6,40), N=2e6)
head(dgf$footprint.events)
nrow(dgf$footprint.events)
setwd(owd)
@
\end{scriptsize}

For several reasons one might be interested only in footprints of a certain size. For example, to report only 15bp width footprints ($p \leq 1e-9$) we can do:

\begin{scriptsize}
<<DNase I footprinting analysis of DNase-seq data in Bioconductor. Example 3>>=
options(width=80)
owd <- setwd(tempdir())
library(DNaseR)
bamfile <- "chrY_3Kb_wgEncodeUwDgfTh1Aln.bam"
f <- system.file("extdata", bamfile, package="DNaseR",mustWork = TRUE)
dgf <- footprints(bam=f, chrN="chrY", chrL=3e6, p=1e-9, width=c(15,15), N=2e6)
head(dgf$footprint.events)
nrow(dgf$footprint.events)
setwd(owd)
@
\end{scriptsize}

However, it is recommended to be flexible in the max. and min. width during footprint search, as transcription factors are not expected to bind forming unique footprint configurations, nor to totally overlap their consensus sequence motifs.


\section{References}
\begin{itemize}

\item Madrigal P, Krajewski P (2012) Current bioinformatic approaches to identify DNase I hypersensitive sites and genomic footprints from DNase-seq data. \textbf{Front Genet} 3: 230.

\item Neph S, Vierstra J, Stergachis AB, Reynolds AP, Haugen E, Vernot B, Thurman RE, John S, Sandstrom R, Johnson AK, Maurano MT, Humbert R, Rynes E, Wang H, Vong S, Lee K, Bates D, Diegel M, Roach V, Dunn D, Neri J, Schafer A, Hansen RS, Kutyavin T, Giste E, Weaver M, Canfield T, Sabo P, Zhang M, Balasundaram G, Byron R, MacCoss MJ, Akey JM, Bender MA, Groudine M, Kaul R, Stamatoyannopoulos JA (2012) An expansive human regulatory lexicon encoded in transcription factor footprints. \textbf{Nature} 489: 83-90.  

\item Piper J, Elze MC, Cauchy P, Cockerill PN, Bonifer C, Ott S (2013) Wellington: a novel method for the accurate identification of digital genomic footprints from DNase-seq data. \textbf{Nucleic Acids Res}, in press.

\item Skellam JG (1946) The frequency distribution of the difference between two Poisson variates belonging to different populations. \textbf{J R Stat Soc Ser A} 109: 296. 

\item Song L, Crawford GE (2010) DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. \textbf{Cold Spring Harb Protoc} 2:pdb.prot5384. 

\end{itemize}

\section{Details}

This document was written using:

<<sessionInfo>>=
sessionInfo()
@ 

\end{document}