%\VignetteIndexEntry{DiffBind: Differential binding analysis of ChIP-Seq peak data}
%\VignettePackage{DiffBind}
%\VignetteEngine{utils::Sweave}

\documentclass{article}

<<style, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@ 

\usepackage{graphicx}

\newcommand{\DBA}{\Biocpkg{DiffBind}~}
\newcommand{\DiffBind}{\Biocpkg{DiffBind}~}
\newcommand{\edgeR}{\Biocpkg{edgeR}~}
\newcommand{\DESeq}{\Biocpkg{DESeq2}~}

\newcommand{\reft}[1]{Table \ref{tab:#1}}
\newcommand{\reff}[1]{Figure \ref{fig:#1}}
\newcommand{\refsf}[1]{Figure \ref{fig:#1}}
\newcommand{\code}[1]{{\Rcode{#1}}}

\begin{document}
\SweaveOpts{concordance=TRUE}

\newcommand{\exitem}[3]{\item \Rcode{\textbackslash#1\{#2\}} #3 \csname#1\endcsname{#2}.}

\title{\Biocpkg{DiffBind}: Differential binding analysis of ChIP-Seq peak data}

\author{
Rory Stark\thanks{\email{rory.stark@cruk.cam.ac.uk}} and
Gord Brown\\
Cancer Research UK - Cambridge Institute\\
University of Cambridge
}

\date{Edited: 4 October 2022; Compiled: \today}

\maketitle

\tableofcontents

\section{Introduction}

This document offers an introduction and overview of the
\R~Bioconductor package \DBA, which
provides functions for processing DNA data enriched for genomic loci,
including ChIP-seq data enriched for sites
where specific protein/DNA binding occurs, 
or histone marks are enriched, as well as open-chromatin
assays such as ATAC-seq.

It is designed to work with aligned sequence reads as
well as lists of enriched loci identified by a peak caller.
The tool is optimized to work with multiple peak sets simultaneously, 
representing different ChIP experiments (antibodies, transcription factor 
and/or histone marks, experimental conditions, replicates) 
as well as managing the results of multiple peak callers. 

The primary emphasis of the package is on identifying sites that are 
differentially bound between sample groups. 
It includes functions to support the processing of peak sets, 
including overlapping and merging peak sets, 
counting sequencing reads overlapping intervals in peak sets, 
and identifying statistically significantly differentially bound sites
based on evidence of binding affinity 
(measured by differences in read densities). 
To this end it uses statistical routines developed in an RNA-Seq context
(primarily the Bioconductor packages 
\edgeR and \DESeq). 
Additionally, the package builds on \R 
graphics routines to provide a set of standardized plots to aid in 
binding analysis.

This guide includes a brief overview of the processing flow, 
followed by several sections containing examples and discussion of
more advanced analytic options. 
The first example focuses on the core task of obtaining differentially bound sites 
based on affinity data, while
the second demonstrates the main plotting routines.

This is followed by discussions of multi-factor designs,
blacklists/greylists, and normalization.

The final example revisits occupancy data (peak calls) in more detail, 
comparing the results of an occupancy-based analysis with an 
affinity-based one. 

The last portions of this document include certain technical aspects of
the how these analyses are accomplished are detailed.

\section{Processing overview}

\DBA works primarily with peaksets, which are sets of 
genomic intervals representing candidate protein binding sites. 
Each interval consists of a chromosome, a start and end position, 
and usually a score of some type 
indicating confidence in, or strength of, the peak. 
Associated with each peakset are metadata relating to the experiment 
from which the peakset was derived. 
Additionally, files containing mapped sequencing reads (generally .bam files)
can be associated with each peakset 
(one for the ChIP data, and optionally another representing a control sample).

Generally, processing data with \DBA involves five phases:
\begin{enumerate}

\item
\Rcode{Reading in peaksets}:
The first step is to read in a set of peaksets and associated metadata. 
Peaksets are derived either from ChIP-Seq peak callers, 
such as MACS (\cite{zhang2008model}), 
or using some other criterion 
(e.g. genomic windows, or all the promoter regions in a genome). 
The easiest way to read in peaksets is using a comma-separated value (csv) 
\emph{sample sheet} with one line for each peakset.
(Spreadsheets in Excel\textsuperscript{\textregistered} format, 
with a \code{.xls} or \code{.xlsx} suffix, are also accepted.)  
An individual sample can have more than one associated peakset; e.g. 
if multiple peak callers are used for comparison purposes each sample 
would have more than one line in the sample sheet. 

\item
\Rcode{Occupancy analysis}:
Peaksets, especially those generated by peak callers, 
provide an insight into the potential \emph{occupancy} 
of the protein being ChIPed for at specific genomic loci.  
After the peaksets have been loaded, it can be useful to perform 
some exploratory plotting to determine how 
these occupancy maps agree with each other, 
e.g. between experimental replicates 
(re-doing the ChIP under the same conditions), 
between different peak callers on the same experiment, 
and within groups of samples representing a common experimental condition. 
\DBA provides functions to enable overlaps to be examined, 
as well as functions to determine how well 
similar samples cluster together. 
In addition, peaks may be filtered based on
published \textbf{blacklists} of region known to be problematic,
as well as custom \textbf{greylists} derived from
control track specific to the experiment (see Section ~\ref{sec:blacklists}).
Beyond quality control, the product of an occupancy analysis may be a
\emph{consensus peakset}, 
representing an overall set of candidate binding sites to be used 
in further analysis.

\item
\Rcode{Counting reads}:
Once a consensus peakset has been derived, 
\DBA can use the supplied sequence read files to count 
how many reads overlap each interval for each unique sample.
By default, the peaks in the consensus peakset are 
re-centered and trimmed based on 
calculating their summits (point of greatest read overlap) 
in order to provide more standardized peak intervals.
The final result of counting is a \emph{binding affinity matrix} containing a 
read count for each sample at every consensus binding site,
whether or not it was identified as a peak in that sample.
With this matrix, the samples can be re-clustered using affinity, 
rather than occupancy, data. 
The binding affinity matrix is used for QC plotting as well as for subsequent 
differential analysis.

\item
\Rcode{Differential binding affinity analysis}:
The core functionality of \DBA is the differential binding affinity analysis, 
which enables binding sites to be identified that are 
significantly differentially bound between sample groups. 
This step includes normalizing the experimental data and establishing a 
model design and a contrast (or contrasts).
Next the underlying core analysis routines are executed, by default using \DESeq. 
This will assign a p-value and FDR to each candidate binding site 
indicating confidence that they are differentially bound.

\item
\Rcode{Plotting and reporting}:
Once one or more contrasts have been run,
\DBA provides a number of functions for reporting and plotting the results. 
MA and volcano plots give an overview of the results of the analysis, 
while correlation heatmaps and PCA plots 
show how the groups cluster based on differentially bound sites. 
Boxplots show the distribution of reads within differentially bound sites 
corresponding to whether 
they gain or lose affinity between the two sample groups. 
A reporting mechanism enables differentially bound sites to be 
extracted for further processing, 
such as annotation, motif, and pathway analyses.

\end{enumerate}

\section{Example: Obtaining differentially bound sites}

This section offers a quick example of how to use 
\DBA to identify significantly differentially bound sites 
using affinity (read count) data.

The dataset for this example consists of ChIPs against the 
transcription factor ERa using 
five breast cancer cell lines\cite{ross2012differential}. 
Three of these cell lines are responsive to tamoxifen treatment, 
while two others are resistant to tamoxifen. 
There are at least two replicates for each of the cell lines, 
with one cell line having three replicates,
for a total of eleven sequenced libraries. 
Of the five cell lines, two are based on MCF7 cells: 
the standard MCF7 tamoxifen responsive line, 
and MCF7 cells specially treated with tamoxifen 
until a tamoxifen resistant version of the cell line is obtained.
For each sample, there is  an associated  peakset derived 
using the MACS peak caller\cite{zhang2008model}, 
for a total of eleven peaksets. 

To save time and space in the package, 
only data for chromosome 18 is used for the vignette. 
The metadata and peak data\footnote{Note that 
due to space limitations the reads 
are not shipped with the package. 
See Section ~\ref{sec:vignette_data} for options to obtain the full dataset. 
It is highly recommended that your obtain the 500M dataset to
work through this vignette.} 
are available in the \Rcode{extra} 
subdirectory of the \DBA package directory;
you can make this your working directory by entering:

<<eval=TRUE,echo=FALSE,results=hide>>=
tmp <-  tempfile(as.character(Sys.getpid()))
pdf(tmp)
savewarn <- options("warn")
options(warn=-1)
@


<<eval=TRUE,echo=TRUE,results=verbatim>>=
library(DiffBind)
@
<<eval=FALSE,echo=TRUE,results=hide>>=
setwd(system.file('extra',package='DiffBind'))
@

If you have downloaded the vignette data, you can
set the current working directory to where it is located.
Alternatively, the following code will download the data into a
temporary directory in which you can run the vignette:

<<eval=FALSE,echo=TRUE,results=verbatim>>=
tmpdir <- tempdir()
url <- 'https://content.cruk.cam.ac.uk/bioinformatics/software/DiffBind/DiffBind_vignette_data.tar.gz'
file <- basename(url)
options(timeout=600)
download.file(url, file.path(tmpdir,file))
untar(file.path(tmpdir,file), exdir = tmpdir )
setwd(file.path(tmpdir,"DiffBind_Vignette"))
@

Performing a full differential binding analysis
can be accomplished in a single step based on
the sample sheet:
<<eval=FALSE,echo=TRUE,results=hide>>=
tamoxifen <- dba.analyze("tamoxifen.csv")
@

Obtaining the sites significantly differentially bound (DB) 
between the samples that respond to tamoxifen 
and those that are resistant is equally straightforward:
<<eval=FALSE,echo=TRUE,results=hide>>=
tamoxifen.DB <- dba.report(tamoxifen)
@

The \Rcode{dba.analyze} function simplifies processing if you
want to perform an analysis using only defaults.
However this may not be the optimal (or even correct) analysis,
so it is often necessary to
perform each step separately in order to have greater control
of the analysis.
The default analysis involves six such steps,
as follows:
<<eval=FALSE,echo=TRUE,results=hide>>=
tamoxifen <- dba(sampleSheet="tamoxifen.csv") %>%
  dba.blacklist() %>%
  dba.count()     %>%
  dba.normalize() %>%
  dba.contrast()  %>%
  dba.analyze()
@

Along the way, there are a number of useful plots and reports
that can illuminate characteristics of the data set and guide
subsequent steps.

The following subsections describe the primary analysis steps in more detail.

\subsection{Reading in the peaksets}

The easiest way to set up an experiment to analyze is with a sample sheet. 
The sample sheet can be a \Robject{dataframe}, 
or it can be read directly from a \Rcode{csv} file.
Here is the example sample sheet read into a 
\Rcode{dataframe} from a \Rcode{csv} file:

<<sampSheet, eval=TRUE,echo=TRUE,results=verbatim>>=
samples <- read.csv(file.path(system.file("extra", package="DiffBind"),
                              "tamoxifen.csv"))
names(samples)
samples
@

The peaksets are read in using the following \DBA function:

<<dbaConstruct, eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen <- dba(sampleSheet="tamoxifen.csv",
                 dir=system.file("extra", package="DiffBind"))
@

Alternatively, the previously read-in sample sheet could be used directly
to create the \Rclass{DBA object}:

<<dbaConstructDF, eval=FALSE,echo=TRUE,results=verbatim>>=
tamoxifen <- dba(sampleSheet=samples)
@

The result is a \Rclass{DBA object}; 
the metadata associated with this object can be displayed simply as follows:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen
@

This shows how many peaks are in each peakset, as well as (in the first line) 
the total number of unique peaks after merging overlapping ones
(\Sexpr{nrow(tamoxifen$merged)}),
and the dimensions of the default binding matrix of 11 samples by the
\Sexpr{nrow(tamoxifen$binding)}
sites that overlap in at least two of the samples. 

Note: \emph{This \Rcode{DBA object}, \Rcode{tamoxifen}, 
is available for loading using \Rcode{data(tamoxifen\_peaks)}.}

Using the data from the peak calls, 
a correlation heatmap can be generated which gives an initial clustering 
of the samples using the cross-correlations of each row of the binding matrix:

<<tamox_occ_corhm, fig=TRUE, include=FALSE, width=9, height=9 >>=
plot(tamoxifen)
@
\incfig{DiffBind-tamox_occ_corhm}{.66\textwidth}
{Correlation heatmap, using occupancy (peak caller score) data.}
{Generated by: \Rcode{plot(tamoxifen)}; 
can also be generated by: \Rcode{dba.plotHeatmap(tamoxifen).}}

The resulting plot (Figure~\ref{DiffBind-tamox_occ_corhm}) 
shows that while the replicates for each cell line 
cluster together appropriately, 
the cell lines do not cluster into groups corresponding 
to those that are responsive (MCF7, T47D, and ZR75) 
vs. those resistant (BT474 and MCF7r) to tamoxifen treatment.
It also shows that the two most highly correlated cell lines are the two
MCF7-based ones, even though they respond differently to tamoxifen treatment. 

\subsection{Blacklists and greylists}

Blacklists and greylists are discussed in a subsequent section.
See Section ~\ref{sec:blacklists} for more details.

\subsection{Counting reads}
The next step is to calculate a binding matrix with scores 
based on read counts for every sample (affinity scores), 
rather than confidence scores for only those peaks 
called in a specific sample (occupancy scores). 
These reads are obtained using the \Rcode{dba.count} function:
\footnote{Note that due to space limitations the 
reads are not shipped with the package. 
See Section ~\ref{sec:vignette_data} for options to obtain the full dataset.
You can get the end result of the \Rcode{dba.count} call by 
loading the supplied \R 
object by invoking \Rcode{data(tamoxifen\_counts)}}

<<eval=FALSE,echo=TRUE,results=hide>>=
tamoxifen <- dba.count(tamoxifen) 
@
<<eval=TRUE,echo=FALSE,results=hide>>=
data(tamoxifen_counts)
@

If you do not have the raw reads available to you, this object is available 
for loading using \Rcode{data(tamoxifen\_counts)}.

After the \Rcode{dba.count} call, the \Rclass{DBA object} can be examined:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen
@

This shows that all the samples are using the same,
\Sexpr{nrow(tamoxifen$binding)} length consensus peakset. 
Also, two new columns have been added. 
The first shows the total number of aligned reads for each sample 
(the "Full" library sizes).
The second is labeled \Rcode{FRiP}, 
which stands for \emph{Fraction of Reads in Peaks}. 
This is the proportion of reads for that sample that 
overlap a peak in the consensus peakset, 
and can be used to indicate which samples show more enrichment overall.
For each sample, multiplying the value in the \Rcode{Reads} 
column by the corresponding \Rcode{FRiP}
value will yield the number of reads that overlap a consensus peak.
This can be done using the \Rcode{dba.show} function:

<<eff_lib_size,eval=TRUE,echo=TRUE,results=verbatim>>=
info <- dba.show(tamoxifen)
libsizes <- cbind(LibReads=info$Reads, FRiP=info$FRiP, 
                  PeakReads=round(info$Reads * info$FRiP))
rownames(libsizes) <- info$ID
libsizes
@

As this example is based on a transcription factor that binds to the DNA, 
resulting in "punctate", relatively narrow peaks, 
the default option to re-center each peak around the point of 
greatest enrichment is appropriate. 
This keeps the peaks at a consistent width 
(in this case, the default \Rcode{summits=\Sexpr{tamoxifen$summits}}
results in \Sexpr{tamoxifen$summits*2+1}bp-wide intervals,
extending \Sexpr{tamoxifen$summits}bp up- and down- stream of the summit)

We can also plot a new correlation heatmap based on the count scores, 
seen in Figure~\ref{DiffBind-tamox_aff_corhm} 
(compare to Figure~\ref{DiffBind-tamox_occ_corhm}). 
While this shows a slightly different clustering,
with overall higher levels of correlation 
(due to using normalized read counts instead of 
whether or not a peak was called),
responsiveness to tamoxifen treatment does not appear to 
form a basis for clustering when using all of the affinity scores. 
(Note that at this point the count scores are computed 
using default normalization parameters. 
Note that the clustering can change based on what normalization 
scoring metric is used; see Section ~\ref{sec:normalization} for more details).

<<tamox_aff_corhm, fig=TRUE, include=FALSE, width=9, height=9 >>=
plot(tamoxifen)
@
\incfig{DiffBind-tamox_aff_corhm}{.66\textwidth}
{Correlation heatmap, using affinity (read count) data.}
{Generated by: \Rcode{plot(tamoxifen)}; 
can also be generated by: \Rcode{dba.plotHeatmap(tamoxifen)}}

\subsection{Normalizing the data}

The next step is to tell \DBA how the data are to be normalized.
Normalization is discussed in detail in Section ~\ref{sec:normalization};
here we consider the default normalization for our example,
obtained using the \Rcode{dba.normalize} function:

<<normalize,eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen <- dba.normalize(tamoxifen)
@

By default, the data are normalized based on
sequencing depth.

The details of the normalization can be examined:

<<show_norm,eval=TRUE,echo=TRUE,results=verbatim>>=
norm <- dba.normalize(tamoxifen, bRetrieve=TRUE)
norm
@

This shows the normalization method used (\Sexpr{norm$norm.method}),
the calculated normalization factors for each sample,
and the full library sizes (which include the total number of reads
in the associated .bam files).

The default library-size based methods results
in all the library sizes being normalized to be the same (the \Rcode{mean}
library size):

<<norm_facs,eval=TRUE,echo=TRUE,results=verbatim>>=
normlibs <- cbind(FullLibSize=norm$lib.sizes, NormFacs=norm$norm.factors,
                  NormLibSize=round(norm$lib.sizes/norm$norm.factors))
rownames(normlibs) <- info$ID
normlibs
@

Other values show that the control reads were subtracted from the ChIP reads
(this is done by default because no blacklists/greylists were applied,
see Section ~\ref{sec:blacklists} for more details).

Normalization of ChIP (and related assays such as ATAC) data is a crucial,
if somewhat complex, area. 
Please see Section ~\ref{sec:normalization}
for a more in-depth discussion
of normalization in \DBA.

\subsection{Establishing a model design and contrast}
Before running the differential analysis,
we need to tell \DBA how to model the data, 
including which comparison(s) we are interested in. 
This is done using the \Rcode{dba.contrast} function, as follows:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen <- dba.contrast(tamoxifen, 
                          reorderMeta=list(Condition="Responsive"))
tamoxifen
@

This call will set up any "default" contrasts by examining
the project metadata factors and assuming we want to
look at the differences between any two sample groups
with at least three replicates in each side of the comparison
(that is, any factor that has two different values where there
are at least three samples that share each value.)
It also establishes the Responsive condition as the baseline, so
it will be in the denominator of the default contrast.
In the current case, there is only one such comparison that qualifies:
The \Rcode{Condition} metadata factor has two values,
\Rcode{Resistant} and \Rcode{Responsive}, that 
have at least three replicates each 
(we see that there are four \Rcode{Resistant} sample replicates
and seven \Rcode{Responsive} sample replicates.)

This function also establishes the default design,
which includes only the metadata factor directly involved in the
contrast (\emph{\Sexpr{dba.show(tamoxifen,bDesign=TRUE)}}).

While in this example we are using \Rcode{dba.contrast} in the default mode,
it does allow for fine-grained control over the 
design and contrasts one wishes to model.
See Section ~\ref{sec:multifactor} for a more detailed discussion of how
including the \Rcode{Tissue} factor in the design
provides for better modeling of the example experiment.

\subsection{Performing the differential analysis}
The main differential analysis function is invoked as follows:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen <- dba.analyze(tamoxifen)
dba.show(tamoxifen, bContrasts=TRUE)
@

This will run the default \DESeq analysis
(see Section ~\ref{subsec:technical_deseq2} discussing the technical details 
of the analysis).
Displaying the results from the \Rcode{DBA object} shows that
\Sexpr{as.numeric(dba.show(tamoxifen,bContrasts=TRUE)$DB.DESeq2)}
of the
\Sexpr{nrow(tamoxifen$merged)}
sites are identified as being significantly differentially bound (DB)
using the default threshold of FDR <= \Sexpr{tamoxifen$config$th}

A correlation heatmap can be plotted,
correlating only the
\Sexpr{as.numeric(dba.show(tamoxifen,bContrasts=TRUE)$DB.DESeq2)}
differentially bound sites identified by the  analysis,
as shown in Figure~\ref{DiffBind-tamox_sdb_corhm}.

<<tamox_sdb_corhm, fig=TRUE, include=FALSE, width=9, height=9 >>=
plot(tamoxifen, contrast=1)
@
\incfig{DiffBind-tamox_sdb_corhm}{.66\textwidth}
{Correlation heatmap, using only significantly differentially bound sites.}
{Generated by: \Rcode{plot(tamoxifen, contrast=1)};
can also be generated by: \Rcode{dba.plotHeatmap(tamoxifen, contrast=1)}}

Using only the differentially bound sites,
we now see that the four tamoxifen resistant samples
(representing two cell lines) cluster together,
while the seven responsive form a separate cluster.
Comparing Figure~\ref{DiffBind-tamox_aff_corhm}, which uses all
\Sexpr{nrow(tamoxifen$merged)}
consensus binding sites, with Figure~\ref{DiffBind-tamox_sdb_corhm},
which uses only the
\Sexpr{as.numeric(dba.show(tamoxifen,bContrasts=TRUE)$DB.DESeq2)}
differentially bound sites, demonstrates how
the differential binding analysis isolates sites
that help distinguish between the Resistant and Responsive sample groups.

Note this is plot is not a "result" in the sense that the analysis is selecting
for sites that differ between the two conditions,
and hence are expected to form clusters representing the conditions.

See Section ~\ref{sec:multifactor}, where a multi-factor design is 
applied to this analysis, for a more sophisticated way to model these data.

\subsection{Retrieving the differentially bound sites}
The final step is to retrieve the differentially bound sites as follows:

<<eval=TRUE,echo=TRUE,results=hide>>=
tamoxifen.DB <- dba.report(tamoxifen)
@

These are returned as a \Rclass{GRanges} object,
appropriate for downstream processing:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen.DB
@

The metadata columns show the mean read concentration over all the samples
(the default calculation uses log2 normalized read counts)
and the mean concentration over the samples in each of
the first (Resistant) group and second (Responsive) group.
The Fold column shows the log fold changes (LFCs)
between the two groups, as calculated by the \DESeq analysis.
A positive value indicates increased binding affinity in the Resistant group,
and a negative value indicates increased binding affinity
in the Responsive group. The final two columns give confidence measures for
identifying these sites as differentially bound,
with a raw p-value and a multiple-testing corrected FDR in the final column
(also calculated by the \DESeq analysis).

We can compare the number of differentially bound sites
that have enriched ER binding in the tamoxifen Resistant samples
and those with enriched binding in the tamoxifen Responsive samples:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
sum(tamoxifen.DB$Fold>0)
sum(tamoxifen.DB$Fold<0)
@

The bias towards enriched binding in the Responsive case
(or loss of binding affinity in the Resistant case) can
be visualized using MA and Volcano plots, as shown in the following Section.

\section{Plotting in \DBA}

Besides the correlation heatmaps we have been looking at,
a number of other plots are available using the affinity data.
This section covers Venn diagrams, MA plots, Volcano plots,
PCA plots, Boxplots, and Heatmaps.

\subsection{Venn diagrams}

Venn diagrams illustrate overlaps between different sets of peaks.
For example, amongst the differentially bound sites,
we can see the differences between the "Gain" sites
(those that increase binding enrichment in the Resistant condition)
and the "Loss" sites (those with lower enrichment) as follows:

<<tamox_sdb_venn, fig=TRUE, include=FALSE, width=9, height=9 >>=
dba.plotVenn(tamoxifen, contrast=1, bDB=TRUE,
             bGain=TRUE, bLoss=TRUE, bAll=FALSE)
@
\incfig{DiffBind-tamox_sdb_venn}{.66\textwidth}
{Venn diagram of Gain vs Loss differentially bound sites.}
{Generated by: \Rcode{dba.plotVenn(tamoxifen, contrast=1, bDB=TRUE,
bGain=TRUE, bLoss=TRUE, bAll=FALSE)}}

Figure~\ref{DiffBind-tamox_sdb_venn} shows the result.

Venn diagrams are also useful for examining overlaps between peaksets,
particularly when determining how best to derive consensus
peaksets for read counting and further analysis.
Section ~\ref{sec:occupancy}, which discusses consensus peaksets, 
shows a number of Venn plots in context,
and the help page for \Rcode{dba.plotVenn} has a number of additional examples.

\subsection{PCA plots}
While the correlation heatmaps already seen are good for showing clustering,
plots based on principal components analysis can be used to give
a deeper insight into how samples are associated.
A PCA plot corresponding to Figure~\ref{DiffBind-tamox_aff_corhm},
which includes normalized read counts for all \Sexpr{nrow(tamoxifen$binding)}
binding sites, can be obtained as follows:

<<tamox_aff_pca, fig=TRUE, include=FALSE, width=9, height=9 >>=
dba.plotPCA(tamoxifen,DBA_TISSUE,label=DBA_CONDITION)
@
\incfig{DiffBind-tamox_aff_pca}{.66\textwidth}
{PCA plot using affinity data for all sites.}
{Generated by:
\Rcode{dba.plotPCA(tamoxifen,DBA\_TISSUE,label=DBA\_CONDITION)}}

The resulting plot (Figure~\ref{DiffBind-tamox_aff_pca})
shows all the MCF7-derived samples (red)
clustering on one side of the first (horizontal) component,
with the Responsive and Resistant samples not separable either in the first nor
in the second (vertical) component.\footnote{Note
that they are separable in the second and third components;
try \Rcode{dba.plotPCA(tamoxifen,
DBA\_CONDITION,
label=DBA\_TISSUE, components=2:3)}}

A PCA plot using only the
\Sexpr{as.numeric(dba.show(tamoxifen,bContrasts=TRUE)$DB.DESeq2)}
differentially bound sites
(corresponding to Figure~\ref{DiffBind-tamox_sdb_corhm}),
using an FDR threshold of 0.05, can be drawn as follows:

<<tamox_sdb_pca, fig=TRUE, include=FALSE, width=9, height=9 >>=
dba.plotPCA(tamoxifen, contrast=1, label=DBA_TISSUE)
@
\incfig{DiffBind-tamox_sdb_pca}{.66\textwidth}
{PCA plot using affinity data for only differentially bound sites.}
{Generated by: \Rcode{dba.plotPCA(tamoxifen,contrast=1,label=DBA\_TISSUE)}}

This plot (Figure~\ref{DiffBind-tamox_sdb_pca})
shows how the differential analysis identifies sites
that can be used to separate the Resistant and Responsive sample groups
along the first component.

The \Rcode{dba.plotPCA} function is customizable.
For example, if you want to see where the replicates
for each of the unique cell lines lies, type

\Rcode{dba.plotPCA(tamoxifen, attributes=c(DBA\_TISSUE,DBA\_CONDITION),
label=DBA\_REPLICATE)}.

If your installation of \R
supports 3D graphics using the \Rpackage{rgl} package,
try \Rcode{dba.plotPCA(tamoxifen,contrast=1,  b3D=TRUE)}.
Seeing the first three principal components
can be a useful exploratory exercise.

\subsection{MA plots}
MA plots are a useful way to visualize the relationship between
the overall binding level at each site and the magnitude of the
change in binding enrichment between conditions,
as well as the effect of normalization on data.
An MA plot can be obtained for the Resistant vs Responsive contrast as follows:

<<tamox_sdb_ma, fig=TRUE, include=FALSE, width=12, height=9 >>=
dba.plotMA(tamoxifen)
@
\incfig{DiffBind-tamox_sdb_ma}{1\textwidth}
{MA plot of Resistant-Responsive contrast.}
{Sites identified as significantly differentially bound shown in red.
Generated by: \Rcode{dba.plotMA(tamoxifen)}}

The plot is shown in Figure~\ref{DiffBind-tamox_sdb_ma}.
Each point represents a binding site,
with the \Sexpr{as.numeric(dba.show(tamoxifen,bContrasts=TRUE)$DB.DESeq2)}
points in magenta representing sites identified as differentially bound.
There is a blue horizontal line through the origin (0 LFC), as well as
a horizontal red curve representing a non-linear loess fit showing the underlying
relationship between coverage levels and fold changes.
The plot shows how the differentially bound sites appear
to have a minimum absolute log fold difference of somewhere
between one and two.
As we have already seen, it also shows that more ERa binding sites
lose binding affinity in the tamoxifen resistant condition
than gain binding affinity,
as evidenced by more red dots below the center line than are above it.
This same data can also be shown with the concentrations of each sample groups
plotted against each other plot using \Rcode{dba.plotMA(tamoxifen, bXY=TRUE)}.

Section ~\ref{sec:normalization} contains several examples of MA plots,
including showing non-normalized data, and the ability to
plot any subset of samples against any other set of sample.

\subsection{Volcano plots}

Similar to MA plots, Volcano plots also highlight significantly differentially
bound sites and show their fold changes.
Here, however, the confidence statistic (FDR or p-value) is
shown on a negative log scale, helping visualize the
relationship between the magnitude of fold changes and the
confidence that sites are differentially bound.

For example, the same data as plotted in Figure~\ref{DiffBind-tamox_sdb_ma}
can be visualized as a volcano plot:
<<tamox_sdb_volcano, fig=TRUE, include=FALSE, width=12, height=9 >>=
dba.plotVolcano(tamoxifen)
@
\incfig{DiffBind-tamox_sdb_volcano}{.66\textwidth}
{Volcano plot of Resistant-Responsive contrast.}
{Sites identified as significantly differentially bound shown in red.
Generated by: \Rcode{dba.plotVolcano(tamoxifen)}}

The plot is shown in Figure~\ref{DiffBind-tamox_sdb_volcano},
with the predominance of lower binding in the Resistant case
evidenced by the greater number of significant sites on the negative side of the
Fold Change (X) axis.

\subsection{Boxplots}

Boxplots provide a way to view how read distributions differ between
classes of binding sites. Consider the example, where
\Sexpr{as.numeric(dba.show(tamoxifen,bContrasts=TRUE)$DB.DESeq2)}
differentially bound sites are identified.
The MA plot (Figure~\ref{DiffBind-tamox_sdb_ma}) shows that these
are not distributed evenly
between those that increase binding affinity in the Responsive group vs.
those that increase binding affinity in the Resistant groups.
This can be seen quantitatively using the sites returned in the report:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
sum(tamoxifen.DB$Fold<0)
sum(tamoxifen.DB$Fold>0)
@

But how are reads distributed amongst the different classes of
differentially bound sites and sample groups?
These data can be more clearly seen using a boxplot:

<<tamox_sdb_box, fig=TRUE, include=FALSE, width=9, height=9 >>=
pvals <- dba.plotBox(tamoxifen)
@

\incfig{DiffBind-tamox_sdb_box}{.66\textwidth}
{Box plots of read distributions for
significantly differentially bound (DB) sites.}
{Tamoxifen Resistant samples are shown in blue,
and Responsive samples are shown in red.
Left two boxes show distribution of reads over all DB sites
in the Resistant and Responsive groups;
middle two boxes show distributions of reads in DB sites that
increase in affinity in the Resistant group;
last two boxes show distributions of reads in DB sites that
increase in affinity in the Responsive group.
Generated by: \Rcode{dba.plotBox(tamoxifen)}}

The default plot (Figure~\ref{DiffBind-tamox_sdb_box}) shows in the
first two boxes that
amongst differentially bound sites overall,
the Responsive samples have a somewhat higher mean read concentration.
The next two boxes show the distribution of reads in differentially bound sites
that exhibit increased affinity in the Resistant samples,
while the final two boxes show the distribution of reads in
differentially bound sites that exhibit increased affinity
in the Responsive samples.

\Rcode{dba.plotBox} returns a matrix of p-values (computed using a two-sided
Wilcoxon `Mann-Whitney' test, paired where appropriate)
indicating which of these distributions are significantly different
from another distribution.

<<eval=TRUE,echo=TRUE,results=verbatim>>=
pvals
@

The significance of the overall difference in distribution of concentrations
amongst the differentially bound sites
in the two groups is shown to be
p-value=\Sexpr{sprintf("%1.2e",pvals['Resistant.DB','Responsive.DB'])},
while those between the Resistant and Responsive groups in the individual cases
(increased in Resistant or increased in Responsive)
have p-values computed as
\Sexpr{sprintf("%1.2e",pvals['Resistant.DB+','Responsive.DB+'])} and
\Sexpr{sprintf("%1.2e",pvals['Resistant.DB-','Responsive.DB-'])}.

\subsection{Heatmaps}

\DBA provides two types of heatmaps. This first, correlation heatmaps,
we have already seen.
For example, the heatmap shown in Figure~\ref{DiffBind-tamox_aff_corhm}
can be generated as follows:
<<eval=TRUE,echo=TRUE,results=hide>>=
corvals <- dba.plotHeatmap(tamoxifen)
@

The effect of different scoring methods (normalization) can be examined in these plots by
setting the \Rcode{score} parameter to a different value.
The default value, \Rcode{DBA\_SCORE\_NORMALIZED},
uses the normalized read counts (see Section ~\ref{sec:normalization}).
Another scoring method is to use RPKM fold
(RPKM of the ChIP reads divided by RPKM of the control reads);
a correlation heatmap for all the data using this scoring method can be
obtained by typing
\Rcode{dba.plotHeatmap(tamoxifen, score=DBA\_SCORE\_RPKM\_FOLD)}.

Another way to view the patterns of binding affinity directly in the differentially bound sites
is via a \emph{binding affinity heatmap},
showing the read scores for some or all of the binding sites.
This can be plotted for the example case as follows:

<<tamox_sdb_hm, fig=TRUE, include=FALSE, width=9, height=9 >>=
hmap <- colorRampPalette(c("red", "black", "green"))(n = 13)
readscores <- dba.plotHeatmap(tamoxifen, contrast=1, correlations=FALSE,
                              scale="row", colScheme = hmap)
@
\incfig{DiffBind-tamox_sdb_hm}{.66\textwidth}
{Binding affinity heatmap showing affinities for differentially bound sites.}
{Samples cluster first by whether they are responsive to tamoxifen treatment,
then by cell line, then by replicate.
Clusters of binding sites show distinct patterns of affinity levels.
Generated by:
\Rcode{dba.plotHeatmap(tamoxifen, contrast=1, correlations=FALSE)}}

Figure~\ref{DiffBind-tamox_sdb_hm} shows the affinities and clustering
of the differentially bound sites (rows),
as well as the sample clustering (columns).
In this case, the (normalized) counts have been row scaled,
and a red/green heatmap color palette applied.

\subsection{Profiling and Profile Heatmaps}

The \Rcode{dba.plotProfile()} function enables the computation of
peakset profiles and the plotting of complex heatmaps. 
It serves as a front-end to enable experiments analyzed using \DiffBind to 
more easily use the profiling and plotting functionality provided by 
the \Rpackage{profileplyr} package written by Tom Carroll and Doug Barrows
(see \url{https://bioconductor.org/packages/release/bioc/html/profileplyr.html}).

Processing proceeds in two phases.

In the first phase, specific peaksets are extracted from a 
\DiffBind \Rcode{DBA object} and profiles are calculated for these
peaks for set of samples in the \DiffBind experiment. 
Profiles are calculated by counting the number of overlapping reads 
in a series of bins upstream and downstream of each peak center.

In the second phase, the derived profiles are plotted 
in a series of complex heatmaps showing the relative intensity 
of overlapping peaks in each bin for each peak in each sample, 
along with summary plots showing the average profile 
across the sites for each sample.

Due to the computational cost of this function,
it is advised that the calculation of profiles and the plotting 
be separated into two calls, 
so that the profiles do not need to be re-generated 
if something goes wrong in the plotting. 
By default, when a \Rcode{DBA object} is passed in to generate profiles, 
plotting is turned off and a \Rcode{profileplyr} object is returned. 
When \Rcode{dba.plotProfile()} is called with a \Rcode{profileplyr} object, 
a plot is generated by default.

The main aspects of the profile plot are which \textbf{samples} 
are plotted (the X-axis) and which \textbf{sites} are plotted (the Y-axis).
These can be specified in a number of flexible ways. 
Other parameters to \Rcode{dba.plotProfile()} determine
how the data are treated, 
controlling aspects such as how many sites are included in the plot, 
data normalization, sample merging 
(computing mean profiles for groups of samples), 
and control over the appearance of the plot.

While few brief example and included here, a more complete
document is available showing more of the various options.
The markdown source for this can be accessed as
\Rcode{system.file('extra/plotProfileDemo.Rmd',package='DiffBind')};
an html version is available for browsing at
\url{https://content.cruk.cam.ac.uk/bioinformatics/software/DiffBind/plotProfileDemo.html},
and a pdf document can be found at
\url{https://content.cruk.cam.ac.uk/bioinformatics/software/DiffBind/plotProfileDemo.pdf}

\subsubsection{Default profile plot}

If an analysis has been completed, the default plot 
will be based on the results of the first contrast. 
If the contrast compares two conditions, 
all of the samples in each condition will be included, 
with the heatmaps colored separately for samples in each contrast condition.

Sample groups are merged based on the \Rcode{DBA\_REPLICATE} attribute, 
such that each sample class will have one heatmap based on the 
normalized mean read counts for all the samples in that class 
that have the same metadata except for the Replicate number.

In terms of sites, two groups of differentially bound sites are included: 
\textbf{Gain} sites (with positive fold change) and 
\textbf{Loss} sites (negative fold change). 
If there are more than 1,000 sites in either category, 
the 1,000 sites with the great absolute value fold change will be included 
(the maximum number of sites to be profiled can be altered).

<<eval=FALSE,echo=TRUE,results=hide>>=
profiles <- dba.plotProfile(tamoxifen)
dba.plotProfile(profiles)
@

\begin{figure}
	\centering
	\includegraphics[width=6in]{plotProfile-default.png}
	\caption[Default \Rcode{dba.plotProfile()} plot.]
	{Default \Rcode{dba.plotProfile()} plot, showing each sample group, colored
	by contrast condition (Resistant or Responsive).}
\end{figure}

This plot shows how the differentially bound sites are divided into 
Gain and Loss groups, and how sample groups belonging to each of the 
two contrast conditions (Resistant and Responsive) result 
in differently colored heatmaps.

\subsubsection{Merging all samples in a contrast condition}

In the sample experiment, there are multiple sample groups comprising 
each side of the contrast: the Resistant class has two sample groups 
based on the BT474 and MCF7 cell lines, 
while the Responsive class has three groups, based on the 
MCF7, T47D, and ZR75 cell lines. 
If we want to generate composite profiles for the 
Responsive and Resistant classes, 
the \Rcode{DBA\_TISSUE} attribute can be added to the merge specification. 
By specifying \Rcode{merge=c(DBA\_TISSUE, DBA\_REPLICATE)}, 
the samples are divided into groups each with the same metadata values 
\emph{except} for the Replicate and Tissue factors.
Samples within each group are merged,
so that the (normalized) mean counts  
for all of the Resistant samples will calculated, 
as well as for all of the Responsive samples:

<<eval=FALSE,echo=TRUE,results=hide>>=
profiles <- dba.plotProfile(tamoxifen,merge=c(DBA_TISSUE, DBA_REPLICATE))
dba.plotProfile(profiles)
@

\begin{figure}
	\centering
	\includegraphics[width=6in]{plotProfile-merge.png}
	\caption[\Rcode{dba.plotProfile()} plot with sample groups merged.]
	{Default \Rcode{dba.plotProfile()} plot with sample groups merged, 
	showing mean signal for all Resistant and Responsive samples.}
\end{figure}


\subsubsection{Avoiding merging to show all sample replicates}

Masks can be used to specify which samples we want to include in the plot.
For example, to see each of the MCF7 samples separately, divided into
contrast groups, specify the samples as a list of two sample masks, 
one combining \Rcode{MCF7} with \Rcode{Resistant}, 
and one combining \Rcode{MCF7} with \Rcode{Responsive}.
Further specifying \Rcode{merge=NULL} 
will prevent the replicates from being merged, so profiles for 
each replicate will be computed and plotted:

<<eval=FALSE,echo=TRUE,results=hide>>=
mask.MCF7 <- tamoxifen$masks$MCF7
mask.Resistant  <- tamoxifen$masks$Resistant
mask.Responsive <- tamoxifen$masks$Responsive
profiles <- dba.plotProfile(tamoxifen,
                           samples=list(MCF7_Resistant=
                                          mask.MCF7 & mask.Resistant,
                                        MCF7_Responsive=
                                          mask.MCF7 & mask.Responsive),
                           merge=NULL)
dba.plotProfile(profiles)
@

\begin{figure}
	\centering
	\includegraphics[width=6in]{plotProfile-MCF7.png}
	\caption[\Rcode{dba.plotProfile()} for unmerged MCF7 samples.]
	{Default \Rcode{dba.plotProfile()} plot showing Gain and Loss sites in 
	all MCF7 sample replicates.}
\end{figure}

Further examples can be found in the \Rcode{plotProfileDemo} notebook.

\section{Example: Multi-factor designs}
\label{sec:multifactor}

The previous discussion showed how to perform a differential binding analysis
using a single factor (\Rcode{Condition}) with two values (\Rcode{Resistant}
and \Rcode{Responsive});
that is, finding the significantly differentially bound sites between
two sample groups.
This section extends the example by including other factors in the design.

One example of this is where there is a second factor,
potentially with multiple values,
that represents a confounding condition.
Examples include cases where  there are potential batch effects,
with samples from the two conditions prepared together,
or a matched design (e.g. matched normal and tumor pairs,
where the primary factor of interest is to discover sites
consistently differentially bound between normal and tumor samples).
More complex designs can also include specific interactions between factors.

In the current example, the  effect we want to control for is the
different cell culture strains used, in particular the
fact that some samples in both of the sample groups,
based on different \Rcode{Condition} values
(one tamoxifen responsive and one tamoxifen resistant),
are both derived from the same \Rcode{Tissue} (MCF7 cell line).

In the previous analysis,
the two MCF7-derived cell lines tended to cluster together.
While the differential binding analysis was able to
identify sites that could be used to separate the
resistant from the responsive samples,
the confounding effect of the common ancestry
could still be seen even when considering only
the significantly differentially bound sites
(Figure~\ref{DiffBind-tamox_aff_corhm}).

Using the generalized linear modeling (GLM) functionality
included in \DESeq and \edgeR,
the confounding factor can be modeled enabling more sensitive results.
This is done by specifying a design formula explicitly to \Rcode{dba.contrast}.

<<eval=TRUE,echo=TRUE,results=hide>>=
tamoxifen <- dba.contrast(tamoxifen,design="~Tissue + Condition")
@

Note that by changing the design formula, the previous results are cleared, requiring the
analysis to be run anew:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen <- dba.analyze(tamoxifen)
dba.show(tamoxifen, bContrasts=TRUE)
@

This shows that where the standard, single-factor \DESeq analysis identifies
\Sexpr{length(tamoxifen.DB)}
differentially bound sites, the analysis using the two-factor design finds
\Sexpr{as.numeric(dba.show(tamoxifen,bContrasts=TRUE)$DB.DESeq2)}
such sites. 
MA and Volcano plots show how the analysis has changed:

<<tamox_block_ma, fig=TRUE, include=FALSE, width=12, height=9 >>=
dba.plotMA(tamoxifen)
@
\incfig{DiffBind-tamox_block_ma}{.66\textwidth}
{MA plot of Resistant-Responsive contrast,
using a multi-factor design "\Sexpr{tamoxifen$design}".}
{Sites identified as significantly differentially bound shown in magenta.
Generated by: \Rcode{dba.plotMA(tamoxifen)}}

<<tamox_block_vol, fig=TRUE, include=FALSE, width=12, height=9 >>=
dba.plotVolcano(tamoxifen)
@
\incfig{DiffBind-tamox_block_vol}{.66\textwidth}
{Volcano plot of Resistant-Responsive contrast,
using a multi-factor design "\Sexpr{tamoxifen$design}".}
{Sites identified as significantly differentially bound shown in magenta.
Generated by: \Rcode{dba.plotVolcano(tamoxifen)}}

The resulting plots are shown in Figure~\ref{DiffBind-tamox_block_ma}
and Figure~\ref{DiffBind-tamox_block_vol}
Comparing these to Figure~\ref{DiffBind-tamox_sdb_ma}
and Figure~\ref{DiffBind-tamox_sdb_volcano},
a number of differences can be observed.
The analysis has become more sensitive, with sites being identified as
significantly differentially bound with lower magnitude fold changes.
Secondly, the distribution of differentially bound sites has shifted.
In the single-factor analysis, they were mostly concentrated in
the lower left of the MA plot,
identifying sites with increased binding in the Responsive condition,
but with relatively low concentrations.
In the multi-factor analysis, sites are identified as being
significantly differentially bound
across a wider range of concentrations and more are identified that
gain binding affinity in the Resistant condition.
Finally, the horizontal red loess fit line appears to better fit
the calculated fold changes while showing an overall slight loss of
binding enrichment.

These observations can be quantified from the report data:
<<eval=TRUE,echo=TRUE,results=verbatim>>=
multifactor.DB <- dba.report(tamoxifen)
@

Looking at sensitivity, we can compare the distribution of fold changes
of the differentially bound sites identified by the single- and multi-factor
analyses:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
min(abs(tamoxifen.DB$Fold))
min(abs(multifactor.DB$Fold))
@

Likewise, we can compare the proportions of sites identified as
being differentially bound between those that gain binding
enrichment in the Resistant condition over those more enriched
in the Responsive conditions, between the single- and multi-factor analyses:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
sum(tamoxifen.DB$Fold > 0) / sum(tamoxifen.DB$Fold < 0)
sum(multifactor.DB$Fold > 0) / sum(multifactor.DB$Fold < 0)
@


It is also interesting to compare the performance of
\edgeR with that of  \DESeq on this dataset:
<<eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen <- dba.analyze(tamoxifen,method=DBA_ALL_METHODS)
dba.show(tamoxifen,bContrasts=TRUE)
@

We see that  \edgeR identifies a somewhat higher number
of sites than \DESeq.
You can check this by looking at the identified sites using \Rcode{dba.report},
and performing MA, Volcano, heatmap, and PCA plots.

We can also compare the sites identified using \edgeR and \DESeq.
An easy way to do this is to use a special feature of the
\Rfunction{dba.plotVenn} function that
shows the overlaps of contrast results:

<<tamox_block_venn, fig=TRUE, include=FALSE, width=9, height=9 >>=
tamoxifen.OL <- dba.plotVenn(tamoxifen,contrast=1,method=DBA_ALL_METHODS,
                             bDB=TRUE)
@
\incfig{DiffBind-tamox_block_venn}{.66\textwidth}
{Venn diagram showing overlap of
differentially bound peaks identified using \edgeR
and \DESeq to do a multi-factor analysis.}
{Generated by plotting the result of:
\Rcode{dba.plotVenn(tamoxifen,contrast=1,method=DBA\_ALL\_METHODS,bDB=TRUE)}}

The overlap is shown in Figure~\ref{DiffBind-tamox_block_venn}.
The largest group of sites are identified by
both \edgeR and \DESeq.
Note that the binding sites unique to \edgeR, \DESeq, and common to both are
returned in the variable \Rcode{tamoxifen.OL}.

To further illustrate the ability to model the data by evaluating
multiple contrasts against a single model,
consider another comparison.
Suppose we'd like to identify ER binding sites that a differentially bound
in all of the MCF7 samples compared to the T47D samples\footnote{We consider 
this example again in Section ~\ref{sec:offsets}.}.
We can do this by adding a contrast to our existing model.
For illustrative purposes, we will also re-order the values for the
\Rcode{Tissue} factor to make MCF7 the reference group:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen <- dba.contrast(tamoxifen,contrast=c("Tissue","MCF7","T47D"),
                          reorderMeta = list(Tissue="MCF7"))
tamoxifen <- dba.analyze(tamoxifen,method=DBA_ALL_METHODS)
dba.show(tamoxifen,bContrasts=TRUE)
@

\section{Blacklists and Greylists}
\label{sec:blacklists}

Good practice in analyzing ChIP-seq (and ATAC-seq) experimental data
include the use of \textbf{blacklists} to remove certain regions
in the reference genome from the analysis\cite{amemiya2019blacklist}.
This section describes how to accomplish this
for both publicly published blacklists as well
as experiment-specific \textbf{greylists}.

\subsection{What are blacklists and greylists?}

\textbf{Blacklists} are pre-defined lists of regions specific 
to a reference genome that are known to be problematic.
The best known lists have been identified
as part of the ENCODE project\cite{amemiya2019blacklist} and are
available for a variety of reference genomes and genome versions.
The current ENCODE blacklists are available through
the \Rcode{dba.blacklist} function.

\textbf{Greylists} are specific to a ChIP-seq experiment, and
are derived from the controls generated as part of 
the experiment\cite{brown2015greylistchip}.
The idea is to analyze libraries that are not
meant to show systematic enrichment (such as Inputs, in which no
anti-body is introduced), and identify anomalous regions
where a disproportionate degree of signal is present.
These regions can then be excluded from subsequent analysis.

\subsection{Why apply blacklists and greylists?}
Application of blacklists prevents identification of problematic regions 
in the reference genome as being differentially bound. 
The regions tend to be ones with a high degree of repeats or
unusual base concentrations.
Application of greylists prevents identification of problematic genomic
regions in the materials used in the experiment as being differentially bound.
For example, these could include areas of high copy-number alterations 
in a cell line.

It has been shown that problematic reads, such as duplicate reads,
are disproportionately likely to overlap blacklisted regions,
and the quality of experimental data can be increased reads in 
blacklisted regions are excluded\cite{carroll2014artifacts}.

Another way of thinking about greylists is that they are one
way of using the information in the control tracks to
improve the reliability of the analysis.
Prior to version 3.0, the default in \DBA has been to 
simply subtract control reads
from ChIP reads in order to dampen the magnitude
of enrichment in anomalous regions.
Greylists represent a more principled way of accomplishing this.
If a greylist has been applied,
the current default in \DBA is to \emph{not} subtract control
reads.

\subsection{When should blacklists and greylists be applied?}
Within \Rpackage{DiffBind}, blacklists and greylists are applied to candidate peak
regions prior to performing a quantitative analysis.
This should be done before calculating a consensus peakset by
excluding blacklisted peaks from each individual peakset.
It can also be done after counting overlapping reads 
by excluding consensus peaks that overlap a blacklisted or greylists region
(see examples below).

It is worth noting that, ideally, blacklists and greylists would
be applied earlier in the process, to the aligned reads (bam files) 
themselves, prior to any peak calling.
Popular peak callers, such as MACS, use the control tracks
to model the background noise levels which plays a critical
role in identifying truly enriched "peak" regions.
Excluding the blacklisted reads prior to peak calling should
result in more accurate identification of enriched regions in the
non-blacklisted areas of the genome.
The \Rcode{dba.blacklist} function offers a way
to easily retrieve any blacklists and computed greylists,
enabling the ability to go back and re-process the data
with blacklisted reads removed prior to the peak-calling step.

\subsection{Example: How to apply a blacklist}

The primary function that controls blacklists and greylists is
called \Rcode{dba.blacklist}.
In order to use this, you either need to have
an existing blacklist, or know the reference genome
used to align your sequencing data.
In the example dataset, the reference is Hg19.

ENCODE blacklists can be applied straightforwardly as follows:\footnote{While
in this example the blacklist used is specified explicitly,
\DBA can usually determine the correct blacklist to apply
using the default \Rcode{blacklist=TRUE}}
<<blacklistPeaks,eval=TRUE,echo=TRUE,results=verbatim>>=
data(tamoxifen_peaks)
tamoxifen
peakdata  <- dba.show(tamoxifen)$Intervals
tamoxifen <- dba.blacklist(tamoxifen, blacklist=DBA_BLACKLIST_HG19, 
                           greylist=FALSE)
tamoxifen
peakdata.BL <- dba.show(tamoxifen)$Intervals
peakdata - peakdata.BL
@

This shows that a single peak was excluded from each of the three 
MCF7 (Responsive) replicates.

Alternatively, the blacklist could be applied \emph{after}
composing a consensus peakset and counting reads.
This is useful to see its impact on the analysis.

Remembering our earlier analysis, with the results stored
in \Rcode{multifactor.DB}:

<<blacklistAnal,eval=TRUE,echo=TRUE,results=verbatim>>=
length(multifactor.DB)
data(tamoxifen_counts)
tamoxifen   <- dba.blacklist(tamoxifen, blacklist=DBA_BLACKLIST_HG19,
                             greylist=FALSE)
blacklisted <- dba.blacklist(tamoxifen, Retrieve=DBA_BLACKLISTED_PEAKS)
tamoxifen   <- dba.contrast(tamoxifen, design="~Tissue + Condition")
tamoxifen   <- dba.analyze(tamoxifen)
blacklisted.DB <- dba.report(tamoxifen)
length(blacklisted.DB)
@

This shows that the peak interval excluded by the blacklist was returned
as a significantly differentially bound site in the original analysis,
but is absent in the analysis performed after blacklisting:

<<blacklistRes,eval=TRUE,echo=TRUE,results=verbatim>>=
bl_site <- match(blacklisted[[1]], multifactor.DB)
multifactor.DB[bl_site,]
is.na(match(blacklisted[[1]], blacklisted.DB))
@

\subsection{Example: How to apply a greylist}

Using greylists is somewhat more complicated.
If they have been pre-computed from control tracks,
either using the \Rpackage{GreyListChIP} package
or from a previous run of \DBA, they can be supplied
directly to \Rcode{dba.blacklist}.
A pre-computed greylist for the sample experiment has
been included with the \DBA package:
<<greylistGet,eval=TRUE,echo=TRUE,results=verbatim>>=
data(tamoxifen_greylist)
names(tamoxifen.greylist)
tamoxifen.greylist$master
@

The greylist object is a list with two elements.
The first, \Rcode{tamoxifen.greylist\$master}, contains
the full greylist to be applied.
The other element, \Rcode{tamoxifen.greylist\$controls}, is
a \Rcode{GRangesList} containing the individual greylists
computed for each of the five control tracks:

<<greylistControls,eval=TRUE,echo=TRUE,results=verbatim>>=
names(tamoxifen.greylist$controls)
tamoxifen.greylist$controls
@

These greylists can be re-combined if the controls are re-used
in other experiments.

The greylist can be applied (along with the blacklist) as follows:

<<greylistPeaks,eval=TRUE,echo=TRUE,results=verbatim>>=
data(tamoxifen_peaks)
tamoxifen <- dba.blacklist(tamoxifen, blacklist=DBA_BLACKLIST_HG19,
                           greylist=tamoxifen.greylist)
@

For this example, we can  apply it to the
binding matrix after computing a consensus peakset and counting
overlapping reads:

<<greylistCons,eval=TRUE,echo=TRUE,results=verbatim>>=
data(tamoxifen_counts)
cons.peaks <- dba.show(tamoxifen)$Intervals[1]
tamoxifen  <- dba.blacklist(tamoxifen, blacklist=DBA_BLACKLIST_HG19,
                            greylist=tamoxifen.greylist)
cons.peaks.grey <- dba.show(tamoxifen)$Intervals[1]
cons.peaks - cons.peaks.grey
@

Some \Sexpr{cons.peaks - cons.peaks.grey} peaks that overlap
the greylist have been excluded from the consensus peakset (representing 
\Sexpr{(1-round(cons.peaks.grey/cons.peaks,3))*100}\% of the total).

\subsection{Example: How to compute a greylist with \Rpackage{GreyListChIP}}

Most of the time, if your experiment has controls such as Input tracks,
the greylist needs to be computed specifically for the analysis.
This can be accomplished using \Rcode{dba.blacklist} to
invoke the \Rpackage{GreyListChIP} package automatically.

Here is how the sample greylist was computed for this example.
You can execute this code if you have all the read data for the vignette
(see Section ~\ref{sec:vignette_data}).

Instead of specifying a greylist explicitly, if the \Rcode{greylist} 
is set to \Rcode{TRUE}, the genome will be determined (if possible)
automatically from the control bam files associated with the experiment.
Alternatively, a specific reference genome can be supplied, either using
a constant provided by \DBA (such as \Rcode{DBA\_BLACKLIST\_HG19}),
or the name of a \Rpackage{BSgenome} object 
(such as \Rcode{"BSgenome.Hsapiens.UCSC.hg19"}).
In these cases, the \Rpackage{GreyListChIP} package
is invoked to compute a greylist for each
control in the experiment, as well as a merged
"master" greylist of all the overlapping regions of all the
control greylists.

In the current example
we can use the following code. 
(Note that even for the sample data, this
can take a substantial amount of time to run, as long or longer
than running \Rcode{dba.count}):
<<greylistMake,eval=FALSE,echo=TRUE,results=verbatim>>=
tamoxifen <- dba(sampleSheet="tamoxifen.csv")
tamoxifen <- dba.blacklist(tamoxifen)
tamoxifen.greylist <- dba.blacklist(tamoxifen, Retrieve=DBA_GREYLIST)
@

The output for this command appears as follows:

\begin{verbatim}
Genome detected: Hsapiens.UCSC.hg19
Applying blacklist...
Removed: 3 of 14102 intervals.
Building greylist: BT474c
coverage: 166912 bp (0.21%)
Building greylist: MCF7c
coverage: 106495 bp (0.14%)
Building greylist: T47Dc
coverage: 56832 bp (0.07%)
Building greylist: TAMRc
coverage: 122879 bp (0.16%)
Building greylist: ZR75c
coverage: 68608 bp (0.09%)
BT474c: 58 ranges, 166912 bases
MCF7c: 14 ranges, 106495 bases
T47Dc: 11 ranges, 56832 bases
TAMRc: 10 ranges, 122879 bases
ZR75c: 12 ranges, 68608 bases
Master greylist: 69 ranges, 251391 bases
Removed: 423 of 14102 intervals.
\end{verbatim}

Note that the sensitivity of how much is excluded can be controlled
by a configuration parameter. The default is \Rcode{0.999}:

<<greylistP,eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen$config$greylist.pval <- 0.999
@

Altering this can cause more or less of the genome to be excluded 
for an experiment. See the documentation for \Rpackage{GreyListChIP} for more
details.

\section{Normalization}
\label{sec:normalization}

Normalization of experimental data is particularly important in ChIP-seq
(and ATAC-seq) analysis, and may require more careful consideration than
needed for RNA-seq analysis. 
This is because the range of ChIP-seq experiments covers more
cases than RNA-seq, which usually involve a similar set of possible
expressed genes and/or transcripts, many of which are not
expected to significantly change expression.
ChIP, ATAC, and similar enrichment-based sequencing data may
not follow the assumptions inherent in popular methods for normalizing
RNA-seq data, as well as exhibiting different types
of efficiency and other biases.

This section discusses options for normalization in \DBA using
the \Rcode{dba.normalize} interface function,
and shows how they affect the example experiment.
It describes the core normalization methods,
methods for calculating library sizes,
options for using background binning for a more "neutral"
normalization, and the potential
to apply separate normalization factors for each 
read measurement (offsets), including loess fit normalization.
Section ~\ref{sec:comparison} compares the impact of
seven different normalization options in both \DESeq and \edgeR,
showing how normalizing against the background vs.
enriched consensus reads has a greater impact on analysis results
than which specific normalization method is chosen.

The final part of the discussion demonstrates how to 
exploit reads associated with spike-ins and parallel factors, 
as an alternative to background reads, to normalize these datasets.

\subsection{Core normalization methods}

\DBA relies on three underlying core methods for normalization.
These include the "native" normalization methods supplied
with \DESeq and \edgeR, as well as a simple library-based method.
The normalization method is specified with the \Rcode{normalize}
parameter to \Rcode{dba.normalize}

The native \DESeq  normalization method 
is based on calculating the
geometric mean for each gene across samples\cite{Love2014},
and is referred to \code{"RLE"} or \code{DBA\_NORM\_RLE} in \DBA.

The native \edgeR normalization method is
based on the trimmed mean of M-values approach\cite{robinson2010scaling},
and is referred to as \code{"TMM"} or \Rcode{DBA\_NORM\_TMM} in \DBA.

A third method is also provided that avoids making any assumptions
regarding the distribution of reads between binding sites in different
samples that may be specific to RNA-seq analysis and inappropriate
for ChIP-seq analysis. 
This method (\Rcode{"lib"} or \Rcode{DBA\_NORM\_LIB})
is based on the different library sizes for each sample,
computing normalization factors to weight each sample so as to appear
to have the same library size.
For \DESeq, this is accomplished by dividing the number of reads in each library
by the \Rcode{mean} library size\footnote{The \Rcode{mean} function can
be changed by the user by setting the \Rcode{libFun} parameter}. 
For \edgeR, the normalization factors are all set to \Rcode{1.0},
indicating that only library-size normalization should occur.

Note that any of these normalization methods can be used with 
either the \DESeq or \edgeR analysis methods, 
as \DBA converts normalization factors to work
appropriately in either \DESeq or \edgeR.

Consider some potential normalization options for the example dataset.
We can start with non-normalized data:

<<norm0,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE,width=12,height=9>>=
data(tamoxifen_analysis)
dba.plotMA(tamoxifen, contrast=list(Resistant=tamoxifen$masks$Resistant),
           bNormalized=FALSE, sub="Non-Normalized")
@
\incfig{DiffBind-norm0}{1\textwidth}
{MA plot showing non-normalized data}
{Generated by plotting the result of:
\Rcode{dba.plotMA(tamoxifen, contrast=list(Resistant=tamoxifen\$masks\$Resistant),
bNormalized=FALSE, sub="Non-Normalized")}}

Figure~\ref{DiffBind-norm0} shows the non-Resistant (Responsive) samples
with a greater raw read density, 
with the darkest part of the blue cloud being located
beneath the horizontal blue line centered zero fold change,
as well as the entirety of the red loess fit curve likewise residing
below that line.

A key question is if this is an artifact that needs to be corrected via 
normalization, or a biological signal that needs to be retained.

Compare the non-normalized data to an analysis based on a normalization
that only takes the library size (total number of aligned reads, or 
sequencing depth) into account:

<<normDESeq2LibFull,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE,width=12,height=9>>=
tamoxifen <- dba.normalize(tamoxifen, normalize=DBA_NORM_LIB)
tamoxifen <- dba.analyze(tamoxifen)
dba.plotMA(tamoxifen, method=DBA_DESEQ2, sub="DESeq2:lib:full")
@
\incfig{DiffBind-normDESeq2LibFull}{1\textwidth}
{MA plot showing results of analysis using Library-size based normalization.}

Figure~\ref{DiffBind-normDESeq2LibFull} shows that normalizing for 
sequencing depth alters the bias in signal enrichment towards the 
Responsive samples only slightly, with the highest density closer to,
but still below, the center line, and the bulk of the loess fit line
in the lower half as well.

The following code gathers the differentially bound
sites in a variable \Rcode{dbs}, in which we will accumulate results
in order to compare the results of different types of normalization:
<<normRes1,eval=TRUE,echo=TRUE,results=verbatim>>=
dbs <- dba.report(tamoxifen, bDB=TRUE, bGain=TRUE, bLoss=TRUE)
dbs$config$factor <- "normalize"
dbs$class[DBA_ID,] <- colnames(dbs$class)[1] <-  "LIB_Full"
dbs$class[DBA_FACTOR,] <- DBA_NORM_LIB
dbs
@

The analysis results reflects the bias towards enrichment in 
the Responsive samples. Of \Sexpr{dba.show(dbs)$Intervals[1]} total sites
identified as differentially bound, \Sexpr{dba.show(dbs)$Intervals[3]}
(\Sexpr{round((dba.show(dbs)$Intervals[3]/dba.show(dbs)$Intervals[1])*100)}\%)
exhibit greater binding affinity in the Responsive condition, while
only  \Sexpr{dba.show(dbs)$Intervals[2]} are enriched in the Resistant
condition.

Compare this to the  \Rcode{RLE} normalization procedure native to
\DESeq:
<<normDESeq2RLE,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE,width=12,height=9>>=
tamoxifen <- dba.normalize(tamoxifen, normalize=DBA_NORM_NATIVE)
tamoxifen <- dba.analyze(tamoxifen)
dba.plotMA(tamoxifen, method=DBA_DESEQ2, sub="DESeq2:RLE:RiP")
@
\incfig{DiffBind-normDESeq2RLE}{1\textwidth}
{MA plot showing results of analysis using RLE based normalization.}

Figure~\ref{DiffBind-normDESeq2RLE} shows quite a different picture.
More sites are close to the zero-fold center line, and the loess
fit sits largely on top, if not slightly above, that line.
The results of the analysis have changed substantially as well:

<<normRes2,eval=TRUE,echo=TRUE,results=verbatim>>=
db <- dba.report(tamoxifen, bDB=TRUE, bGain=TRUE, bLoss=TRUE)
db$class[DBA_ID,] <- "RLE_RiP"
db$class[DBA_FACTOR,] <- DBA_NORM_RLE
dbs <- dba.peakset(dbs, db)
db
@

The \Sexpr{dba.show(dbs)$Intervals[4]} sites as identified as
differentially bound in this analysis are much more evenly
divided, with a few \emph{more} sites enriched in the Resistant
condition (\Sexpr{dba.show(dbs)$Intervals[5]}) than in the
Responsive condition {\Sexpr{dba.show(dbs)$Intervals[6]}}.
We can compare overlaps between the analysis based
on library size normalization vs RLE normalization:

<<normDESeq2Comparison,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE, height=12>>=
par(mfrow=c(3,1))
dba.plotVenn(dbs,c(1,4), main="Total DB Sites")
dba.plotVenn(dbs,dbs$masks$Gain,main="Gain in Resistant")
dba.plotVenn(dbs,dbs$masks$Loss,main="Gain in Responsive")
par(mfrow=c(1,1))
@
\incfig{DiffBind-normDESeq2Comparison}{1\textwidth}
{Venn diagrams showing overlaps of sites identified as differentially bound
when using library size vs RLE normalization.}

The \Rcode{RLE} normalization, developed for normalizing RNA-seq count matrices,
has resulted in normalizing the data such that the binding changes
are more evenly distributed between the two conditions.
In Figure~\ref{DiffBind-normDESeq2Comparison}, the top plot
shows there are \Sexpr{length(dba.overlap(dbs,c(1,4))$inAll)}
sites identified in both analyses, with another
\Sexpr{length(dba.overlap(dbs,c(1,4))$onlyA)+length(dba.overlap(dbs,c(1,4))$onlyB)}
unique to one analysis or the other.
The remaining diagrams show how the \Rcode{RLE}-based analysis
identifies many additional sites that gain binding affinity in the 
Resistant condition, while "missing" sites identified in the
library-sized based analysis as being enriched in the Responsive condition.

These two analyses could lead to quite different biological 
conclusions regarding the dynamics of ER binding in response to tamoxifen.
Which one should be favored?
Given the nature of the experimental design
for the example data, with both biological
replication in the form of multiple cell lines as
well as multiple experimental/technical replicates,
and without having a prior reason to believe that changes
in binding affinity should be balanced,
it would be difficult to justify preferring the \Rcode{RLE}-based analysis,
as it alters the data distribution to a greater extent.
It is largely for this reason that a library-size normalization
is set as the default method within \DBA.

\subsection{Library size calculations}

An important parameter involved in normalizing is the 
calculation of library sizes.
In the previous sub-section, for the library-size based normalization,
the library size for each sample
was set to the total number of aligned reads in the bam file for that
sample, representing the sequencing depth.
This is the Full library size, represented in \DBA as
\Rcode{DBA\_LIBSIZE\_FULL} or \Rcode{"full"}.

Another popular way of calculating the library size is to
sum the reads that overlap consensus peaks in each sample (Reads in Peaks).
Within \DBA, this is known as
\Rcode{DBA\_LIBSIZE\_PEAKREADS} or \Rcode{"RiP"}.
In this case, the matrix of read counts overlapping consensus
sites functions similarly the count matrix in a standard RNA-seq analysis.
Library sizes calculated in this manner take into account 
aspects of both the sequencing depth and the
"efficiency" of the ChIP. 
An inefficient ChIP, where a high proportion of the reads are not
in enriched peaks (high proportion of "background" reads), may not have
a strong signal even if sequenced to a greater depth.
When using the native normalization methods, developed originally for RNA-seq,
the \Rcode{"RiP"} library sizes are used.

We can compare the results of a library-size based normalization
using the full library sizes, as in Figure~\ref{DiffBind-normDESeq2LibFull},
with one using the reads in peaks:

<<normDESeq2LibRiP,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE,width=12,height=9>>=
tamoxifen <- dba.normalize(tamoxifen, normalize=DBA_NORM_LIB,
                           library=DBA_LIBSIZE_PEAKREADS, background=FALSE)
tamoxifen <- dba.analyze(tamoxifen)
dba.plotMA(tamoxifen, method=DBA_DESEQ2, sub="DESeq2:lib:RiP")
@
\incfig{DiffBind-normDESeq2LibRiP}{1\textwidth}
{MA Plot showing results of analysis using "RiP" for library-size
normalization}

Figure~\ref{DiffBind-normDESeq2LibRiP} shows how the library-size 
normalization based on peak reads differs from one based
on sequencing depth alone.
The normalization is more "even", in that most sites have similar read densities
(and are closer to the blue, center zero-fold line), and the
loess curve is no longer below the zero-fold line.
The analysis itself show more balance between differentially bound sites
that gain affinity in the two conditions:

<<normRes3,eval=TRUE,echo=TRUE,results=verbatim>>=
dbs$class[DBA_CONDITION,1:3] <- DBA_LIBSIZE_FULL
dbs$class[DBA_CONDITION,4:6] <- DBA_LIBSIZE_PEAKREADS
dbs$config$condition <- "lib.size"
db <- dba.report(tamoxifen, bDB=TRUE, bGain=TRUE, bLoss=TRUE)
db$class[DBA_ID,] <- "LIB_RiP"
db$class[DBA_FACTOR,] <- DBA_NORM_LIB
db$class[DBA_CONDITION,] <- DBA_LIBSIZE_PEAKREADS
dbs <- dba.peakset(dbs, db)
db
@

The numbers and ratios of the gain and loss sites look a lot more like the 
previous \Rcode{RLE}-based analysis than the full library-size based analysis,
as can be confirmed with a Venn 
diagram:

<<normDESeq2LibsizeVenn,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE, height=7>>=
dba.plotVenn(dbs,c(1,7,4),main="DB Sites")
@
\incfig{DiffBind-normDESeq2LibsizeVenn}{1\textwidth}
{Venn diagram showing overlapping differentially bound sites identified in
analyses using library-size normalization with full and reads-in-peaks
library sizes, as well as \Rcode{RLE} normalization using reads-in-peaks.}

Figure~\ref{DiffBind-normDESeq2LibsizeVenn} shows the overlaps
of the differentially bound sites from the two library-size based
analyses and the prior \Rcode{RLE} analysis.
When using reads in peaks, the library-size analysis identifies
many of the same sites enriched in the Resistant group that the 
\Rcode{RLE} analysis detected, as well as 
\Sexpr{length(dba.overlap(dbs,c(1,4,7))$notB)} sites that are
more enriched in the Responsive condition that the prior library-size based
analysis identified but that the \Rcode{RLE} on did not.

Since calculating library sizes using Reads in Peaks yields results that
are similar to the \Rcode{RLE} method, identifying
more and more balanced differentially bound peaks,
it may be that taking ChIP
efficiency into consideration, not just sequencing depth, is advantageous.
However, there is no unbiased way to determine if the underlying
cause of differences in reads overlapping consensus peaks is
due to a technical issue in ChIP,
or to a true biological signal whereby there "really" are different
degrees of overall binding.

In the current example, the fraction of reads in peaks (\Rcode{FRiP})
is fairly consistent between replicates for each cell type, indicating
that the differences are \emph{not} due to ChIP efficiency:

<<frip,eval=TRUE,echo=TRUE,results=verbatim>>=
dba.show(tamoxifen,attributes=c(DBA_ID,DBA_FRIP))
@

It is largely for this reason that the default normalization in \DBA 
uses library size normalization based on full library sizes.

\subsection{Background normalization}

In the previous section, we saw how a library-size normalization based
on the full library size, rather than on the reads
in the enriched areas designated by the consensus peaks set,
enabled an analysis that avoided making assumption
about the changes in binding patterns (specifically, assumptions
that binding changes are roughly balanced). 
Another way to do this, while gaining some of the
benefits of the native normalization methods,
is to use \emph{background normalization}.
Given the difficulty in differentiating between technical biases and 
biological signal when attempting to normalize ChIP-seq
data, the idea is to normalize based on a more neutral sample
of reads.

The core background normalization technique is to divide the
genome into large bins and count overlapping reads\footnote{
This method is used in the \Rpackage{THOR}
differential analysis tool\code{allhoff2016thor},
and is discussed in the User Guide for the \Rpackage{csaw}
package\cite{lun2016csaw}.
Internally, \DBA uses the \Rpackage{csaw} methods to compute
background reads}.
As the enrichment expected in ChIP-seq (and ATAC-seq) is expected
to occur over relatively narrow intervals (roughly between 100bp and 600bp),
it is expected that there should not be systematic differences in signals
over much larger intervals (on the order of 10,000bp and greater).
Any differences seen should be technical rather than biological, so
it is safer to normalize based these differences.

Note also that this type of background normalization
is one of the methods recommended for 
ATAC-seq analysis \cite{reske2020atac}.

By specifying \Rcode{background=TRUE} in \Rcode{dba.normalize},
all chromosomes that contains peaks in the consensus set are divided
into non-overlapping bins (default size 15,000bp) and overlapping reads counted.
These can then be normalized using any of the normalization methods.
While generally doing a library size based normalization will yield the same
result as on the full library size\footnote{The only difference is that
the "full" library size includes the total number of reads in the supplied
bam files, while the background normalization, by default, 
counts the reads on chromosomes which contain peaks in the peakset.},
the native normalization methods can be applied to the background bins.

For the example, we can compare to a \DESeq analysis
with \Rcode{RLE} normalization, as well as a \edgeR analyses
(and its native \Rcode{TMM} normalization). 
Note  that computing background reads requires
access to the full sequencing data (bam files);
however the loaded  example objects 
\Rcode{tamoxifen\_counts} and \Rcode{tamoxifen\_analysis} include
the results of calculating background reads and can be used as-is.

<<normBG,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE,width=12,height=16>>=
data(tamoxifen_analysis)
tamoxifen <- dba.normalize(tamoxifen, method=DBA_ALL_METHODS,
                           normalize=DBA_NORM_NATIVE,
                           background=TRUE)
tamoxifen <- dba.analyze(tamoxifen, method=DBA_ALL_METHODS)
dba.show(tamoxifen,bContrasts=TRUE)
par(mfrow=c(2,1))
dba.plotMA(tamoxifen, method=DBA_EDGER, sub="edgeR:TMM:background")
dba.plotMA(tamoxifen, method=DBA_DESEQ2, sub="DESeq2:RLE:background")
par(mfrow=c(1,1))
@
\incfig{DiffBind-normBG}{1\textwidth}
{MA Plots showing results of analysis using background reads and 
(top) \Rcode{TMM} normalization with \edgeR and
(bottom) \Rcode{RLE} normalization with \DESeq}

Figure~\ref{DiffBind-normBG} shows the results of these analyses.
While the loess fit line is smoother and closer to the zero fold change line,
the sites identified as being differentially bound remain biased towards
those with a loss of binding affinity in the Resistant condition.

<<normBgRes,eval=TRUE,echo=TRUE,results=verbatim>>=
db <- dba.report(tamoxifen, bDB=TRUE, bGain=TRUE, bLoss=TRUE)
db$class[DBA_ID,] <- "RLE_BG"
db$class[DBA_FACTOR,] <- DBA_NORM_RLE
db$class[DBA_CONDITION,] <- DBA_LIBSIZE_BACKGROUND
dbs <- dba.peakset(dbs, db)
db
@

Of the \Sexpr{dba.show(db)$Intervals[1]} differentially bound sites,
only \Sexpr{dba.show(db)$Intervals[2]} gain affinity in the Resistant
condition, while \Sexpr{dba.show(db)$Intervals[3]} show a loss.
This analysis is quite similar to the one carried out using
full library size based normalization:

<<normBGVenns,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE, width=14,height=20>>= 
par(mfcol=c(3,2))
dba.plotVenn(dbs,c(1,10),   main="All Differentially Bound Sites")
dba.plotVenn(dbs,c(2,11),   main="Gain in Resistant cells")
dba.plotVenn(dbs,c(3,12),   main="Loss in Resistant cells")
dba.plotVenn(dbs,c(1,10,4), main="All Differentially Bound Sies")
dba.plotVenn(dbs,c(2,11,5), main="Gain in Resistant cells")
dba.plotVenn(dbs,c(3,12,6), main="Loss in Resistant cells")
@
\incfig{DiffBind-normBGVenns}{1\textwidth}
{Venn diagrams of overlapping differentially bound sites identified using
full library size based normalization compared to background RLE and 
Reads in Peaks RLE normalization}

Figure~\ref{DiffBind-normBGVenns} illustrates the differences. 
The left column compares the impact of full library size normalization to 
that of background normalization, showing how the background normalization
is even more conservative, identifying fewer sites that both gain
and lose binding affinity.
The right column includes the results of an analysis based 
on the native \Rcode{RLE} normalization method for \DESeq,
identifying many additional sites that gain binding affinity in the
Resistant cells, while "missing" many sites that lose binding affinity in the
Resistant cells.

\subsection{Offsets and loess normalization}
\label{sec:offsets}

While the normalization discussed so far have relied on computing 
normalization coefficients for each sample,
an alternative is to compute normalization factors for
each read count in the consensus count matrix (that is, for each
consensus peak for each sample).
These are called \emph{normalization factors} in \DESeq and
\emph{offsets} in \edgeR.

A matrix of offsets can be supplied via the \Rcode{dba.normalize}
function, or an offset matrix can be calculated
using a \Rcode{loess} fit.
As described in the user guide for the 
\Rpackage{csaw} package\cite{lun2016csaw},
this method can help correct certain kinds of biases in the data,
particularly trended biases where there is a systematic difference
in fold changes as mean concentration levels change.
This normalization method was also identified in
\cite{reske2020atac} as being advantageous for ATAC-seq data
that show a trended bias.

In the tamoxifen example, we don't see this kind of bias.
However we can approximate it by reducing the data to two
sample groups:

<<MCF7T47D,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE,width=12,height=9>>=
mcf7t47d <- dba(tamoxifen,mask=c(3:7))
dba.plotMA(mcf7t47d,
           contrast=list(MCF7=mcf7t47d$masks$MCF7,
                         T47D=mcf7t47d$masks$T47D), 
           bNormalized=FALSE)
@
\incfig{DiffBind-MCF7T47D}{1\textwidth}
{MA Plot showing evidence of a "trended bias" in reduced dataset}

Figure~\ref{DiffBind-MCF7T47D} shows the apparent trended bias between the 
Responsive MCF7 and T47D samples.

Next we will use a loess fit to generate offsets, 
followed by an \edgeR analysis 
(to better approximate the \Rpackage{csaw} example):
<<loessfit,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE,width=12,height=9>>=
mcf7t47d$config$AnalysisMethod <- DBA_EDGER
mcf7t47d <- dba.normalize(mcf7t47d, offsets=TRUE)
mcf7t47d <- dba.contrast(mcf7t47d, contrast=c("Tissue","MCF7","T47D"))
mcf7t47d <- dba.analyze(mcf7t47d)
dba.plotMA(mcf7t47d)
@
\incfig{DiffBind-loessfit}{1\textwidth}
{MA Plot showing results of  normalizing with offsets
generated using a loess fit.}

Figure~\ref{DiffBind-loessfit} shows how the bias is "corrected" by this
normalization.
The result is a very close balance between sites identified
and significantly gaining binding affinity in the MCF7 samples
versus those gaining binding affinity in the T47D samples:

<<compareLoess,eval=TRUE,echo=TRUE,results=verbatim>>=
mcf7t47d.DB <- dba.report(mcf7t47d)
sum(mcf7t47d.DB$Fold > 0)
sum(mcf7t47d.DB$Fold < 0)
@

If we are certain that the bias is technical, this could save
the dataset.
However, if we are not certain that the observed
trend does not reflect a genuine biological
signal, such as a case where the MCF7 cells have a set of
sites with much higher binding affinity overall, this normalization
could skew the data so as to remove identification of these sites.

\subsection{Comparing the impact of normalization methods on analysis results}
\label{sec:comparison}

As can be seen, there are many possible ways of analyzing the same data
when taking into account analysis methods (\DESeq and \edgeR),
library size calculations (relying on all sequencing reads or only those 
that overlap consensus peaks), normalization methods 
(\Rcode{RLE}, \Rcode{TMM}, \Rcode{loess offsets}, or solely by library size), 
and finally whether to normalize based on enriched versus background regions.

There are seven primary ways to normalize the example dataset:
\begin{enumerate}
\item
Library size normalization using full sequencing depth
\item
Library size normalization using Reads in Peaks
\item
\Rcode{RLE} on Reads in Peaks
\item
\Rcode{TMM} on Reads in Peaks
\item
\Rcode{loess} fit on Reads in Peaks
\item
\Rcode{RLE} on Background bins
\item
\Rcode{TMM} on Background bins
\end{enumerate}

We can cycle through the seven possibilities,
repeated for \DESeq and for \edgeR,
and gather the results in a report-based \Rcode{DBA} object
called \Rcode{dbs.all}
for plotting and further analysis:
<<normCompare,eval=TRUE,echo=FALSE,results=verbatim>>=
data(tamoxifen_analysis)
dbs.all <- NULL
for(norm in c("lib","RLE","TMM", "loess")) {
  for(libsize in c("full","RiP","background")) {
    tam <- NULL
    background <- offsets <- FALSE
    if(libsize == "full" && norm != "lib") {
      background <- NULL
    }
    if(libsize == DBA_LIBSIZE_BACKGROUND) {
      background <- TRUE
      if(norm == DBA_NORM_LIB) {
        background <- NULL
      }
    } 
    if(norm == "loess" && !is.null(background)) {
      offsets <- TRUE
      if(libsize != "background") {
        background <- FALSE
      } else {
        background <- NULL
      }
    }
    if(!is.null(background)) {
      tam <- dba.normalize(tamoxifen, method=DBA_ALL_METHODS,
                           normalize=norm, library=libsize, 
                           background=background, offsets=offsets)
    }
    if(!is.null(tam)) {
      tam <- dba.analyze(tam, method=DBA_ALL_METHODS)
      for(meth in DBA_ALL_METHODS) {
        db <- dba.report(tam, method=meth, bDB=TRUE)
        if(meth == DBA_EDGER) {
          methstr <- "edgeR"
        } else {
          methstr <- "DESeq2"
        }
        if(libsize == "background") {
          libstr <- "BG"
        } else {
          libstr <- libsize
        }
        id <- paste(norm,libstr,methstr,sep="_")
        if(libsize == "full") {
          libstr <- "BG"
        }
        if(is.null(dbs.all)) {
          dbs.all <- db
          dbs.all$config$factor    <- "Normalization Method"
          dbs.all$config$condition <- "Reference Reads"
          dbs.all$config$treatment <- "Analysis Method"
          dbs.all$class[DBA_ID,]  <- colnames(dbs.all$class)[1] <- id
          
          dbs.all$class[DBA_FACTOR,]    <- norm
          dbs.all$class[DBA_CONDITION,] <- libstr
          dbs.all$class[DBA_TREATMENT,] <- "edgeR"
          dbs.all$class[DBA_TISSUE,]    <- NA
        } else {
          db$class[DBA_ID,]        <- id
          db$class[DBA_FACTOR,]    <- norm
          db$class[DBA_CONDITION,] <- libstr
          db$class[DBA_TREATMENT,] <- methstr
          db$class[DBA_TISSUE,]    <- NA
          dbs.all <- dba.peakset(dbs.all,db)
        }
      }
    }
  }
}
dbs.all <- dba(dbs.all,minOverlap=1)
dbs.all
@

First considering the \DESeq results, we can plot
a heatmap of the identified differentially bound peaks
to see how the methods cluster:

<<normClusterDESeq2,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE, width=14,height=15>>=
deseq <- dba(dbs.all,mask=dbs.all$masks$DESeq2, minOverlap = 1)
binding <- dba.peakset(deseq, bRetrieve=TRUE)
dba.plotHeatmap(deseq, maxSites=nrow(binding),  bLog=FALSE,
                correlations=FALSE,minval=-5, maxval=5, cexCol=1.3, 
                colScheme = hmap, main = "DESeq2 Differentially Bound Sites", 
                ColAttributes = c(DBA_CONDITION, DBA_FACTOR),
                key.title = "LFC")
@
\incfig{DiffBind-normClusterDESeq2}{1\textwidth}
{Clustering heatmap of differentially bound sites from \Rpackage{DESeq2}
analyses using various normalization and reference read methods}

In Figure~\ref{DiffBind-normClusterDESeq2}, sites that gain affinity in
the tamoxifen Resistant condition are shown in green (positive fold changes),
those that gain affinity in the Responsive condition are shown in red
(negative fold changes), and those that are not identified by a specific
analysis are shown in black (zero fold change).

The analyses break into two main clusters, one containing the four analyses
that relied on the main count matrix ("RiP") for normalizing, and the
other encompassing the three analyses that did not 
(relying on either background read counts, 
or on the total number of reads in the sequencing libraries.)
The RiP cluster shows a greater balance between sites that gain binding strength 
in each of the two conditions,
while sites identified by analyses using background reads identify
mostly sites with greater binding affinity in the Responsive sample group.

Given that the choice of which sets of reads to use for normalizing 
(focusing on reads in peaks or on all the reads in the libraries)
is more important in determining analysis results than the choice of
specific normalization method, we can now look at
the relative importance of the choice of analysis methods
(\DESeq and \edgeR):
<<normCluster1,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE, width=14,height=15>>=
binding <- dba.peakset(dbs.all, bRetrieve=TRUE)
dba.plotHeatmap(dbs.all, maxSites=nrow(binding), bLog=FALSE,
                correlations=FALSE, minval=-5, maxval=5, cexCol=1.3, 
                colScheme = hmap, key.title="LFC",
                ColAttributes = c(DBA_CONDITION, DBA_TREATMENT, DBA_FACTOR),
                main="All Differentially Bound Sites")
@
\incfig{DiffBind-normCluster1}{1\textwidth}
{Heatmap of fold changes in differentially bound sites from
\DESeq and  \edgeR analyses using various normalization and 
reference read methods}

<<normCluster2,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE, width=14,height=15>>=
dba.plotHeatmap(dbs.all, cexCol=1.3, main="Correlations of DB Sites",
                ColAttributes = c(DBA_CONDITION, DBA_TREATMENT, DBA_FACTOR))
@
\incfig{DiffBind-normCluster2}{1\textwidth}
{Correlation heatmap of fold changes in differentially bound sites from
\DESeq and  \edgeR analyses using various normalization and 
reference read methods}

Figure~\ref{DiffBind-normCluster1} and Figure~\ref{DiffBind-normCluster2}  
show that main division into a cluster
that normalizes using the consensus count matrix and another that
uses background reads (or raw sequencing depth) is maintained.
Within each of these two clusters, the analyses cluster by
analysis method (\DESeq vs \edgeR).

Again, the specific analysis method used appears to be far less important
than which sets of reads are used as the basis for normalization.
Indeed, the native RNA-seq methods, \Rcode{RLE} and \Rcode{TMM},
give very similar and highly correlated results when using
the same count data.

An assumption in RNA-seq analysis, that the read count matrix reflects
an unbiased representation of the experimental data,
may be violated when using a narrow set of consensus
peaks that are chosen specifically based on their rates
of enrichment. It is not clear that using 
normalization methods developed for RNA-seq count matrices
on the consensus reads
will not alter biological signals; 
it is probably a good idea to
avoid using the consensus count matrix (Reads in Peaks) 
for normalizing unless there is a good prior reason 
to expect balanced changes in binding.

\subsection{Spike-in normalization}

An alternative for avoiding the use of the consensus count matrix
when normalizing ChIP-seq data is to use
\emph{spike-in} data, where exogenous chromatin 
(usually from \emph{Drosophila melanogaster})
is "spiked in" to the ChIP.
If the amount of spiked-in chromatin can be precisely controlled,
then we can use the relative amounts of reads that map
to the alternative reference genome for each sample.
\DBA allows for spike-in reads to be included in the experiment,
either as an additional set of sequencing reads (bam) files, or
included in the primary reads (assuming a hybrid
reference genome, where the exogenous reads align to different
chromosome names than do the consensus peaks).
Counting these reads then forms a background we can use
in the same manner as background normalization discussed previously
(either a library size adjustment using the total number of exogenous reads
for each sample, or a native RNA-seq method (\Rcode{TMM} or \Rcode{RLE})
taking into account how those reads are distributed).

To illustrate this, a example dataset has been included in the \DBA package.
This dataset is derived from that used in \cite{guertin2018parallel}\footnote{
the data, including the sequencing reads, are available to install
from github at \Rcode{andrewholding/Brundle"} and 
\Rcode{andrewholding/BrundleData"}.}.
In this dataset, which also looks at ER binding,
the signal is highly unbalanced between conditions.
In this case we know that this is the result of a genuine biological
phenomenon, as the condition treated with Fulvestrant is known
to block ER binding.
We are looking to normalize the data while preserving the 
bias toward binding in the non-Fulvestrant condition.

We can load the data as follows, observing the two conditions,
each with four replicates:
<<loadSpikes,eval=TRUE,echo=TRUE,results=verbatim>>=
load(system.file('extra/spikes.rda',package='DiffBind'))
spikes
@

For this experiment, \emph{Drosophila} reads are included in the samplesheet
using the \Rcode{Spikein} column in the samplesheet:
<<spikeins,eval=TRUE,echo=TRUE,results=verbatim>>=
spikes$samples$Spikein
@

The loss of binding affinity in the Fulvestrant condition can
be seen in a MA plot:

<<spikeMAnone,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE,width=12,height=9>>=
dba.plotMA(spikes, contrast=list(Fulvestrant=spikes$masks$Fulvestrant),
           bNormalized=FALSE, sub="RAW", bSmooth=FALSE, dotSize=1.5)
@
\incfig{DiffBind-spikeMAnone}{1\textwidth}
{MA plot of ER dataset with and without Fulvestrant treatment,
non-normalized}

Figure~\ref{DiffBind-spikeMAnone} shows the non-normalized data.

Next we use the non-spikein normalization methods already discussed:
full library sizes, as well as \Rcode{RLE} normalization
using Reads-in-Peaks and background reads.

<<spikeMAstraight,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE,width=10,height=18>>=
par(mfrow=c(3,1))
spikes <- dba.normalize(spikes, normalize=DBA_NORM_LIB, 
                        background=FALSE)
spikes <- dba.analyze(spikes)
dba.plotMA(spikes, sub="LIB full", bSmooth=FALSE, dotSize=1.5)

spikes <- dba.normalize(spikes, normalize="RLE",
                        background=FALSE)
spikes <- dba.analyze(spikes)
dba.plotMA(spikes, sub="RLE RiP", bSmooth=FALSE, dotSize=1.5)

spikes <- dba.normalize(spikes, normalize="RLE", 
                        background=TRUE)
spikes <- dba.analyze(spikes)
dba.plotMA(spikes, sub="RLE BG", bSmooth=FALSE, dotSize=1.5)
par(mfrow=c(1,1))
@
\incfig{DiffBind-spikeMAstraight}{1\textwidth}
{MA plots of ER dataset with and without Fulvestrant treatment,
normalized by library size and \Rcode{RLE} using
Reads-in-Peaks and background reads.}

In Figure~\ref{DiffBind-spikeMAstraight}, the middle plot shows how
a direct application of \Rcode{RLE} using the consensus peaks
over-normalizes, causing many sites, after normalization,
to have positive fold changes and be identified as significantly
gaining in binding affinity after Fulvestrant treatment.
The analysis based on a full-depth library size adjustment does better,
but it still somewhat shifts the read density upwards towards the
Fulvestrant condition, and identifies 
at least one site as gaining binding affinity.
Using a background normalization performs best.

Now we can compare to an alternative background derived from counting
the \emph{Drosophila} reads\footnote{Note that
pre-calculated background reads are included for the example in
an object named \Rcode{spikes.spikeins},
so we do not need to re-count them for the vignette;
we can pass the pre-calculated ones in instead. 
Normally, with access to the spike-in reads, 
setting \Rcode{spikein=TRUE} will result
in the spike-in reads being counted.}

<<spikeMAspikein,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE,width=10,height=14>>=
par(mfrow=c(2,1))
spikes <- dba.normalize(spikes, normalize=DBA_NORM_LIB, 
                        spikein=spikes.spikeins)
spikes <- dba.analyze(spikes)
dba.plotMA(spikes, sub="LIB spikein", bSmooth=FALSE, dotSize=1.5)

spikes <- dba.normalize(spikes, normalize=DBA_NORM_RLE, spikein = TRUE)
spikes <- dba.analyze(spikes)
dba.plotMA(spikes, sub="RLE spikein", bSmooth=FALSE, dotSize=1.5)
par(mfrow=c(1,1))
@
\incfig{DiffBind-spikeMAspikein}{1\textwidth}
{MA plots of ER dataset with and without Fulvestrant treatment,
using \emph{Drosophila}
spike-in reads as a background, normalized using the
number of \emph{Drosophila} reads and \Rcode{RLE}.}

Figure~\ref{DiffBind-spikeMAspikein} shows how using the
spike-in reads enable normalization where all the apparent
gains in binding affinity are
eliminated, and the bulk of the sites are  identified as significantly
losing binding affinity in the Fulvestrant condition.

\subsection{Parallel factor normalization}

Another method related to spike-ins is supported in \DBA,
whereby an antibody for a "parallel factor" is also
pulled down in the ChIP samples, and consensus peaks from this
second factor are  used instead of the consensus peaks of the primary factor.
This idea behind this \emph{parallel factor normalization}
is that while we may not know the
true biological properties of the "foreground" factor being studied,
we can perform a ChIP on the same sample for an alternative
factor that is known not to change its binding patterns under
the conditions of our experiment.
In \cite{guertin2018parallel}, the transcription factor
\textbf{CTCF} is identified as an appropriate
such parallel factor\footnote{Note that the version
of parallel factor normalization supported directly in \DBA
is not the same as that discussed in \cite{guertin2018parallel},
where a more sophisticated modeling approach is used.}.

Alternatively, if there are sites for the primary pull-down that
are known to not change binding affinity, these can be used
without performing a second pull down.
This can be seen in the \Rpackage{THOR} tool, where the "housekeeping"
normalization is based on focusing on histone marks
such as H3K4me3 that should be consistently bound in the 
promoter regions associated with housekeeping genes\cite{allhoff2016thor}.

The utility of having a parallel factor
can be demonstrated using a sample dataset, again
looking at ER binding before and after treatment with Fulvestrant:

<<loadParfac,eval=TRUE,echo=TRUE,results=verbatim>>=
load(system.file('extra/parallelFactor.rda',package='DiffBind'))
parallelFactor
@

Again,the loss of binding affinity in the Fulvestrant condition can
be seen in the MA plot:

<<parfacMAnone,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE,width=12,height=9>>=
dba.plotMA(parallelFactor,
           contrast=list(Fulvestrant=parallelFactor$masks$Fulvestrant),
           bNormalized=FALSE, sub="RAW", bSmooth=FALSE, dotSize=1.5)
@
\incfig{DiffBind-parfacMAnone}{1\textwidth}
{MA plot of second ER dataset with and without Fulvestrant treatment,
non-normalized}

Figure~\ref{DiffBind-parfacMAnone} shows the non-normalized data.

By supplying a set of CTCF consensus peaks (loaded
with the example data and called \Rcode{parallelFactor.peaks}),
we can direct \DBA to count reads overlapping these peaks to form
the background distributions and proceed to normalization:

<<parfacMA,fig=TRUE,eval=TRUE,echo=TRUE,results=verbatim,include=FALSE,width=10,height=14>>=
par(mfrow=c(2,1))
parallelFactor <- dba.normalize(parallelFactor, norm=DBA_NORM_LIB,
                                spikein = parallelFactor.peaks)
parallelFactor <- dba.analyze(parallelFactor)
dba.plotMA(parallelFactor, sub="LIB CTCF", bSmooth=FALSE, dotSize=1.5)

parallelFactor <- dba.normalize(parallelFactor, norm=DBA_NORM_RLE,
                                spikein = TRUE) 
parallelFactor <- dba.analyze(parallelFactor)
dba.plotMA(parallelFactor, sub="RLE CTCF", bSmooth=FALSE, dotSize=1.5)
par(mfrow=c(1,1))
@
\incfig{DiffBind-parfacMA}{1\textwidth}
{MA plots of ER dataset with and without Fulvestrant treatment,
using reads overlapping CTCF sites as a background, normalized using the
number of reads overlapping CTCF sites, and \Rcode{RLE} over the CTCF sites.}

Figure~\ref{DiffBind-parfacMA} shows how the parallel factor  is
able to avoid over-normalizing the data and results in analyses identifying
the majority of ER binding sites
as significantly losing binding affinity when treated with Fulvestrant.

\subsection{Normalization summary}

There are a myriad of normalization option available in \DiffBind.
Table~\ref{DiffBind-normtable} summarizes the allowable combinations
of normalization methods (columns) and which sets of references reads
they can be applied to (rows):

<<results=tex, echo=FALSE>>=
require(xtable)
collabs <- paste(DBA_NORM_LIB," & ",DBA_NORM_RLE," & ",
                 DBA_NORM_TMM," & ",DBA_OFFSETS_LOESS)

rowlabs <- c(DBA_LIBSIZE_FULL,DBA_LIBSIZE_PEAKREADS,
             DBA_LIBSIZE_BACKGROUND,
             DBA_NORM_SPIKEIN,"parallel factor")

normtab <- matrix("X",5,4)
normtab[1,2:4] <- ""
normtab[c(1,3:5),4] <- ""
normtab <- data.frame(normtab)
rownames(normtab) <- rowlabs
addtorow <- list()
addtorow$pos <- list(0,0)
addtorow$command <- c("& \\multicolumn{4}{c|}{Normalization} \\\\\n",
                      paste("Reference & ",collabs,"\\\\\n"))

captionStr <- paste("Table of allowable normalization schemes.",
                    "Columns are normalization methods set by \\Rcode{normalize} or \\Rcode{offsets}.",
                    "Rows are reference reads, set by \\Rcode{library}, \\Rcode{background}, or \\Rcode{spikein}.")

print(xtable::xtable(normtab,align=c(rep("|c",5),"|"),
                     caption=captionStr, label="DiffBind-normtable"), 
      add.to.row = addtorow,
      include.colnames=FALSE, scalebox=1)
@

As we have seen, different normalization parameters
can alter the experimental data, potentially altering the
biological conclusions an analysis might suggest.
How then should the normalization parameters be set?

There is no single answer to this, and establishing the correct normalization
can be one of the most challenging aspects of a differential binding analysis.
Unless we have prior knowledge about the expected signal, or if
technical biases are particularly obvious, it may be wise
to avoid over-normalizing the data.

In the absence of spike-ins or a parallel factors, the "safest"
method is probably to set \Rcode{background=TRUE} and
\Rcode{normalize=DBA\_NORM\_NATIVE},
resulting in the use of background reads and the native normalization method
(\Rcode{TMM} for \edgeR, and \Rcode{RLE} for \DESeq).
This can be approximated at very low computational cost, with no extra 
reading of bam files, by the default settings of 
\Rcode{library=DBA\_LIBSIZE\_FULL},
\Rcode{normalize=DBA\_NORM\_LIBRARY}, and \Rcode{background=FALSE}.

If a well-characterized parallel factor is available, or if one is
available for "free" in the case of certain histone mark ChIPs,
this is probably preferable to spike-ins, given the
difficulties in initial spike-in quantification.

Only in certain cases where indicated by prior knowledge is the use of the
main count matrix, based on consensus peaks, appropriate.

\section{Example: Occupancy analysis and overlaps}
\label{sec:occupancy}

In this section, we look at the tamoxifen resistance ER-binding dataset in 
some more detail,
showing what a pure occupancy-based analysis would look like,
and comparing it to the results obtained using the affinity data.
For this we will start by re-loading the peaksets:

<<eval=TRUE,echo=TRUE,results=hide>>=
data(tamoxifen_peaks)
@

\subsection{Overlap rates}

One reason to do an occupancy-based analysis is to determine
what candidate sites should be used in a subsequent affinity-based analysis.
In the example so far, we took all sites that were identified in peaks
in at least two of the eleven peaksets, reducing the number of sites from
\Sexpr{nrow(tamoxifen$merged)}
overall to the
\Sexpr{nrow(tamoxifen$binding)}
sites used in the differential analysis.
We could have used a more stringent criterion,
such as only taking sites identified in five or six of the peaksets, 
or a less stringent one, such as including all
\Sexpr{nrow(tamoxifen$merged)}
sites. In making the decision of what criteria to use many factors come into play,
but it helps to get an idea of the rates at which the peaksets overlap
(for more details on how overlaps are determined, 
see Section ~\ref{subsec:technical_merging} on peak merging).
A global overview can be obtained using the \Rcode{RATE} mode of the 
\Rcode{dba.overlap} function as follows:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
olap.rate <- dba.overlap(tamoxifen,mode=DBA_OLAP_RATE)
olap.rate
@
The returned data in \Rcode{olap.rate} is a vector containing the number of 
peaks that appear in at least one, two, three, and so on up to all eleven peaksets.

These values can be plotted to show the overlap rate drop-off curve:

<<tamox_rate, fig=TRUE, include=FALSE, width=9, height=6 >>=
plot(olap.rate,type='b',ylab='# peaks',
     xlab='Overlap at least this many peaksets')
@
\incfig{DiffBind-tamox_rate}{.66\textwidth}{Overlap rate plot.}
{Shows how the number of overlapping peaks decreases as the overlap criteria
becomes more stringent.
X axis shows the number of peaksets in which the site is identified, 
while the Y axis shows the number of overlapping sites.
Generated by plotting the result of: 
\Rcode{dba.overlap(tamoxifen,mode=DBA\_OLAP\_RATE)}}

The rate plot is shown in Figure~\ref{DiffBind-tamox_rate}. 
These curves typically exhibit a roughly geometric drop-off, 
with the number of overlapping sites halving as the overlap criterion 
become stricter by one site.
When the drop-off is extremely steep, this is an indication that 
the peaksets do not agree very well. 
For example, if there are replicates you expect to agree, 
there may be a problem with the experiment. 
In the current example, peak agreement is high and the curve 
exhibits a better than geometric drop-off.

\subsection{Deriving consensus peaksets}
When performing an overlap analysis, it is often the case that
the overlap criteria are set stringently in order to lower noise and
drive down false positives.\footnote{It is less clear that
limiting the potential binding sites in this way is appropriate
when focusing on affinity data, as the 
differential binding analysis method will identify only sites that are
significantly differentially bound, even if operating on peaksets
that include incorrectly identified sites.}
The presence of a peak in multiple peaksets is an indication that 
it is a "real" binding site, in the sense of being 
identifiable in a repeatable manner. 
The use of biological replicates (performing the ChIP multiple times), 
as in the tamoxifen dataset, can be used to guide derivation 
of a consensus peakset. 
Alternatively, an inexpensive but less powerful way to help accomplish 
this is to use multiple peak callers for each ChIP dataset and 
look for agreement between peak callers (\cite{li2011measuring}).

Consider for example the standard (tamoxifen responsive) MCF7 cell line, 
represented by three replicates in this dataset. How well do the replicates 
agree on their peak calls? 
The overlap rate for the Responsive MCF7 samples can be isolated 
using a \emph{sample mask}.
A set of sample masks are automatically associated with a 
\Rcode{DBA object} in the \Rcode{\$masks} field:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
names(tamoxifen$masks)
@

Arbitrary masks can be generated using the \Rcode{dba.mask} function,
or simply by specifying a vector of peakset numbers.
In this case, a mask that isolates the MCF7 samples can be generated 
by combining to pre-defined masks (MCF7 and Responsive) and 
passed into the \Rcode{dba.overlap} function:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
dba.overlap(tamoxifen,tamoxifen$masks$MCF7 & tamoxifen$masks$Responsive,
            mode=DBA_OLAP_RATE)
@

There are
\Sexpr{dba.overlap(tamoxifen,tamoxifen$masks$MCF7 & tamoxifen$masks$Responsive,mode=DBA_OLAP_RATE)[3]}
peaks (out of
\Sexpr{dba.overlap(tamoxifen,tamoxifen$masks$MCF7 & tamoxifen$masks$Responsive,mode=DBA_OLAP_RATE)[1]})
identified in all three replicates.
A finer grained view of the overlaps can be obtained with the
\Rcode{dba.plotVenn} function:


<<tamox_mcf7_venn, fig=TRUE, include=FALSE, width=9, height=9 >>=
dba.plotVenn(tamoxifen, tamoxifen$masks$MCF7 & tamoxifen$masks$Responsive)
@
\incfig{DiffBind-tamox_mcf7_venn}{.66\textwidth}{Venn diagram showing
how the ER peak calls for three replicates of responsive MCF7 cell line overlap.}
{Generated by plotting the result of: 
\Rcode{dba.venn(tamoxifen,tamoxifen\$masks\$MCF7 \& tamoxifen\$masks\$Responsive)}}

The resultant plot is shown as Figure~\ref{DiffBind-tamox_mcf7_venn}.
This plot shows the
\Sexpr{dba.overlap(tamoxifen,tamoxifen$masks$MCF7 & tamoxifen$masks$Responsive,mode=DBA_OLAP_RATE)[3]}
consensus peaks identified as common to all replicates, but further breaks down 
how the replicates relate to each other.
The same can be done for each of the replicated cell line experiments, 
and rather than applying a global cutoff (eg. 3 of 11), each cell line could be
dealt with individually in deriving a final peakset.
A separate consensus peakset for each of the replicated sample types can be 
added to the DBA object using \Rcode{dba.peakset}:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen_consensus <- dba.peakset(tamoxifen, 
                                   consensus=c(DBA_TISSUE,DBA_CONDITION),
                                   minOverlap=0.66)
@

This adds a new consensus peakset for each set of samples that share the same 
Tissue and Condition values. 
The exact effect could be obtained by calling
\Rcode{tamoxifen\_consensus <- dba.peakset(tamoxifen, consensus=-DBA\_REPLICATE)} 
on the original set of peaks; this tells \DiffBind to generate a consensus 
peakset for every set of samples that have identical metadata values 
\emph{except} the Replicate number.

From this, a new \Rclass{DBA} object can be generated consisting of 
only the five consensus peaksets (the \$Consensus mask filters 
peaksets previously formed using \Rcode{dba.peakset}) :

<<eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen_consensus <- dba(tamoxifen_consensus,
                           mask=tamoxifen_consensus$masks$Consensus,
                           minOverlap=1)
tamoxifen_consensus
@

and an overall consensus peakset, that includes peaks identified in at 
least two replicates of at least one sample group, can be identified:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
consensus_peaks <- dba.peakset(tamoxifen_consensus, bRetrieve=TRUE)
@

This consensus peakset could then be used as the basis for the 
binding matrix used in \Rcode{dba.count}:

\Rcode{tamoxifen <- dba.count(tamoxifen, peaks=consensus\_peaks)}

Finally, consider an analysis where we wished to treat all five MCF7 samples 
together to look for binding sites specific to that cell line irrespective of 
tamoxifen resistant/responsive status.
We can create consensus peaksets for each cell type,
and look at how the resultant peaks overlap 
(shown in Figure~\ref{DiffBind-tamox_lines_venn}):

<<tamox_lines_venn, fig=TRUE, include=FALSE, width=9, height=9 >>=
data(tamoxifen_peaks)
tamoxifen <- dba.peakset(tamoxifen, consensus=DBA_TISSUE, minOverlap=0.66)
cons.ol <- dba.plotVenn(tamoxifen, tamoxifen$masks$Consensus)
@
\incfig{DiffBind-tamox_lines_venn}{.66\textwidth}{Venn diagram showing 
how the consensus peaks for each cell type overlap.}
{Generated by plotting the result of: 
\Rcode{dba.venn(tamoxifen,tamoxifen\$masks\$Consensus)}}

Figure~\ref{DiffBind-tamox_lines_venn} shows how consensus peaksets
derived for each cultured cell type overlap.
The ZR75 samples stand out for having \Sexpr{length(cons.ol$onlyD)} peaks
common to both replicates that are not identified in any other cell type.

\subsection{A complete occupancy analysis: identifying sites unique to a sample group}

Occupancy-based analysis, in addition to offering many ways of deriving
consensus peaksets, can also be used to identify sites unique to a 
group of samples. 
This is analogous to, but not the same as, finding differentially bound sites.
In these subsections, the two approaches are directly compared.

Returning to the original tamoxifen dataset:

<<eval=TRUE,echo=TRUE,results=hide>>=
data(tamoxifen_peaks)
@

We can derive consensus peaksets for the Resistant and Responsive groups. 
First we examine the overlap rates:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
dba.overlap(tamoxifen,tamoxifen$masks$Resistant,mode=DBA_OLAP_RATE)
dba.overlap(tamoxifen,tamoxifen$masks$Responsive,mode=DBA_OLAP_RATE)
@

Requiring that consensus peaks overlap in at least one third of the samples 
in each group results in
\Sexpr{dba.overlap(tamoxifen,tamoxifen$masks$Resistant,mode=DBA_OLAP_RATE)[2]}
sites for the Resistant group and
\Sexpr{dba.overlap(tamoxifen,tamoxifen$masks$Responsive,mode=DBA_OLAP_RATE)[3]}
sites for the Responsive group:

<<tamox_cons_venn, fig=TRUE, include=FALSE, width=9, height=9 >>=
tamoxifen <- dba.peakset(tamoxifen, consensus=DBA_CONDITION, minOverlap=0.33)
dba.plotVenn(tamoxifen,tamoxifen$masks$Consensus)
@
\incfig{DiffBind-tamox_cons_venn}{.66\textwidth}{Venn diagram showing how the
ER peak calls for two response groups overlap.}
{Generated by plotting the result of: 
\Rcode{dba.plotVenn(tamoxifen, tamoxifen\$masks\$Consensus)}}

Figure~\ref{DiffBind-tamox_cons_venn} shows that
\Sexpr{nrow(dba.overlap(tamoxifen,tamoxifen$masks$Consensus,DataType=DBA_DATA_FRAME)$onlyA)}
sites are unique to the Resistant group, and
\Sexpr{nrow(dba.overlap(tamoxifen,tamoxifen$masks$Consensus,DataType=DBA_DATA_FRAME)$onlyB)}
sites are unique to the Responsive group, with
\Sexpr{nrow(dba.overlap(tamoxifen,tamoxifen$masks$Consensus,DataType=DBA_DATA_FRAME)$inAll)}
sites being identified in both groups (meaning in at least half the
Resistant samples and at least three of the seven Responsive samples).
If our primary interest is in finding binding sites that are different
between the two groups, it may seem reasonable to consider the
\Sexpr{nrow(dba.overlap(tamoxifen,tamoxifen$masks$Consensus,DataType=DBA_DATA_FRAME)$inAll)}
common sites to be uninteresting, and focus on the
\Sexpr{nrow(dba.overlap(tamoxifen,tamoxifen$masks$Consensus,DataType=DBA_DATA_FRAME)$onlyA) + nrow(dba.overlap(tamoxifen,tamoxifen$masks$Consensus,DataType=DBA_DATA_FRAME)$onlyB)}
sites that are unique to a specific group. These unique sites can be obtained
using \Rcode{dba.overlap}:

<<eval=TRUE,echo=TRUE,results=hide>>=
tamoxifen.OL <- dba.overlap(tamoxifen, tamoxifen$masks$Consensus)
@

The sites unique to the Resistant group are accessible in 
\Rcode{tamoxifen.OL\$onlyA}, with the Responsive-unique sites in 
\Rcode{tamoxifen.OL\$onlyB}:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen.OL$onlyA
tamoxifen.OL$onlyB
@

The scores associated with each site are derived from the peak caller 
confidence score, and are a measure of confidence in the peak call (occupancy), 
not a measure of how strong or distinct the peak is.

\subsection{Comparison of occupancy and affinity based analyses}

So how does this occupancy-based analysis compare to the 
previous affinity-based analysis?

First, different criteria were used to select the overall consensus peakset.
We can compare them to see how well they agree:

<<tamox_compare_venn, fig=TRUE, include=FALSE, width=9, height=9 >>=
tamoxifen <- dba.peakset(tamoxifen,tamoxifen$masks$Consensus,
                         minOverlap=1,sampID="OL Consensus")
tamoxifen <- dba.peakset(tamoxifen,!tamoxifen$masks$Consensus,
                         minOverlap=3,sampID="Consensus_3")
dba.plotVenn(tamoxifen,14:15)
@
\incfig{DiffBind-tamox_compare_venn}{.66\textwidth}{Venn diagram showing how 
the ER peak calls for two different ways of deriving consensus peaksets.}
{Generated by plotting the result of: \Rcode{dba.plotVenn(tamoxifen,14:15)}}

Figure~\ref{DiffBind-tamox_compare_venn} shows that the two sets agree on 
about 85\% of their sites, so the results should be directly comparable between 
the differing parameters used to establish
the consensus peaksets.\footnote{Alternatively,
we could re-run the analysis using the newly derived consensus peakset by passing
it into the counting function: 
\Rcode{tamoxifen <- dba.count(tamoxifen, peaks=tamoxifen\$masks\$Consensus)}}

Next re-load the affinity analysis:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
data(tamoxifen_analysis)
@

To compare the sites unique to each sample group identified from the
occupancy analysis with those sites identified as differentially bound based on 
affinity (read count) data, we use a feature of
\Rcode{dba.report} that facilitates evaluating the occupancy status of sites. 
Here we obtain a report of all the sites (\Rcode{th=1}) 
with occupancy statistics (\Rcode{bCalled=TRUE}):

<<eval=TRUE,echo=TRUE,results=hide>>=
tamoxifen.rep <- dba.report(tamoxifen,bCalled=TRUE,th=1)
@

The \Rcode{bCalled} option adds two columns to the report 
(\Rcode{Called1} and \Rcode{Called2}), one for each group, giving the number 
of samples within the group in which the site was identified as a peak in 
the original peaksets generated by the peak caller.
We can use these to recreate the overlap criteria used in the occupancy analysis:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
onlyResistant <- tamoxifen.rep$Called1>=2 & tamoxifen.rep$Called2<3
sum(onlyResistant )
onlyResponsive <- tamoxifen.rep$Called2>=3 &  tamoxifen.rep$Called1<2
sum(onlyResponsive)
bothGroups <- tamoxifen.rep$Called1>= 2 & tamoxifen.rep$Called2>=3
sum(bothGroups)
@

Comparing these numbers verifies the similarity with those seen in 
Figure~\ref{DiffBind-tamox_cons_venn}, showing again how the basic analysis
is not oversensitive to differences in how the consensus peaksets are formed.
This overlap analysis suggests that
\Sexpr{sum(onlyResistant)+sum(onlyResponsive)}
of the sites are uniquely bound in either the Responsive or Resistant groups,
while \Sexpr{sum(bothGroups)} sites are common to both.

Completing a full differential analysis and focusing on only those sites 
identified as significantly differentially bound (FDR <= 0.05),
however, shows a different story than that obtainable using only occupancy data:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
tamoxifen.DB <- dba.report(tamoxifen,bCalled=TRUE)
onlyResistant.DB  <- (tamoxifen.DB$Called1 >= 2) & (tamoxifen.DB$Called2 < 3)
sum(onlyResistant.DB)
onlyResponsive.DB <- (tamoxifen.DB$Called2 >= 3) & (tamoxifen.DB$Called1 < 2)
sum(onlyResponsive.DB)
bothGroups.DB     <- (tamoxifen.DB$Called1 >= 2) & (tamoxifen.DB$Called2 >= 3)
sum(bothGroups.DB)
neitherGroup.DB   <- (tamoxifen.DB$Called1  < 2) & (tamoxifen.DB$Called2 < 3)
sum(neitherGroup.DB)
@

There are a number of notable differences in the results.

First, overall there are  fewer sites identified as differentially bound
(\Sexpr{length(tamoxifen.DB)}) 
than are sites identified as being unique to one condition
(\Sexpr{sum(onlyResistant)}+\Sexpr{sum(onlyResponsive)} ==
\Sexpr{sum(onlyResistant)+sum(onlyResponsive)}).
Only about
\Sexpr{round(sum(sum(onlyResistant.DB), sum(onlyResponsive.DB))/sum(sum(onlyResistant),sum(onlyResponsive))*100)}\%
of sites unique to one condition are identifiable as significantly differentially bound
(\Sexpr{sum(onlyResistant.DB)}+\Sexpr{sum(onlyResponsive.DB)} =
\Sexpr{sum(onlyResistant.DB) + sum(onlyResponsive.DB)}
out of \Sexpr{sum(onlyResistant)+sum(onlyResponsive)}).
Focusing only on sites unique to one condition
would result in many false positives;
most of the sites identified in the occupancy analysis as unique to a
sample group are not found to be significantly differentially bound
using the affinity data. 

Second, differentially bound sites are as likely to be called in the consensus
of both response groups as they are to be unique to one group;
of the total sites identified as significantly differentially bound, 
(\Sexpr{length(tamoxifen.DB)}),
\Sexpr{round(sum(bothGroups.DB)/length(tamoxifen.DB)*100)}\%
are called as peaks in \emph{both} response groups (\Sexpr{sum(bothGroups.DB)}).

Third, the largest single group of differentially bound sites 
(\Sexpr{sum(neitherGroup.DB)})
were not identified as being consistently associated with \emph{either} 
sample group (peaks called no more than no Resistant sample and
no more than 2 Responsive samples), yet were still shown to have
significantly different read densities.
Any reasonable criteria for isolating peaks in an occupancy analysis would miss
these completely, resulting in many false negatives.

A final advantage of a quantitative analysis is that
the differentially bound peaks identified using the affinity analysis
are associated with significance statistics (p-value and FDR)
that can be used to rank them for further examination,
while the occupancy analysis yields a relatively unordered list of peaks,
as the peak caller statistics refer only to the significance of occupancy,
and not of differential binding.

\section{Backward compatibility with pre-version 3.0 analyses}
\label{sec:backward}

It is recommended that existing analyses be re-run with the current software.
Existing scripts should execute (with the exception of two
normalization parameters which have been moved from \Rcode{dba.analyze}
to the new interface function \Rcode{dba.normalize}.)

Most existing \DBA scripts and saved objects will
run correctly using version 3.0, but there may be differences
in the results. 

This section describes how to approximate earlier
results for existing scripts and objects.

\subsection{Running with saved DBA objects}

If a \Rcode{DBA} object was created with an earlier version of 
\DBA, and saved using the \Rcode{dba.save}
function, and loaded using the \Rcode{dba.load} function,
all settings should be preserved, such that running the analysis anew
will yield similar results.

In order to re-run the analysis using the post-version 3.0 settings,
the original script should be used to re-create the \Rcode{DBA} object.

\subsection{Re-running \DBA scripts}

By default, if you re-run a \DBA script, it will use the
new defaults from version 3.0 and beyond.
In order to re-analyze an experiment in the pre-version 3.0 mode,
a number of defaults need to be changed.

When calling \Rcode{dba.count}, the following defaults are changed:
\begin{enumerate}
\item
\Rcode{summits:} 
This parameter is now set by default.
Setting \Rcode{summits=FALSE} will preempt re-centering each 
peak interval around its point of highest pileup.
\item
\Rcode{filter:} 
The new default for this parameter is \Rcode{1}
and is based on \Rcode{RPKM} values; previously
it was set to \Rcode{filter=0} and was based on read counts.
previously it was set to \Rcode{filter=0}.
\item
\Rcode{minCount:}
This is a new parameter representing a minimum 
read count value. It now default to \Rcode{0}; 
to get the previous behavior, set \Rcode{minCount=1}.
\item
\end{enumerate}

The easiest way to perform subsequent processing in a pre-version 3.0 
manner is to set a configuration option:

<<eval=FALSE,echo=TRUE,results=verbatim>>=
DBA$config$design <- FALSE
@

This will result in the appropriate defaults being set for
the new interface function, \Rcode{dba.normalize} (which
does not need to be invoked explicitly.)
The pre-version 3.0 settings for \Rcode{dba.normalize} parameters
are as follows:
\begin{enumerate}
\item
\Rcode{normalize:} 
\Rcode{DBA\_NORM\_DEFAULT}
\item
\Rcode{library:}
\Rcode{DBA\_LIBSIZE\_FULL}
\item
\Rcode{background:} 
\Rcode{FALSE}
\end{enumerate}
  
Note that two parameters that used to be available when calling 
\Rcode{dba.analyze} have been moved:

\begin{enumerate}
\item
\Rcode{bSubControl:} 
now integrated into  \Rcode{dba.count}.
\Rcode{TRUE} by default (unless a greylist has
been added using \Rcode{dba.blacklist}).
\item
\Rcode{bFullLibrarySize:} 
This is now part of the 
\Rcode{library} parameter for \Rcode{dba.normalize}. 
\Rcode{library=DBA\_LIBSIZE\_FULL} is equivalent to
\Rcode{bFullLibrarySize=TRUE}, and \Rcode{library=DBA\_LIBSIZE\_PEAKREADS} 
is equivalent to \Rcode{bFullLibrarySize=FALSE}.
\end{enumerate}

\section{Technical notes}
\label{sec:technical}

This section includes some technical notes explaining some of the  
technical details of \DBA processing.

\subsection{Loading peaksets}
\label{subsec:technical_peaksets}

There are a number of ways to get peaksets loaded into a DBA object.
Peaksets can be read in from files or loaded from interval sets already stored in an R object.
Samples can be specified either in a sample sheet
(using \Rfunction{dba}) or loaded one at a time (using \Rfunction{dba.peakset}).

When loading in peaksets from files,
specifying what peak caller generated the file
enables peaks from supported peak callers to be read in.
See the help page for \Rfunction{dba.peakset} for a list of supported peak callers.
Any string can be used to indicate the peak caller;
if it is not one of the supported callers, a default "raw" format is assumed,
consisting of a text file with three or four columns
(indicating the chromosome, start position, and end position,
with a score for each interval found in the fourth column, if present).
You can further control how peaks are read using the
\Rcode{PeakFormat}, \Rcode{ScoreCol}, and \Rcode{bLowerBetter} fields
if you want to override the defaults for the specified peak caller identifier.
For example, with the tamoxifen dataset used in this tutorial, the peaks were called using the MACS peak caller, but the data are supplied as  text files in BED format,
not the expected MACS "xls" format.
To maintain the peak caller in the metadata,
we could specify the \Rcode{PeakCaller} as "macs"
but the \Rcode{PeakFormat} as "bed".
If we wanted to use peak scores from a column other than the fifth,
the \Rcode{scorecol} parameter could be set to indicate the appropriate column number.
When handling scoring, \DBA by default assumes that a higher score indicates a "better" peak.
If this is not the case, for example if the score is a p-value or FDR,
we could set  \Rcode{bLowerScoreBetter} to \Rcode{TRUE}.

When using a sample sheet,
values for fields missing in the sample sheet can be supplied when calling \Rfunction{dba}.
In addition to the minimal sample sheet used for the tutorial,
an equivalent sample sheet with all the metadata fields is included, called "tamoxifen\_allfields.csv".
See the help page for \Rfunction{dba} for an example using this sample sheet.

\subsection{Merging peaks}
\label{subsec:technical_merging}

When forming the global binding matrix consensus peaksets,
\DBA first identifies all unique peaks amongst the relevant peaksets.
As part of this process, it merges overlapping peaks,
replacing them with a single peak representing the narrowest region
that covers all peaks that overlap by at least one base.
There are at least two consequences of this that are worth noting.

First, as more peaksets are included in analysis,
the average peak width tends to become longer as more overlapping peaks are detected
and the start/end points are adjusted outward to account for them.
Secondly, peak counts may not appear to add up as you may expect due to merging.
For example, if one peakset contains two small peaks near to each other,
while a second peakset includes a single peak that overlaps both of these by at least one base,
these will all be replaced in the merged matrix with a single peak. 
As more peaksets are added, multiple peaks from multiple peaksets may be merged together
to form a single, wider peak.
Use of the "summits" parameter is recommended to control for this widening effect.

\subsection{Details of \Biocpkg{DESeq2} analysis}
\label{subsec:technical_deseq2}

When \Rfunction{dba.analyze} is invoked using the default method 
\Rcode{method=DBA\_DESEQ2}, 
a standardized differential analysis is performed using the 
\Biocpkg{DESeq2}package (\cite{Love2014}).
This section details the  steps in that analysis.

First, a matrix of counts is constructed using all the samples in the experiment, 
with rows for each consensus interval, 
and columns for each sample in the experiment.
The raw read count is used for this matrix; 
if the \Rcode{bSubControl} parameter is set to \Rcode{TRUE}, 
the raw number of reads in the control sample (if available) will be subtracted.
And read counts that are less than \Rcode{minCount} are set to \Rcode{minCount} 
(default is zero).
% If a non-zero \Rcode{filter} is supplied, the \Rcode{filterFun} is run
% across the rows (consensus peaks), and any peak interval for which the result
% of the function is less than the specified \Rcode{filter} is removed.
% The default \Rcode{filterFun} is \Rfunction{max}, resulting in
% any peak interval without at least one sample having at least \Rcode{filter}
% counts being removed from the subsequent analysis.

A \Rcode{DESeqDataSet} object is then created using 
\Rcode{DESeq2::DESeqDataSetFromMatrix} with the count matrix, 
metadata from the \Rcode{DBA} object, and the design formula.

Next the normalization parameters are set for the analysis.
If the normalization involves \Rcode{offsets},
the \DESeq normalization factors are set to the offsets;
if the \Rcode{DBA\_NORM\_OFFSETs\_ADJUST} is set, the offsets
are assumed to be in \edgeR format and are run through the
\Rcode{edgeR::scaleOffset} function, adjusted to be mean centered on 1,
then normalized to library size.
If offsets are not used, then \Rcode{DESeq::sizeFactors} is
called with the scaling factors computed in \Rcode{dba.normalize}.
If these were based on the native \Rcode{RLE} method,
then the result of a previous call to \Rcode{DESeq2::estimateSizeFactors} 
is used.

\Rfunction{DESeq2::estimateDispersions} is then called with the 
\Rclass{DESeqDataSet} object.
By default the \Rcode{fitType} will be set to \Rcode{local};
this can be overridden by setting a configuration option 
\Rcode{DBA\$config\$DESeq2\$fitType} to the desired value.

Finally the model is fitted and tested using 
the \Rfunction{DESeq2::nbinomWaldTest}, with defaults.

The final results, as a \Rclass{DESeqDataSet} object, are accessible
by calling \Rfunction{dba.analyze()} with 
\Rcode{bRetrieveAnalysis=DBA\_DESeq2} (or \Rcode{bRetrieveAnalysis=TRUE}).

When a contrast is evaluated, the results are obtained 
by calling \Rcode{DESeq2::results}.
If the contrast was specified to \Rfunction{dba.contrast()}
using a single character string
containing the name of a specific column coefficient, 
the call is made with \Rcode{name} set to the character string.
Otherwise, the call is made with \Rcode{contrast} set appropriate
(either as a vector of three character strings representing a design
\Rcode{Factor}, and two values for that Factor,
or a list of one or two character strings representing coefficient names,
or as a numeric vector of the same length as the number of 
coefficients in the design matrix.
Note that coefficient names can be retrieved by calling
\Rfunction{dba.contrast()} with \Rcode{bRetrieveCoefficients=TRUE};
these are obtained from a call to \Rcode{DESeq2::resultsNames}.

The fold changes used in subsequent reports and plots are the shrunk values 
estimated from the \Rcode{results} using \Rfunction{DESeq2::lfcShrink}.

When retrieving or plotting results (e.g. calling \Rfunction{dba.report()},
\Rfunction{dba.plotMA()},or \Rfunction{dba.plotVolcano()}), 
if a \Rcode{fold} cutoff is specified other than \Rcode{0.0}, 
the results are re-computed using 
\Rfunction{DESeq2::results} with the \Rcode{lfcThreshold} supplied.
Fold values are re-computed as well 
(using \Rfunction{DESeq2::lfcShrink)}).

\subsection{Details of \Biocpkg{edgeR} analysis}
\label{subsec:technical_edger}

When \Rfunction{dba.analyze} is invoked using the 
\Rcode{method=DBA\_EDGER}\footnote{Note that 
\Biocpkg{edgeR} can be made the default analysis method for a DBA object
by setting \Rcode{DBA\$config\$AnalysisMethod} to \Rcode{DBA\_EDGER}.}, 
a standardized differential analysis is performed using the
\Rpackage{edgeR} package (\cite{Robinson:2010p249}). 
This section details the steps in that analysis.

First, a matrix of counts is constructed using all the samples in the experiment, 
with rows for each consensus interval, 
and columns for each sample in the experiment.
The raw read count is used for this matrix; 
if the \Rcode{bSubControl} parameter is set to \Rcode{TRUE}, 
the raw number of reads in the control sample (if available) will be subtracted.
And read counts that are less than \Rcode{minCount} are set to \Rcode{minCount} 
(default is zero).
% If a non-zero \Rcode{filter} is supplied, the \Rcode{filterFun} is run
% across the rows (consensus peaks), and any peak interval for which the result
% of the function is less than the specified \Rcode{filter} is removed.
% The default \Rcode{filterFun} is \Rfunction{max}, resulting in
% any peak interval without at least one sample having at least \Rcode{filter}
% counts being removed from the subsequent analysis.

Next an \edgeR \Rcode{DGEList} is created by calling
\Rcode{edgeR::DGEList} with the count matrix, the pre-computed library
sizes and normalization factors (from \Rcode{dba.normalize)}),
and meta-data derived from the \Rcode{DBA} object.
If these were based on the \Rcode{TMM} or \Rcode{RLE} methods,
then the result of a previous call to \Rcode{edgeR::calcNormFactors} 
(with \Rcode{doWeighting=FALSE})
is used as the normalization factors.

If the normalization involves \Rcode{offsets}, the offsets
are retrieved.
If the \Rcode{DBA\_NORM\_OFFSETs\_ADJUST} is set, the offsets
\Rcode{edgeR::scaleOffset} function before being set as
the offsets in the \Rcode{DGEList}.

Next \Rfunction{edgeR::estimateGLMTrendedDisp} is called with the 
\Rcode{DGEList} and a design matrix derived from the design formula.
By default, \Rfunction{edgeR::estimateGLMTagwiseDisp} is called next;
this can be bypassed by setting 
\Rcode{DBA\$config\$edgeR\$bTagwise=FALSE}.\footnote{In versions prior to 3.0,
\Rcode{bTagwise} was a parameter to \Rfunction{dba.analyze()}.}

The model is fitted by calling \Rfunction{edgeR::glmQLFit} with the design matrix.
The final results, as a \Rclass{DGEGLM} object, are accessible
by calling \Rfunction{dba.analyze()} with \Rcode{bRetrieveAnalysis=DBA\_EDGER}.

Each specified contrast is evaluated in two steps.
First the main test is performed by a call to 
First is a call to \Rcode{edgeR::glmQLFTest} with the fitted model 
and the values for each coefficient in the design matrix.
If the contrast was specified to \Rfunction{dba.contrast()} 
using values for each of the coefficients in the design matrix, 
those values are used.
Otherwise, if the contrast was specified using a single character string
containing a) the name of a specific column coefficient;
b) a vector of three character strings representing a design \Rcode{Factor}, 
and two values for that Factor;
or c) list of one or two character strings representing coefficient names,
a numeric vector of the same length as the number of coefficients is derived.
Note that allowable coefficient names (which differ from those used internally
by \Rpackage{edgeR}) 
can be retrieved by calling
\Rfunction{dba.contrast()} with \Rcode{bRetrieveCoefficients=TRUE}.

Finally, \Rfunction{edgeR::topTags} is called to retrieve the results of the test.
The fold changes used in subsequent reports and plots are those estimated 
from the \Rcode{TopTags}.

When retrieving or plotting results (e.g. calling \Rfunction{dba.report()},
\Rfunction{dba.plotMA()},or \Rfunction{dba.plotVolcano()}), 
if a \Rcode{fold} cutoff is specified other than \Rcode{0.0}, 
the results are re-computed using 
\Rcode{edgeR::glmTreat} instead of \Rcode{edgeR::glmQLFTest}.

\section{Technical notes for versions prior to \DBA 3.0 (without an explicit model design)}

Prior to version \Rcode{3.0}, \DBA did not offer a way to explicitly 
specify a model design. 
The technical details of how it performed analyses are described here.

Note that these methods are still supported. 
In order to conduct analyses in the former style, the user must call
\Rcode{dba.contrast()} with \Rcode{design=FALSE}.
Contrasts can be added either automatically or explicitly 
(specifying at least \Rcode{groups1}),
and a \Rcode{block}ing fact may also be specified.

\subsection{\Biocpkg{DESeq2} analysis}
When \Rfunction{dba.analyze} is invoked using the default method 
\Rcode{method=DBA\_DESEQ2}, a standardized differential analysis is performed 
using the \Biocpkg{DESeq2}package (\cite{Love2014}). 
This section details the precise steps in that analysis.

For each contrast, a separate analysis is performed. 
First, a matrix of counts is constructed for the contrast, with columns 
for all the samples in the first group, followed by columns for all the 
samples in the second group. 
The raw read count is used for this matrix; if the \Rcode{bSubControl} 
parameter is set to \Rcode{TRUE} (as it is by default), 
the raw number of reads in the control sample (if available) will be subtracted.
Next the library size is computed for each sample for use in subsequent normalization. 
By default, this is the total number of reads in the library 
(calculated from the source BAM/BED file). 
Alternatively, if the \Rcode{bFullLibrarySize} parameter is set to FALSE, 
the total number of reads in peaks (the sum of each column) is used. 
The first step concludes with a call to \Biocpkg{DESeq2}'s 
\Rfunction{DESeqDataSetFromMatrix} function, 
which returns a \Rclass{DESeqDataSet} object. 

If \Rcode{bFullLibrarySize} is set to TRUE (default), 
then \Rfunction{sizeFactors} is called with the number of reads in
the BAM/BED files for each ChIP sample, divided by the minimum of these; 
otherwise, \Rfunction{estimateSizeFactors} is invoked. 

\Rfunction{estimateDispersions} is then called with the 
\Rclass{DESeqDataSet} object and \Rcode{fitType} set to \Rcode{local}. 
Next the model is fitted and tested using \Rfunction{nbinomWaldTest}.
The final results (as a \Rclass{DESeqDataSet}) are accessible within
the \Rclass{DBA} object as 

\Rcode{DBA\$contrasts[[n]]\$DESeq2\$DEdata} 

and may be examined and manipulated directly for further customization.
Note however that if you wish to use this object directly with
\Biocpkg{DESeq2} functions, then the \Rcode{bReduceObjects} 
parameter should be set to FALSE, otherwise the default value of TRUE
will result in essential object fields being stripped.

If a blocking factor has been added to the contrast, an additional 
\Biocpkg{DESeq2} analysis is carried out by setting the \Rcode{design}
to include all the unique values for the blocking factor. 
This occurs before the dispersion values are calculated. 
The resultant \Rclass{DESeqDataSet} object is accessible as

\Rcode{DBA\$contrasts[[n]]\$DESeq2\$block\$DEdata}. 

\subsection{\edgeR analysis}
When \Rfunction{dba.analyze} is invoked using the 
\Rcode{method=DBA\_EDGER}\footnote{Note that 
\Biocpkg{edgeR} can be made the default analysis method for a DBA object by 
setting \Rcode{DBA\$config\$AnalysisMethod=DBA\_EDGER}.}, 
a standardized differential analysis is performed using the 
\edgeR package (\cite{Robinson:2010p249}).
This section details the precise steps in that analysis.

For each contrast, a separate analysis is performed. 
First, a matrix of counts is constructed for the contrast, with columns for 
all the samples in the first group, followed by columns for all the
samples in the second group. 
The raw read count is used for this matrix; if the \Rcode{bSubControl} 
parameter is set to \Rcode{TRUE} (as it is by default), 
the raw number of reads in the control sample (if available) 
will be subtracted (with a minimum final read count of 1). 
Next the library size is computed for each sample for use in 
subsequent normalization. 
By default, this is the total number of reads in the library 
(calculated from the source BAM//BED file).
Alternatively, if the \Rcode{bFullLibrarySize} parameter is set to FALSE,
the total number of reads in peaks (the sum of each column) is used. 
Note that "effective" library size (\Rcode{bFullLibrarySize=FALSE})
may be more appropriate for situations when the overall signal (binding rate) 
is expected to be directly comparable between the samples.
Next comes a call to \edgeR's \Rfunction{DGEList} function.
The \Rclass{DGEList} object that results is next passed to 
\Rcode{calcNormFactors} with \Rcode{method="TMM"} and 
\Rcode{doWeighting=FALSE}, returning an updated \Rcode{DGEList} object. 
This is passed to \Rcode{estimateCommonDisp} with default parameters. 

If the method is \Rcode{DBA\_EDGER\_CLASSIC}, then if \Rcode{ bTagwise} 
is TRUE (most useful when there are at least three members in 
each group of a contrast), the resulting 
\Rclass{DGEList} object is then passed to \Rcode{estimateTagwiseDisp},
with the prior set to 50 divided by two less than the total number of
samples in the contrast, and \Rcode{trend="none"}. 
The final steps are to perform testing to determine the significance measure 
of the differences between the sample groups by calling 
\Rfunction{exactTest} (\cite{RobinsonandSmyth}) using the \Rclass{DGEList}
with the \Rcode{dispersion} set based on the \Rcode{bTagwise} parameter.

If the method is \Rcode{DBA\_EDGER\_GLM} (the default), 
then a a design matrix is generated with two coefficients
(the Intercept and one of the groups).
Next  \Rfunction{estimateGLMCommonDisp} is called; 
if \Rcode{bTagwise=TRUE},  
\Rfunction{estimateGLMTagwiseDisp} is called as well. 
The model is fitted by calling \Rfunction{glmFit}, 
and the specific contrast fitted by calling \Rfunction{glmLRT}, 
specifying that the second coefficient be dropped.
Finally, an \Rfunction{exactTest} (\cite{McCarthyetal}) is performed, 
using either common or tagwise dispersion depending on the value 
specified for \Rcode{bTagwise}.

This final  \Rclass{DGEList} for contrast n is stored in the
\Rclass{DBA} object as 

\Rcode{DBA\$contrasts[[n]]\$edgeR} 

and may be examined and manipulated directly for further customization.
Note however that if you wish to use this object directly with \edgeR functions,
then the \Rcode{bReduceObjects} parameter should be set to FALSE, 
otherwise the default value of TRUE will result in essential object fields being stripped.

If a blocking factor has been added to the contrast, an additional 
\edgeR analysis is carried out. 
This follows the \Rcode{DBA\_EDGER\_GLM} case detailed above, 
except a more complex design matrix is generated that includes all the
unique values for the blocking factor. 
These coefficients are all included in the \Rfunction{glmLRT} call. 
The resultant object is accessible as

\Rcode{DBA\$contrasts[[n]]\$edgeR\$block}. 

\section{Vignette Data}
\label{sec:vignette_data}

Due to space limitations, the aligned reads associated with the cell line data 
used in this vignette are not included as part of the \DiffBind package. 

Data for the vignette are available for download at 
\url{https://www.cruk.cam.ac.uk/core-facilities/bioinformatics-core/software/DiffBind}.

The following code can be used to download and set up the vignette data:

This workbook requires the sample data used for the `DiffBind` vignette. These data can be obtained as follows:

<<eval=FALSE,echo=TRUE,results=verbatim>>=
tmpdir <- tempdir()
url <- 'https://content.cruk.cam.ac.uk/bioinformatics/software/DiffBind/DiffBind_vignette_data.tar.gz'
file <- basename(url)
options(timeout=600)
download.file(url, file.path(tmpdir,file))
untar(file.path(tmpdir,file), exdir = tmpdir )
setwd(file.path(tmpdir,"DiffBind_Vignette"))
@


The full data for all chromosomes are also available in the Short Read Archive
(GEO accession number GSE32222).
Email for detailed instructions on how to retrieve them in the appropriate form.

\section{Using \Biocpkg{DiffBind} and \Biocpkg{ChIPQC} together}

\Biocpkg{DiffBind} and \Biocpkg{ChIPQC} are both packages that 
help manage and analyze ChIP-seq experiments, and are designed to be used together.

If you already have a project in \Biocpkg{DiffBind}, 
then \Biocpkg{ChIPQC} can accept a \Rclass{DBA object} 
in place of the sample sheet when creating a \Rclass{ChIPQCexperiment} object.

Once a \Rclass{ChIPQCexperiment} object has been constructed, 
it can be used in place of a \Rclass{DBA} object in most calls to \Biocpkg{DiffBind}.
All plotting, counting, and analysis functions are available from \Biocpkg{DiffBind}. 

It is also possible to extract a \Rclass{DBA} object from a 
\Rclass{ChIPQCexperiment} object using the \Rfunction{QCdba} method.
The resulting \Rclass{DBA} object can be used in \Biocpkg{DiffBind} 
without restriction, although neither it nor 
\Rclass{DBA} objects based on it can be re-attached to the original 
\Rclass{ChIPQCexperiment} object
(although they can be used in lieu of a sample sheet when creating a new one.) 

In a typical workflow, the first step would be to run a 
\Biocpkg{ChIPQC} analysis before peak calling to assess library quality 
and establish what filtering should be done at the read level 
(mapping quality, duplicates, and blacklists).
Next peaks would be called externally,
and read into a new \Rclass{ChIPQCexperiment} object 
to assess peak-based metrics, such as FRiP, peak profiles, and clustering.

At this point, \Biocpkg{DiffBind} could be used to perform occupancy analysis, 
derive consensus peak sets, re-count reads to form a binding matrix, 
and set up contrasts to carry out full differential binding analyses 
using the \Biocpkg{edgeR} and \Biocpkg{DESeq2} packages, 
along with  plotting and reporting functions.

\section{Acknowledgements}

This package was developed at Cancer Research UK's Cambridge Research Institute 
with the help and support of many people there. 
We wish to acknowledge everyone the Bioinformatics Core under the leadership of Matthew Eldridge, 
as well as the Nuclear Receptor Transcription Laboratory under the leadership of Jason Carroll. 
Researchers who contributed ideas and/or pushed us in the right direction include 
Caryn-Ross Innes, Vasiliki Theodorou, and Tamir Chandra among many others. 
We also thank  members of the Gordon Smyth laboratory at the WEHI, Melbourne, 
particularly Mark Robinson and Davis McCarthy, for helpful discussions.


\section{Session Info}
<<sessionInfo, results=tex, print=TRUE, eval=TRUE>>=
toLatex(sessionInfo())
@

\bibliography{DiffBind}

\end{document}