\documentclass[12pt]{article} \topmargin 0in \headheight 0in \headsep 0in \oddsidemargin 0in \evensidemargin 0in \textwidth 176mm \textheight 215mm \usepackage{natbib} \usepackage{Sweave} %\usepackage{url} \usepackage{subfig} %\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png} \newcommand{\DBA}{\textsf{DiffBind}~} \newcommand{\edgeR}{\texttt{edgeR}~} \newcommand{\DESeq}{\texttt{DESeq}~} \newcommand{\code}[1]{{\small\texttt{#1}}} \newcommand{\R}{\textsf{R~}} \newcommand{\reft}[1]{Table \ref{tab:#1}} \newcommand{\reff}[1]{Figure \ref{fig:#1}} \newcommand{\refsf}[1]{Figure \ref{fig:#1}} \begin{document} %\VignetteIndexEntry{DiffBind User's Guide} \title{\DBA: differential binding analysis \\ of ChIP-Seq peak data} \author{Rory Stark \\ \texttt{rory.stark@cancer.org.uk} \and Gordon Brown \\ \texttt{gordon.brown@cancer.org.uk} \and Cancer Research UK\\Cambridge Research Institute } % Please increment date when working on this document, so that % date shows genuine change date, not merely date of compile. \date{20 March 2012} \maketitle \section{Introduction} This document offers an introduction and overview of the \R~Bioconductor package \DBA, which provides functions for processing ChIP-Seq data enriched for genomic loci where specific protein/DNA binding occurs, including peak sets identified by ChIP-Seq peak callers and aligned sequence read datasets. It is designed to work with multiple peak sets simultaneously, representing different ChIP experiments (antibodies, transcription factor and/or histone marks, experimental conditions, replicates) as well as managing the results of multiple peak callers. The primary emphasis of the package is on identifying sites that are differentially bound between two sample groups. It includes functions to support the processing of peak sets, including overlapping and merging peak sets, counting sequencing reads overlapping intervals in peak sets, and identifying statistically significantly differentially bound sites based on evidence of binding affinity (measured by differences in read densities). To this end it uses statistical routines developed in an RNA-Seq context (primarily the Bioconductor packages \edgeR and \DESeq). Additionally, the package builds on \R graphics routines to provide a set of standardized plots to aid in binding analysis. This guide includes a brief overview of the processing flow, followed by three sections of examples: the first focusing on the core task of obtaining differentially bound sites based on affinity data, the second working through the main plotting routines, and the third revisiting occupancy data (peak calls) in more detail, as well as comparing the results of an occupancy-based analysis with an affinity-based one. Finally, some technical aspects of the how these analyses are accomplished are detailed. \section{Processing overview} \DBA works primarily with peaksets, which are sets of genomic intervals representing candidate protein binding sites. Each interval consists of a chromosome, a start and end position, and usually a score of some type indicating confidence in, or strength of, the peak. Associated with each peakset are metadata relating to the experiment from which the peakset was derived. Additionally, files containing mapped sequencing reads (BAM/SAM/BED) can be associated with each peakset (one for the ChIP data, and optionally another representing a control dataset). Generally, processing data with \DBA involves five phases: \begin{enumerate} \item \texttt{Reading in peaksets}: The first step is to read in a set of peaksets and associated metadata. Peaksets are derived either from ChIP-Seq peak callers, such as MACS (\cite{zhang2008model}), or using some other criterion (e.g. all the promoter regions in a genome). The easiest way to read in peaksets is using a comma-separated value (csv) sample sheet with one line for each peakset. A single experiment can have more than one associated peakset, e.g. if multiple peak callers are used for comparison purposes, and hence have more than one line in the sample sheet. Once the peaksets are read in, a merging function finds all overlapping peaks and derives a single set of unique genomic intervals covering all the supplied peaks. \item \texttt{Occupancy analysis}: Peaksets, especially those generated by peak callers, provide an insight into the potential \emph{occupancy} of the protein being ChIPed for at specific genomic loci. After the peaksets have been loaded, it can be useful to perform some exploratory plotting to determine how these occupancy maps agree with each other, e.g. between experimental replicates (re-doing the ChIP under the same conditions), between different peak callers on the same experiment, and within groups of samples representing a common experimental condition. \DBA provides functions to enable overlaps to be examined, as well as functions to determine how well similar samples cluster together. Beyond quality control, the product of an occupancy analysis may be a \emph{consensus peakset}, representing an overall set of candidate binding sites to be used in further analysis. \item \texttt{Counting reads}: Once a consensus peakset has been derived, \DBA can use the supplied sequence read files to count how many reads overlap each interval for each unique sample. The result of this is a \emph{binding affinity matrix} containing a (normalized) read count for each sample at every potential binding site. With this matrix, the samples can be re-clustered using affinity, rather than occupancy, data. The binding affinity matrix is used for QC plotting as well as for subsequent differential analysis. \item \texttt{Differential binding affinity analysis}: The core functionality of \DBA is the differential binding affinity analysis, which enables binding sites to be identified that are statistically significantly differentially bound between sample groups. To accomplish this, first a contrast (or contrasts) is established, dividing the samples into groups to be compared. Next the core analysis routines are executed, by default using \edgeR. This will assign a p-value and FDR to each candidate binding site indicating the significance of their being differentially bound. \item \texttt{Plotting and reporting}: Once one or more contrasts have been run, \DBA provides a number of functions for reporting and plotting the results. MA plots give an overview of the results of the analysis, while correlation heatmaps and PCA plots show how the groups cluster based on differentially bound sites. Boxplots show the distribution of reads within differentially bound sites corresponding to whether they gain or lose affinity between the two sample groups. A reporting mechanism enables differentially bound sites to be extracted for further processing, such as annotation and/or pathway analysis. \end{enumerate} \section{Example: obtaining differentially bound sites} This section offers a quick example of how to use \DBA to identify significantly differentially bound sites using affinity (read count) data. The dataset for this example consists of ChIPs against the transcription factor ERa using five breast cancer cell lines (\cite{ross2012differential}). Three of these cell lines are responsive to tamoxifen, while two others are resistant to tamoxifen treatment. There are at least two replicates for each of the cell lines, with one cell line having three replicates, for a total of eleven sequenced libraries. Note that this experiment includes two types of MCF7 cells: the regular tamoxifen responsive line as well as MCF7 cells specially treated with tamoxifen until a tamoxifen resistant cell line is obtained. For each sample, we have one peakset originally derived using the MACS peak caller (\cite{zhang2008model}), for a total of eleven peaksets. Note that to save space in the package, only data for chromosome 18 is used. The metadata and peak data are available in the \texttt{extra} subdirectory of the \DBA package directory; you can make this your working directory by entering: <>= tmp = tempfile(as.character(Sys.getpid())) pdf(tmp) savewarn = options("warn") options(warn=-1) savewd = getwd() @ <>= library(DiffBind) setwd(system.file("extra", package="DiffBind")) @ Obtaining the sites significantly differentially bound (DB) between the samples that respond to tamoxifen and those that are resistant can be done in a five-step script: <>= tamoxifen = dba(sampleSheet="tamoxifen.csv") tamoxifen = dba.count(tamoxifen) tamoxifen = dba.contrast(tamoxifen, categories=DBA_CONDITION) tamoxifen = dba.analyze(tamoxifen) tamoxifen.DB = dba.report(tamoxifen) @ The following subsections describe these steps in more detail. \subsection{Reading in the peaksets} \begin{table}\scriptsize \centering \caption{Tamoxifen dataset sample sheet (tamoxifen.csv).} \begin{tabular}{|c|c|c|c|c|c|c|c|} \hline SampleID& Tissue& Factor& Condition& Replicate& bamReads& bamControl& Peaks \\ \hline BT474.1-& BT474& ER& Resistant& 1& BT474\_ER\_1.bed.gz& BT474\_Input.bed.gz& BT474\_ER\_1.bed.gz\\ BT474.2-& BT474& ER& Resistant& 2& BT474\_ER\_2.bed.gz& BT474\_Input.bed.gz& BT474\_ER\_2.bed.gz\\ MCF7.1+& MCF7&\ ER& Responsive& 1& MCF7\_ER\_1.bed.gz& MCF7\_Input.bed.gz& MCF7\_ER\_1.bed.gz\\ MCF7.2+& MCF7& ER& Responsive& 2& MCF7\_ER\_2.bed.gz& MCF7\_Input.bed.gz& MCF7\_ER\_2.bed.gz\\ MCF7.3+& MCF7& ER& Responsive& 3& MCF7\_ER\_3.bed.gz& MCF7\_Input.bed.gz& MCF7\_ER\_3.bed.gz\\ T47D.1+& T47D& ER& Responsive& 1& T47D\_ER\_1.bed.gz& T47D\_Input.bed.gz& T47D\_ER\_1.bed.gz\\ T47D.2+& T47D& ER& Responsive& 2& T47D\_ER\_2.bed.gz& T47D\_Input.bed.gz& T47D\_ER\_2.bed.gz\\ MCF7.1-& MCF7& ER& Resistant& 1& TAMR\_ER\_1.bed.gz& TAMR\_Input.bed.gz& TAMR\_ER\_1.bed.gz\\ MCF7.2-& MCF7& ER& Resistant& 2& TAMR\_ER\_2.bed.gz& TAMR\_Input.bed.gz& TAMR\_ER\_2.bed.gz\\ ZR75.1+& ZR75& ER& Responsive& 1& ZR75\_ER\_1.bed.gz& ZR75\_Input.bed.gz& ZR75\_ER\_1.bed.gz\\ ZR75.2+& ZR75& ER& Responsive& 2& ZR75\_ER\_2.bed.gz& ZR75\_Input.bed.gz& ZR75\_ER\_2.bed.gz\\ \hline \end{tabular} \label{tab:tamoxifen} \end{table} \reft{tamoxifen} shows the sample sheet, saved in a file called \texttt{tamoxifen.csv}. The peaksets are read in using the following \DBA function: <>= tamoxifen = dba(sampleSheet="tamoxifen.csv") @ The result is a \texttt{DBA object}; the metadata associated with this object can be displayed simply as follows: <>= tamoxifen @ %11 Samples, 2602 sites in matrix (3557 total): % ID Tissue Factor Condition Peak.caller Replicate Intervals %1 BT474.1- BT474 ER Resistant raw 1 1084 %2 BT474.2- BT474 ER Resistant raw 2 1115 %3 MCF7.1+ MCF7 ER Responsive raw 1 1513 %4 MCF7.2+ MCF7 ER Responsive raw 2 1037 %5 MCF7.3+ MCF7 ER Responsive raw 3 1372 %6 T47D.1+ T47D ER Responsive raw 1 509 %7 T47D.2+ T47D ER Responsive raw 2 347 %8 TAMR.1- TAMR ER Resistant raw 1 1148 %9 TAMR.2- TAMR ER Resistant raw 2 933 %10 ZR75.1+ ZR75 ER Responsive raw 1 2111 %11 ZR75.2+ ZR75 ER Responsive raw 2 1975 This shows how many peaks are in each peakset, as well as (in the first line) total number of unique peaks after merging overlapping ones (3,557) and the default binding matrix of 11 samples by the 2,602 sites that overlap in at least two of the samples. This object is available for loading using \texttt{data(tamoxifen\_peaks)}. Using only this peak caller data, a correlation heatmap can be generated which gives an initial clustering of the samples using the cross-correlations of each row of the binding matrix: <>= plot(tamoxifen) @ The resulting plot (\reff{tamox-occ-corhm}) shows that while the replicates for each cell line cluster together appropriately, the cell lines do not cluster into groups corresponding to those that are responsive (MCF7+, T47D, and ZR75) vs. those resistant (BT474 and MCF7-) to tamoxifen treatment. It also shows that the two most highly correlated cell lines are the two MCF7-based ones, even though they respond differently to tamoxifen treatment. \begin{figure} \centering \includegraphics[width=14cm]{tamox_occ_corhm.png} \caption{Correlation heatmap, using occupancy (peak caller score) data. Generated by: \texttt{plot(tamoxifen)}; can also be generated by: \texttt{dba.plotHeatmap(tamoxifen).}} \label{fig:tamox-occ-corhm} \end{figure} \subsection{Counting reads} The next step is to calculate a binding matrix with scores based on read counts for every sample (affinity scores), rather than confidence scores for only those peaks called in a specific sample (occupancy scores). These reads are obtained using the \texttt{dba.count} function:\footnote{Note that due to space limitations the reads are not shipped with the package. We hope to make them easily available in the future. Alternatively, you can get the result of the \texttt{dba.count} call by loading the supplied \R object by invoking \texttt{data(tamoxifen\_counts)}} <>= tamoxifen = dba.count(tamoxifen, minOverlap=3) @ <>= data(tamoxifen_counts) @ If you do not have the raw reads available to you, this object is available loading using \texttt{data(tamoxifen\_counts)}. The \texttt{dba.count} call plots a new correlation heatmap based on the affinity scores, seen in \refsf{tamox-aff-corhm}. While this shows overall higher correlation levels (evidenced by the shift in the scale), and a slightly different clustering, responsiveness to tamoxifen treatment does not appear to form a basis for clustering when using all of the affinity scores. (Note that the clustering can change based on what scoring metric is used; see Section 4.4 for more details). \subsection{Establishing a contrast} Before running the differential analysis, we need to tell \DBA which cell lines fall in which groups. This is done using the \texttt{dba.contrast} function, as follows: <>= tamoxifen = dba.contrast(tamoxifen, categories=DBA_CONDITION) @ The uses the \emph{condition} metadata (Responsive vs. Resistant) to set up a a contrast with 4 samples in the Resistant group and 7 samples in the Responsive group.\footnote{This step is actually optional: if the main analysis function \texttt{dba.analyze} is invoked with no contrasts established, \DBA will set up a default set of contrasts automatically, which will include the one we are interested in.} \begin{figure} \centering \subfloat[CORHM1][Correlation heatmap, using affinity (read count) data. Generated by: \texttt{tamoxifen = dba.count(tamoxifen)}; can also be generated by: \texttt{dba.plotHeatmap(tamoxifen)}] { \includegraphics[width=10cm]{tamox_aff_corhm.png} \label{fig:tamox-aff-corhm}} \qquad \subfloat[CORHM2][Correlation heatmap, using only significantly differentially bound sites. Generated by: \texttt{tamoxifen = dba.analyze(tamoxifen)}; can also be generated by: \texttt{dba.plotHeatmap(tamoxifen, contrast=1)}]{ \includegraphics[width=10cm]{tamox_sdb_corhm.png} \label{fig:tamox-sdb-corhm}} \caption{Correlation heatmap plots, generated using \texttt{dba.plotHeatmap}.} \label{fig:tamox_heatmaps} \end{figure} \subsection{Performing the differential analysis} The main differential analysis function is invoked as follows: <>= tamoxifen = dba.analyze(tamoxifen) @ This will run an \edgeR analysis (see subsequent section discussing the technical details of the \edgeR analysis) on the binding affinity matrix. Displaying the resultant \texttt{DBA object} shows that 235 of the 1,654 sites are identified as being significantly differentially bound (DB) using the default threshold of FDR <= 0.1: <>= tamoxifen @ %11 Samples, 1654 sites in matrix: % ID Tissue Factor Condition Peak.caller Replicate Intervals %1 BT474.1- BT474 ER Resistant counts 1 1654 %2 BT474.2- BT474 ER Resistant counts 2 1654 %3 MCF7.1+ MCF7 ER Responsive counts 1 1654 %4 MCF7.2+ MCF7 ER Responsive counts 2 1654 %5 MCF7.3+ MCF7 ER Responsive counts 3 1654 %6 T47D.1+ T47D ER Responsive counts 1 1654 %7 T47D.2+ T47D ER Responsive counts 2 1654 %8 TAMR.1- TAMR ER Resistant counts 1 1654 %9 TAMR.2- TAMR ER Resistant counts 2 1654 %10 ZR75.1+ ZR75 ER Responsive counts 1 1654 %11 ZR75.2+ ZR75 ER Responsive counts 2 1654 % %1 Contrast: % Group1 Members1 Group2 Members2 DB.edgeR %1 Resistant 4 Responsive 7 176 By default, \texttt{dba.analyze} plots a correlation heatmap if it finds any significantly differentially bound sites, shown in \refsf{tamox-sdb-corhm}. Using only the differentially bound sites, we now see that the four tamoxifen resistant samples (representing two cell lines) cluster together, although the tamoxifen-responsive MCF7 replicates cluster closer to them than to the other tamoxifen responsive samples. Comparing \refsf{tamox-aff-corhm}, which uses all 1,654 consensus binding sites, with \refsf{tamox-sdb-corhm}, which uses only the 235 differentially bound sites, demonstrates how the differential binding analysis isolates sites that help distinguish between the Resistant and Responsive sample groups. The fact that an ideal clustering is not apparent even when considering only the differentially bound sites indicates that there may be a confounding factor we are not taking into account in this analysis. See Section 5 for a more sophisticated analysis that takes this factor (the presence of MCF7 cell lines in both the responsive and resistant groups) into account. \subsection{Retrieving the differentially bound sites} The final step is to retrieve the differentially bound sites as follows: <>= tamoxifen.DB = dba.report(tamoxifen) @ These are returned as a \texttt{GRanges} object, appropriate for downstream processing: <>= tamoxifen.DB @ %RangedData with 182 rows and 6 value columns across 1 space % space ranges | Conc Conc1 Conc2 Fold p.value FDR % | %1 chr18 [ 384853, 386517] | 6.822477 8.096600 4.422122 3.674478 7.560491e-07 0.001250505 %2 chr18 [62640747, 62642453] | 6.935226 2.818403 7.556743 -4.738340 3.231268e-06 0.002672258 %3 chr18 [ 8719898, 8721018] | 6.666340 7.927585 4.356627 3.570957 9.123563e-06 0.004610124 %4 chr18 [67583814, 67584869] | 5.344121 1.001062 5.970114 -4.969051 1.318678e-05 0.004610124 %5 chr18 [72497301, 72498984] | 7.855787 4.253954 8.463994 -4.210040 1.652061e-05 0.004610124 %6 chr18 [10013642, 10014854] | 6.331710 7.540254 4.336512 3.203741 1.672354e-05 0.004610124 %7 chr18 [ 7635601, 7636863] | 6.222280 7.392168 4.415090 2.977078 2.334838e-05 0.005516888 %8 chr18 [ 8167364, 8168199] | 4.835931 6.025611 2.936172 3.089440 3.189348e-05 0.006593976 %9 chr18 [ 8947487, 8948503] | 5.915582 7.151613 3.766335 3.385278 3.767074e-05 0.006846554 %... ... ... ... ... ... ... ... ... ... %174 chr18 [ 479200, 480041] | 4.919858 5.828205 3.916731 1.911473 0.01003126 0.09457060 %175 chr18 [ 8489855, 8490579] | 3.978404 4.891862 2.964229 1.927633 0.01004482 0.09457060 %176 chr18 [40476698, 40477842] | 4.098327 2.024730 4.620057 -2.595327 0.01009693 0.09457060 %177 chr18 [ 3821937, 3822886] | 4.507370 2.320755 5.039343 -2.718587 0.01012031 0.09457060 %178 chr18 [ 7054204, 7054814] | 4.299982 5.256037 3.188626 2.067411 0.01029863 0.09565665 %179 chr18 [23704720, 23705570] | 5.901634 6.769520 4.981827 1.787693 0.01044403 0.09565665 %180 chr18 [29846635, 29847337] | 4.086749 5.006580 3.058624 1.947956 0.01044488 0.09565665 %181 chr18 [10569384, 10570571] | 6.461755 7.379404 5.438426 1.940978 0.01046787 0.09565665 %182 chr18 [52191999, 52193072] | 4.708795 5.602390 3.736879 1.865511 0.01057757 0.09612806 The value columns show the mean read concentration over all the samples (the default calculation uses log2 normalized ChIP read counts with control read counts subtracted) and the mean concentration over the first (Resistant) group and second (Responsive) group. The Fold column shows the difference in mean concentrations between the two groups (Conc\_Resistant - Conc\_Responsive), with a positive value indicating increased binding affinity in the Resistant group and a negative value indicating increased binding affinity in the Responsive group. The final two columns give confidence measures for identifying these sites as differentially bound, with a raw p-value and a multiple testing corrected FDR in the final column. \section{Example: plotting} Besides the correlation heatmaps automatically generated by the core functions, a number of other plots are available using the affinity data. This sections covers MA plots, PCA plots, Boxplots, and Heatmaps. \subsection{MA plots} MA plots are a useful way to visualize the effect of normalization on data, as well as seeing which of the datapoints are being identified as differentially bound. An MA plot can be obtained for the resistant-responsive contrast as follows: <>= dba.plotMA(tamoxifen) @ The plot is shown in \reff{tamox-sdb-ma}. Each point represents a binding site, with point in red representing sites identified as differentially bound. The plot shows how the differentially bound sites have an absolute log fold difference of at least 2. This same data can also be shown with the concentrations of each sample groups plotted against each other plot using \texttt{dba.plotMA(tamoxifen, bXY=T)}. \begin{figure} \centering \includegraphics[width=12cm]{tamox_sdb_ma.png} \caption{MA plot of Resistant-Responsive contrast, with sites identified as significantly differentially bound shown in red. Generated by: \texttt{dba.plotMA(tamoxifen)}} \label{fig:tamox-sdb-ma} \end{figure} \subsection{PCA plots} While the correlation heatmaps already seen are good for showing clustering, plots based on principal components analysis can be used to give a deeper insight into how samples are associated. A PCA plot corresponding to \refsf{tamox-aff-corhm}, which includes normalized read counts for all the binding sites, can be obtained as follows: <>= dba.plotPCA(tamoxifen) @ This draws the plot and returns a color legend. The resulting plot (\refsf{tamox-aff-pca}) shows the four Resistant samples (black) not separable from the Responsive samples (red) in either the first (horizontal) or the second (vertical) components when looking at all the binding sites. \begin{figure}[h] \centering \subfloat[PCA1][PCA plot using affinity data for all sites. Generated by: \texttt{dba.plotPCA(tamoxifen)}]{ \includegraphics[width=0.4\textwidth]{tamox_aff_pca.png} \label{fig:tamox-aff-pca}} \qquad \subfloat[PCA2][PCA plot using affinity data for only differentially bound sites. Generated by: \texttt{dba.plotPCA(tamoxifen, contrast=1)}]{ \includegraphics[width=0.4\textwidth]{tamox_sdb_pca.png} \label{fig:tamox-sdb-pca}} \caption{Principal Component Analysis (PCA) plots, generated using \texttt{dba.plotPCA}.} \label{fig:tamox_pca} \end{figure} A PCA plot using only the differentially bound sites (corresponding to \refsf{tamox-sdb-corhm}), using an FDR threshold of 0.1, can be drawn as follows: <>= dba.plotPCA(tamoxifen, contrast=1,th=.1) @ This plot (\reff{tamox-sdb-pca}) shows that the differential analysis identifies sites than can be used to separate the sample groups along both the first and second components. The \texttt{dba.plotPCA} function is customizable. For example, if you want to see where each of the unique cell lines lies, type \texttt{dba.plotPCA(tamoxifen, attributes=c(DBA\_TISSUE,DBA\_CONDITION)}. If your installation of \R supports 3D graphics using the \texttt{rgl} package, try \texttt{dba.plotPCA(tamoxifen, b3D=T)}. You may have to rotate the resultant plot top-to-bottom to align it with the standard two-dimension plot, but seeing the first three principal components can be a useful exploratory exercise. \subsection{Boxplots} Boxplots provide a way to view how read distributions differ between classes of binding sites. Consider the example, where the 235 differentially bound sites are identified. The MA plot (\reff{tamox-sdb-ma}) shows that these are not distributed evenly between those that increase binding affinity in the Responsive group vs. those that increase binding affinity in the Resistant groups. This can be seen quantitatively using the sites returned in the report: <>= sum(values(tamoxifen.DB)$Fold<0) sum(values(tamoxifen.DB)$Fold>0) @ But how are reads distributed amongst the different classes of differentially bound sites and sample groups? These data can be more clearly seen using a boxplot: <>= pvals = dba.plotBox(tamoxifen) @ \begin{figure} \centering \includegraphics[width=13cm]{tamox_sdb_box} \caption{Box plots of read distributions for significantly differentially bound (DB) sites. Tamoxifen resistant samples are shown in red, and responsive samples are shown in blue. Left two boxes show distribution of reads over all DB sites in the Resistant and Responsive groups; middle two boxes show distributions of reads in DB sites that increase in affinity in the Responsive group; last two boxes show distributions of reads in DB sites that increase in affinity in the Resistant group. Generated by: \texttt{dba.plotBox(tamoxifen)}} \label{fig:tamox-sdb-box} \end{figure} The default plot (\reff{tamox-sdb-box}) shows in the first two boxes that amongst differentially bound sites overall, the Responsive samples have a somewhat higher mean read concentration. The next two boxes show the distribution of reads in differentially bound sites that exhibit increased affinity in the Responsive samples, while the final two boxes show the distribution of reads in differentially bound sites that exhibit increased affinity in the Resistant samples. \texttt{dba.plotBox} returns a matrix of p-values (computed using a two-sided Wilcoxon `Mann-Whitney' test) indicating which of these distributions are significantly different from another distribution. <>= pvals @ % Resistant.DB Responsive.DB Resistant.DB+ Responsive.DB+ Resistant.DB- Responsive.DB- % Resistant.DB 1.000000e+00 4.577788e-05 2.044220e-08 8.729574e-11 5.477838e-12 5.842724e-01 % Responsive.DB 4.577788e-05 1.000000e+00 7.742447e-27 5.808723e-06 8.635489e-09 2.526501e-08 % Resistant.DB+ 2.044220e-08 7.742447e-27 1.000000e+00 2.924697e-30 2.880757e-27 2.176e-10 % Responsive.DB+ 8.729574e-11 5.808723e-06 2.924697e-30 1.000000e+00 6.878802e-02 2.268808e-18 % Resistant.DB- 5.477838e-12 8.635489e-09 2.880757e-27 6.878802e-02 1.000000e+00 5.719333e-20 % Responsive.DB- 5.842724e-01 2.526501e-08 2.141676e-10 2.268808e-18 5.719333e-20 1.000000e+00 The significance of the overall difference in distribution of concentrations amongst the differentially bound sites in the two groups is shown to be p-value=2.135309e-11, while those between the Resistant and Responsive groups in the individual cases (increased in Responsive or Resistant) have p-values computed as 2.926571e-41 and 2.588615e-18. \subsection{Heatmaps} \DBA provides two types of heatmaps. This first, correlation heatmaps, we have already seen. For example, the heatmap shown in \refsf{tamox-aff-corhm} can be generated as follows: <>= corvals = dba.plotHeatmap(tamoxifen) @ The effect of different scoring methods (normalization) can be examined in these plots by setting the \texttt{score} parameter to a different value. The default value, \texttt{DBA\_SCORE\_TMM\_MINUS\_FULL}, uses the TMM normalization procedure from \edgeR, with contral reads subtracted first and using the full library size. Another scoring method is to use RPKM fold (RPKM of the ChIP reads divided by RPKM of the control reads; a correaltion heatmap for all the data using this scoring method can be obtained by typing \texttt{dba.plotHeatmap(tamoxifen, score=DBA\_SCORE\_RPKM\_FOLD)}. Another way to view the patterns of binding affinity directly in the differentially bound sites is via a binding affinity heatmap. This can be plotted for the example case as follows: <>= corvals = dba.plotHeatmap(tamoxifen, contrast=1, correlations=FALSE) @ \reff{tamox-sdb-hm} shows the affinities and clustering of the differentially bound sites (rows), as well as the sample clustering (columns). This plot can be tweaked to get more contrast, for example by using row-scaling \texttt{dba.plotHeatmap(tamoxifen, contrast=1, correlations=FALSE, scale=row)}. \begin{figure} \centering \includegraphics[width=12cm]{tamox_sdb_hm} \caption{Binding affinity heatmap showing affinities for differentially bound sites. Samples cluster first by whether they are responsive to tamoxifen treatment, then by cell line. Clusters of binding sites show distinct patterns of affinity levels. Generated by: \texttt{dba.plotHeatmap(tamoxifen, contrast=1, correlations=FALSE)}} \label{fig:tamox-sdb-hm} \end{figure} \section{Example: differential binding analysis using a blocking factor} The previous example showed how to perform a differential binding analysis using a single factor with two values; that is, finding the significantly differentially bound sites between two sets of samples. This section extends the example by including a second factor, potentially with multiple values, that represents a confounding condition. Examples of experiments where it is appropriate to use a blocking factor include one where there are potential batch effects, with samples from the two conditions prepared together, or a matched design (e.g. matched normal and tumor pairs, where the primary factor of interest is to discover sites consistently differentially bound between normal and tumor samples. In the current example, the confounding effect we want to control for is the presence of two sets of samples, one tamoxifen responsive and one resistant, that are both derived from the same MCF7 cell line. In the previous analysis, the two MCF7-derived cell lines tended to cluster together. While the differential binding analysis was able to identify sites that could be used to separate the resistance from the responsive samples, the confounding effect of the common ancestry could still be seen even when considering only the significantly differentially bound sites (\reff{tamox-aff-corhm}). Using the generalized linear modelling (GLM) functionality included in \edgeR and \DESeq, the confounding factor can be explicitly modelled. This is done by specifying a blocking factor to \texttt{dba.contrast}. There are a number of ways to specify this factor. If it is encapsulated in a piece of metadata (eg. DBA\_REPLICATE, or DBA\_TREATMENT etc.), simply specifying the metadata field is sufficient. In the current case, there is no specific metadata field that captures the factor we want to block (although an unused metadata field, such as DBA\_TREATMENT, could be used to specify this factor). An alternate way of specifying the confounded sampled is to use a mask: <>= data(tamoxifen_counts) tamoxifen = dba.contrast(tamoxifen,categories=DBA_CONDITION, block=tamoxifen$masks$MCF7) @ Now when the analysis is run, it will be run using both the single-factor comparison as well as fitting a linear model with the second, blocking factor, for comparison: <>= tamoxifen = dba.analyze(tamoxifen) tamoxifen @ This indicates that where the standard, single-factor \edgeR analysis identifies 235 differentially bound sites, the analysis using the blocking factor finds 416 such sites. An MA plot shows how the analysis has changed: <>= dba.plotMA(tamoxifen,method=DBA_EDGER_BLOCK) @ \begin{figure} \centering \includegraphics[width=12cm]{tamox_block_ma.png} \caption{MA plot of Resistant-Responsive contrast, using MCF7 origin as a blocking factor, with sites identified as significantly differentially bound shown in red. Generated by: \texttt{dba.plotMA(tamoxifen, method=DBA\_EDGER\_BLOCK)}} \label{fig:tamox-block-ma} \end{figure} The resulting plot is shown in \reff{tamox-block-ma}. Comparing this to \reff{tamox-sdb-ma}, at least two differences can be observed. The analysis has become more sensitive, with sites being identified as significantly differentially bound with lower magnitude fold changes (as low as twofold, as this plot is on a log2 scale). But it is not merely lowering a fold threshold: many sites with higher fold changes are no longer found to be significant. These were included in the earlier analysis because the confounding factor was not being modelled. Consider the resulting separation and clustering using the newly discovered differentially bound sites: <>= dba.plotHeatmap(tamoxifen,contrast=1,method=DBA_EDGER_BLOCK, attributes=c(DBA_TISSUE,DBA_CONDITION,DBA_REPLICATE)) @ <>= dba.plotPCA(tamoxifen,contrast=1,method=DBA_EDGER_BLOCK, attributes=c(DBA_TISSUE,DBA_CONDITION)) @ \begin{figure} \centering \includegraphics[width=14cm]{tamox_block_corhm.png} \caption{Correlation heatmap of using scores for significantly differentially bound sites for the Resistant-Responsive contrast, using MCF7 origin as a blocking factor. Generated by: \texttt{dba.plotHeatmap(tamoxifen, contrast=1, method=DBA\_EDGER\_BLOCK, attributes=c(DBA\_TISSUE,DBA\_CONDITION,DBA\_REPLICATE))}} \label{fig:tamox-block-corhm} \end{figure} \begin{figure} \centering \includegraphics[width=11cm]{tamox_block_pca.png} \caption{Plot of first two principal components, using scores for significantly differentially bound sites for the Resistant-Responsive contrast, using MCF7 origin as a blocking factor. Generated by: \texttt{dba.plotPCA(tamoxifen,contrast=1,method=DBA\_EDGER\_BLOCK, attributes=c(DBA\_TISSUE,DBA\_CONDITION))}} \label{fig:tamox-block-pca} \end{figure} Frequently, as more sites are included in these plots, the result is often worse clustering/separation along the grouping of primary interest. With this analysis, however, clustering and separation are improved, even though many more sites have been identified. The correlation heatmap (\reff{tamox-block-corhm}, compare to \reff{tamox-sdb-corhm}) shows the tamoxifen resistant cell lines clustering together, with the tamoxifen responsive MCF7 sample clustering with the other responsive cell lines. Likewise, the PCA plot (\reff{tamox-block-pca}, compare to \reff{tamox-sdb-pca}) shows a cleaner separation of the MCF7 samples, particularly in the second component. It is also interesting to compare the performance of \edgeR with that of \DESeq on this dataset: <>= tamoxifen = dba.analyze(tamoxifen,method=DBA_DESEQ) tamoxifen @ While \DESeq has a much more conservative approach to the single factor analysis, identifying only 54 sites as differentially bound, modelling the confounding fact results in an almost identical result. You can check this by looking at the identified sites using \texttt{dba.report}, and performing MA, heatmap, and PCA plots. \section{Example: occupancy analysis and overlaps} In this section, we look at the tamoxifen resistance ER-binding dataset in some more detail, showing what a pure occupancy-based analysis would look like, and comparing it to the results obtained using the affinity data. For this we will start by re-loading the peaksets: <>= data(tamoxifen_peaks) @ \subsection{Overlap rates} One reason to do an occupancy-based analysis is to determine what candidate sites should be used in a subsequent affinity-based analysis. In the example so far, we took all sites that were identified in peaks in at least three of the eleven peaksets, reducing the number of sites from 3,557 overall to the 1,654 sites used in the differential analysis. We could have used a more stringent criterion, such as only taking sites identified in five or six of the peaksets, or a less stringent one, such as including all 3,557 sites. In making the decision of what criteria to use many factors come into play, but it helps to get an idea of the rates at which the peaksets overlap (for more details on how overlaps are determined, see Section 7.1 on peak merging). A global overview can be obtained using the RATE mode of the \texttt{dba.overlap} function as follows: <>= olap.rate = dba.overlap(tamoxifen,mode=DBA_OLAP_RATE) olap.rate @ \texttt{olap.rate} is a vector containing the number of peaks that appear in at least one, two, three, and so on up to all eleven peaksets. % [1] 3557 2602 1654 1299 1002 764 620 455 352 187 118 These values can be plotted to show the overlap rate drop-off curve: <>= plot(olap.rate,type='b') @ The rate plot is shows in \reff{tamox-rate}. These curves typically exhibit a roughly geometric drop-off, with the number of overlapping sites halving as the overlap criterion become stricter by one site. When the drop-off is extremely steep, this is an indication that the peaksets do not agree very well. For example, if there are replicates you expect to agree, there may be a problem with the experiment. In the current example, peak agreement is high and the curve exhibits a better than geometric drop-off. \begin{figure} \centering \includegraphics[width=9cm]{tamox_rate.png} \caption{Overlap rate plot, showing how the number of overlapping peaks decreases as the overlap criteria becomes more stringent. X axis shows the number of peaksets in which the site is identified, while the Y axis shows the number of overlapping sites. Generated by plotting the result of: \texttt{dba.overlap(tamoxifen,mode=DBA\_OLAP\_RATE)}} \label{fig:tamox-rate} \end{figure} \subsection{Deriving consensus peaksets} When performing an overlap analysis, it is often the case that the overlap criteria are set stringently in order to lower noise and drive down false positives.\footnote{It is less clear that limiting the potential binding sites in this way is appropriate when focusing on affinity data, as the differential binding analysis method will identify only sites that are significantly differentially bound, even if operating on peaksets that include incorrectly identified sites.} The presence of a peak in multiple peaksets is an indication that it is a "real" binding site, in the sense of being identifiable in a repeatable manner. The use of biological replicates (performing the ChIP multiple times), as in the tamoxifen dataset, can be used to guide derivation of a consensus peakset. Alternatively, an inexpensive but less powerful way to help accomplish this is to use multiple peak callers for each ChIP dataset and look for agreement between peak callers (\cite{li2011measuring}). Consider for example the standard (tamoxifen responsive) MCF7 cell line, represented by three replicates in this dataset. How well do the replicates agree on their peak calls? The overlap rate for just the MCF7 samples can be isolated using a \emph{sample mask}. A set of sample masks are automatically associated with a \texttt{DBA object} in the \texttt{\$masks} field: <>= names(tamoxifen$masks) @ % [1] "BT474" "MCF7" "T47D" "TAMR" "ZR75" "ER" % [6] "Resistant" "Responsive" "raw" "Replicate.1" "Replicate.2" % [12] "Replicate.3" Arbitrary masks can be generated using the \texttt{dba.mask} function, or simply by specifying a vector of peakset numbers. In this case, a mask that isolates the MCF7 samples can be generated by combining to pre-defined masks (MCF7 and Responsive) and passed into the \texttt{dba.overlap} function: <>= dba.overlap(tamoxifen,tamoxifen$masks$MCF7 & tamoxifen$masks$Responsive, mode=DBA_OLAP_RATE) @ % [1] 1767 1222 874 There are 874 peaks (out of 1,767) identified in all three replicates. A finer grained view of the overlaps can be obtained with the \texttt{dba.plotVenn} function: <>= dba.plotVenn(tamoxifen,tamoxifen$masks$MCF7 & tamoxifen$masks$Responsive) @ \begin{figure} \centering \includegraphics[width=13cm]{tamox_mcf7_venn.png} \caption{Venn diagram showing how the ER peak calls for three replicates of responsive MCF7 cell line overlap. Generated by plotting the result of: \texttt{dba.venn(tamoxifen,tamoxifen\$masks\$MCF7 \& tamoxifen\$masks\$Responsive)}} \label{fig:tamox-mcf7-venn} \end{figure} The resultant plot is shown as \reff{tamox-mcf7-venn}. This plot shows the 874 consensus peaks identified as common to all replicates, but further breaks down how the replicates relate to each other. The same can be done for each of the replicated cell line experiments, and rather than applying a global cutoff (3 of 11), each cell line could be dealt with individually in deriving a final peakset. For example we could replace the three tamoxifen-responsive MCF7 peaksets with one including the 1,222 peaks identified in at least two replicates by masking the non-consensus MCF7 peaksets (the \$Consensus mask filters peaksets formed from existing peaksets using \texttt{dba.peakset}) as follows: <>= tamoxifen = dba.peakset(tamoxifen,tamoxifen$masks$MCF7&tamoxifen$masks$Responsive, sampID="MCF7+") tamoxifen = dba(tamoxifen,!(!tamoxifen$masks$Consensus&tamoxifen$masks$MCF7)) tamoxifen @ % 9 Samples, 2377 sites in matrix (3322 total): % ID Tissue Factor Condition Peak.caller Replicate Intervals %1 BT474.1- BT474 ER Resistant raw 1 1084 %2 BT474.2- BT474 ER Resistant raw 2 1115 %3 T47D.1+ T47D ER Responsive raw 1 509 %4 T47D.2+ T47D ER Responsive raw 2 347 %5 TAMR.1- TAMR ER Resistant raw 1 1148 %6 TAMR.2- TAMR ER Resistant raw 2 933 %7 ZR75.1+ ZR75 ER Responsive raw 1 2111 %8 ZR75.2+ ZR75 ER Responsive raw 2 1975 %9 MCF7+ MCF7 ER Responsive raw 1-2-3 1222 \subsection{A complete occupancy analysis: identifying sites unique to a sample group} Occupancy-based analysis, in addition to offering many ways of deriving consensus peaksets, can also be used to identify sites unique to a group of samples. This is analogous to, but not the same as, finding differentially bound sites. In these subsections, the two approaches are directly compared. Returning to the original tamoxifen dataset: <>= data(tamoxifen_peaks) @ We can derive consensus peaksets for the Resistant and Responsive groups. First we examine the overlap rates: <>= dba.overlap(tamoxifen,tamoxifen$masks$Resistant,mode=DBA_OLAP_RATE) dba.overlap(tamoxifen,tamoxifen$masks$Responsive,mode=DBA_OLAP_RATE) @ % [1] 1875 1298 597 436 % [1] 3208 2293 1217 807 584 265 161 For the Resistant group, we choose an overlap rate of two, leaving 1,298 sites, while for the Responsive group, we use an overlap rate of 3, leaving 1,217 sites, and look at the overlap between the two groups: <>= tamoxifen = dba.peakset(tamoxifen,tamoxifen$masks$Resistant, sampID="Resistant",minOverlap=2) tamoxifen = dba.peakset(tamoxifen,tamoxifen$mask$Responsive, sampID="Responsive",minOverlap=3) dba.plotVenn(tamoxifen,tamoxifen$masks$Consensus) @ \begin{figure} \centering \includegraphics[width=9cm]{tamox_cons_venn.png} \caption{Venn diagram showing how the ER peak calls for two response groups overlap. Generated by plotting the result of: \texttt{dba.plotVenn(tamoxifen, tamoxifen\$masks\$Consensus)}} \label{fig:tamox-cons-venn} \end{figure} \reff{tamox-cons-venn} shows that 448 sites are unique to the Resistant group, and 392 sites are unique to the Responsive group, with 819 sites being identified in both groups (meaning in at least half the Resistant samples and at least three of the seven Responsive samples). If our primary interest is in finding binding sites that are different between the two groups, it may seem reasonable to consider the 819 common sites to be uninteresting, and focus on the 840 sites that are unique to a specific group. The sites can be obtained using \texttt{dba.overlap}: <>= tamoxifen.OL = dba.overlap(tamoxifen, tamoxifen$masks$Consensus) @ The sites unique to the Resistant group are accessible in \texttt{tamoxifen.OL\$onlyA}, with the Responsive-unique sites in \texttt{tamoxifen.OL\$onlyB}: <>= tamoxifen.OL$onlyA tamoxifen.OL$onlyB @ %RangedData with 448 rows and 1 value column across 1 space % space ranges | Resistant % | %1 chr18 [301531, 302172] | 0.12897827 %2 chr18 [346557, 347362] | 0.02545366 %3 chr18 [361121, 362106] | 0.01780650 %4 chr18 [384853, 386517] | 0.08871906 %5 chr18 [479200, 480032] | 0.08687365 %6 chr18 [562348, 563067] | 0.03202105 %7 chr18 [639178, 639957] | 0.05527764 %8 chr18 [647089, 647869] | 0.05521693 %9 chr18 [914869, 915486] | 0.06593881 %... ... ... ... ... %440 chr18 [74267807, 74268931] | 0.08594548 %441 chr18 [74313069, 74313720] | 0.05233002 %442 chr18 [74629529, 74630608] | 0.01670948 %443 chr18 [74675317, 74676231] | 0.06404468 %444 chr18 [75157775, 75158504] | 0.03463882 %445 chr18 [75163026, 75163816] | 0.03066147 %446 chr18 [75401417, 75402162] | 0.04940022 %447 chr18 [75525519, 75526188] | 0.06458959 %448 chr18 [75826088, 75826939] | 0.02528083 %> %RangedData with 392 rows and 1 value column across 1 space % space ranges | Responsive % | %1 chr18 [ 336592, 337347] | 0.04664408 %2 chr18 [ 439109, 440079] | 0.03310468 %3 chr18 [ 988767, 989698] | 0.04701041 %4 chr18 [1065304, 1066051] | 0.06951515 %5 chr18 [1231653, 1232311] | 0.05523114 %6 chr18 [1369773, 1370746] | 0.08923925 %7 chr18 [1470236, 1470902] | 0.06573040 %8 chr18 [2161751, 2162455] | 0.03927969 %9 chr18 [2197080, 2197908] | 0.04607203 %... ... ... ... ... %384 chr18 [72790966, 72791819] | 0.08770689 %385 chr18 [72914352, 72914983] | 0.03892986 %386 chr18 [73517930, 73518921] | 0.02559103 %387 chr18 [74835590, 74836200] | 0.12889933 %388 chr18 [74850775, 74851581] | 0.07360388 %389 chr18 [74860617, 74861878] | 0.05664056 %390 chr18 [74906349, 74907306] | 0.03568147 %391 chr18 [75641971, 75642647] | 0.06198451 %392 chr18 [76087923, 76089259] | 0.12598012 The scores associated with each site are derived from the peak caller confidence score, and are a measure of confidence in the peak call (occupancy), not a measure of how strong or distinct the peak is. \subsection{Comparison of occupancy and affinity based analyses} So how does this occupancy-based analysis compare to the previous affinity-based analysis? First, different criteria were used to select the overall consensus peakset. We can compare them to see how well they agree: <>= tamoxifen = dba.peakset(tamoxifen,tamoxifen$masks$Consensus, minOverlap=1,sampID="OL Consensus") tamoxifen = dba.peakset(tamoxifen,!tamoxifen$masks$Consensus, minOverlap=3,sampID="Consensus_3") dba.plotVenn(tamoxifen,14:15) @ \reff{tamox-compare-venn} shows that the two sets agree on about 85\% of their sites, so the results should be directly comparable.\footnote{Alternatively, we could re-run the analysis using the newly derived consensus peakset by passing it into the counting function: \texttt{> tamoxifen = dba.count(tamoxifen, peaks = dba.peakset(tamoxifen, peaks=14, bRetrieve=T))}} \begin{figure} \centering \includegraphics[width=9cm]{tamox_compare_venn.png} \caption{Venn diagram showing how the ER peak calls for two different ways of deriving consensus peaksets. Generated by plotting the result of: \texttt{dba.plotVenn(tamoxifen,14:15)}} \label{fig:tamox-compare-venn} \end{figure} Next re-load the affinity analysis: <>= data(tamoxifen_analysis) @ To compare the sites unique to each sample group identified from the occupancy analysis with those sites identified as differentially bound based on affinity (read count) data, we use a feature of \texttt{dba.report} that facilitates evaluating the occupancy status of sites. Here we obtain a report of all the sites (\texttt{th=1}) with occupancy statistics (\texttt{bCalled=T}): <>= tamoxifen.rep = dba.report(tamoxifen,bCalled=T,th=1) @ The \texttt{bCalled} option adds two columns to the report (\texttt{Called1} and \texttt{Called2}), one for each group, giving the number of samples within the group in which the site was identified as a peak in the original peaksets generated by the peak caller. We can use these to recreate the overlap criteria used in the occupancy analysis: <>= onlyResistant = values(tamoxifen.rep)$Called1>=2 & values(tamoxifen.rep)$Called2<3 sum(onlyResistant ) onlyResponsive = values(tamoxifen.rep)$Called2>=3 & values(tamoxifen.rep)$Called1<2 sum(onlyResponsive) bothGroups = values(tamoxifen.rep)$Called1>=2 & values(tamoxifen.rep)$Called2>=3 sum(bothGroups) @ % [1] 313 % [1] 391 % [1] 821 Comparing these numbers verifies the similarity with those seen in \reff{tamox-cons-venn}. Focusing on only those sites identified as significantly differentially bound (FDR <= 0.1), however, shows a different story than that obtainable using only occupancy data: <>= tamoxifen.DB = dba.report(tamoxifen,bCalled=T,th=.1) onlyResistant = values(tamoxifen.DB)$Called1>=2 & values(tamoxifen.DB)$Called2<3 sum(onlyResistant) onlyResponsive = values(tamoxifen.DB)$Called2>=3 & values(tamoxifen.DB)$Called1<2 sum(onlyResponsive) bothGroups = values(tamoxifen.DB)$Called1>=2 & values(tamoxifen.DB)$Called2>=3 sum(bothGroups) @ %[1] 32 % [1] 87 %[1] 54 There are a number of notable differences in the results. First there are many fewer sites identified as differentially bound. Indeed, most of the sites identified in the occupancy analysis as unique to a sample group are not found to be significantly differentially bound using the affinity data. While partly this is a result of the stringency of the statistical tests, it shows how the affinity analysis can discriminate between sites where peak callers are making occupancy decisions that do not reflect significant differences in read densities at these sites. Secondly, while the sites unique to a sample group are more likely to be identified as differentially bound, differentially bound sites are as likely to be called in the consensus of both response groups as they are to be unique to one group, as \emph{nearly half of the sites identified as significantly differentially bound are called as peaks in both response groups}. Finally, the differentially bound peaks identified using the affinity analysis are associated with significance statistics (p-value and FDR) that can be used to rank them for further examination, while the occupancy analysis yields a relatively unordered list of peaks, as the peak caller statistics refer only to the significance of occupancy, and not of differential binding. \section{Technical notes} This section includes some technical notes explaining some of the internal technical details of \DBA processing. \subsection{Merging peaks} When forming the global binding matrix consensus peaksets, \DBA first identifies all unique peaks amongst the relevant peaksets. As part of this process, it merges overlapping peaks, replacing them with a single peak representing the narrowest region that covers all peaks that overlap by at least one base. There are at least two consequences of this that are worth noting. First, as more peaksets are included in analysis, the average peak width tends to become longer as more overlapping peaks are detected and the start/end points are adjusted outward to account for them. Secondly, peak counts may not appear to add up as you may expect due to merging. For example, if one peakset contains two small peaks near to each other, while a second peakset includes a single peak that overlaps both of these by at least one base, these will all be replaced in the merged matrix with a single peak. As more peaksets are added, multiple peaks from multiple peaksets may be merged together to form a single, wider peak. \subsection{\edgeR analysis} When \texttt{dba.analyze} is invoked using the default \texttt{method=DBA\_EDGER}, a standardized differential analysis is performed using the \edgeR package (\cite{Robinson:2010p249}). This section details the precise steps in that analysis. For each contrast, a separate analysis is performed. First, a matrix of counts is constructed for the contrast, with columns for all the samples in the first group, followed by columns for all the samples in the second group. The raw read count is used for this matrix; if the \texttt{bSubControl} parameter is set to \texttt{TRUE} (as it is by default), the raw number of reads in the control sample (if available) will be subtracted (with a minimum final read count of 1). Next the library size is computed for each sample for use in subsequent normalization. By default, this is the total number of reads in peaks (the sum of each column). Alternatively, if the \texttt{bFullLibrarySize} parameter is set to TRUE, the total number of reads in the library (calculated from the source BAM/SAM/BED file) is used. The default setting is appropriate for situations when the overall signal is expected to be directly comparable between the samples; using the full library size may be preferable if samples are expected to have dramatically different signals (e.g., if some are expected to have very low binding rates compared to others). Next comes a call to \edgeR's \texttt{DGEList} function. The \texttt{DGEList} object that results is next passed to \texttt{calcNormFactors} with all other parameters retained as defaults (\texttt{method="TMM"}), returning an updated \texttt{DGEList} object. This is passed to \texttt{estimateCommonDisp} with default parameters. If the method is \texttt{DBA\_EDGER\_CLASSIC}, then if \texttt{ bTagwise} is TRUE (most useful when there are at least three members in each group of a contrast), the resulting \texttt{DGEList} object is then passed to \texttt{estimateTagwiseDisp}, with the prior set to 50 divided by two less than the total number of samples in the contrast, and \texttt{trend="none"}. The final steps are to perform testing to determine the significance measure of the differences between the sample groups by calling \texttt{exactTest} using the \texttt{DGEList} with the \texttt{dispersion} set based on the \texttt{bTagwise} parameter. If the method is \texttt{DBA\_EDGER\_GLM} (the default), then a a design matrix is generated with two coefficients (the Intercept and one of the groups). Next \texttt{estimateGLMCommonDisp} is called; if \texttt{bTagwise=TRUE}, \texttt{estimateGLMTagwiseDisp} is called as well. The model is fitted by calling \texttt{glmFit}, and the specific contrast fitted by calling \texttt{glmLRT}, specifying that the second coefficient be dropped. Finally, and \texttt{exactTest} is performed, using either common or tagwise dispersion depending on the value specified for \texttt{bTagwise}. This final \texttt{DGEList} for contrast n is stored in the \texttt{DBA object} as \texttt{DBA\$contrasts[[n]]\$edgeR} and may be examined and manipulated directly for further customization. Note however that if you wish to use this object directly with \edgeR functions, then the \texttt{bReduceObjects} parameter should be set to FALSE, otherwise the default value of TRUE will result in essential object fields being stripped. If a blocking factor has been added to the contrast, an additional \edgeR analysis is carried out. This follows the \texttt{DBA\_EDGER\_GLM} case detailed above, except a more complex design matrix is generated that includes all the unique values for the blocking factor. These coefficients are all included in the \texttt{glmLRT} call. The resultant object is accessible as \texttt{DBA\$contrasts[[n]]\$edgeR\$block}. \subsection{\DESeq analysis} When \texttt{dba.analyze} is invoked using \texttt{method=DBA\_DESEQ}\footnote{Note that \DESeq can be made the default analysis method for a DBA object by setting \texttt{DBA\$config\$AnalysisMethod=DBA\_DESEQ}.}, a standardized differential analysis is performed using the \DESeq package (\cite{Anders:2010p792}). This section details the precise steps in that analysis. For each contrast, a separate analysis is performed. First, a matrix of counts is constructed for the contrast, with columns for all the samples in the first group, followed by columns for all the samples in the second group. The raw read count is used for this matrix; if the \texttt{bSubControl} parameter is set to \texttt{TRUE} (as it is by default), the raw number of reads in the control sample (if available) will be subtracted. Next the library size is computed for each sample for use in subsequent normalization. By default, this is the total number of reads in peaks (the sum of each column). Alternatively, if the \texttt{bFullLibrarySize} parameter is set to TRUE, the total number of reads in the library (calculated from the source BAM/SAM/BED file) is used. The first step concludes with a call to DESeq's \texttt{newCountDataSet} function, which returns a \texttt{CountDataSet} object. If \texttt{bFullLibrarySize} is set to TRUE, then \texttt{sizeFactors} is called with the number of reads in the BAM/SAM/BED files for each ChIP sample, divided by the minimum of these; otherwise, \texttt{estimateSizeFactors} is invoked. Next, \texttt{estimateDispersions} is called with the \texttt{CountDataSet} object and \texttt{fitType} set to \texttt{local}. If there are no replicates, (only one sample in each group), \texttt{method} is set to \texttt{blind}. Otherwise, if \texttt{bTagwise} is TRUE, \texttt{method} is set to \texttt{per-condition}; if it is FALSE, \texttt{method} is set to \texttt{pooled}. If the method is \texttt{DBA\_DESEQ\_CLASSIC}, \texttt{nbinomTest} is called, and the result (reordered by adjusted p-value) saved for reporting. If the method is \texttt{DBA\_DESEQ\_GLM} (the default), two models are fitted using \texttt{fitNbinomGLMs}: a full model is fitted with all the coefficients, and a second model is fitted with the second coefficient dropped. These are tested against each other using \texttt{nbinomGLMTest}, witht he resulting p values adjusted using \texttt{p.adjust} (with \texttt{method="BH"}). The final results are acessible within the \texttt{DBA object} as \texttt{DBA\$contrasts[[n]]\$DESeq\$DEdata} and may be examined and manipulated directly for further customization. Note however that if you wish to use this object directly with \DESeq functions, then the \texttt{bReduceObjects} parameter should be set to FALSE, otherwise the default value of TRUE will result in essential object fields being stripped. If a blocking factor has been added to the contrast, an additional \DESeq analysis is carried out. This follows the \texttt{DBA\_DESEQ\_GLM} case detailed above, except a more complex design is generated when \texttt{newCountDataSet} is called that includes all the unique values for the blocking factor. These coefficients are all included in the \texttt{fitNbinomGLMs} calls. The resultant object is accessible as \texttt{DBA\$contrasts[[n]]\$DESeq\$block}. \section{Acknowledgements} This package was developed at Cancer Research UK's Cambridge Research Institute with the help and support of many people there. We wish to acknowledge everyone the Bioinformatics Core under the leadership of Matthew Eldridge, as well as the Nuclear Receptor Transcription Laboratory under the leadership of Jason Carroll. Researchers who contributed ideas and/or pushed us in the right direction include Caryn-Ross Innes, Vasiliki Theodorou, and Tamir Chandra among many others. We also thank members of the Gordon Smyth laboratory at the WEHI, Melbourne, particularly Mark Robinson and Davis McCarthy, for helpful discussions. \section{Setup} This vignette was built on: <>= sessionInfo() @ <>= dev.off() unlink(tmp) setwd(savewd) @ \bibliographystyle{plainnat} \bibliography{DiffBind_bib} \end{document}