%\VignetteIndexEntry{Analysing ChIP-Seq data with the "MMDiff" package}
%\VignettePackage{MMDiff}

\documentclass[12pt]{article}

\topmargin 0in
\headheight 0in
\headsep 0in
\oddsidemargin 0in
\evensidemargin 0in
\textwidth 176mm
\textheight 215mm
\usepackage{natbib}
\usepackage[english]{babel}
\usepackage{Sweave}
%\usepackage{url}
\usepackage{subfig}
%\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png}
\newcommand{\DBA}{\textsf{DiffBind}}
\newcommand{\edgeR}{\texttt{edgeR}}
\newcommand{\DESeq}{\texttt{DESeq}}
\newcommand{\MD}{\texttt{MMDiff}}
\newcommand{\code}[1]{{\small\texttt{#1}}}
\newcommand{\R}{\textsf{R}}

\newcommand{\tab}[1]{Table \ref{tab:#1}}
\newcommand{\fig}[1]{Figure \ref{fig:#1}}


\begin{document}
\SweaveOpts{concordance=TRUE}


\title{\MD: Statistical Testing for ChIP-Seq data sets}
\author{Gabriele Schweikert \\ \texttt{G.Schweikert@ed.ac.uk}\\
  University of Edinburgh, UK\\
}


\date{1 October 2012}
\maketitle

\section{Introduction}

ChIP-Seq has rapidly become the dominant experimental technique to
determine the location of transcription factor binding sites and
histone modifications. Typically, computational peak finders, such as
Macs \citep{Macs}, are used to identify potential candidate regions,
i.e. regions with significantly enriched read coverage compared to
some background. In the following, we will simply call these regions
{\it peaks} and will assume that their {\it genomic coordinates} are
provided. Going beyond this basic analysis, it is often of interest to
detect a subset of peaks where significant {\it changes of read
  coverage} occur in a treatment experiment relative to a control (see
\fig{Example}). Statistical analysis of ChIP-Seq data however remains
challenging, due to the highly structured nature of the data and the
paucity of replicates.  Current approaches to detect differentially
bound regions are mainly borrowed from RNA-Seq data analysis, thus
focusing on total counts of fragments mapped to a region, ignoring any
information encoded in the shape of the peak profile.

Higher order features of ChIP-Seq peak enrichment profiles carry
important and often complementary information to total counts, and
hence are potentially important in assessing differential binding. We
therefore incorporate higher order information into testing for
differential binding by adapting recently proposed kernel-based
statistical tests to ChIP-Seq data.

\section{Analysis pipeline}

Suppose we have performed an experiment to study the binding of a
certain transcription factor or the occurrence of a certain histone
modification. We obtained $n_{Samples 1}$ ChIP-Seq data sets for a
control group and $n_{Samples 2}$ ChIP-Seq data sets for a treatment
group. On each of the sets we have run a peak finder, e.g. Macs
\citep{Macs}, to determine the regions of read enrichment,
i.e. binding sites. Eventually, we came up with a set of $n_{Peaks}$
regions, which we will examine across all data sets. The task now will
be to determine the set of peaks that have significantly changed read
coverage between the control group and the treatment group. A change
in this context may correspond to a difference in overall binding
(i.e. a change in the total number of reads mapping to the region
after normalisation) and/or to a change that affects the shape of the
peak. The first kind of changes can be detected with packages like
\DBA~\citep{DiffBind,DiffBind2}. Here, we will be most interested in
the second type of changes. However, this package is compatible with
DiffBind and results can easily be compared.


As an example we use data from \cite{Cfp1Data}. It examines the role
of CxxC finger protein 1 (Cfp1) in the establishment of Trimethylation
of histone H3 Lys 4 (H3K4me3). The experiment consists of H3K4me3
ChIP-Seq measurements from three different cell lines: (1) a wild-type
mouse embryonic stem cell line (WT), (2) a mutant line lacking the
protein Cfp1 (Null) and (3) a rescue cell line where Cfp1 was
re-inserted into the genome (Resc). Cfp1 is known to be a conserved
DNA-binding subunit of the H3K4 histone methyltransferase Set1
complex. It is therefore expected that H3K4me3 is reduced in the Null
cells. However, as the H3K4 histone methyltransferase activity is
redundantly encoded in at least six different complexes in mammals,
the precise target regions of Cfp1 are unknown. In addition, under the
assumption that the different enzymes potentially act cooperatively at
the same target regions, it is to be expected that binding is not
completely abolished at these regions but rather reduced, potentially
leading to altered H3K4me3 peak profiles. In the rescue cell line 'normal'
H3K4me3 levels should be observed and it will therefore be treated as
a replicate of the WT cell lines.

We have aligned the short reads from the ChIP-Seq experiments using BWA
\citep{BWA}, creating bam files. Subsequently, we run Macs \citep{Macs}, on
each bam file creating Peak lists containing the coordinates of enriched
regions. The complete bam files are available as part of ArrayExpress
Experiment E-ERAD-79. To run the following examples, we provide subsets of
the BAM files with reads mapping to chr1:3000000...75000000 in the data
package MMDiffBamSubset. Due to space limitation these files include only
one replicate for each cell line (WT, Resc, Null) as well as a control input
sample. The package MMDiffBamSubset also contains a corresponding subset of
the called peaks.  In analogy to DiffBind, information on the experiment is
stored in a csv sample sheet. In the case of our example, it is called
Cfp1.csv (\tab{cfp1}) and it is also available with the provided data
package. Among other information the sample sheet contains for each sample
the path to the sample bam files and to the peak files.


\begin{table}[!h]\scriptsize
  \centering
  \caption{Cfp1 dataset sample sheet (Cfp1.csv).}
  \begin{tabular}{|c|c|c|c|c|c|c|c|c|}
    \hline
    SampleID& Tissue& Factor& Condition& Replicate& bamReads& bamControl& Peaks& PeakCaller \\ \hline

    WT.AB2&WT&H3K4me3&1&2&reads/WT\_2.bam&reads/Input.bam&peaks/WT\_2\_Macs.xls&macs\\
    Null.AB2&Null&H3K4me3&2&2&reads/Null\_2.bam&reads/Input.bam&peaks/Null\_2\_Macs.xls&macs\\
    Resc.AB2&Resc&H3K4me3&1&2&reads/Resc\_2.bam&reads/Input.bam&peaks/Resc\_2\_Macs.xls&macs\\

    \hline
  \end{tabular}
  \label{tab:cfp1}
\end{table}


<<eval=TRUE,echo=FALSE,results=hide>>=
tmp <-  tempfile(as.character(Sys.getpid()))
pdf(tmp)
savewarn <- options("warn")
options(warn=-1)
savewd <- getwd()
@


\begin{figure}
  \centering
  \includegraphics[width=14cm]{ExamplePeak419.pdf}
  \caption{Example Peak: H3K4me3 read enrichment profiles at chromosome
    1 :36124037-36133089. The Null sample shows strong signal
    depletion between 36126000 and 36128000 when compared to the WT
    and Rescue samples.
  }
  \label{fig:Example}
\end{figure}

Our analysis pipeline consists of 5 steps, which will be described in
more detail in the following subsections.

\begin{enumerate}
\item {\bf Building histograms}
\item {\bf Outlier detection and normalisation}
\item {\bf Computing of distances between histograms}
\item {\bf Determination of p values}
\item {\bf Analysis of results}
\end{enumerate}

\subsection{Step 1:  Building histograms}


The experiment details along with the provided peak sets detected by
Macs are read in using the following \DBA~function:

<<echo=FALSE,results=hide>>=
MMDiffBS_extdata <- system.file("extdata", package="MMDiffBamSubset")
leftoverPlots <- dir(MMDiffBS_extdata, pattern=glob2rx("Rplots*.pdf"))
if(length(leftoverPlots) > 0) {
    try({
        file.remove(paste(MMDiffBS_extdata, leftoverPlots,
                          sep=.Platform$file.sep))
    }, silent=TRUE)
}
@

<<eval=TRUE,echo=TRUE,results=verbatim>>=
library('MMDiff')
library('MMDiffBamSubset')
oldwd <- setwd(system.file("extdata", package="MMDiffBamSubset"))
Cfp1 <- dba(sampleSheet="Cfp1.csv", minOverlap=3)
@

\noindent As a result, an S3 object (class DBA) is created. This class is
defined in the \DBA package and additional information can be found in the
\DBA vignette. An overview over methods for this class can be viewed by
typing \texttt{?summary.DBA}. This DBA object will form the basic data
structure for the following analysis. However, we will extend the original
DBA object by an element called \texttt{MD} (see below). Note, that we
called \texttt{dba()} such that a consensus peak set is created consisting
of regions which occur in at least \texttt{minOverlap=3} samples. Therefore
out of a total of 7370 peaks 2466 regions are selected:


<<eval=TRUE,echo=TRUE,results=verbatim>>=
Cfp1
@

\noindent Next, we will specify the regions of interest, i.e. the
peaks, with a "GRanges" Object. For this example we will examine the
first 1000 consensus peaks:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
Peaks <- dba.peakset(Cfp1,bRetrieve=TRUE)
Peaks <- Peaks[1:1000]
@

\noindent Note, that we don't have to use the consensus peaks but we
could chose an arbitrary set of regions, for example, these 200 100bp
consecutive regions on chromosome 1:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
Peaks.2 <- GRanges(seqnames = Rle('chr1'),
                 ranges = IRanges(start=seq(3200000,3219900,100), 
                                  width=100))
@

\noindent We now want to create read count profiles across each peak
for each sample. To do so, we call the \MD~function
\texttt{getPeakProfiles()}, which uses Rsamtools to collect reads
mapping to a certain region in a bam file. Additionally, strand shifts
between forward and reverse strand are determined and corrected for and
reads in each region are binned (default bin.length = 20bp). As a
result, histograms are obtained for each sample and peak.

To correct for the strand shift, only peaks are selected whose total
number of reads mapping to each strand is in the 9th decile. If
draw.on=TRUE, a plot is generated for each bam file (all bamRead and
bamControl files), showing smoothscatter plots of total number of
reads mapping to the peaks on forward vs reverse strand, see
\fig{Strands}. This can be used as a quality control (Points should
lie on the diagonal). The peaks used to determine the strand shift are
shown in red. For each of the selected peaks the shift between forward
and reverse strand is determined using the cross-correlation function
ccf. If draw.on=TRUE, histograms are plotted for each bam file,
showing the distribution of shifts (\fig{StrandShifts}). The median is
used to correct all reads mapping to any peak in the respective bam
file. (Note, the shifts can vary between samples i.e. different bam
files.)


\begin{figure}
  \centering
  \includegraphics[width=14cm]{forwardVSReverse_WT_2.pdf}
  \caption{Smooth scatter plot of total number of
reads mapping to the peaks on forward vs reverse strand. The red
circles indicate peaks used to determine the strand shift.
  }
  \label{fig:Strands}
\end{figure}


\begin{figure}
  \centering
  \includegraphics[width=14cm]{strandShifts_WT_2.pdf}
  \caption{Histogram of determined strand shifts. Red line indicates
    the median which is used for correction.
  }
  \label{fig:StrandShifts}
\end{figure}


<<eval=TRUE,echo=TRUE,results=hide>>=

Cfp1Profiles <- getPeakProfiles(Cfp1, Peaks,
                                bin.length=50, 
                                save.files=FALSE,
                                run.parallel=FALSE)
setwd(oldwd)
@


This object is available for loading using \texttt{data(Cfp1Profiles)}. The
original DBA object has been extended by an element called \texttt{MD},
which is a list containing elements such as \texttt{RawTotalCounts} and
\texttt{PeakRawHists}:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
names(Cfp1Profiles$MD)
@

\noindent \texttt{RawTotalCounts} is a simple matrix ($n_{Samples}$ x
$n_{Peaks}$) containing the total number of reads found in the
corresponding bam file mapping to a given peak.


\texttt{PeakRawHists} is a list with one element per peak. The name of the
elements are the peak coordinates.

<<eval=TRUE,echo=TRUE,results=verbatim>>=
PeakRawHists <- Cfp1Profiles$MD$PeakRawHists
names(PeakRawHists)[1:10]
@

\noindent For each peak, a matrix can be accessed which contains
histograms for each sample, e.g. for peak 5 ("chr1:3842066-3843373"),
this is a 4 by 24 matrix, containing counts on 24 bins for 4 samples
(WT, Resc, Null, Input):

<<eval=TRUE,echo=TRUE,results=verbatim>>=
peak.id <- 5
dim(PeakRawHists[[peak.id]])
@

\noindent The column names of the matrix correspond to histogram
mid points (genomic coordinates), and the row names signify the bam file
from which the histogram was generated. Note that histogram length
can vary between different peaks, as the peaks are not necessary of
the same length.

<<eval=TRUE,echo=TRUE,results=verbatim>>=
colnames(PeakRawHists[[peak.id]])
rownames(PeakRawHists[[peak.id]])
@

\noindent Profiles for a given peak (e.g. peak 33) can be plotted like this:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
Peak.id <- 33
Sample.ids <- c("WT.AB2", "Null.AB2", "Resc.AB2")
plotPeak(Cfp1Profiles, Peak.id, Sample.ids, NormMethod=NULL)
@


\noindent Note, additional information can be kept if
\texttt{getPeakProfiles()} is called with the parameter \texttt{keep.extra =
  TRUE}. In this case an extra element \texttt{RawHists} is appended to the
list \texttt{MD}, which contains a list for each bam file:

\noindent For instance:

<<eval=TRUE,echo=TRUE,results=hide>>=
oldwd <- setwd(system.file("extdata", package="MMDiffBamSubset"))
Cfp1Profiles2 <- getPeakProfiles(Cfp1, Peaks, 
                                 bin.length=50,
                                 save.files=FALSE, 
                                 keep.extra=TRUE)
names(Cfp1Profiles2$MD$RawHists$WT_2)
setwd(oldwd)
@

\noindent where "Counts" contains the histogram counts, "Mids" the histogram
mid points, "Counts.p" and "Counts.n" histograms for the forward and reverse
strand histograms, respectively. "Meta" contains meta information such as
the applied shift and bin length.


\subsection{Step 2: Normalisation}

So far we have looked at raw read counts. However, each sample may
have been sequenced to a different depth and in order to make samples
comparable, they have to be normalised. Different normalisation
strategies have been suggested, here we use the method proposed by
\cite{DESeq}: Briefly, a size factor estimate for data set $s$ is
computed as the median of the ratios of the s-th data set's counts to
those of a pseudo-reference obtained by taking the geometric mean
across data sets. Note, that under certain situations only a subset of
samples should be normalised with respect to each other (e.g. when
groups of the samples have different signal-to-noise ratios, one might
want to analyse the groups independently.). In this case we can
specify which samples should be used. It is also possible to specify a
subset of peaks that are used to determine the normalisation
factors. For example, it might be necessary to determine outliers with
extreme total counts to be excluded. The function
\texttt{findOutliers} can be used for this task (see the example in
the man page).


<<eval=TRUE,echo=TRUE,results=verbatim>>=
SampleIDs <- c("WT.AB2", "Null.AB2", "Resc.AB2")
Cfp1Norm <- getNormFactors(Cfp1Profiles, 
                           method = "DESeq", 
                           SampleIDs=SampleIDs)
@

\noindent The estimated factors can be accessed as:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
Cfp1Norm$MD$NormFactors$DESeq
@

\noindent Additionally, NormTotalCounts is appended which contains
normalised total counts (RawTotalCounts/NormFactors) for all specified
samples (all others are set to 0).

\noindent Note, that histograms (e.g. \texttt{PeakRawHists}) are not
normalised and normalisation factors have to be specified in the subsequent
steps.

\noindent To plot normalised peaks, e.g. peak 33, use:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
Peak.id <- 33
Sample.ids <- c("WT.AB2", "Null.AB2", "Resc.AB2")
plotPeak(Cfp1Norm, Peak.id, Sample.ids, NormMethod='DESeq')
@

\subsection{Step 3: Determine distances between histograms}

Next, we want to determine a distance measure for each peak comparing WT,
Resc, and Null, i.e. for each peak we need the three pairwise distances: 'WT
vs Null', 'WT vs Resc' and 'Resc vs Null'.

When running the \MD function
\texttt{compHistDists()}, we can specify which 'distance measure' we want to
compute. We recommend the default method 'MMD' \citep{MMD}, which computes
the maximum mean discrepancy between pairs of
histograms \citep{MMD}. Alternatively, \texttt{method=GMD} (Generalized
Minimum Distances) \citep{GMD} can be used, which we will use here, because
it is faster:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
Cfp1Dists <- compHistDists(Cfp1Norm, method='GMD',
   overWrite=FALSE, NormMethod='DESeq')
@


\noindent We could also specify the comparisons directly, for example:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
SampleIDs <- Cfp1Norm$samples$SampleID
ID <- which(upper.tri(matrix(1, length(SampleIDs), length(SampleIDs)),
                      diag = F), arr.ind=T)
CompIDs <- rbind(SampleIDs[ID[,1]], SampleIDs[ID[,2]])
CompIDs
@

\noindent and run the function \texttt{compHistDists}:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
Cfp1Dists <- compHistDists(Cfp1Norm, method='GMD', CompIDs=CompIDs,
                           overWrite=FALSE, NormMethod='DESeq')
@


As a result, \texttt{Cfp1\$MD} is appended with a new element
called \texttt{DISTS}. In \texttt{DISTS\$GMD} you will find a matrix of
dimension ($n_{Peaks}$ x $n_{Comps}$) containing all pairwise distances
for each peak:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
Cfp1Dists$MD$DISTS$GMD[1:10,]
@

\noindent We provide \texttt{Cfp1Dists}, which was created running
\texttt{compHistDists} three times, once with each methods.

<<eval=TRUE,echo=TRUE,results=verbatim>>=
data(Cfp1Dists)
names(Cfp1Dists$MD$DISTS)
@


\subsection{Step 4: Determine p-values}

Finally we want to detect peaks that are different in a treatment
group relative to a control group. In this example, we only have one
replicate for the treatment group (Null.AB2) and we will use WT.AB2
and Resc.AB2 as replicates for the control group. We compute empirical
p-values pooling peaks with similar mean total counts. The p-values
are adjusted for multiple testing with the method by Benjamini and
Hochberg (1995).

<<eval=TRUE,echo=TRUE,results=verbatim>>=
data(Cfp1Dists)
group1 <- c("WT.AB2","Resc.AB2")
group2 <- c("Null.AB2")
Cfp1Pvals <- detPeakPvals(Cfp1Dists, group1=group1, group2=group2,
                    name1='Wt/Resc', name2='Null', method='MMD')

Cfp1Pvals <- detPeakPvals(Cfp1Pvals, group1=group1, group2=group2,
                    name1='Wt/Resc', name2='Null', method='GMD')

Cfp1Pvals <- detPeakPvals(Cfp1Pvals, group1=group1, group2=group2,
                    name1='Wt/Resc', name2='Null', method='Pearson')
@


\subsection{Step 5: Analysis results}

The indices of peaks with a p-value $<0.05$ can be obtained, by:

<<eval=TRUE,echo=TRUE,results=verbatim>>=

idxMMD <- which(Cfp1Pvals$MD$Pvals$MMD<0.05)
@

\noindent and similarly for the other methods:

<<eval=TRUE,echo=TRUE,results=verbatim>>=
idxGMD <- which(Cfp1Pvals$MD$Pvals$GMD<0.05)
idxPearson <- which(Cfp1Pvals$MD$Pvals$Pearson<0.05)
@

\noindent The coordinates of the peaks can be obtained by:
<<eval=TRUE,echo=TRUE,results=verbatim>>=
rownames(Cfp1Pvals$MD$Pvals$MMD)[idxMMD[1:10]]
@


\noindent For sanity checks, computed distances can be plotted as a
function of mean total count (\fig{MMDs}- \ref{fig:Pearson}):

<<eval=TRUE,echo=TRUE,results=verbatim>>=
group1 <- c("WT.AB2", "Resc.AB2")
group2 <- c("Null.AB2")

plotHistDists(Cfp1Pvals, group1=group1, group2=group2, method='MMD')
plotHistDists(Cfp1Pvals, group1=group1, group2=group2, method='GMD')
plotHistDists(Cfp1Pvals, group1=group1, group2=group2, method='Pearson')
@

\noindent Note, that with MMD, one can observe a number of peaks which have
higher distances in the between group comparison than any of the with-in
group distances. If we assume that the within group-distances give us a good
estimate of naturally occurring variation between peaks under the same
condition, these are likely to be peaks that are truly affected when Cfp1
is knocked out. In this example, MMD therefore seems to be best suited to
detect target regions of Cfp1.


\begin{figure}
  \centering
  \includegraphics[width=14cm]{MMDDists.pdf}
  \caption{MMD based distances as a function of averaged normalised
    total counts. Shown are between-group distances (i.e. Wt/Resc vs
    Null), where each black cross corresponds to a peak. In addition,
    within-group distances (i.e. between WT and Resc) are overlayed
    (blue dots). Differentially called peaks with large enough
    between-group distances are additionally labelled with red
    circles.}
  \label{fig:MMDs}
\end{figure}

\begin{figure}
  \centering
  \includegraphics[width=14cm]{GMDDists.pdf}
  \caption{GMD based distances as a function of averaged normalised
    total counts.}
  \label{fig:GMDs}
\end{figure}


\begin{figure}
  \centering
  \includegraphics[width=14cm]{PearsonDists.pdf}
  \caption{Pearson correlation as a function of averaged normalised
    total counts}
  \label{fig:Pearson}
\end{figure}


\section{Acknowledgements}

The package was developed at the University of Edinburgh, School of
Informatics and the Wellcome Trust Centre for Cell Biology with
support from Guido Sanguinetti and other members of his group. The
data used to illustrate the usage of the package was kindly provided
by Thomas Clouaire and Adrian Bird. System administration and
computing support was provided by Shaun Webb and Alastair Kerr. Arthur
Gretton was very helpful in discussing MMD-based statistical testing
and Rory Stark gave valuable insights into ChIP-Seq data analysis and
his package \DBA.

The research leading to these results has received funding from the
People Program (Marie Curie Actions) of the European Union's Seventh
Framework Programme (FP7/2007-2013) under REA grant agreement number
$n^o$ PIEF-GA-2011-299192.


\section{Setup}

This vignette was built on:

<<setup,eval=TRUE,echo=TRUE,results=verbatim>>=
sessionInfo()
@

<<eval=TRUE,echo=FALSE,results=hide>>=
dev.off()
unlink(tmp)
setwd(savewd)
@

\bibliographystyle{plainnat}
\bibliography{MMDiff_bib}

\end{document}