%\VignetteIndexEntry{Analysing ChIP-Seq data with the "MMDiff" package} %\VignettePackage{MMDiff} \documentclass[12pt]{article} \topmargin 0in \headheight 0in \headsep 0in \oddsidemargin 0in \evensidemargin 0in \textwidth 176mm \textheight 215mm \usepackage{natbib} \usepackage[english]{babel} \usepackage{Sweave} %\usepackage{url} \usepackage{subfig} %\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png} \newcommand{\DBA}{\textsf{DiffBind}} \newcommand{\edgeR}{\texttt{edgeR}} \newcommand{\DESeq}{\texttt{DESeq}} \newcommand{\MD}{\texttt{MMDiff}} \newcommand{\code}[1]{{\small\texttt{#1}}} \newcommand{\R}{\textsf{R}} \newcommand{\tab}[1]{Table \ref{tab:#1}} \newcommand{\fig}[1]{Figure \ref{fig:#1}} \begin{document} \SweaveOpts{concordance=TRUE} \title{\MD: Statistical Testing for ChIP-Seq data sets} \author{Gabriele Schweikert \\ \texttt{G.Schweikert@ed.ac.uk}\\ University of Edinburgh, UK\\ } \date{1 October 2012} \maketitle \section{Introduction} ChIP-Seq has rapidly become the dominant experimental technique to determine the location of transcription factor binding sites and histone modifications. Typically, computational peak finders, such as Macs \citep{Macs}, are used to identify potential candidate regions, i.e. regions with significantly enriched read coverage compared to some background. In the following, we will simply call these regions {\it peaks} and will assume that their {\it genomic coordinates} are provided. Going beyond this basic analysis, it is often of interest to detect a subset of peaks where significant {\it changes of read coverage} occur in a treatment experiment relative to a control (see \fig{Example}). Statistical analysis of ChIP-Seq data however remains challenging, due to the highly structured nature of the data and the paucity of replicates. Current approaches to detect differentially bound regions are mainly borrowed from RNA-Seq data analysis, thus focusing on total counts of fragments mapped to a region, ignoring any information encoded in the shape of the peak profile. Higher order features of ChIP-Seq peak enrichment profiles carry important and often complementary information to total counts, and hence are potentially important in assessing differential binding. We therefore incorporate higher order information into testing for differential binding by adapting recently proposed kernel-based statistical tests to ChIP-Seq data. \section{Analysis pipeline} Suppose we have performed an experiment to study the binding of a certain transcription factor or the occurrence of a certain histone modification. We obtained $n_{Samples 1}$ ChIP-Seq data sets for a control group and $n_{Samples 2}$ ChIP-Seq data sets for a treatment group. On each of the sets we have run a peak finder, e.g. Macs \citep{Macs}, to determine the regions of read enrichment, i.e. binding sites. Eventually, we came up with a set of $n_{Peaks}$ regions, which we will examine across all data sets. The task now will be to determine the set of peaks that have significantly changed read coverage between the control group and the treatment group. A change in this context may correspond to a difference in overall binding (i.e. a change in the total number of reads mapping to the region after normalisation) and/or to a change that affects the shape of the peak. The first kind of changes can be detected with packages like \DBA~\citep{DiffBind,DiffBind2}. Here, we will be most interested in the second type of changes. However, this package is compatible with DiffBind and results can easily be compared. As an example we use data from \cite{Cfp1Data}. It examines the role of CxxC finger protein 1 (Cfp1) in the establishment of Trimethylation of histone H3 Lys 4 (H3K4me3). The experiment consists of H3K4me3 ChIP-Seq measurements from three different cell lines: (1) a wild-type mouse embryonic stem cell line (WT), (2) a mutant line lacking the protein Cfp1 (Null) and (3) a rescue cell line where Cfp1 was re-inserted into the genome (Resc). Cfp1 is known to be a conserved DNA-binding subunit of the H3K4 histone methyltransferase Set1 complex. It is therefore expected that H3K4me3 is reduced in the Null cells. However, as the H3K4 histone methyltransferase activity is redundantly encoded in at least six different complexes in mammals, the precise target regions of Cfp1 are unknown. In addition, under the assumption that the different enzymes potentially act cooperatively at the same target regions, it is to be expected that binding is not completely abolished at these regions but rather reduced, potentially leading to altered H3K4me3 peak profiles. In the rescue cell line 'normal' H3K4me3 levels should be observed and it will therefore be treated as a replicate of the WT cell lines. We have aligned the short reads from the ChIP-Seq experiments using BWA \citep{BWA}, creating bam files. Subsequently, we run Macs \citep{Macs}, on each bam file creating Peak lists containing the coordinates of enriched regions. The complete bam files are available as part of ArrayExpress Experiment E-ERAD-79. To run the following examples, we provide subsets of the BAM files with reads mapping to chr1:3000000...75000000 in the data package MMDiffBamSubset. Due to space limitation these files include only one replicate for each cell line (WT, Resc, Null) as well as a control input sample. The package MMDiffBamSubset also contains a corresponding subset of the called peaks. In analogy to DiffBind, information on the experiment is stored in a csv sample sheet. In the case of our example, it is called Cfp1.csv (\tab{cfp1}) and it is also available with the provided data package. Among other information the sample sheet contains for each sample the path to the sample bam files and to the peak files. \begin{table}[!h]\scriptsize \centering \caption{Cfp1 dataset sample sheet (Cfp1.csv).} \begin{tabular}{|c|c|c|c|c|c|c|c|c|} \hline SampleID& Tissue& Factor& Condition& Replicate& bamReads& bamControl& Peaks& PeakCaller \\ \hline WT.AB2&WT&H3K4me3&1&2&reads/WT\_2.bam&reads/Input.bam&peaks/WT\_2\_Macs.xls&macs\\ Null.AB2&Null&H3K4me3&2&2&reads/Null\_2.bam&reads/Input.bam&peaks/Null\_2\_Macs.xls&macs\\ Resc.AB2&Resc&H3K4me3&1&2&reads/Resc\_2.bam&reads/Input.bam&peaks/Resc\_2\_Macs.xls&macs\\ \hline \end{tabular} \label{tab:cfp1} \end{table} <>= tmp <- tempfile(as.character(Sys.getpid())) pdf(tmp) savewarn <- options("warn") options(warn=-1) savewd <- getwd() @ \begin{figure} \centering \includegraphics[width=14cm]{ExamplePeak419.pdf} \caption{Example Peak: H3K4me3 read enrichment profiles at chromosome 1 :36124037-36133089. The Null sample shows strong signal depletion between 36126000 and 36128000 when compared to the WT and Rescue samples. } \label{fig:Example} \end{figure} Our analysis pipeline consists of 5 steps, which will be described in more detail in the following subsections. \begin{enumerate} \item {\bf Building histograms} \item {\bf Outlier detection and normalisation} \item {\bf Computing of distances between histograms} \item {\bf Determination of p values} \item {\bf Analysis of results} \end{enumerate} \subsection{Step 1: Building histograms} The experiment details along with the provided peak sets detected by Macs are read in using the following \DBA~function: <>= MMDiffBS_extdata <- system.file("extdata", package="MMDiffBamSubset") leftoverPlots <- dir(MMDiffBS_extdata, pattern=glob2rx("Rplots*.pdf")) if(length(leftoverPlots) > 0) { try({ file.remove(paste(MMDiffBS_extdata, leftoverPlots, sep=.Platform$file.sep)) }, silent=TRUE) } @ <>= library('MMDiff') library('MMDiffBamSubset') oldwd <- setwd(system.file("extdata", package="MMDiffBamSubset")) Cfp1 <- dba(sampleSheet="Cfp1.csv", minOverlap=3) @ \noindent As a result, an S3 object (class DBA) is created. This class is defined in the \DBA package and additional information can be found in the \DBA vignette. An overview over methods for this class can be viewed by typing \texttt{?summary.DBA}. This DBA object will form the basic data structure for the following analysis. However, we will extend the original DBA object by an element called \texttt{MD} (see below). Note, that we called \texttt{dba()} such that a consensus peak set is created consisting of regions which occur in at least \texttt{minOverlap=3} samples. Therefore out of a total of 7370 peaks 2466 regions are selected: <>= Cfp1 @ \noindent Next, we will specify the regions of interest, i.e. the peaks, with a "GRanges" Object. For this example we will examine the first 1000 consensus peaks: <>= Peaks <- dba.peakset(Cfp1,bRetrieve=TRUE) Peaks <- Peaks[1:1000] @ \noindent Note, that we don't have to use the consensus peaks but we could chose an arbitrary set of regions, for example, these 200 100bp consecutive regions on chromosome 1: <>= Peaks.2 <- GRanges(seqnames = Rle('chr1'), ranges = IRanges(start=seq(3200000,3219900,100), width=100)) @ \noindent We now want to create read count profiles across each peak for each sample. To do so, we call the \MD~function \texttt{getPeakProfiles()}, which uses Rsamtools to collect reads mapping to a certain region in a bam file. Additionally, strand shifts between forward and reverse strand are determined and corrected for and reads in each region are binned (default bin.length = 20bp). As a result, histograms are obtained for each sample and peak. To correct for the strand shift, only peaks are selected whose total number of reads mapping to each strand is in the 9th decile. If draw.on=TRUE, a plot is generated for each bam file (all bamRead and bamControl files), showing smoothscatter plots of total number of reads mapping to the peaks on forward vs reverse strand, see \fig{Strands}. This can be used as a quality control (Points should lie on the diagonal). The peaks used to determine the strand shift are shown in red. For each of the selected peaks the shift between forward and reverse strand is determined using the cross-correlation function ccf. If draw.on=TRUE, histograms are plotted for each bam file, showing the distribution of shifts (\fig{StrandShifts}). The median is used to correct all reads mapping to any peak in the respective bam file. (Note, the shifts can vary between samples i.e. different bam files.) \begin{figure} \centering \includegraphics[width=14cm]{forwardVSReverse_WT_2.pdf} \caption{Smooth scatter plot of total number of reads mapping to the peaks on forward vs reverse strand. The red circles indicate peaks used to determine the strand shift. } \label{fig:Strands} \end{figure} \begin{figure} \centering \includegraphics[width=14cm]{strandShifts_WT_2.pdf} \caption{Histogram of determined strand shifts. Red line indicates the median which is used for correction. } \label{fig:StrandShifts} \end{figure} <>= Cfp1Profiles <- getPeakProfiles(Cfp1, Peaks, bin.length=50, save.files=FALSE, run.parallel=FALSE) setwd(oldwd) @ This object is available for loading using \texttt{data(Cfp1Profiles)}. The original DBA object has been extended by an element called \texttt{MD}, which is a list containing elements such as \texttt{RawTotalCounts} and \texttt{PeakRawHists}: <>= names(Cfp1Profiles$MD) @ \noindent \texttt{RawTotalCounts} is a simple matrix ($n_{Samples}$ x $n_{Peaks}$) containing the total number of reads found in the corresponding bam file mapping to a given peak. \texttt{PeakRawHists} is a list with one element per peak. The name of the elements are the peak coordinates. <>= PeakRawHists <- Cfp1Profiles$MD$PeakRawHists names(PeakRawHists)[1:10] @ \noindent For each peak, a matrix can be accessed which contains histograms for each sample, e.g. for peak 5 ("chr1:3842066-3843373"), this is a 4 by 24 matrix, containing counts on 24 bins for 4 samples (WT, Resc, Null, Input): <>= peak.id <- 5 dim(PeakRawHists[[peak.id]]) @ \noindent The column names of the matrix correspond to histogram mid points (genomic coordinates), and the row names signify the bam file from which the histogram was generated. Note that histogram length can vary between different peaks, as the peaks are not necessary of the same length. <>= colnames(PeakRawHists[[peak.id]]) rownames(PeakRawHists[[peak.id]]) @ \noindent Profiles for a given peak (e.g. peak 33) can be plotted like this: <>= Peak.id <- 33 Sample.ids <- c("WT.AB2", "Null.AB2", "Resc.AB2") plotPeak(Cfp1Profiles, Peak.id, Sample.ids, NormMethod=NULL) @ \noindent Note, additional information can be kept if \texttt{getPeakProfiles()} is called with the parameter \texttt{keep.extra = TRUE}. In this case an extra element \texttt{RawHists} is appended to the list \texttt{MD}, which contains a list for each bam file: \noindent For instance: <>= oldwd <- setwd(system.file("extdata", package="MMDiffBamSubset")) Cfp1Profiles2 <- getPeakProfiles(Cfp1, Peaks, bin.length=50, save.files=FALSE, keep.extra=TRUE) names(Cfp1Profiles2$MD$RawHists$WT_2) setwd(oldwd) @ \noindent where "Counts" contains the histogram counts, "Mids" the histogram mid points, "Counts.p" and "Counts.n" histograms for the forward and reverse strand histograms, respectively. "Meta" contains meta information such as the applied shift and bin length. \subsection{Step 2: Normalisation} So far we have looked at raw read counts. However, each sample may have been sequenced to a different depth and in order to make samples comparable, they have to be normalised. Different normalisation strategies have been suggested, here we use the method proposed by \cite{DESeq}: Briefly, a size factor estimate for data set $s$ is computed as the median of the ratios of the s-th data set's counts to those of a pseudo-reference obtained by taking the geometric mean across data sets. Note, that under certain situations only a subset of samples should be normalised with respect to each other (e.g. when groups of the samples have different signal-to-noise ratios, one might want to analyse the groups independently.). In this case we can specify which samples should be used. It is also possible to specify a subset of peaks that are used to determine the normalisation factors. For example, it might be necessary to determine outliers with extreme total counts to be excluded. The function \texttt{findOutliers} can be used for this task (see the example in the man page). <>= SampleIDs <- c("WT.AB2", "Null.AB2", "Resc.AB2") Cfp1Norm <- getNormFactors(Cfp1Profiles, method = "DESeq", SampleIDs=SampleIDs) @ \noindent The estimated factors can be accessed as: <>= Cfp1Norm$MD$NormFactors$DESeq @ \noindent Additionally, NormTotalCounts is appended which contains normalised total counts (RawTotalCounts/NormFactors) for all specified samples (all others are set to 0). \noindent Note, that histograms (e.g. \texttt{PeakRawHists}) are not normalised and normalisation factors have to be specified in the subsequent steps. \noindent To plot normalised peaks, e.g. peak 33, use: <>= Peak.id <- 33 Sample.ids <- c("WT.AB2", "Null.AB2", "Resc.AB2") plotPeak(Cfp1Norm, Peak.id, Sample.ids, NormMethod='DESeq') @ \subsection{Step 3: Determine distances between histograms} Next, we want to determine a distance measure for each peak comparing WT, Resc, and Null, i.e. for each peak we need the three pairwise distances: 'WT vs Null', 'WT vs Resc' and 'Resc vs Null'. When running the \MD function \texttt{compHistDists()}, we can specify which 'distance measure' we want to compute. We recommend the default method 'MMD' \citep{MMD}, which computes the maximum mean discrepancy between pairs of histograms \citep{MMD}. Alternatively, \texttt{method=GMD} (Generalized Minimum Distances) \citep{GMD} can be used, which we will use here, because it is faster: <>= Cfp1Dists <- compHistDists(Cfp1Norm, method='GMD', overWrite=FALSE, NormMethod='DESeq') @ \noindent We could also specify the comparisons directly, for example: <>= SampleIDs <- Cfp1Norm$samples$SampleID ID <- which(upper.tri(matrix(1, length(SampleIDs), length(SampleIDs)), diag = F), arr.ind=T) CompIDs <- rbind(SampleIDs[ID[,1]], SampleIDs[ID[,2]]) CompIDs @ \noindent and run the function \texttt{compHistDists}: <>= Cfp1Dists <- compHistDists(Cfp1Norm, method='GMD', CompIDs=CompIDs, overWrite=FALSE, NormMethod='DESeq') @ As a result, \texttt{Cfp1\$MD} is appended with a new element called \texttt{DISTS}. In \texttt{DISTS\$GMD} you will find a matrix of dimension ($n_{Peaks}$ x $n_{Comps}$) containing all pairwise distances for each peak: <>= Cfp1Dists$MD$DISTS$GMD[1:10,] @ \noindent We provide \texttt{Cfp1Dists}, which was created running \texttt{compHistDists} three times, once with each methods. <>= data(Cfp1Dists) names(Cfp1Dists$MD$DISTS) @ \subsection{Step 4: Determine p-values} Finally we want to detect peaks that are different in a treatment group relative to a control group. In this example, we only have one replicate for the treatment group (Null.AB2) and we will use WT.AB2 and Resc.AB2 as replicates for the control group. We compute empirical p-values pooling peaks with similar mean total counts. The p-values are adjusted for multiple testing with the method by Benjamini and Hochberg (1995). <>= data(Cfp1Dists) group1 <- c("WT.AB2","Resc.AB2") group2 <- c("Null.AB2") Cfp1Pvals <- detPeakPvals(Cfp1Dists, group1=group1, group2=group2, name1='Wt/Resc', name2='Null', method='MMD') Cfp1Pvals <- detPeakPvals(Cfp1Pvals, group1=group1, group2=group2, name1='Wt/Resc', name2='Null', method='GMD') Cfp1Pvals <- detPeakPvals(Cfp1Pvals, group1=group1, group2=group2, name1='Wt/Resc', name2='Null', method='Pearson') @ \subsection{Step 5: Analysis results} The indices of peaks with a p-value $<0.05$ can be obtained, by: <>= idxMMD <- which(Cfp1Pvals$MD$Pvals$MMD<0.05) @ \noindent and similarly for the other methods: <>= idxGMD <- which(Cfp1Pvals$MD$Pvals$GMD<0.05) idxPearson <- which(Cfp1Pvals$MD$Pvals$Pearson<0.05) @ \noindent The coordinates of the peaks can be obtained by: <>= rownames(Cfp1Pvals$MD$Pvals$MMD)[idxMMD[1:10]] @ \noindent For sanity checks, computed distances can be plotted as a function of mean total count (\fig{MMDs}- \ref{fig:Pearson}): <>= group1 <- c("WT.AB2", "Resc.AB2") group2 <- c("Null.AB2") plotHistDists(Cfp1Pvals, group1=group1, group2=group2, method='MMD') plotHistDists(Cfp1Pvals, group1=group1, group2=group2, method='GMD') plotHistDists(Cfp1Pvals, group1=group1, group2=group2, method='Pearson') @ \noindent Note, that with MMD, one can observe a number of peaks which have higher distances in the between group comparison than any of the with-in group distances. If we assume that the within group-distances give us a good estimate of naturally occurring variation between peaks under the same condition, these are likely to be peaks that are truly affected when Cfp1 is knocked out. In this example, MMD therefore seems to be best suited to detect target regions of Cfp1. \begin{figure} \centering \includegraphics[width=14cm]{MMDDists.pdf} \caption{MMD based distances as a function of averaged normalised total counts. Shown are between-group distances (i.e. Wt/Resc vs Null), where each black cross corresponds to a peak. In addition, within-group distances (i.e. between WT and Resc) are overlayed (blue dots). Differentially called peaks with large enough between-group distances are additionally labelled with red circles.} \label{fig:MMDs} \end{figure} \begin{figure} \centering \includegraphics[width=14cm]{GMDDists.pdf} \caption{GMD based distances as a function of averaged normalised total counts.} \label{fig:GMDs} \end{figure} \begin{figure} \centering \includegraphics[width=14cm]{PearsonDists.pdf} \caption{Pearson correlation as a function of averaged normalised total counts} \label{fig:Pearson} \end{figure} \section{Acknowledgements} The package was developed at the University of Edinburgh, School of Informatics and the Wellcome Trust Centre for Cell Biology with support from Guido Sanguinetti and other members of his group. The data used to illustrate the usage of the package was kindly provided by Thomas Clouaire and Adrian Bird. System administration and computing support was provided by Shaun Webb and Alastair Kerr. Arthur Gretton was very helpful in discussing MMD-based statistical testing and Rory Stark gave valuable insights into ChIP-Seq data analysis and his package \DBA. The research leading to these results has received funding from the People Program (Marie Curie Actions) of the European Union's Seventh Framework Programme (FP7/2007-2013) under REA grant agreement number $n^o$ PIEF-GA-2011-299192. \section{Setup} This vignette was built on: <>= sessionInfo() @ <>= dev.off() unlink(tmp) setwd(savewd) @ \bibliographystyle{plainnat} \bibliography{MMDiff_bib} \end{document}