\documentclass{article} \usepackage{url} \usepackage{breakurl} \usepackage{amsmath} \usepackage{amssymb} %\VignetteIndexEntry{The DMRcate package user's guide} %\VignetteEngine{knitr::knitr} \begin{document} \title{The \texttt{DMRcate} package user's guide} \author{Peters TJ, Buckley MJ, Statham A, Pidsley R, Clark SJ, Molloy PL} \maketitle \renewcommand{\abstractname}{Summary} \begin{abstract} \texttt{DMRcate} extracts the most differentially methylated regions (DMRs) and variably methylated regions (VMRs) from both Whole Genome Bisulphite Sequencing (WGBS) and Illumina\textregistered Infinium BeadChip Array samples via kernel smoothing. \end{abstract} <>= source("http://bioconductor.org/biocLite.R") biocLite("DMRcate") @ Load \texttt{DMRcate} into the workspace: <>= library(DMRcate) @ \section*{Illumina\textregistered Array Workflow} We now can load in the test data set of beta values. We assume at this point that normalisation and filtering out bad-quality probes via their detection \textit{p}-values have already been done. Many packages are available for these purposes, including \texttt{minfi}, \texttt{wateRmelon} and \texttt{methylumi}. M-values (logit-transform of beta) are preferable to beta values for significance testing via \texttt{limma} because of increased sensitivity, but we will retain the beta matrix for visualisation purposes later on. The TCGA (Cancer Genome Atlas - colorectal cancer) data in \texttt{myBetas} only comes from chromosome 20, but DMRcate will have no problem taking in the approximately half million probes as input for this pipeline either. <>= data(dmrcatedata) myMs <- logit2(myBetas) @ Some of the methylation measurements on the array may be confounded by proximity to SNPs, and cross-hybridisation to other areas of the genome\cite{Chen}. In particular, probes that are 0, 1, or 2 nucleotides from the methylcytosine of interest show a markedly different distribution to those farther away, in healthy tissue (Figure 1). \begin{figure}[htbp!] \caption{Beta distribution of 450K probes from publically available data from blood samples of healthy individuals \cite{Heyn} by their proximity to a SNP. ``All SNP probes'' refers to the 153 113 probes listed by Illumina\textregistered\ whose values may potentially be confounded by a SNP. } \centering \includegraphics[width=\textwidth]{heynSNP.pdf} \end{figure} It is with this in mind that we filter out probes 2 nucleotides or closer to a SNP that have a minor allele frequency greater than 0.05, and the approximately 30,000 \cite{Chen} cross-reactive probes, so as to reduce confounding. Here we use Illumina\textregistered's database of approximately 150,000 potentially SNP-confounded probes, and an internally-loaded dataset of the probes from \cite{Chen}, to filter these probes out. About 600 are removed from our M-matrix of approximately 10,000: <>= nrow(illuminaSNPs) nrow(myMs) myMs.noSNPs <- rmSNPandCH(myMs, dist=2, mafcut=0.05) nrow(myMs.noSNPs) @ Next we want to annotate our matrix of M-values with relevant information. The default is the \texttt{ilmn12.hg19} annotation, but this can be substituted for any argument compatible with the interface provided by the \texttt{minfi} package. We also use the backbone of the \texttt{limma} pipeline for differential array analysis to get \textit{t}-statistics changes and, optionally, filter probes by their fdr-corrected \textit{p}-value. Here we have 38 patients with 2 tissue samples each taken from them. We want to compare within patients across tissue samples, so we set up our variables for a standard limma pipeline, and set \texttt{coef=39} in \texttt{cpg.annotate} since this corresponds to the phenotype comparison in \texttt{design}. <>= patient <- factor(sub("-.*", "", colnames(myMs))) type <- factor(sub(".*-", "", colnames(myMs))) design <- model.matrix(~patient + type) myannotation <- cpg.annotate("array", myMs.noSNPs, analysis.type="differential", design=design, coef=39) @ Now we can find our most differentially methylated regions with \texttt{dmrcate()}. For each chromosome, two smoothed estimates are computed: one weighted with \texttt{myannotation\$stat} and one not, for a null comparison. The two estimates are compared via a Satterthwaite approximation\cite{Satterthwaite}, and a significance test is calculated at all hg19 coordinates that an input probe maps to. After fdr-correction, regions are then agglomerated from groups of significant probes where the distance to the next consecutive probe is less than \texttt{lambda} nucleotides. <>= dmrcoutput <- dmrcate(myannotation, lambda=1000, C=2) @ We can convert our DMR list to a GRanges object, which uses the \texttt{genome} argument to annotate overlapping promoter regions (+/- 2000 bp from TSS). and pass it to DMR.plot, which uses the \texttt{Gviz} package as a backend for contextualising each DMR. We'll choose one associated with the GATA5 locus. <>= results.ranges <- extractRanges(dmrcoutput, genome = "hg19") results.ranges @ Now we can plot a significant DMR. We use functionality from the \texttt{Gviz} package as a backend for this purpose. We will plot a DMR associated with the GATA5 locus for the first 6 tumour/normal matched pairs. <>= groups <- c(Tumour="magenta", Normal="forestgreen") cols <- groups[as.character(type)] samps <- c(1:6, 38+(1:6)) DMR.plot(ranges=results.ranges, dmr=1, CpGs=myBetas, phen.col=cols, genome="hg19", samps=samps) @ \section*{WGBS Workflow} WGBS is a little different. Because the data is represented binomially (that is, by the number of methylated reads followed by the total coverage for that particular CpG site) rather than the continuous distribution afforded by array intensities, we must model the differential methylation signal in a way that respects this. A popular way of doing this is via the beta-binomial distribution. We currently recommend using the method implemented in the DSS package\cite{Feng}, because it uses dispersion shrinkage via a Bayesian framework - similar to \texttt{edgeR} for RNA-Seq count data. The \texttt{CpGs} GRanges object contains simulated data for 3 Treatment vs. 3 Control samples for $10^{5}$ CpG sites, generated by WGBSSuite\cite{Rackham}. <>= CpGs @ Note the structure of the metadata columns for this object: samples come in column pairs, with the number of methylated reads followed by the total coverage for that CpG site. Naturally, $\langle$sample$\rangle$.cov must always be $\geq \langle$sample$\rangle$.C. This structure must be in place in order for downstream tasks such as \texttt{DMR.plot()} to be run. Using this structure, we can now extract the methylation and coverage counts, and prepare a \texttt{bsseq} object as we would for \texttt{DSS}, and call differentially methylated CpG sites. <>= meth <- as.data.frame(CpGs)[,c(1:2, grep(".C$", colnames(as.data.frame(CpGs))))] coverage <- as.data.frame(CpGs)[,c(1:2, grep(".cov$", colnames(as.data.frame(CpGs))))] treat1 <- data.frame(chr=coverage$seqnames, pos=coverage$start, N=coverage$Treatment1.cov, X=meth$Treatment1.C) treat2 <- data.frame(chr=coverage$seqnames, pos=coverage$start, N=coverage$Treatment2.cov, X=meth$Treatment2.C) treat3 <- data.frame(chr=coverage$seqnames, pos=coverage$start, N=coverage$Treatment3.cov, X=meth$Treatment3.C) ctrl1 <- data.frame(chr=coverage$seqnames, pos=coverage$start, N=coverage$Control1.cov, X=meth$Control1.C) ctrl2 <- data.frame(chr=coverage$seqnames, pos=coverage$start, N=coverage$Control2.cov, X=meth$Control2.C) ctrl3 <- data.frame(chr=coverage$seqnames, pos=coverage$start, N=coverage$Control3.cov, X=meth$Control3.C) samples <- list(treat1, treat2, treat3, ctrl1, ctrl2, ctrl3) sampnames <- sub("\\..*", "", colnames(meth))[-c(1:2)] obj_bsseq <- makeBSseqData(samples, sampnames) DSSres <- DMLtest(obj_bsseq, group1=sampnames[1:3], group2=sampnames[4:6], smoothing=FALSE) @ We can now enter \texttt{DSSres} into the \texttt{DMRcate} workflow. Because CpGs are much closer together than they are when represented by Illumina arrays, we will shrink the kernel size by increasing \texttt{C}. We will also run this in serial (\texttt{mc.cores=1}). If you want to run \texttt{dmrcate()} in parallel (1 chromosome per core), please check your processor specifications by running \texttt{detectCores()}. <>= wgbsannot <- cpg.annotate("sequencing", DSSres) wgbs.DMRs <- dmrcate(wgbsannot, lambda = 1000, C = 50, pcutoff = 0.05, mc.cores = 1) wgbs.ranges <- extractRanges(wgbs.DMRs, genome = "hg19") groups <- c(Treatment="darkorange", Control="blue") cols <- groups[sub("[0-9]", "", sampnames)] DMR.plot(ranges=wgbs.ranges, dmr=1, CpGs=CpGs, phen.col=cols, genome="hg19") @ <>= sessionInfo() @ \begin{thebibliography}{9} \bibitem{Chen} Chen YA, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, Zanke BW, Gallinger S, Hudson TJ, Weksberg R. Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. \emph{Epigenetics}. 2013 Jan 11;8(2). \bibitem{Heyn} Heyn H, Li N, Ferreira HJ, Moran S, Pisano DG, Gomez A, Esteller M. Distinct DNA methylomes of newborns and centenarians. \emph{Proceedings of the National Academy of Sciences}. 2012 \textbf{109}(26), 10522-7. \bibitem{Satterthwaite} Satterthwaite FE. An Approximate Distribution of Estimates of Variance Components., \emph{Biometrics Bulletin}. 1946 \textbf{2}: 110-114 \bibitem{Feng} Feng H, Conneely KN, Wu H. A Bayesian hierarchical model to detect differentially methylated loci from single nucleotide resolution sequencing data. \emph{Nucleic Acids Research}. 2014 \textbf{42}(8), e69. \bibitem{Rackham} Rackham, OJL, Dellaportas P, Petretto E, Bottolo, L. WGBSSuite: Simulating Whole Genome Bisulphite Sequencing data and benchmarking differential DNA methylation analysis tools. \emph{Bioinformatics} 2015. (Oxford, England), (March). \end{thebibliography} \end{document}