%\VignetteIndexEntry{charm Vignette} %\VignetteDepends{charmData, BSgenome.Hsapiens.UCSC.hg18} %\VignetteKeywords{} %\VignettePackage{charm} \documentclass{article} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \usepackage{graphicx} \usepackage{hyperref} \usepackage{pdfpages} \begin{document} \title{Using the charm package to estimate DNA methylation levels and find differentially methylated regions} \date{October 10, 2012} \author{Peter Murakami\footnote{pmurakam@jhsph.edu}, Martin Aryee, Rafael Irizarry} \maketitle \begin{center} Johns Hopkins School of Medicine / Johns Hopkins School of Public Health\\Baltimore, MD, USA \end{center} <>= options(width=60) options(continue=" ") options(prompt="R> ") @ \section{Introduction} The Bioconductor package \Rpackage{charm} can be used to analyze DNA methylation data generated using McrBC fractionation and two-color Nimblegen microarrays. It is customized for use with data the from the custom CHARM microarray \cite{IrizarryGenomeRes2008}, but can also be applied to many other Nimblegen designs. The preprocessing and normalization methods are described in detail in \cite{AryeeBiostatistics2011}. Functions include: \begin{itemize} \item Quality control \item Finding suitable control probes for normalization \item Percentage methylation estimates \item Identification of differentially methylated regions \item Plotting of differentially methylated regions \end{itemize} As input we will need raw Nimblegen data (.xys) files and a corresponding annotation package built with pdInfoBuilder. This vignette uses the following packages: \begin{itemize} \item \Rpackage{charm}: contains the analysis functions \item \Rpackage{charmData}: an example dataset \item \Rpackage{pd.charm.hg18.example}: the annotation package for the example dataset \item \Rpackage{BSgenome.Hsapiens.UCSC.hg18}: A BSgenome object containing genomic sequence used for finding non-CpG control probes \end{itemize} Each sample is represented by two xys files corresponding to the untreated (green) and methyl-depleted (red) channels. The 532.xys and 635.xys suffixes indicate the green and red channels respectively. \section{Analyzing data from the custom CHARM microarray} Load the \Rpackage{charm} package: <>= library(charm) library(charmData) @ \section{Read in raw data} Get the name of your data directory (in this case, the example data): <>= dataDir <- system.file("data", package="charmData") dataDir @ First we read in the sample description file: <>= phenodataDir <- system.file("extdata", package="charmData") pd <- read.delim(file.path(phenodataDir, "phenodata.txt")) phenodataDir pd @ A valid sample description file should contain at least the following (arbitrarily named) columns: \begin{itemize} \item a filename column \item a sample ID column \item a group label column (optional) \end{itemize} The sample ID column is used to pair the methyl-depleted and untreated data files for each sample. The group label column is used when identifying differentially methylated regions between experimental groups. The \Rcode{validatePd} function can be used to validate the sample description file. When called with only a sample description data frame and no further options \Rcode{validatePd} will try to guess the contents of the columns. <>= res <- validatePd(pd) @ Now we read in the raw data. The \Rcode{readCharm} command makes the assumption (unless told otherwise) that the two xys files for a sample have the same file name up to the suffixes 532.xys (untreated) and 635.xys (methyl-depleted). The sampleNames argument is optional. Note that if the \Rpackage{ff} package has been loaded previously in your R session, the output of readCharm will contain \Rcode{ff} rather than \Rcode{matrix} objects, and all subsequent \Rpackage{charm} functions acting on it (except those in pipeline 2 described below) will recognize this and use \Rcode{ff} objects also. Using the \Rpackage{ff} package is recommended when the data set is otherwise too large for the amount of memory available. <>= rawData <- readCharm(files=pd$filename, path=dataDir, sampleKey=pd, sampleNames=pd$sampleID) rawData @ \section{Array quality assessment} We can calculate array quality scores and generate a pdf report with the \Rcode{qcReport} command. A useful quick way of assessing data quality is to examine the untreated channel where we expect every probe to have signal. Very low signal intensities on all or part of an array can indicate problems with hybridization or scanning. The CHARM array and many other designs include background probes that do not match any genomic sequence. Any signal at these background probes can be assumed to be the result of optical noise or cross-hybridization. Since the untreated channel contains total DNA a successful hybridization would have strong signal for all untreated channel genomic probes. The array signal quality score (pmSignal) is calculated as the average percentile rank of the signal robes among these background probes. A score of 100 means all signal probes rank above all background probes (the ideal scenario). <>= qual <- qcReport(rawData, file="qcReport.pdf") qual @ The PDF quality report is shown in Appendix A. Three quality metrics are calculated for each array: \begin{enumerate} \item Average signal strength: the average percentile rank of untreated channel signal probes among the background (anti-genomic) probes. \item Untreated channel signal standard deviation. The array is divided into a series of rectangular blocks and the average signal level calculated for each. Since probes are arranged randomly on the array there should be no large differences between blocks. Arrays with spatial artifacts have a larg standard deviation between blocks. \item Methyl-depleted channel signal standard deviation. \end{enumerate} To remove samples with a quality score less than 78, we could do this: <>= qc.min = 78 ##Remove arrays with quality scores below qc.min: rawData=rawData[,qual$pmSignal>=qc.min] qual=qual[qual$pmSignal>=qc.min,] pd=pd[pd$sampleID%in%rownames(qual),] pData(rawData)$qual=qual$pmSignal @ and to identify which probes have a mean quality score above 75 we could do this: <>= pmq = pmQuality(rawData) rmpmq = rowMeans(pmq) okqc = which(rmpmq>75) @ We now want to calculate probe-level percentage methylation estimates for each sample. As a first step we need to identify a suitable set of unmethylated control probes from CpG-free regions to be used in normalization. <>= library(BSgenome.Hsapiens.UCSC.hg18) ctrlIdx <- getControlIndex(rawData, subject=Hsapiens, noCpGWindow=600) @ We can check the success of the control probes by comparing their intensity distribution with the non-control probes (before any normalization in which the control probes are used). <>= cqc = controlQC(rawData=rawData, controlIndex=ctrlIdx, IDcol="sampleID", expcol="tissue", ylimits=c(-6,8), outfile="boxplots_check.pdf", height=7, width=9) cqc @ We can also access the probe annotation using standard functions from the oligo package. <>== chr = pmChr(rawData) pns = probeNames(rawData) pos = pmPosition(rawData) seq = pmSequence(rawData) pd = pData(rawData) @ \section{Percentage methylation estimates and differentially methylated regions (DMRs)} The minimal code required to estimate methylation would be \Rcode{p <- methp(rawData, controlIndex=ctrlIdx)}. However, it is often useful to get \Rcode{methp} to produce a series of diagnostic density plots to help identify non-hybridization quality issues. The \Rcode{plotDensity} option specifies the name of the output pdf file, and the optional \Rcode{plotDensityGroups} can be used to give groups different colors. Remember that if the \Rpackage{ff} package was loaded before producing \Rcode{rawData} with \Rcode{readCharm}, the output of \Rcode{methp} will be an \Rcode{ff} rather than a \Rcode{matrix} object. \Rpackage{charm} functions (except those in pipeline 2 described below) will handle \Rcode{ff p} objects automatically. <>= p <- methp(rawData, controlIndex=ctrlIdx, plotDensity="density.pdf", plotDensityGroups=pd$tissue) head(p) @ For a simple unsupervised clustering of the samples, we can plot the results of a classical multi-dimensional scaling analysis. <>= cmdsplot(labcols=c("red","black","blue"), expcol="tissue", rawData=rawData, p=p, okqc=okqc, noXorY=TRUE, outfile="cmds_topN.pdf", topN=c(100000,1000)) @ The density plots are shown in Appendix B and the MDS plot is shown in Appendix C. \subsection{Pipeline 1 (recommended): Regression-based DMR-finding after correcting for batch effects} Optionally, we may wish to restrict our search for DMRs to non-control probes exceeding some quality threshold. We may do that simply by subsetting: <>= Index = setdiff(which(rmpmq>75),ctrlIdx) Index = Index[order(chr[Index], pos[Index])] p0 = p #save for pipeline 2 example p = p[Index,] seq = seq[Index] chr = chr[Index] pos = pos[Index] pns = pns[Index] pns = clusterMaker(chr,pos) @ You might also wish to consider excluding some probes from the between-array normalization step in \Rcode{methp} earlier using the \Rcode{excludeIndex} argument, e.g., \Rcode{excludeIndex=which(rmpmq<=50)}, however, note that it is probably inadvisable to remove probes from between-array normalization in \Rcode{methp} that you will end up using in the analysis (note that the probes with mean qc < x1 are a subset of the probes with mean qc < x2 when x1=50 and x2=75 as in this example. Setting x1>x2 is not recommended as it would result in un-normalized probes being used in the analysis). Using the \Rcode{clusterMaker} function was necessary in order to redefine the array regions since removing probes may result in too few probes per region or unacceptably large gaps between probes within the same region. At this point it may also be helpful to remove arrays whose average correlation with all other arrays is below some threshold, since it is often reasonable to assume that most probes are not differentially methylated between arrays. Unsupervised clustering would also probably tend to show such arrays as clustering separately. Another reason for removing arrays at this point is if they have missing data on any of the covariates to be used in the following analysis. Remember that any arrays excluded from this point forward should be removed from both \Rcode{p} and \Rcode{pd}. To identify DMR candidates, we use the \Rcode{dmrFind} function. As it requires the same \Rcode{mod} and \Rcode{mod0} arguments as the \Rcode{sva()} function from the \Rpackage{sva} package, we must first create these. Data with paired samples may be accommodated by including the pair ID column as a factor in \Rcode{mod} and \Rcode{mod0}. <>= mod0 = matrix(1,nrow=nrow(pd),ncol=1) mod = model.matrix(~1 +factor(pd$tissue,levels=c("liver","colon","spleen"))) @ We may now call \Rcode{dmrFind}. Setting the \Rcode{coeff} argument to 2 means that we are interested in the colon-liver comparison, since it is the second column of \Rcode{mod} that defines that comparison in the linear model. <>= library(corpcor) thedmrs = dmrFind(p=p, mod=mod, mod0=mod0, coeff=2, pns=pns, chr=chr, pos=pos) @ To compare liver and spleen, set \Rcode{coeff} to 3. To compare colon and spleen, you must redefine \Rcode{mod} such that the first level is either colon or spleen, and then set \Rcode{coeff} appropriately, e.g., \Rcode{mod = model.matrix(~1 + factor(pd\$tissue, levels=c("colon","spleen","liver")))} and \Rcode{coeff=2}. To avoid repeating the SVA analysis within \Rcode{dmrFind}, you may provide the surrogate variables already identified above as an argument to subsequent \Rcode{dmrFind} calls through the \Rcode{svs} argument. The surrogate variables are located in \Rcode{thedmrs\$args\$svs}, so adding \Rcode{svs=thedmrs\$args\$svs} to the call to \Rcode{dmrFind} would prevent SVA from being called again. As long as \Rcode{p} (or \Rcode{logitp}, if you provided \Rcode{logitp} to \Rcode{dmrFind}), \Rcode{mod}, and \Rcode{mod0} are the same, the surrogate variables will be the same regardless of which comparison you explore. Also note that if you adjust for covariates, their effects will be controlled for when finding DMRs, however their effects are not removed from the matrix of "cleaned" percent methylation estimates (i.e., \Rcode{cleanp}) returned by \Rcode{dmrFind}, which by default removes only batch effects (i.e., the surrogate variables identified by SVA). Consequently the adjustment covariate effects will still show up in the clustering results (and will probably be enhanced) and in the DMR plots (since they do not get removed from the \Rcode{cleanp} matrix). Only the surrogate variables identified by SVA will be removed from the clustering results and the DMR plots, regardless of whether or not you adjust for covariates. Setting \Rcode{rob=FALSE} in dmrFinder will cause the covariates' effects to be removed from \Rcode{cleanp} as well (all except the covariate of interest). \Rcode{rob=TRUE} by default because covariates explicitly adjusted for should typically be real biological rather than technical confounders, and removing the effects of real biological confounders from the percent methylation estimates would change them from being our best estimate of what the true percent methylation is for each probe in our sample to an adjusted version of this. If you want to obtain FDR q-values for the DMR candidates returned by \Rcode{dmrFind}, you may use the \Rcode{qval} function as follows: <>= withq = qval(p=p, dmr=thedmrs, numiter=3, verbose=FALSE, mc=1) @ The \Rcode{numiter} argument is set to 3 here only for convenience of demonstration. In reality it should be much higher (hundreds, if not thousands). The \Rcode{p} argument provided must be the same as the one used in \Rcode{dmrFind}. The \Rcode{qval} function utilizes the \Rpackage{parallel} package that comes with R as of version 2.14. By default, \Rcode{mc=1} (no parallelization), however on multiple-core machines you can set \Rcode{qval} to use more cores to parallelize the process. If you are working in a shared computing environment, take care not to request more cores than are available to you. We may plot DMR candidates from \Rcode{dmrFind} using the \Rcode{plotDMRs} function. In order to mark the location of CpG islands in the second panel of each plot, we must first obtain a table identifying CpG islands. CpG island definitions according to the method of Wu et al (2010) \cite{Wu2010} are available for a large number of genomes and are one source for such a table. Alternatively, CpG island definitions may be obtained from the UCSC Genome Browser, which is what we will use here for this example. <>= con <- gzcon(url(paste("http://hgdownload.soe.ucsc.edu/goldenPath/hg18/database/","cpgIslandExt.txt.gz", sep=""))) txt <- readLines(con) cpg.cur <- read.delim(textConnection(txt), header=FALSE, as.is=TRUE) cpg.cur <- cpg.cur[,1:3] colnames(cpg.cur) <- c("chr","start","end") cpg.cur <- cpg.cur[order(cpg.cur[,"chr"],cpg.cur[,"start"]),] plotDMRs(dmrs=thedmrs, Genome=Hsapiens, cpg.islands=cpg.cur, exposure=pd$tissue, outfile="./colon-liver.pdf", which_plot=c(1), which_points=c("colon","liver"), smoo="loess", ADD=3000, cols=c("black","red","blue")) @ Instead of plotting the probe p-values in the 3rd panel, you may also wish to inspect the behavior of the green channel (total) across the DMR regions. To do this, you must first have obtained the green channel intensity matrix, which we do here after spatial adjustment and background correction. In addition to specifying \Rcode{panel3="G"}, we must also provide \Rcode{G} and the sequences corresponding its rows (because the intensities are further corrected for gc-content). <>= dat0 = spatialAdjust(rawData, copy=FALSE) dat0 = bgAdjust(dat0, copy=FALSE) G = pm(dat0)[,,1] #from oligo G = G[Index,] plotDMRs(dmrs=thedmrs, Genome=Hsapiens, cpg.islands=cpg.cur, exposure=pd$tissue, outfile="./colon-liver2.pdf", which_plot=c(1), which_points=c("colon","liver"), smoo="loess", ADD=3000, cols=c("black","red","blue"), panel3="G", G=G, seq=seq) @ \subsubsection{Continuous covariate of interest} The \Rcode{dmrFind} function also handles a continuous covariate of interest. Here we generate an artificial continuous covariate called \Rcode{x} and perform the analysis using that. <>= pd$x = c(1,2,3,4,5,6) mod0 = matrix(1,nrow=nrow(pd),ncol=1) mod = model.matrix(~1 +pd$x) coeff = 2 thedmrs2 = dmrFind(p=p, mod=mod, mod0=mod0, coeff=coeff, pns=pns, chr=chr, pos=pos) @ To plot the DMR results, you may either categorize the continuous covariate as for example as follows <>= groups = as.numeric(cut(mod[,coeff],c(-Inf,2,4,Inf))) #You can change these cutpoints. pd$groups = c("low","medium","high")[groups] plotDMRs(dmrs=thedmrs2, Genome=Hsapiens, cpg.islands=cpg.cur, exposure=pd$groups, outfile="./test.pdf", which_plot=c(1), smoo="loess", ADD=3000, cols=c("black","red","blue")) @ or you may plot the correlation of each probe with the covariate as follows: <>= plotDMRs(dmrs=thedmrs2, Genome=Hsapiens, cpg.islands=cpg.cur, exposure=pd$x, outfile="./x.pdf", which_plot=c(1), smoo="loess", ADD=3000, cols=c("black","red","blue")) @ An additional function that can be helpful for working with tables with columns "chr", "start", and "end" as many of the objects required or returned by these functions are is the \Rcode{regionMatch} function, which finds for each region in one table the nearest region in another table (using the \Rcode{nearest()} function in the \Rpackage{IRanges} package) and provides information on how near they are to each other. <>= ov = regionMatch(thedmrs$dmrs,thedmrs2$dmrs) head(ov) @ One may also plot regions other than DMR candidates returned by \Rcode{dmrFind}, using the \Rcode{plotRegions} function. <>= mytable = thedmrs$dmrs[,c("chr","start","end")] mytable[2,] = c("chr1",1,1000) #not on array mytable$start = as.numeric(mytable$start) mytable$end = as.numeric(mytable$end) plotRegions(thetable=mytable[c(1),], cleanp=thedmrs$cleanp, chr=chr, pos=pos, Genome=Hsapiens, cpg.islands=cpg.cur, outfile="myregions.pdf", exposure=pd$tissue, exposure.continuous=FALSE) @ \subsection{Pipeline 2: DMR-finding without adjusting for batch or other covariates} We can identify differentially methylated regions using the original \Rcode{dmrFinder}: <>= dmr <- dmrFinder(rawData, p=p0, groups=pd$tissue, compare=c("colon", "liver","colon", "spleen"), removeIf=expression(nprobes<4 | abs(diff)<.05 | abs(maxdiff)<.05)) @ <>= names(dmr) names(dmr$tabs) head(dmr$tabs[[1]]) @ When called without the \Rcode{compare} option, \Rcode{dmrFinder} performs all pairwise comparisons between the groups. We can also plot DMR candidates with the dmrPlot function. Here we plot just the top DMR candidate from the first DMR table. <>= dmrPlot(dmr=dmr, which.table=1, which.plot=c(1), legend.size=1, all.lines=TRUE, all.points=FALSE, colors.l=c("blue","black","red"), colors.p=c("blue","black"), outpath=".", cpg.islands=cpg.cur, Genome=Hsapiens) @ We can also plot any given genomic regions using this data by using the \Rcode{regionPlot} function, supplying the regions in a data frame that must have columns with names "chr", "start", and "end". Naturally, regions that are not on the array will not appear in the resulting file. <>= mytab = data.frame(chr=as.character(dmr$tabs[[1]]$chr[1]), start=as.numeric(c(dmr$tabs[[1]]$start[1])), end=as.numeric(c(dmr$tabs[[1]]$end[1])), stringsAsFactors=FALSE) regionPlot(tab=mytab, dmr=dmr, cpg.islands=cpg.cur, Genome=Hsapiens, outfile="myregions.pdf", which.plot=1:5, plot.these=c("liver","colon"), cl=c("blue","black"), legend.size=1, buffer=3000) @ The DMR plot is shown in Appendix D, and the plot of the user-provided region is shown in Appendix E. \subsubsection{Analysis of paired samples} If the samples are paired, we can also analyze them as such. To show this, let's pretend that the samples in our test data set are paired, and then use the \Rcode{dmrFinder} function with the \Rcode{"paired"} argument set to TRUE and the \Rcode{"pairs"} argument specifying which samples are pairs. (In this example we also have to lower the cutoff since there are not enough samples to find any regions with the default cutoff of 0.995.) <>= pData(rawData)$pair = c(1,1,2,2,1,2) dmr2 <- dmrFinder(rawData, p=p0, groups=pd$tissue, compare=c("colon", "liver","colon", "spleen"), removeIf=expression(nprobes<4 | abs(diff)<.05 | abs(maxdiff)<.05), paired=TRUE, pairs=pData(rawData)$pair, cutoff=0.95) @ We plot the, say, third DMR with the \Rcode{dmrPlot} function (shown in Appendix F) <>= dmrPlot(dmr=dmr2, which.table=1, which.plot=c(3), legend.size=1, all.lines=TRUE, all.points=FALSE, colors.l=c("blue","black"), colors.p=c("blue","black"), outpath=".", cpg.islands=cpg.cur, Genome=Hsapiens) @ Plotting user-provided regions using the results of paired analysis is done using the \Rcode{regionPlot} function as before (shown in Appendix G). <>= regionPlot(tab=mytab, dmr=dmr2, cpg.islands=cpg.cur, Genome=Hsapiens, outfile="myregions_paired.pdf", which.plot=1:5, plot.these=c("colon-liver"), cl=c("black"), legend.size=1, buffer=3000) @ \bibliography{charmVignette}{} \bibliographystyle{plain} \section{Appendix A: Quality report} \includepdf[pages=-]{qcReport.pdf} \section{Appendix B: Density plots} Each row corresponds to one stage of the normalization process (Raw data, After spatial and background correction, after within-sample normalization, after between-sample normalization, percentage methylation estimates). The left column shows all probes, while the right column shows control probes. \includepdf{density.pdf} \section{Appendix C: MDS plot} \includepdf{cmds_topN.pdf} \section{Appendix D: DMR plot} DMR plot for the first DMR in the list. \includepdf{colon-liver.pdf} \section{Appendix E: Plot of an arbitrary genomic region} For the arbtirary region we just chose the first DMR. \includepdf{myregions.pdf} \section{Appendix F: DMR plot from analysis of paired samples} DMR plot for the third DMR in the list \includepdf{colon-liver_paired.pdf} \section{Appendix G: Plot of an arbitrary genomic region, shown using paired results} For the arbtirary region we simply chose the same first DMR as in appendix E. \includepdf{myregions_paired.pdf} \section{Details} This document was written using: <<>>= sessionInfo() @ \end{document}