% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{Analyze Illumina Infinium methylation microarray data} %\VignetteKeywords{Illumina, BeadArray, DNA methylation, Microarray preprocessing} %\VignetteDepends{lumi, Biobase} %\VignettePackage{lumi} \documentclass[a4paper]{article} \usepackage{amsmath,pstricks} % With MikTeX 2.9, using pstricks also requires using auto-pst-pdf or running % pdflatex will fail. Note that using auto-pst-pdf requires to set environment % variable MIKTEX_ENABLEWRITE18=t on Windows, and to set shell_escape = t in % texmf.cnf for Tex Live on Unix/Linux/Mac. \usepackage{auto-pst-pdf} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \SweaveOpts{keep.source=TRUE} \author{Pan Du$^\ddagger$\footnote{dupan (at) northwestern.edu}, Gang Feng$^\ddagger$\footnote{g-feng (at) northwestern.edu}, Spencer Huang$^\ddagger$\footnote{huangcc (at) northwestern.edu}, Warren A. Kibbe$^\ddagger$\footnote{wakibbe (at) northwestern.edu}, Simon Lin$^\ddagger$\footnote{s-lin2 (at) northwestern.edu}} \begin{document} \setkeys{Gin}{width=1\textwidth} \title{Analyze Illumina Infinium methylation microarray data} \maketitle \begin{center}$^\ddagger$Biomedical Informatics Center \\ Northwestern University, Chicago, IL, 60611, USA \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Overview} This is a tutorial of processing Illumina Infinium methylation microarray data using \Rpackage{lumi} package. For the processing of Illumina GoldenGate methylation microarray data, please check the \Rpackage{methylumi} package. The tutorial will briefly describe the class structure, supported methods and functions in quality and color balance assessment, color balance adjustment, background adjustment, normalization and modeling the methylation status. It will also illustrate why we suggest using M-value instead of Beta-value in the statistics analysis. \section{Major classes of Illumina methylation microarray data} By default, Illumina suggests using the Beta-value to quantify the methylation levels. The Beta-value is a ratio between Illumina methylated probe intensity and total probe intensities (sum of methylated and unmethylated probe intensities). It is in the range of 0 and 1, which can be interpreted as the percentage of methylation. However, we have shown the Beta-value method has severe heteroscedasticity in the low and high methylation range, which imposes serious challenges in applying many statistic models [1]. The M-value, which is the log2 ratio of methylated probe intensity and unmethylated probe intensity, is another method used to measure the methylation level. We have shown the M-value method is approximately homoscedastic in the entire methylation range. As a result, it is more statistically valid in differential and other statistic analysis [1]. For this reason, we defined a new \Rclass{MethyLumiM} class. \Rclass{MethyLumiM} is a class inherited from \Rclass{ExpressionSet} class. The "assayData" slot includes three required data matrix, which are "exprs", "methylated" and "unmethylated". The "exprs" is a matrix of M-values, which is the log2 ratio of methylated and unmethylated probe intensities; "methylated" and "unmethylated" are intensity matrix measured by methylated and unmethylalted probes of Illumina Infinium methylation microarray. "detection" is an optional matrix in "assayData" slot. It is the detection p-value outputted by Illumina GenomeStudio software. To keep the control data information, which is mainly used for the quality control purpose, the \Rclass{MethyLumiM} class has a "controlData" slot. The "controlData" slot can be either NULL or a \Rclass{MethyLumiQC} object. Users are able to plot the control data using \Rfunction{plotControlData} function. We use the \Rfunction{methylumiR} function in \Rpackage{methylumi} package to read in the Illumina methylation data, which was outputted by Illumina GenomeStudio or BeadStudio software. The \Rfunction{methylumiR} function returns a \Rclass{MethyLumiSet} class object. For convenience, we wrote a coerce function to map from \Rclass{MethyLumi}, \Rclass{MethyLumiSet} or other \Rclass{eSet} inherited objects to \Rclass{MethyLumiM} class object. And We wrote a \Rfunction{lumiMethyR} function as a wrap function of \Rfunction{methylumiR} and coerce the returned object as a \Rclass{MethyLumiM} class object. \section{Data preprocessing} The first thing is to load the \Rpackage{lumi} package. <>= library(lumi) @ \subsection{Read the methylation microarray data} We define a \Rfunction{lumiMethyR} function to read the Illumina methylation data, which is a wrap of the \Rfunction{methylumiR} function in \Rpackage{methylumi} package. The \Rfunction{lumiMethyR} function coerces the returned object (in \Rclass{MethyLumiSet} class) as a \Rclass{MethyLumiM} class object. The annotation library is not required if the GenomeStudio output data has already included the "COLOR\_CHANNEL" column. \begin{Sinput} > ## specify the file name > # fileName <- 'Example_Illumina_Methylation_profile.txt' > ## load the data > # example.lumiMethy <- lumiMethyR(fileName, lib="IlluminaHumanMethylation27k.db") # Not Run \end{Sinput} \subsection{Example methylation datasets} The \Rpackage{lumi} package includes two example datasets. One is a control and treatment dataset (included in \Robject{example.lumiMethy} data object), which is measured in two batches. Another one is a titration dataset (included in \Robject{example.methyTitration} data object). To save storage space, only a randomly selected subset rows (CpG sites) were kept in the example datasets. The control and treatment dataset (\Robject{example.lumiMethy}) includes four control and four treatment samples together with their technique replicates. The original samples and technique replicates were measured in two batches. Figure \ref{fig:sampleRelation} and \ref{fig:sampleRelationCluster} show the overall sample relations of the control and treatment dataset in a PCA plot and Hierarchical cluster tree. We can see the batch difference is the major effect of sample differences. <>= ## load example data (a methyLumiM object) data(example.lumiMethy) ## summary of the example data example.lumiMethy ## print sample Names sampleNames(example.lumiMethy) @ % overall data distribution \begin{figure} \centering <>= plotSampleRelation(example.lumiMethy, method='mds', cv.Th=0) @ \caption{Overall sample relations in example.lumiMethy dataset} \label{fig:sampleRelation} \end{figure} \begin{figure} \centering <>= plotSampleRelation(example.lumiMethy, method='cluster', cv.Th=0) @ \caption{Overall sample relations shown as a hierarchical tree} \label{fig:sampleRelationCluster} \end{figure} The methylation titration dataset (\Robject{example.methyTitration}) includes 8 samples: "A1", "A2", "B1", "B2", "C1", "C2", "D" and "E". They are mixtures of Sample A (a B-lymphocyte sample) and Sample B (is (a colon cancer sample from a female donor) at five different titration ratios: 100:0 (A), 90:10 (C), 75:25 (D), 50:50 (E) and 0:100 (B). Sample A, B and C have technique replicates. Figure \ref{fig:sampleRelationTitration} shows the overall sample relations of the titration dataset in a PCA plot. <>= ## load the tritration data (a methyLumiM object) data(example.methyTitration) ## summary of the example data example.methyTitration ## print sample Names sampleNames(example.methyTitration) @ \begin{figure} \centering <>= plotSampleRelation(example.methyTitration, method='mds', cv.Th=0) @ \caption{Overall sample relations of the titration dataset} \label{fig:sampleRelationTitration} \end{figure} @ \subsection{Check data distribution} Similar with expression microarray data, the \Rpackage{lumi} defines density and boxplot to check the overall distribution of the data. Figure \ref{fig:densityMTitration} and \ref{fig:densityM} show the density of the example datasets. We can see the sample distributions can be quite different from each other. Because the density plot of M-values usually includes two modes, using the traditional boxplot cannot accurately represent the distribution of the data. We used another type of boxplot (method signature is \Rclass{MethyLumiM} class) which is capable to show data with multiple modes. It uses different color levels to represent the M-values in different levels of HDRs (highest density regions) as specified in "prob" (probability of density coverage). The M-values locating outside the range of the maximum probability specified in "prob" are plotted as dots, which is similar with the outliers in the regular boxplot. By default, parameter "prob" is set as c(seq(10,90, by=10), 95). This means the M-values outside of 95\% of HDR are plotted as outliers. Figure \ref{fig:boxplotM} shows the boxplot of M-values of example methylation data. The color depth represents the height of density. The short horizontal bar at the position with the deepest color corresponds to the maximum position of density distribution of each sample. Based on Figure \ref{fig:boxplotM}, we can see the distribution difference across samples. Because the methylation levels (measured by M-values or Beta-values) can have big changes across different conditions and treatment, only the methylation level distributions of technique or biological replicates are expected to have consistency. Therefore, we cannot claim a sample has quality issue if its distribution is quite different from others. \begin{figure} \centering <>= ## plot the density density(example.methyTitration, xlab="M-value") @ \caption{Density plot of M-value methylation levels of titration data before normalization} \label{fig:densityMTitration} \end{figure} \begin{figure} \centering <>= ## specify the colors of control and treatment samples sampleColor <- rep(1, ncol(example.lumiMethy)) sampleColor[grep("Treat", sampleNames(example.lumiMethy))] <- 2 density(example.lumiMethy, col=sampleColor, xlab="M-value") ## plot the density @ \caption{Density plot of M-value methylation levels before normalization} \label{fig:densityM} \end{figure} \begin{figure} \centering <>= ## Because the distribution of M-value has two modes, we use a boxplot different from regular ones boxplot(example.lumiMethy) @ \caption{Multi-mode boxplot of M-value methylation levels before normalization} \label{fig:boxplotM} \end{figure} \subsection{Check color balance} Based on the current Infinium assay design, Illumina uses two colors to label the final extended base following the hybridization of mehtylated or unmethylated probes. As a result, some of the CpG sites are measured in the red channel (final extended bases are A or T), whereas others are measured in the green channel (final extended bases are C or G). Note that the methylated and unmethylated probes of the same CpG site have the same color. Due to the difference in labeling efficiency and scanning properties of two color channels, the intensities measured in two color channels might be imbalanced. Because the methylation level is estimated based on the ratio of methylated and unmethylated probe intensities, many probe specific variations (like binding affinity and color balance) can be greatly reduced. However, because of the non-linearity of color effects, especially this effects will be different across different experiment conditions, color imbalance is not ignorable if the color effects are quite different across samples. Therefore, color balance adjustment will be important if the color effect is very inconsistent across samples. Moreover, the inconsistence of color balance is another indicator of sample quality problems. So checking color balance is another important measure of QC. The \Rpackage{lumi} package provides three different types of plotting functions to check color balance. Function \Rfunction{plotColorBias1D} plots the density of two color channels separately and shows them in red and green colors. By default, it pools both methylated and unmethylated probe intensities together, as shown in Figure \ref{fig:densityColorBiasBoth}. It can also separately plot the methylated and unmethylated probe intensities, as shown in Figure \ref{fig:densityColorBiasMethy} and \ref{fig:densityColorBiasUnmethy}, or plot the CpG-site Intensity (details shown in the next subsection). \begin{figure} \centering <>= plotColorBias1D(example.lumiMethy) @ \caption{Density plot of two color channels (both methylated and unmethylated probe intensities)} \label{fig:densityColorBiasBoth} \end{figure} \begin{figure} \centering <>= plotColorBias1D(example.lumiMethy, channel='methy') @ \caption{Density plot of two color channels (methylated probe intensities)} \label{fig:densityColorBiasMethy} \end{figure} \begin{figure} \centering <>= plotColorBias1D(example.lumiMethy, channel='unmethy') @ \caption{Density plot of two color channels (unmethylated probe intensities)} \label{fig:densityColorBiasUnmethy} \end{figure} Similarly, we defined boxplot function, which plot two color channels separately. Function \Rfunction{boxplotColorBias} separately plots boxplot of the red and green color channels. Like \Rfunction{plotColorBias1D}, we can select to plot methylated and unmethylated probe intensities, as shown in Figure \ref{fig:boxplotColorBiasMethy} and \ref{fig:boxplotColorBiasUnmethy}. Function \Rfunction{colorBiasSummary} can produce a quantile CpG-site Intensity summary of each sample. It has the same options of "channel" parameter. \begin{figure} \centering <>= boxplotColorBias(example.lumiMethy, channel='methy') @ \caption{Box plot of two color channels (methylated probe intensities)} \label{fig:boxplotColorBiasMethy} \end{figure} \begin{figure} \centering <>= boxplotColorBias(example.lumiMethy, channel='unmethy') @ \caption{Box plot of two color channels (unmethylated probe intensities)} \label{fig:boxplotColorBiasUnmethy} \end{figure} <>= ## summary of color balance information of individual samples colorBiasSummary(example.lumiMethy[,1:8], channel='methy') @ Function \Rfunction{plotColorBias2D} separately plots the methylated and unmethylated probe intensities in a 2-D scatter plot, and shows the interrogated CpG sites in red and green dots based on their color channels. Each time, \Rfunction{plotColorBias2D} can only plot one sample. By comparing the 2D scatter plots, we can easily see the details of color balance difference between samples. Figure \ref{fig:scatterColorBias1} shows one example of color imbalance. \begin{figure} \centering <>= plotColorBias2D(example.lumiMethy, selSample=1, cex=2) @ \caption{Scatter plot of methylated and unmethylated probe intensities to show color imbalance (example 1)} \label{fig:scatterColorBias1} \end{figure} \subsection{Quality assessment based on the distribution of CpG-site Intensity} Apart from color balance assessment, we propose to assess the sample quality based on the across sample distribution of the measured CpG site intensities. We defined the intensity of a measured CpG site, CpG-site Intensity, as the sum of methylated probe intensity and unmethylated probe intensity. As a single CpG site on one chromosome can only have two methylation status, if methylated probe for this CpG site has higher intensity, then the corresponding unmethylated probe should have lower intensity because only one probe (either methylated or unmethylated probe) can be bound on a particular CpG site on one copy of chromosome. That means the CpG-site Intensity is in proportional to the total copies of CpG sites. It also means that the methylation level (ratio between methylated and unmethylated probe intensities) changes should not have big effects on the CpG-site Intensity. Therefore, the CpG-site Intensity distribution of methylation microarray should still be similar across samples in different conditions if they have only methylation level difference, and we can check the sample quality based on CpG-site Intensity distribution. Figure \ref{fig:boxplotColorBiasSum} and \ref{fig:densityColorBiasSum} show the boxplot and density plot of two color channels. We can see it is much more obvious when using CpG-site Intensity to check color imbalance. Figure \ref{fig:densityIntensity} and \ref{fig:boxplotIntensity} show the CpG-site Intensity distribution of the example data. Because this example data has severe color imbalance, we can see the CpG-site Intensity distributions are similar but still have many differences. We can see the improvements after color balance adjustment. \begin{figure} \centering <>= boxplotColorBias(example.lumiMethy, channel='sum') @ \caption{Box plot of of CpG-site Intensity (two color channels were plotted separately)} \label{fig:boxplotColorBiasSum} \end{figure} \begin{figure} \centering <>= plotColorBias1D(example.lumiMethy, channel='sum') @ \caption{Density plot of CpG-site Intensity (two color channels were plotted separately)} \label{fig:densityColorBiasSum} \end{figure} \begin{figure} \centering <>= density(estimateIntensity(example.lumiMethy), xlab="log2(CpG-site Intensity)") @ \caption{Density plot of CpG-site Intensity} \label{fig:densityIntensity} \end{figure} \begin{figure} \centering <>= boxplot(estimateIntensity(example.lumiMethy)) @ \caption{Boxplot of CpG-site Intensity} \label{fig:boxplotIntensity} \end{figure} We can also use other QC plots for expression data to check the sample consistency of methylation data base on CpG-site Intensity. Figure \ref{fig:pairIntensity} shows pairwise plot of CpG-site Intensities of four samples, Ctrl1, Ctrl1.rep, Treat1 and Treat1.rep. Ctrl1, Ctrl1.rep and Treat1, Treat1.rep are technique replicates. We can clearly see two bands in the pairwise plot, which were caused by the difference of color imbalance between two batches. \begin{figure} \centering <>= ## get the color channel information colorChannel <- as.character(pData(featureData(example.lumiMethy))[, "COLOR_CHANNEL"]) ## replace the "Red" and "Grn" as color names defined in R colorChannel[colorChannel == 'Red'] <- 'red' colorChannel[colorChannel == 'Grn'] <- 'green' ## select a subet of sample for pair plot selSample <- c( "Ctrl1", "Ctrl1.rep", "Treat1", "Treat1.rep") ## plot pair plot with the dots in scatter plot colored based on the color channels pairs(estimateIntensity(example.lumiMethy[, selSample]), dotColor= colorChannel, main="Pair plot of CpG-site Intensity") @ \caption{Pair plot of CpG-site Intensity with colors} \label{fig:pairsColor} \end{figure} \subsection{Color balance adjustment} Based on the color balance assessment plots shown in previous two sections, we can see there are clear color balance differences between two batches in the example control and treatment dataset. That means we need to perform color balance adjustment before further analysis. Alternatively, we can separately process two color channels, but will course many inconveniences in the following up analysis. For example, we have to use different p-value or fold-change thresholds in identifying the differentially methylated CpG sites. Moreover, considering we may not have access to the color channel information later on, performing color balance adjustment first will make following up analysis much more convenient. The \Rpackage{lumi} package provides \Rfunction{lumiMethyC} function for color balance adjustment. The basic idea of color balance adjustment is to treat it as the normalization between two color channels. Function \Rfunction{lumiMethyC} provides two color balance adjustment methods: "quantile" and "ssn". Because two color channels have different number of probes and not match each other, we cannot directly apply regular "quantile" normalization used in expression microarray analysis. To solve this problem, we designed a \Rfunction{smoothQuantileNormalization} method. Basically, it requires the quantile normalization should be smooth. By doing this, the \Rfunction{smoothQuantileNormalization} can further resolve one major withdraw of regular quantile normalization, i.e., quantile normalization doesn't work well in the areas with sparse density, like two extremes of data distribution. Figure \ref{fig:densityColorBiasSumAdj} and \ref{fig:boxplotColorBiasSumAdj} show density and boxplot of two color channels after color balance adjustment using smooth quantile normalization. Figure \ref{fig:scatterColorBias1Adj} shows the scatter plots after color balance adjustment using smooth quantile normalization, we can see the distribution becomes very similar between two color channels. And the change of pair plot after color balance is more obvious by comparing Figure \ref{fig:pairsColorAdj} and Figure \ref{fig:pairsColor}. <>= ## summary of color balance information of individual samples lumiMethy.c.adj <- lumiMethyC(example.lumiMethy) @ \begin{figure} \centering <>= plotColorBias1D(lumiMethy.c.adj, channel='sum') @ \caption{Density plot of CpG-site Intensity (two color channels were plotted separately) after color balance adjustment} \label{fig:densityColorBiasSumAdj} \end{figure} \begin{figure} \centering <>= boxplotColorBias(lumiMethy.c.adj, channel='sum') @ \caption{Box plot of of CpG-site Intensity (two color channels were plotted separately) after color balance adjustment } \label{fig:boxplotColorBiasSumAdj} \end{figure} \begin{figure} \centering <>= ## plot the color balance adjusted scatter plot of two color channels plotColorBias2D(lumiMethy.c.adj, selSample=1, cex=2) @ \caption{Scatter plot of methylated and unmethylated probe intensities after color balance adjustment (example 1)} \label{fig:scatterColorBias1Adj} \end{figure} \begin{figure} \centering <>= ## plot pairwise plot after color balance adjustment pairs(estimateIntensity(lumiMethy.c.adj[, selSample]), dotColor= colorChannel, main="Pair plot of CpG-site Intensity") @ \caption{Pair plot of CpG-site Intensity with colors} \label{fig:pairsColorAdj} \end{figure} \subsection{Background level correction} Illumina provides negative control probes for the estimation of background levels. The information of the control probe information is kept in the controlData slot, which is a \Rclass{MethyLumiQC} object. When the controlData includes the negative control probe information, the background estimation will be the median of the negative control probes. Red and Green color channels will be estimated separately. However, in many cases the information of negative control probes usually was not outputted together with the methylation data. As a result, we have to find other methods to estimate the background levels of individual samples. As Illumina methylation has two color channels, the background levels of two color channel can be different due to color imbalance. So color balance adjustment is suggested to be performed before background adjustment, or background estimation should be separately performed in each color channel. The estimation of background level of Infinium methylaton microarray is based on the assumption that the lots of CpG sites are unmethylated, which results in a density mode of the intensities measured by methylated probes. The position of this mode represents the background level. Figure \ref{fig:bgDensityMethy} shows the background mode of methylated probe data of first five example samples. We can see the differences between two color channels. Function \Rfunction{estimateMethylationBG} estimates the background level of each sample. Function \Rfunction{bgAdjustMethylation} adjusts the background levels based on the estimation returned by \Rfunction{estimateMethylationBG}. The \Rfunction{lumiMehtyB} function is a general function for background correction, which by default call \Rfunction{estimateMethylationBG} function. If we directly perform background adjustment on the raw data, we have to perform background adjustment separately on two color channels, which can be done by setting the parameter "separateColor" of \Rfunction{lumiMethyB} as TRUE. Figure \ref{fig:bgAdjDensityMethy} shows the density plot of methylated probes after background correction with two color channels processed separately. We can see the density modes of two color channels overlapping each other after background correction performed separately performed in each color channel. Alternatively, we can perform background adjustment after color balance adjustment, as shown in Figure \ref{fig:bcAdjDensityMethy}. <>= ##separately adjust backgrounds of two color channels lumiMethy.b.adj <- lumiMethyB(example.lumiMethy, method="bgAdjust2C", separateColor=TRUE) ##background adjustment of individual samples lumiMethy.bc.adj <- lumiMethyB(lumiMethy.c.adj, method="bgAdjust2C") @ \begin{figure} \centering <>= ## plot the background mode of methylated probe data of first five example samples plotColorBias1D(example.lumiMethy[,1:5], channel='methy', xlim=c(-1000,5000), logMode=FALSE) @ \caption{Background mode shown in the density plot of methylated probes} \label{fig:bgDensityMethy} \end{figure} \begin{figure} \centering <>= ## plot the background mode of methylated probe data of first five example samples plotColorBias1D(lumiMethy.b.adj [,1:5], channel='methy', xlim=c(-1000,5000), logMode=FALSE) @ \caption{Background mode shown in the density plot of methylated probes after background adjustment (separately in two color channel)} \label{fig:bgAdjDensityMethy} \end{figure} \begin{figure} \centering <>= ## plot the background mode of methylated probe data of first five example samples plotColorBias1D(lumiMethy.bc.adj [,1:5], channel='methy', xlim=c(-1000,5000), logMode=FALSE) @ \caption{Background mode shown in the density plot of methylated probes after background and color adjustment} \label{fig:bcAdjDensityMethy} \end{figure} \subsection{Data normalization} Because the total amount of CpG methylation can differ significantly from sample to sample in different conditions, many assumptions used by mRNA expression microarrays are not valid in processing DNA methylation data. So directly applying the normalization methods designed for expression microarray to the methylation data, like M-value or Beta-value, is inappropriate. To circumvent this difficulty, we perform normalization at the probe level, i.e., normalize the intensities of methylated and unmethylated probes instead of the summarized methylation levels. The \Rpackage{lumi} package provides \Rfunction{lumiMethyN} function for probe level normalization. Currently, it supports two methods, "ssn" and "quantile". Users can also provide there own normalization methods, as long as it imports and outputs a data matrix. See subsection " Use user provided preprocessing functions" of for more details. The "ssn" (simple scaling normalization) method first estimates the background level use function \Rfunction{estimateMethylationBG}, then scale the background subtracted data to the same average intensity level, and finally adjust the background level to a user specified level. The assumption of "quantile" normalization is that the intensity distribution of the pooled methylated and unmethylated probes, as shown in Figure \ref{fig:densityColorBiasBoth}, are the similar for different samples. The quantile normalization itself is the same as expression microarrays. To check the effectiveness of normalization, we first check the sample relations after SSN normalization, as shown in Figure \ref{fig:sampleRelationClusterNormalized}. We can see most of the technique replicates are clustered together after normalization. Comparing with the sample relations of the raw data shown in Figure \ref{fig:sampleRelationCluster}, the improvement is obvious. Figure \ref{fig:densitySSN} and \ref{fig:densityQuantile} show the density plots of methylation levels after SSN and quantile normalization, separately. The red and black colors represent "Treatment" and "Control" samples, respectively. We can see the mode positions of normalized data are more consistent across samples while keeping the methylation differences in the middle methylation range. Figure \ref{fig:densityIntensityNormalizedSSN} and \ref{fig:densityIntensityNormalizedQ} show the density plot of CpG-site Intensity after SSN and quantile normalization, respectively. Comparing with the density plot of CpG-site Intensity before preprocessing, shown in Figure \ref{fig:densityIntensity}, we can see they are much more consistent across samples. We can see the distribution of CpG-site Intensity of quantile normalized data is more consistent than SSN normalized data. This is because the assumption of quantile normalization is more aggressive than SSN, it assumes the distribution of pooled intensities of methylated and unmethylated probes are the same. The selection of SSN and quantile normalization is similar with the case in expression microarray. It depends on the data itself and the purpose of analysis. Figure \ref{fig:boxplotColorBiasNormalized} shows the color bias boxplot after preprocessing. We can see they are very consistent across samples. Finally, we show the pair plot of M-value after preprocessing. We expect there is little difference between technique replicates, but more difference between "Treatment" and "Control" samples, this is exactly what is shown in Figure \ref{fig:pairMNormalize}. All of these results show the color balance adjustment plus normalization is effective. <>= ## Perform SSN normalization based on color balance adjusted data lumiMethy.c.ssn <- lumiMethyN(lumiMethy.c.adj, method='ssn') ## Perform quantile normalization based on color balance adjusted data lumiMethy.c.quantile <- lumiMethyN(lumiMethy.c.adj, method='quantile') @ \begin{figure} \centering <>= plotSampleRelation(lumiMethy.c.ssn, method='cluster', cv.Th=0) @ \caption{Overall sample relations shown as a hierarchical tree} \label{fig:sampleRelationClusterNormalized} \end{figure} \begin{figure} \centering <>= ## plot the density of M-values after SSN normalization density(lumiMethy.c.ssn, col= sampleColor, main="Density plot after SSN normalization") @ \caption{Density plot of M-values after SSN normalization} \label{fig:densitySSN} \end{figure} \begin{figure} \centering <>= ## plot the density of M-values after quantile normalization density(lumiMethy.c.quantile, col= sampleColor, main="Density plot after quantile normalization") @ \caption{Density plot of M-values after quantile normalization} \label{fig:densityQuantile} \end{figure} \begin{figure} \centering <>= density(estimateIntensity(lumiMethy.c.ssn), col= sampleColor, xlab="log2(CpG-site Intensity)") @ \caption{Density plot of CpG-site Intensity after quantile SSN normalization} \label{fig:densityIntensityNormalizedSSN} \end{figure} \begin{figure} \centering <>= density(estimateIntensity(lumiMethy.c.quantile), col= sampleColor, xlab="log2(CpG-site Intensity)") @ \caption{Density plot of CpG-site Intensity after quantile normalization} \label{fig:densityIntensityNormalizedQ} \end{figure} \begin{figure} \centering <>= boxplotColorBias(lumiMethy.c.quantile, channel='sum') @ \caption{Box plot of of CpG-site Intensity (two color channels were plotted separately) after normalization} \label{fig:boxplotColorBiasNormalized} \end{figure} \begin{figure} \centering <>= ## select a subet of sample for pair plot selSample <- c( "Ctrl1", "Ctrl1.rep", "Treat1", "Treat1.rep") ## plot pair plot with the dots in scatter plot colored based on the color channels pairs(lumiMethy.c.quantile[, selSample], dotColor= colorChannel, main="Pair plot of M-value after normalization") @ \caption{Pair plot of M-value after normalization} \label{fig:pairMNormalize} \end{figure} \subsection{Modeling the methylation status} Comparing with the Beta-value, we cannot directly know the methylation status of the CpG site based on the M-value. By checking the density plots of M-values, as shown in Figure \ref{fig:densityIntensityNormalizedQ} and Figure \ref{fig:densityIntensityNormalizedSSN}, we can see the density plot has two obvious modes, with one corresponds to the unmethylated CpG sites and another one corresponds to the methylated CpG sites. Based on this observation, we developed an algorithm to estimate the methylation status by fitting a two component Gamma mixture model. Function \Rfunction{gammaFitEM} estimates the parameters of gamma mixture models using EM algorithm. Figure \ref{fig:gammaFit} shows the fitted two component gamma mixture model in comparison with the histogram of the first sample in example.lumiMethy dataset. \begin{figure} \centering <>= ## Fit the two component gamma mixture model of the first sample of example.lumiMethy fittedGamma <- gammaFitEM(exprs(example.lumiMethy)[,1], plotMode=TRUE) @ \caption{Compare the fitted two component gamma mixture model with the histogram of the first sample of example.lumiMethy} \label{fig:gammaFit} \end{figure} Function \Rfunction{methylationCall} estimates the methylation status based on the fitting results of \Rfunction{gammaFitEM}. Three methylation statuses are reported: "Unmethy" (unmethylation posterior probability > unmethylation threshold), "Methy" (methylation posterior probability > methylation threshold), or "Margin". By default, both unmethylation and methylation probability thresholds are 0.95. Following is the summary of methylation status of the first sample in example.lumiMethy dataset: <>= ## estimate the methylation status based on the results of gammaFitEM methyCall <- methylationCall(fittedGamma) table(methyCall) @ The inputs of both \Rfunction{gammaFitEM} and \Rfunction{methylationCall} are a M-value vector of a particular sample. Function \Rfunction{lumiMethyStatus} is a wrap function of \Rfunction{methylationCall} to process a \Rclass{MethyLumiM} object. It returns a methylation status matrix of the same dimension as the input \Rclass{MethyLumiM} object. The methylation probability matrix is kept as the "probability" attributes of the result. Following is an example: <>= ## estimate the methylation status of a LumiMethyM object methyCall <- lumiMethyStatus(example.lumiMethy[,1:4]) head(methyCall) ## retrieve the methylation probability matrix methyProb <- attr(methyCall, "probability") head(methyProb) @ \subsection{Use user provided preprocessing functions} For convenience to the users, the user can specify their own preprocessing function when call \Rfunction{lumiMethyB}, \Rfunction{lumiMethyC} and \Rfunction{lumiMethyN}. The input and output of the user provided function should be a intensity matrix (pool of methylated and unmethylated probe intensities). Following are some examples of using user defined preprocessing functions. <>= ## suppose "userB" is a user defined background adjustment method lumiMethy.b.adj <- lumiMethyB(example.lumiMethy, method=userB, separateColor=TRUE) # not Run ## suppose "userC" is a user defined color balance adjustment method lumiMethy.c.adj <- lumiMethyC(example.lumiMethy, method=userC, separateColor=TRUE) # not Run ## suppose "userN" is a user defined probe level normalization method lumiMethy.c.n <- lumiMethyN(lumiMethy.c.adj, method= userN, separateColor=TRUE) # not Run @ \subsection{Options to separately process each color channel} Functions \Rfunction{lumiMethyB}, \Rfunction{lumiMethyC} and \Rfunction{lumiMethyN} also has the option to separately process in each color channel. The presumption is that the data (\Rclass{MethyLumiM} object) should include the color channel column (with column name "COLOR\_CHANNEL") in the featureData and the green and red colors are represented as "Red" and "Grn". If the GenomeStudio (BeadStudio) output the annotation information together with the data (we recommend it), the \Rfunction{lumiMethyR} function will automatically add the annotation information, including color channel information, to the featureData slot. If the data does not include color channel information, users can add it use function \Rfunction{addAnnotationInfo}. <>= ## retrieve the featureData information ff <- pData(featureData(example.lumiMethy)) ## show the color channel information head(ff) ## add user provided color channel information if it is not existed in the featureData # example.lumiMethy <- addAnnotationInfo(example.lumiMethy, lib="IlluminaHumanMethylation27k.db") @ Here are examples of separately processing each color channel: <>= ## suppose "userB" is a user defined background adjustment method lumiMethy.b.adj <- lumiMethyB(example.lumiMethy, separateColor=TRUE) ## suppose "userC" is a user defined color balance adjustment method lumiMethy.c.adj <- lumiMethyC(example.lumiMethy, separateColor=TRUE) ## suppose "userN" is a user defined probe level normalization method lumiMethy.c.n <- lumiMethyN(lumiMethy.c.adj, separateColor=TRUE) @ \subsection{About the detection p-value of a CpG site} The detection p-values of methylation arrays reflect the strength of DNA hybridization over the background (comparing the CpG-intensity with the intensities of negative control probes). Because all genomic DNA is expect existing in a normal diploid sample, almost all detection p-values are significant. Non-significant detection p-values usually means bad probe design, bad hybridization or possible chromosome abnormalities (like mutations, indels) at the probe matching locations. Given one sample in the titration data set as an example, the 99 percentile of the -log10(detection p-value) is 28.3, which is extraordinarily significant p-value. And there are only 27 probes with detection p-value worse than 0.0001. This is in great contrast to expression microarrays, where up to 40 percent or more of the genes might not be expressed in a given sample, and filtering based on detection p-values can dramatically reduce the false-positive rate for the expression data. However, removing the bad quality CpG measurements is still an important step during preprocessing of Illumina methylation data. Same as Illumina Expression data, we can use function \Rfunction{detectionCall} to estimate the detection call. Please check the help of \Rfunction{detectionCall} for more details of its usage. <>= ## Estimate the detection call of a CpG site presentCount <- detectionCall(example.lumiMethy) @ \section{Gene annotation} The annotation package, \Rpackage{IlluminaHumanMethylation27k.db}, of the HumanMethylation27 BeadChip can be downloaded from Bioconductor. The usage of this package is the same as the expression microarrays. \section{Performance comparison} To be added in the near future ... \section{A use case: from raw data to functional analysis} \subsection{Preprocessing the Illumina Infinium Methylation microarray data} <>= library(lumi) ## specify the file name # fileName <- 'Example_Illumina_Methylation_profile.txt' ## load the data # example.lumiMethy <- lumiMethyR(fileName, lib="IlluminaHumanMethylation27k.db") # Not Run ## Quality and color balance assessment data(example.lumiMethy) ## summary of the example data example.lumiMethy ## preprocessing and quality control after normalization plotColorBias1D(example.lumiMethy, channel='sum') boxplotColorBias(example.lumiMethy, channel='sum') ## select interested sample to further check color balance in 2D scatter plot plotColorBias2D(example.lumiMethy, selSample=1) ## Color balance adjustment between two color channels lumiMethy.c.adj <- lumiMethyC(example.lumiMethy) ## Check color balance after color balance adjustment boxplotColorBias(lumiMethy.c.adj, channel='sum') ## Background adjustment is skipped because the SSN normalization includes background adjustment ## Normalization ## Perform SSN normalization based on color balance adjusted data lumiMethy.c.ssn <- lumiMethyN(lumiMethy.c.adj, method='ssn') ## Or we can perform quantile normalization based on color balance adjusted data # lumiMethy.c.q <- lumiMethyN(lumiMethy.c.adj, method='quantile') ## plot the density of M-values after SSN normalization density(lumiMethy.c.ssn, main="Density plot of M-value after SSN normalization") ## comparing with the density of M-values before normalization density(example.lumiMethy, main="Density plot of M-value of the raw data") ## output the normlized M-value as a Tab-separated text file write.exprs(lumiMethy.c.ssn, file='processedMethylationExampleData.txt') @ \subsection{Identify differentially methylated genes and Functional analysis} The identification of differentially methylated genes and Functional analysis are the same as expression data, shown in the lumi.pdf. Here we will not repeat this part. \subsection{GEO submission of the methylation data} The submission of methylation data is similar with expression microarray data. The submission file will be in the SOFT format. So users can submit all the data in a batch. We also use \Rfunction{produceGEOSampleInfoTemplate} to produce a template of sample information with some default fillings. Some fields have been filed in with default settings. Users should fill in or modify the detailed sample descriptions by referring some previous submissions. No blank fields are allowed. Users are also required to fill in the "Sample\_platform\_id" by checking information of the GEO Illumina platform. \Rfunction{produceMethylationGEOSubmissionFile} is the main function of produce GEO submission file including both normalized and raw methylation data information in the SOFT format. By default, the R objects, methyLumiM, methyLumiM.raw and sampleInfo, will be saved in a file named 'supplementaryData.Rdata'. Users can include this R data file as a GEO supplementary data file. \begin{Sinput} ## Produce the sample information template > produceGEOSampleInfoTemplate(methyLumiM, fileName = "GEOsampleInfo.txt") ## After editing the 'GEOsampleInfo.txt' by filling in sample information > produceMethylationGEOSubmissionFile(methyLumiM, methyLumiM.raw, sampleInfo='GEOsampleInfo.txt') \end{Sinput} \section{Session Info} <>= toLatex(sessionInfo()) @ \section{Acknowledgement} We appreciate Dr. Sean Davis from NIH developed methylumi package and built IlluminaHumanMethylation27k.db package for us! This work was supported in part by the NIH award 1RC1ES018461-01 (to Dr. Lifang Hou), P30CA060553 and UL1RR025741. We would like to thank Lifang Hou for supporting this work, Xiao Zhang for preparing the titration samples, Nadereh Jafari and Vivi Frangidakis for conducting the Illumina BeadChip experiments. \section{References} 1. Du P, Kibbe WA and Lin SM: "lumi: a Bioconductor package for processing Illumina microarray" Bioinformatics 2008 24(13):1547-1548 2. Du P, Zhang X, Huang CC, Jafari N, Kibbe WA, Hou L, and Lin SM: "Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis", BMC Bioinformatics 2010, 11:587 %\bibliographystyle{plainnat} %\bibliography{lumi} \end{document}