%\VignetteIndexEntry{The \Rpackage{ChAMP} Package} %\VignetteKeywords{Illumina DNA methylation normalisation array normalization performance } %\VignettePackage{ChAMP} %% http://www.bioconductor.org/help/course-materials/2010/AdvancedR/BuildPackage.pdf \documentclass[11pt]{article} \usepackage{Sweave} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \usepackage{natbib} \usepackage{url} \title{The \Rpackage{ChAMP} Package} \author{Morris TJ, Butcher L, Feber A, Teschendorff A, Chakravarthy A, Beck S} \begin{document} \maketitle \begin{figure}[!tpb]%logo \centerline{\includegraphics[scale=2.5]{logo4.jpg}}%{system.file("extdata","logo4.jpg",package="ChAMP")}} \end{figure} The \Rpackage{ChAMP} package is a pipeline that integrates currently available 450k analysis methods and also offers its own novel functionality. It utilises the data import, quality control and SWAN \citep{Maksimovic12} normalization functions offered by the \Rpackage{minfi} \citep{minfi} package. In addition, the \Rpackage{ChAMP} package includes the Peak Based Correction (PBC) method \citep{Dedeurwaerder11} and the BMIQ normalization \citep{Teschendorff13} is set as the default method. A number of other pipelines and packages are available for 450k analysis including \Rpackage{IMA} \citep{Wang12IMA}, \Rpackage{minfi} \citep{minfi}, \Rpackage{methylumi} \citep{methylumi}, \Rpackage{R n Beads} \citep{rnbeads} and \Rpackage{watermelon} \citep{Pidsley13}. However, \Rpackage{ChaMP} includes several analysis options that are not available in other packages. The singular value decomposition (SVD) method \citep{Teschendorff09} allows an in-depth look at batch effects and for correction of this the ComBat method \citep{Johnson07} has been implemented. For the identification of differentially methylated regions (DMRs) \Rpackage{ChAMP} offers the new Probe Lasso method. Finally, \Rpackage{ChAMP} has an additional function to analyse 450k for copy number aberrations \citep{Feber13}. %mention shinymethy \section{Installation} It is essential that you have R already installed on your computer. \Rpackage{ChAMP} is a pipeline that utilises many Bioconductor packages that are currently available from CRAN and Bioconductor. For all of the steps of the pipeline to work make sure that you have installed \Rpackage{minfi}, \Rpackage{DNAcopy}, \Rpackage{impute}, \Rpackage{marray}, \Rpackage{limma}, \Rpackage{preprocessCore}, \Rpackage{RPMM}, \Rpackage{sva} and \Rpackage{IlluminaHumanMethylation450kmanifest}.\\ \\ This can be done in one go using the following commands:\\ <>= source("http://bioconductor.org/biocLite.R") biocLite(c('minfi', 'DNAcopy', 'impute', 'marray', 'limma', 'preprocessCore', 'RPMM', 'sva', 'IlluminaHumanMethylation450kmanifest','wateRmelon')) @ Load the \Rpackage{ChAMP} package. \\ <>= library(ChAMP) @ It is then easier to set your working directory to the folder containing your .idat files and sample sheet. There can only be one sample sheet in this folder. \\ \section{Test Dataset} The package contains a test dataset of HumanMethylation450 data which can be used to test functions available in \Rpackage{ChAMP}.\\ This can be loaded by pointing the directory to the testDataSet:\\ <>= testDir=system.file("extdata",package="ChAMPdata") @ Also a pre-filtered saved version can be loaded using <>= data(testDataSet) myLoad=testDataSet @ \section{Full Pipeline} Figure \ref{fig:pipeline} outlines the steps in the \Rpackage{ChAMP} pipeline. Each step can be run individually as a separate function. This allows integration with other analysis pipelines. It also enables the user to save the results of each step for future reference or further analysis. Alternatively the full pipeline can be run at once with one command:\\ <>= champ.process(directory = testDir) @ When running the full pipeline through the champ.process() function a number of parameters can be adjusted.\\ \begin{figure}[!tpb]%pipeline \centerline{\includegraphics[scale = 0.55]{figure1.pdf}}%{system.file("extdata","figure1.pdf",package="ChAMP")}} \caption{An overview of the \Rpackage{ChAMP} pipeline showing all major steps available in the pipeline. This steps can be run as a full pipeline or individually as functions combined with other packages. This vignette will explain in detail how to proceed.} \label{fig:pipeline} \end{figure} \section{A note on computational requirements} The ability to run the pipeline on a large number of samples depends somewhat on the memory available. The \Rpackage{ChAMP} pipeline runs 200 samples successfully on a computer with 8GB of memory. Beyond this it may be necessary to find a server/cluster to run the analysis on. The champ.load() function uses the most memory. If you plan to run the analysis more than once it is recommended to to run myLoad=champ.load() and save this list for future analyses. In this case, when the list of load objects is saved to the variable 'myLoad' you can simply run champ.process(fromIDAT=F). <>= save(myLoad,file="currentStudyloadedData.RData") load("currentStudyloadedData.RData") champ.process(fromIDAT=FALSE) @ %Add info on combining datasets... Another option if you have time or space constraints or if you are combining \Rpackage{ChAMP} with other analysis pipelines is to run the analysis step by step and not use the champ.process() function. Each separate function is described in detail below, but the full pipeline in a step-wise process would go: <>= myLoad=champ.load(directory = testDir) myNorm=champ.norm() champ.SVD() batchNorm=champ.runCombat() limma=champ.MVP() lasso=champ.lasso(fromFile =TRUE, limma=limma) champ.CNA() @ \section{Description of \Rpackage{ChAMP} functions} As previously mentioned, the user has the option to run each of the \Rpackage{ChAMP} functions individually to allow integration with other pipelines or to save the results of each step. Here each function is described in detail. \subsection{Load} \subsubsection{Load Data from idat files} The champ.load() function utilises \Rpackage{minfi} to load data from .idat files. By default this loads data from the current working directory, in this directory you should have all .idat files and a sample sheet. The sample sheet currently needs to be a .csv file as came with your results following hybridization. You can choose whether you want M-values or beta-values. For small datasets M-values are recommended \citep{Zhuang12}. The newest version of \Rpackage{minfi} automatically filters out the 65 SNP probes that were included on the chip as internal controls which can useful for identifying sample mixups. As such, before filtering for low quality probes the dataset will include 485,512 probes. <>= myLoad$pd @ %\subsubsection{Load Data from Genome Studio files} \subsubsection{Filtering for failed probes} By default \Rpackage{ChAMP} filters the data for detection p-value (< 0.01). This utilises \Rpackage{minfi} method for calculating the detection p-value which differs from the method used in Genome Studio. A file failedSamples.txt is saved (see Figure \ref{fig:failed}) and also printed to the screen showing the fraction of failed probes per sample. If any of these values is high (>0.05) you may want to consider removing that sample from the analysis and rerunning. The option to filter out probes with <3 is available by changing the filterBeads parameter to TRUE. The default will filter probes with <3 beads in at least 5\% of samples, but this can be adjusted with the beadCutoff parameter. <>= myLoad=champ.load(directory = testDir, filterBeads=TRUE) @ \subsubsection{Output} The load function saves 3 quality control images (see Figures \ref{fig:density}, \ref{fig:mds} and \ref{fig:cluster}. The clustering image will not be saved if there are more than 65 samples in the dataset. \begin{figure}[!ht]%densityPlot \centerline{\includegraphics[scale = 0.55]{densityPlot.jpg}}%{system.file("extdata","densityPlot.jpg",package="ChAMP")}} \caption{An example of a density plot showing the density of unnormalised beta values for all the samples that have been uploaded. They are coloured based on the Sample Group as defined in the sample sheet (Figure \ref{fig:sampleSheet}). This plot may identify any samples with significantly different profiles.} \label{fig:density} \end{figure} \begin{figure}[!ht]%mdsPlot \centerline{\includegraphics[scale = 0.55]{mdsPlot.jpg}}%{system.file("extdata","mdsPlot.jpg",package="ChAMP")}} \caption{An example of a MDS (multidimensional scaling) plot. This plot allows a visualisation of the similarity of samples based on the top 1000 most variable probes amongst all samples. The samples are coloured by Sample\_Group (as defined in the sample sheet Figure sampleSheet. } \label{fig:mds} \end{figure} \begin{figure}[!ht]%clusterPlot \centerline{\includegraphics[scale = 0.55]{rawSampleCluster.jpg}}%{system.file("extdata","rawSampleCluster.jpg",package="ChAMP")}} \caption{An example of a cluster plot. This offers a second way to visualise the similarity of samples based on all probes using hierarchical clustering. } \label{fig:cluster} \end{figure} \subsubsection{Usage} <>= myLoad = champ.load(directory=testDir) @ \subsection{Normalization for adjustment of type-2 bias} There are several normalization methods available: BMIQ \citep{Teschendorff13}, SWAN \citep{Maksimovic12}, PBC \citep{Dedeurwaerder11} or NONE. SQN \citep{Touleimat12} is not yet implemented. The default method is BMIQ. It will save three quality control images to the folder '/Normalization' for each sample (see Figures \ref{fig:type1}, \ref{fig:type2} and \ref{fig:check}). These images are showing the fit that the normalization is applying to each probe type. After normalization a second set of quality control images will be saved similar to pre-normalization (see Figures \ref{fig:density}, \ref{fig:mds} and \ref{fig:cluster}). The clustering image will not be saved if there are more than 65 samples in the dataset. \begin{figure}[!ht]%type1 \centerline{\includegraphics[scale = 0.55]{type1fit.jpg}}%{system.file("extdata","type1fit.jpg", package="ChAMP")}} \caption{An example of the Type 1 fit with BMIQ.} \label{fig:type1} \end{figure} \begin{figure}[!ht]%type2 \centerline{\includegraphics[scale = 0.55]{type2fit.jpg}}%{system.file("extdata","type2fit.jpg",package="ChAMP")}} \caption{An example of the Type 2 fit with BMIQ. } \label{fig:type2} \end{figure} \begin{figure}[!ht]%check \centerline{\includegraphics[scale = 0.55]{checkBMIQ.jpg}}%{system.file("extdata","checkBMIQ.jpg",package="ChAMP")}} \caption{An overview of both type 1 and 2 fit achieved with BMIQ. } \label{fig:check} \end{figure} \subsubsection{Usage} <>= myNorm=champ.norm() @ \subsection{SVD for identification of components of variation (batch effects)} The singular value decomposition method (SVD) implemented by Teschendorff \citep{Teschendorff09} for 27k data is used to identify the most significant components of variation. These components of variation would ideally be biological factors of interest, but it may be technical variation (batch effects). To get the most from this analysis it is useful to include as much information as possible. If samples have been loaded from .idat files then the 18 internal controls on the bead chip (including bisulphite conversion efficiency) are included as well as any study details defined in the sample sheet (well position, sentrix id, batch etc) (Figure \ref{fig:sampleSheet}). Additional phenotype/study information (age, gender, batch) can be included in a separate file (set studyInfo=TRUE) (Figure \ref{fig:studyInfo}). This must be a .txt file saved as studyInfo.txt and the first column must be labelled "Sample\_Name" with the sample names as used in the sample sheet. These new covariates will be considered categorical by default. If any are continuous it is necessary to include a list in the parameter infoFactor=c() that defines them. TRUE = categorical, FALSE=continuous. These settings will print to the screen after the SVD analysis has been run to confirm they were correctly assigned. It is important to realise SVD does not manipulate the data, but analyses it to compute p-values based on the significance of each component. The result is a heatmap (saved as SVDsummary.pdf) of the top 6 principle components correlated to the information provided (Figure \ref{fig:SVD}). The darker colours represent a lower p-value indicating a larger component of variation. If it becomes clear from this SVD analysis that the largest components of variation are technical factors (batch effects) then it is worth considering the experimental design and implementing other normalization methods that may help remove technical variation. ComBat is included in this pipeline to remove variation related to chip number but it or other methods may be implemented independently to remove other sources of technical variation revealed in the SVD analysis. \begin{figure}[!ht]%study info \centerline{\includegraphics[scale=0.9]{studyInfo.jpg}}%{system.file("extdata","studyInfo.jpg",package="ChAMP")}} \caption{An example of the study info file. This file must include the first column "Sample\_Name" and be identical to the first column of the sample sheet (Figure \ref{fig:sampleSheet}). It should be saved as studyInfo.txt and can include any addition study information including gender, age, batch, cell type, patientID etc.} \label{fig:studyInfo} \end{figure} \begin{figure}[!ht]%SVD \centerline{\includegraphics[scale = 0.55]{SVDsummary.pdf}}%{system.file("extdata","SVDsummary.pdf",package="ChAMP")}} \caption{An example of the SVD output. This heatmap shows all the components of variation that have been included in the analysis. The first 18 are internal controls on the bead chip representing potential technical variation, The remaining covariates are obtained from data in the sample sheet (Figure \ref{fig:sampleSheet}) and the study info file (Figure \ref{fig:studyInfo}).} \label{fig:SVD} \end{figure} \subsubsection{Usage} <>= champ.SVD() @ \subsection{Correction for batch effects related to bead chip using ComBat} This function implements the ComBat normalization method \citep{Johnson07} that was developed for microarrays. The \Rpackage{sva} R package is used to implement this function. ComBat is specifically defined in the \Rpackage{ChAMP} package to correct for batch effects related to the slide (Sentrix\_ID). More advanced users can implement ComBat using \Rpackage{sva} documentation to adjust for other batch effects. \subsubsection{Procedure} ComBat uses an empirical Bayes method to correct for technical variation. If ComBat were applied directly to beta values zeros could be introduced as beta values have a range between 0 and 1. For this reason \Rpackage{ChAMP} logit transforms beta values before ComBat adjustment and then computes the reverse logit transformation following the ComBat adjustment. If the user has chosen to use M-values no logit transformation will be done. If the dataset only includes data from one slide the adjustment will abort. As the CNA method utilises intensity values rather than beta or M -values the ComBat adjustment will be repeated for that function. The ComBat function can be a time consuming step in the pipeline if you have a large number of samples. \subsubsection{Usage} <>= batchNorm=champ.runCombat() @ \subsection{Calling MVPs - Methylation Variable Positions} This function implements the \Rpackage{limma} package to calculate the p-value for differential methylation using a linear model. At present, the function can only compute differential methylation between two groups. The two groups are determined by the Sample\_Group column in the sample sheet. If more than two groups are present the function will calculate the differential methylation between the first two unique groups in the file. As the function is running it will print which two groups the contrast is being computed for and will then printout how many MVPs were found for the given p-value. All probes are saved in a file "MVP\_ALL.txt" which can be filtered for a p-value of interest. The filename also includes the two groups compared and the method used to adjust the p-values (for example "MVP\_ALL\_CvsT\_BHadjust.txt"). The file includes the annotation for each probe, the average beta (for the sample group) and the delta beta for the two groups used in the comparison. In addition, a bedfile is saved with only the significant MVPs. This may be useful for downstream pathway analysis. It is planned that a future version of \Rpackage{ChAMP} will include the ability to compare multiple groups. However, a more advanced user can run \Rpackage{limma} with other settings and use the output (including all probes and their p-values) for the DMR hunter. %bedfile \subsubsection{Usage} <>= limma=champ.MVP() head(limma) @ \subsection{DMR Hunter - Probe Lasso} This function computes and returns a data frame of probes binned into discreet differentially methylated regions (DMRs), with accompanying p-value. The data frame is distilled from a \Rpackage{limma} output of probes and their association statistics. It additionally writes three informative images as pdfs: 1) a box plot of probe spacing for the input data as a function of genomic feature/CGI relation (see Figure \ref{fig:features}); 2) a cumulative quantile plot illustrating how the user-specified parameters affect the sizes of all windows employed for DMR-calling (see Figure \ref{fig:dmr}) and; 3) a dot plot with size-scaled dots to further illustrate the resulting window sizes for each genomic feature that follows from the user-specified parameters (see Figure \ref{fig:radius}). Differentially methylated regions (DMRs) are extended and discreet segments of the genome that show a quantitative alteration in DNA methylation levels between two groups. A number of different approaches to the identification of DMRs have been implemented, most notably ''windows'' in which differential methylation is sought over a stretch of DNA of pre-defined - and often fixed - size. This approach is utilitarian and effective but is confounded by the fluctuating CpG density throughout a genome; it is further limited when using microarrays, in that coverage is often incomplete and non-uniform. Consequently, DMR identification is likely to be biased towards regions of densely tiled probes. To compensate for this the Probe Lasso DMR Hunter is based on a feature-oriented dynamic window ("lasso"), which aims to capture neighbouring, significant probes and bundle them into DMRs. \subsubsection{Procedure} Because the Probe Lasso DMR Hunter takes into account probe spacing, the first step to identify DMRs is to calculate a dataset-specific record of probe spacing. Although it is tempting to calculate probe spacing based on all 485,577 probes on the Infinium 450k BeadChip, the Probe Lasso DMR Hunter requires input for a number of user-specified parameters (e.g., filtering of X-chromosome probes and putatively polymorphic probes) that can truncate the dataset; additionally, probe failures are inevitable and will differ between experiments. As such, each dataset is bestowed a unique set of probes, which underscores the need for an experiment-specific catalogue of probe-spacing. Once the dataset has been defined, the nearest neighbouring probe is calculated for all available probes; this data is then partitioned into one of 28 different categories comprising information on the genomic feature (1st exon, 3$\prime$UTR, 5$\prime$UTR, body, intergenic region, TSS200, or TSS1500) and that features relation (if any) to a nearby CpG island (island, shore, shelf, or not associated). Figure \ref{fig:features} illustrates the wildly variable probe spacing as a function of genomic feature on the Infinium 450k BeadChip. Data were derived from an experiment in which the X-chromosome was omitted, polymorphic probes were included, and >98\% of probes were detected across all samples. \begin{figure}[!ht] \centerline{\includegraphics[scale=0.75]{probeFeatures.jpg}}%{system.file("extdata","probeFeatures.jpg",package="ChAMP")}} \caption{A study specific plot showing the number of probes found in each feature. Note the different spacing on the 450k array between neighbouring probes for different features (1st exon, 3$\prime$UTR, 5$\prime$UTR, gene body, intergenic region, transcription start sites) and their relation to the nearest CpG island (in the island, in a shore, in a shelf, not associated). Intergenic regions, 3$\prime$UTRs and gene bodies with no CGI relation are very sparsely spaced; whereas probes in the 1st exon and TSSs are very closely spaced.} \label{fig:features} \end{figure} The user then specifies a maximum (or minimum) lasso size (bp) and the Probe Lasso DMR Hunter determines which feature/CGI category fulfils this criterion first and what quantile of the relevant feature/CGI category it corresponds to. This quantile is applied to all feature/CGI categories to define feature/CGI category-specific lasso sizes (see Figures \ref{fig:dmr} and \ref{fig:radius}). \begin{figure}[!ht] \centerline{\includegraphics[scale=0.75]{DMR.jpg}}%{system.file("extdata","DMR.jpg",package="ChAMP")}} \caption{Defining the lasso for each feature type. This illustrates when someone has specified a minimum lasso of 10bp and how this sets the quantile for all feature/CGI relation-specific lassos to \~48\%. These lassos are then cast out (centred on each probe) to capture a minimum number of significant probes. Probes that fulfil this are set aside.} \label{fig:dmr} \end{figure} \begin{figure}[!ht] \centerline{\includegraphics[scale=1]{radius.jpg}}%{system.file("extdata","radius.jpg",package="ChAMP")}} \caption{This image shows the radius size for each feature.} \label{fig:radius} \end{figure} Next, operating solely on the significant probes in the dataset, an appropriately-sized lasso is thrown around each probe; if the lasso captures a user-specified number of significant probes (including itself), that probe (and the probes captured by its lasso) is set aside, in essence, a ''mini-DMR''. Next, Probe Lasso DMR Hunter attempts to close-up gaps between neighbouring mini-DMRs whose lasso boundaries are either overlapping or less than a user-specified distance apart (e.g., 1000bp), effectively defining a ''final DMR''. The coordinates for each DMR are calculated by throwing an appropriately-sized lasso around each probe in the DMR and taking the minimum and maximum genomic coordinates. Finally, Probe Lasso DMR Hunter pulls back all probes within the DMR coordinates and returns them as data frame containing genomic annotation and association statistics of each MVP. Additionally, a p-value for each DMR is calculated using Stouffler's method. This method combines the p-values of individual probes within a DMR by weighting them according to the underlying correlation structure of methylation scores between probes; p-values of probes whose methylation scores are correlated are down-weighted, while independent probes are not penalised. The concept of the probe lasso is visualized as a cartoon in Figure \ref{fig:lasso}. \begin{figure}[!ht] \centerline{\includegraphics[scale=1]{lasso.jpg}}%{system.file("extdata","lasso.jpg",package="ChAMP")}} \caption{The details on calling a DMR. This shows the application of feature/CGI relation-specific lassos to the dataset. In the above example, six probes are 'successful' in capturing the minimum number of probes (e.g., 3) in their lasso; however, because there is less than (the user-specified) 1kb separating the first and last 'successful' probes, the DMR boundaries are defined between these (+/- their lassos!). Note that 'unsuccessful' probes are now included in the DMR. } \label{fig:lasso} \end{figure} \subsubsection{SNP filtering} SNPs can be filtered from the DMR hunter using data obtained from the 1000 Genomes Project. This data includes SNP information for 4 populations (European, American, Asian and African). \citep{1000genome} \subsubsection{Output} %DMR output file \begin{figure}[!ht] \centerline{\includegraphics[scale=1]{DMRoutput.jpg}}%{system.file("extdata","DMRoutput.jpg",package="ChAMP")}} \caption{This is an example of the champ.lasso() output that is saved if the dataset has produced a DMR list. The columns show the probeID, adjusted p-value, chromosome, map info, chromosome arm, feature relation, SNP allele frequency for the forward strand for the selected population (pol.af.f), SNP allele frequency for the reverse strand for the selected population (pol.af.r), the distance in base pairs to the next nearest significant probe (nrst.probe), the size in base pairs of the lasso used to capture this DMR (lasso.radii), the DMR number (dmr.no), the DMR start location (dmr.start), the DMR end location (dmr.end), the DMR size (dmr.size) and the p-value of significance for the DMR (dmr.p)} \label{fig:DMRoutput} \end{figure} %bedfile \subsubsection{Usage} <>= lasso=champ.lasso(fromFile=TRUE, limma=limma, image=FALSE,bedFile=FALSE) if(!is.null(lasso)) { head(lasso) } @ \subsection{Copy Number Aberrations} This function uses the HumanMethylation450 data to identify copy number aberrations \citep{Feber13}. The function utilises the intensity values for each probe to count copy number and determine if copy number aberrations are present. Copy number is determined using the \Rpackage{CopyNumber} package. \subsubsection{Procedure} Intensity values are quantile normalized and the user has the option to also run the ComBat normalization for correction of batch effects related to the slide before the copy number is calculated. Feber \citep{Feber13} compared the results obtained using this method to copy number data from Illumina CytoSNP array and Affymetrix SNP 6.0 arrays and found that using this method for 450k data was effective in identifying regions of gain and loss. \begin{figure}[!ht] \centerline{\includegraphics[scale = 0.55]{CNAtext.jpg}}%{system.file("extdata","CNAtext.jpg",package="ChAMP")}} \caption{This is an example of the champ.CNA() output that is saved for each sample. The columns show the sample name, chromosome, segment start, segment end, the number of probes in the segment and the segment mean. Typically 0.3 is used as a cut-off for the segment mean to call gains and losses.} \label{fig:cnatext} \end{figure} \begin{figure}[!ht] \centerline{\includegraphics[scale = 0.55]{CNAimage.jpg}}%{system.file("extdata","CNAimage.jpg",package="ChAMP")}} \caption{This image shows all the data for a single sample. Chromosomes are shown alternating green and black.} \label{fig:cna} \end{figure} \subsubsection{Usage} <>= CNA=champ.CNA() @ \section{Further analysis} Additional options for downstream analysis are available using other packages. It is recommended that DMRs be investigated using pathway analysis software and Encode data. In addition data from previously published 450k datasets can be downloaded using \Rpackage{marmal-aid} and combined with study data for meta-analysis. \bibliographystyle{natbib} \bibliography{450k} \end{document}