%\VignetteIndexEntry{The \Rpackage{wateRmelon} Package} %\VignetteKeywords{Illumina DNA methylation 450k array normalization normalisation preformance } %\VignettePackage{wateRmelon} % Sweave Vignette Template derived bt Leo Schalkwyk from %% http://www.bioconductor.org/help/course-materials/2010/AdvancedR/BuildPackage.pdf \documentclass[11pt]{article} \usepackage{Sweave} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \title{The \Rpackage{wateRmelon} Package} \author{Chloe CY Wong, Ruth Pidsley and Leonard C Schalkwyk} \begin{document} \maketitle \section{ About the package } The \Rpackage{wateRmelon} package is designed to make it convenient to use the data quality metrics and normalization methods from our paper [1] as part of existing pipelines or work flows, and so as much as possible we have implemented S4 methods for \Rclass{MethyLumiSet} objects (\Rpackage{methylumi} package), \Rclass{MethylSet} and \Rclass{RGChannelSet} objects (\Rpackage{minfi} package) and \Rclass{exprmethy450} objects (\Rpackage{IMA} package).\\ In addition to our own functions, the package also contains functions by Matthieu Defrance [2] and Nizar Touleimat [3] and Andrew Teschendorff [4] as well as a wrapper for the SWAN method [5]. \section{Installation} Because it is designed to work with several Bioconductor packages it unavoidably has many dependencies from CRAN as well as Bioconductor. In principle install.packages() installs dependencies automatically, but if there are problems you can install them by hand using the following commands: <>= install.packages('ROCR', 'matrixStats') if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install( 'limma', 'minfi', 'IlluminaHumanMethylation450kmanifest', 'methylumi', 'lumi') @ Installing the latest package from a local copy (assuming it is in the current working directory of your R session): <>= install.packages('wateRmelon_0.9.9.tar.gz', repos=NULL, type='source') @ \section{Trying it out} The package contains a small subset of 450K array data which can be used to explore functions quickly; the \Robject{melon} data set for example is a \Rclass{MethyLumiSet} with 12 samples but only 3363 features: %<>= %library( 'wateRmelon' ) %@ % data ( package='wateRmelon' ) <>= library('wateRmelon') # load in melon dataset data (melon) # display dimensions of data matrix dim(melon) # quality filter using default thresholds melon.pf<-pfilter(melon) # preprocess using our best method melon.dasen.pf <- dasen(melon.pf) @ \section{Our performance metrics} We have taken advantage of known DNA methylation patterns associated with genomic imprinting and X-chromosome inactivation (XCI), in addition to the performance of SNP genotyping assays present on the array, to derive three independent metrics which we use to test alternative schemes of correction and normalization. These metrics also have potential utility as quality scores for datasets. All of them are expressed in such a way that lower values indicate better performance (i.e. better predicted ability to detect real methylation differences between samples). \subsection{Genomic imprinting} This is based on the hemi-methylation of genomic imprinting differentially methylated regions (DMRs), and is a standard-error-like measure of dispersion(SE). <>= # calculate iDMR metrics on QC'd betas dmrse_row(melon.pf) # calculate iDMR metrics on QC'd and preprocessed betas dmrse_row(melon.dasen.pf) # slightly lower (better) standard errors @ \subsection{SNP genotypes} A very simple genotype calling by one-dimensional K-means clustering is performed on each SNP, and for those SNPs where there are three genotypes represented the squared deviations are summed for each genotype (similar to a standard deviation for each of allele A homozygote, AB heterozygote and allele B homozygote). By default these are further divided by the square root of the number of samples to get a standard error-like statistic. <>= # calculate SNP metrics on QC'd betas genki(melon.pf) # calculate SNP metrics on QC'd and preprocessed betas genki(melon.dasen.pf) # slightly lower (better) standard errors @ \subsection{X-chromosome inactivation} This is based on the male-female difference in DNA methylation, almost all of which is due to hypermethylation of the inactive X in females. This difference is thus a good predictor of X-chromosome location for probes, and this can be used for a Receiver Operating Characteristic (ROC) curve analysis. We report 1-(area under curve) (AUC) so that smaller values indicate better performance, just as in our other two metrics. This requires the samples to be of known sex (and not all the same) and chromosome assignments for all probes. It takes more time to calculate than the other two metrics. Note that the roc and the auk are both sea birds. <>= # calculate X-chromosome metrics on QC'd betas seabi(melon.pf, sex=pData(melon.pf)$sex, X=fData(melon.pf)$CHR=='X') # calculate X-chromosome metrics on QC'd and preprocessed betas seabi(melon.dasen.pf, sex=pData(melon.dasen.pf)$sex, X=fData(melon.dasen.pf)$CHR=='X' ) @ \section{Suggested analysis workflow} \subsection{Load data} You can use a variety of methods to load your data, either from GenomeStudio final report text files or from iDAT files. \Rpackage{methylumi} and \Rpackage{IMA} can read text files, we recommend \Rpackage{methylumi} because the \Robject{exprmethy450} object only stores betas and not raw intensities. \Rpackage{methylumi} and \Rpackage{minfi} can both read iDAT files, and produce objects that can be used by our functions. Neither contains the full annotation that comes inside the final report text file. If you use the GenomeStudio file we recommend saving the unnormalized, uncorrected version of the data. We also recommend keeping the barcode names (SentrixID\_RnnCnn) as the column headers or in a separate dataframe. <>= library(methylumi) melon <- methyLumiR('finalreport.txt') @ \subsection{Reading IDAT files in \Rpackage{wateRmelon}} Due to the release of the EPIC micro-array and the \Rpackage{methylumi} package not being updated in a long time, we have provided our own IDAT reader that will convert idat files into \Robject{MethyLumiSet} objects. The \Rfunction{readEPIC} function is very similar to the \Rpackage{methylumi} idat reader but can handle the EPIC array. In addition to this, \Rfunction{readEPIC} can recursively search a specified file path for idat files and parse them together. <>= mlumi <- readEPIC('path/to/directory') @ \subsection{Tidying data} Normalization only works well at cleaning up minor distributional differences between samples. Failed or otherwise atypical samples should be filtered out beforehand. Also if you have different tissues or similar drastic divides in your data it may not be optimal to normalize everything together. Visualization of raw intensities is a good way of identifying grossly atypical (failed) samples. <>= boxplot(log(methylated(melon)), las=2, cex.axis=0.8 ) @ <>= boxplot(log(unmethylated(melon)), las=2, cex.axis=0.8 ) @ Additionally one can use the \Rfunction{outlyx} and \Rfunction{bscon} functions to easily check data quality. The outlyx function takes any beta matrix (preferably raw) and will identify any samples that are inconsistent with the rest of the data, from the plot we can observe that any data points that fall into the red squares are indeed outlying and should be removed from analysis. If performing this on large data-sets it is possible to take a random subset of probes and run outlyx on that to assess a general idea. <>= outlyx(melon) # Can take some time on large data-sets @ Another way to check data quality is to assess how well the bisulfite conversion had gone when preparing the DNA. There are numerous control probes that assess this and the bscon function converts the intensities from these probes into an easy to understand percentage. For general purposes removing samples with less than 80\% bisulfite conversion is usually appropriate for most analysis, but you can be more stringent if you like! <>= bsc <- bscon(melon) hist(bsc) @ Filtering by detection p-value provides a straightforward approach for removing both failed samples and probes. The \Rfunction{pfilter} function conveniently discards samples with more than (by default ) 1\% of probes above the .05 detection p-value threshold, and probes with any samples with beadcount under 3 or more than 1\% above the p-value threshold. <>= melon.pf <- pfilter(melon) @ It has come to our attention that data read in using the various packages and input methods will give subtly variable data output as they calculate detection p-value and beta values differently, and do/don?t give information about beadcount. The pfilter function does not correct for this, but simply uses the detection p-value and bead count provided by each package. \subsection{Normalize and calculate betas} In our analysis[1] we tested 15 preprocessing and normalization methods. This involved processing 10 data sets 15 different ways and calculating three metrics of performance from each one. You can do the same thing with your data if you like, but our recommended method \Rfunction{dasen} will work well for most data sets. If you suspect varying dye bias (if you have arrays scanned on different instruments, for example), you might want to try \Rfunction{nanes}. <>= melon.dasen.pf <- dasen(melon.pf) @ \section{Additional Functionality} We have recently updated the \Rpackage{wateRmelon} package to facilitate a variety of common analyses that are used in EWAS. These include age prediction, cell-type composition estimation and a suite of quality-control tools that we feel are useful for analysis. \subsection{Age Prediction} To perform age prediction within wateRmelon we provide a simple function that will perform the linear function using Horvath's coefficients. These are supplied by default and can be supplied with a different set by manually specifying the \Rfunction{coef} argument. <>= data(coef) agep(melon.dasen.pf, coeff= coef, method='horvath') @ \subsection{Cell Type Proportion Estimation} Another useful tool for EWAS is estimating cell type proportions, wateRmelon provides a \Robject{MethyLumiSet} method for those who prefer using these objects. The two functions are mostly identical, however there is a small distinction between to two. In wateRmelon we do not normalise the biological data set and the reference data-set together and instead impose the quanitles of the data-set onto the reference data-set then perform the deconvolution step. This approach allows us to one - provide prenormalised data and secondly exploit the nature of normalised data (where each sample has identical quantiles) and impose the normalised quantiles onto the reference dataset. This is beneficial for particularly large data-sets where the sample numbers are in the thousands and the normalising the refernece and biological datasets does not have that much effect on the resultant quantiles. Currently no EPIC reference datasets exist so any data from HumanMethylationEPIC microarrays will be converted to match the 450K reference dataset. <>= # Code will not work with melon as melon only has a subset of probes estimateCellCounts.wmln(melon.dasen.pf) @ \subsection{Assess Normalisation Violence - \Rfunction{qual}} The \Rfunction{qual} function seeks to explore a facet of analysis that has not been explored in EWAS extensively and that is the potentially source of confounding introduced by the normalisation method used. The \Rfunction{qual} function seeks to quantify how much a sample has changed during normalisation and will relay it back. The idea behind this is that a sample that has changed drastically from normalisation may have had some nascent problems with it. As a result if you suspect a sample has changed too much during normalisation the \Rfunction{qual} function will identify this. If you do remove samples using \Rfunction{qual}, make sure to renormalise the data afterwards! <>= normv <- qual(betas(melon.dasen.pf), betas(melon.pf)) plot(normv[,1:2]) @ \subsection{Remove MAF/SNP heterozygotes before analysis} The last function we will discuss is pwod (probe wise outlier detection). This tool rather simply scans a betas matrix and removes any highly outlying signals (4 IQRs away). These signals are likely caused my MAFs and or SNP heterozygotes within your sample population and may be in probes that are not in the SNP lists/cross reactive probe lists that you may want to remove from the data prior to statistical testing. <>= melon.pwod.dasen.pf <- pwod(betas(melon.dasen.pf)) @ \subsection{Example workflow: method for MethyLumiSet} <>= data(melon) # load in melon dataset # Optional read-in data using: # melon <- readEPIC('path/to/idats') outliers <- outlyx(melon, plot=F) sum(outliers$out) sum(bscon(melon)<85) # Check for outliers melon.pf<-pfilter(melon) # perform QC on raw data matrix using default thresholds melon.dasen.pf<-dasen(melon.pf) # preprocess using our best method qual <- qual(betas(melon.dasen.pf), betas(melon.pf)) # Check for bad samples (again) sex <- pData(melon.dasen.pf)$sex # extract phenotypic information for test bet<-betas(melon.dasen.pf) # extract processed beta values melon.sextest<-sextest(bet,sex) # run t-test to idenitify sex difference agep(melon.dasen.pf) # Check ages (see if they match up e.t.c) melon.pwod <- pwod(bet) # Clean up data for statistical testing @ \subsection{Further analysis} We can't offer much help with the actual analysis, which will be different for every experiment. In general though, you need to get your experimental variables into the same order as your arrays and apply some kind of statistical test to each row of the table of betas. In the workflow shown here, the array barcodes are preserved and can be retrieved with \Rfunction{sampleNames} or \Rfunction{colnames}. It's convenient to read in a samplesheet, for example in csv format, with the barcodes as row names. \section{References} [1] Pidsley R, Wong CCY, Volta M, Lunnon K, Mill J, Schalkwyk LC: A data-driven approach to preprocessing Illumina 450K methylation array data (submitted) [2] Dedeurwaerder S, Defrance M, Calonne E, Sotiriou C, Fuks F: Evaluation of the Infinium Methylation 450K technology . Epigenetics 2011, 3(6):771-784. [3] Touleimat N, Tost J: Complete pipeline for Infinium R Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation. Epigenomics 2012, 4:325-341) [4]Teschendorff AE, Marabita F, Lechner M, Bartlett T, Tegner J, Gomez-Cabrero D, Beck S. A Beta-Mixture Quantile Normalisation method for correcting probe design bias in Illumina Infinium 450k DNA methylation data. Bioinformatics. 2012 Nov 21. [5] Maksimovic J, Gordon L, Oshlack A: SWAN: Subset quantile Within-Array Normalization for Illumina Infinium HumanMethylation450 BeadChips. Genome biology 2012, 13(6):R44 \end{document} % R CMD Sweave waternette.rnw % R CMD texi2dvi --pdf --clean waternette.tex % R CMD Stangle waternette.rnw