%\VignetteIndexEntry{The \Rpackage{wateRmelon} Package}
%\VignetteKeywords{Illumina DNA methylation 450k array normalization normalisation preformance }
%\VignettePackage{wateRmelon}
% Sweave Vignette Template derived bt Leo Schalkwyk from
%% http://www.bioconductor.org/help/course-materials/2010/AdvancedR/BuildPackage.pdf
\documentclass[11pt]{article}
\usepackage{Sweave}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\title{The \Rpackage{wateRmelon} Package}
\author{Chloe CY Wong, Ruth Pidsley and Leonard C Schalkwyk}
\begin{document}
\maketitle
\section{ About the package }
The \Rpackage{wateRmelon} package is designed to make it convenient to use
the data quality metrics and normalization methods from our paper [1] as part
of existing pipelines or work flows, and so as much as possible we have
implemented S4 methods for \Rclass{MethyLumiSet} objects (\Rpackage{methylumi}
 package), \Rclass{MethylSet} and \Rclass{RGChannelSet} objects (\Rpackage{minfi} package) and \Rclass{exprmethy450} objects (\Rpackage{IMA}
package).\\

In addition to our own functions, the package also contains functions by
Matthieu Defrance [2] and Nizar Touleimat [3] and Andrew Teschendorff [4]
as well as a wrapper for the SWAN method [5].

\section{Installation}
Because it is designed to work with several Bioconductor packages it unavoidably
has many dependencies from CRAN as well as Bioconductor. In principle
install.packages() installs dependencies automatically, but if there are problems
you can install them by hand using the following commands:

<<UnevaluatedCode, eval=FALSE>>=
install.packages('ROCR', 'matrixStats')
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install( 'limma', 'minfi',
   'IlluminaHumanMethylation450kmanifest',
   'methylumi', 'lumi')
@


Installing the latest package from a local copy (assuming it is in the current working
directory of your R session):

<<UnevaluatedCode, eval=FALSE>>=
install.packages('wateRmelon_0.9.9.tar.gz', repos=NULL, type='source')
@

\section{Trying it out}
The package contains a small subset of 450K array data which can be used to
explore functions quickly; the \Robject{melon} data set for example is a \Rclass{MethyLumiSet}
with 12 samples but only 3363 features:

%<<UnseenMessages, results=hide>>=
%library( 'wateRmelon' )
%@
% data ( package='wateRmelon' )


<<code-block, keep.source=TRUE>>=
library('wateRmelon')
# load in melon dataset
data (melon)
# display dimensions of data matrix
dim(melon)
# quality filter using default thresholds
melon.pf<-pfilter(melon)
# preprocess using our best method
melon.dasen.pf <- dasen(melon.pf)
@

\section{Our performance metrics}
We have taken advantage of known DNA methylation patterns associated with genomic
imprinting and X-chromosome inactivation (XCI), in addition to the performance of
SNP genotyping assays present on the array, to derive three independent metrics
which we use to test alternative schemes of correction and normalization.
These metrics also have potential utility as quality scores for datasets. All
of them are expressed in such a way that lower values indicate better performance
(i.e. better predicted ability to detect real methylation differences between samples).

\subsection{Genomic imprinting}
This is based on the hemi-methylation of genomic imprinting differentially methylated
regions (DMRs), and is a standard-error-like measure of dispersion(SE).

<<dmrse, keep.source=TRUE>>=
# calculate iDMR metrics on QC'd betas
dmrse_row(melon.pf)
# calculate iDMR metrics on QC'd and preprocessed betas
dmrse_row(melon.dasen.pf)
# slightly lower (better) standard errors
@

\subsection{SNP genotypes}
A very simple genotype calling by one-dimensional K-means clustering is
performed on each SNP, and for those SNPs where there are three genotypes
represented the squared deviations are summed for each genotype (similar
to a standard deviation for each of allele A homozygote, AB heterozygote and
allele B homozygote).  By default these are further divided by the square root
of the number of samples to get a standard error-like statistic.

<<genki, keep.source=TRUE>>=
# calculate SNP metrics on QC'd betas
genki(melon.pf)
# calculate SNP metrics on QC'd and preprocessed betas
genki(melon.dasen.pf)
# slightly lower (better) standard errors
@


\subsection{X-chromosome inactivation}
This is based on the male-female difference in DNA methylation, almost all of which is
due to hypermethylation of the inactive X in females. This difference is thus a good
predictor of X-chromosome location for probes, and this can be used for a Receiver
Operating Characteristic (ROC) curve analysis.  We report 1-(area under curve) (AUC)
so that smaller values indicate better performance, just as in our other two metrics. This requires the samples
to be of known sex (and not all the same) and chromosome assignments for all probes. It
takes more time to calculate than the other two metrics. Note that the roc and
the auk are both sea birds.


<<seabi, keep.source=TRUE>>=
# calculate X-chromosome metrics on QC'd betas
seabi(melon.pf, sex=pData(melon.pf)$sex, X=fData(melon.pf)$CHR=='X')
# calculate X-chromosome metrics on QC'd and preprocessed betas
seabi(melon.dasen.pf,
   sex=pData(melon.dasen.pf)$sex,
   X=fData(melon.dasen.pf)$CHR=='X'
)
@



\section{Suggested analysis workflow}

\subsection{Load data}
You can use a variety of methods to load your data,
either from GenomeStudio final report text files
or from iDAT files.  \Rpackage{methylumi} and
\Rpackage{IMA} can read text files, we recommend
\Rpackage{methylumi} because the \Robject{exprmethy450}
object only stores betas and not raw intensities.
\Rpackage{methylumi} and \Rpackage{minfi} can both read
iDAT files, and produce objects that can be used by
our functions.  Neither contains the full annotation
that comes inside the final report text file.  If you
use the GenomeStudio file we recommend saving the
unnormalized, uncorrected version of the data. We
also recommend keeping the barcode names
(SentrixID\_RnnCnn) as the column headers or in a separate
dataframe.

<<UnevaluatedCode, eval=FALSE>>=
library(methylumi)
melon <- methyLumiR('finalreport.txt')
@

\subsection{Reading IDAT files in \Rpackage{wateRmelon}}
Due to the release of the EPIC micro-array and the \Rpackage{methylumi} package not being updated in a long time, we have provided our own IDAT reader that will convert idat files into \Robject{MethyLumiSet} objects. The \Rfunction{readEPIC} function is very similar to the \Rpackage{methylumi} idat reader but can handle the EPIC array. In addition to this, \Rfunction{readEPIC} can recursively search a specified file path for idat files and parse them together.

<<UnevaluatedCodeecc, eval=FALSE>>=
mlumi <- readEPIC('path/to/directory')
@


\subsection{Tidying data}
Normalization only works well at cleaning up minor
distributional differences between samples. Failed
or otherwise atypical samples should be filtered out
beforehand.  Also if you have different tissues
or similar drastic divides in your data it may not be optimal
to normalize everything together.

Visualization of raw intensities is a good way of
identifying grossly atypical (failed) samples.
<<IncludeGraphic, fig=TRUE>>=
boxplot(log(methylated(melon)), las=2, cex.axis=0.8 )
@
<<IncludeGraphic, fig=TRUE>>=
boxplot(log(unmethylated(melon)), las=2, cex.axis=0.8 )
@

Additionally one can use the \Rfunction{outlyx} and \Rfunction{bscon} functions to easily check data quality. The outlyx function takes any beta matrix (preferably raw) and will identify any samples that are inconsistent with the rest of the data, from the plot we can observe that any data points that fall into the red squares are indeed outlying and should be removed from analysis.
If performing this on large data-sets it is possible to take a random subset of probes and run outlyx on that to assess a general idea.
<<outlyx, fig=TRUE>>=
outlyx(melon) # Can take some time on large data-sets
@


Another way to check data quality is to assess how well the bisulfite conversion had gone when preparing the DNA. There are numerous control probes that assess this and the bscon function converts the intensities from these probes into an easy to understand percentage. For general purposes removing samples with less than 80\% bisulfite conversion is usually appropriate for most analysis, but you can be more stringent if you like!
<<bscon, fig=TRUE>>=
bsc <- bscon(melon)
hist(bsc)
@

Filtering by detection p-value provides a straightforward
approach for removing both failed samples and probes.
The \Rfunction{pfilter} function
conveniently discards samples with more than (by default ) 1\% of
probes above the .05 detection p-value threshold, and probes with any samples
with beadcount under 3 or more than 1\% above the p-value threshold.

<<pfilter, keep.source=TRUE>>=
melon.pf <- pfilter(melon)
@

It has come to our attention that data read in using the various packages and
input methods will give subtly variable data output as they
calculate detection p-value and beta values differently, and do/don?t give
information about beadcount.  The pfilter function does not correct
for this, but simply uses the detection p-value and bead count
provided by each package.

\subsection{Normalize and calculate betas}

In our analysis[1] we tested 15 preprocessing and normalization methods.
This involved processing 10 data sets 15 different ways and calculating three
metrics of performance from each one.  You can do the same thing with
your data if you like, but our recommended method \Rfunction{dasen}
will work well for most data sets.  If you suspect varying dye bias (if
you have arrays scanned on different instruments, for example), you might
want to try \Rfunction{nanes}.

<<dasen, keep.source=TRUE>>=
melon.dasen.pf <- dasen(melon.pf)
@

\section{Additional Functionality}
We have recently updated the \Rpackage{wateRmelon} package to facilitate a variety of common analyses that are used in EWAS. These include age prediction, cell-type composition estimation and a suite of quality-control tools that we feel are useful for analysis.

\subsection{Age Prediction}
To perform age prediction within wateRmelon we provide a simple function that will perform the linear function using Horvath's coefficients. These are supplied by default and can be supplied with a different set by manually specifying the \Rfunction{coef} argument.
<<agep, keep.source=TRUE>>=
data(coef)
agep(melon.dasen.pf, coeff= coef, method='horvath')
@

\subsection{Cell Type Proportion Estimation}
Another useful tool for EWAS is estimating cell type proportions, wateRmelon provides a \Robject{MethyLumiSet} method for those who prefer using these objects. The two functions are mostly identical, however there is a small distinction between to two.

In wateRmelon we do not normalise the biological data set and the reference data-set together and instead impose the quanitles of the data-set onto the reference data-set then perform the deconvolution step. This approach allows us to one - provide prenormalised data and secondly exploit the nature of normalised data (where each sample has identical quantiles) and impose the normalised quantiles onto the reference dataset. This is beneficial for particularly large data-sets where the sample numbers are in the thousands and the normalising the refernece and biological datasets does not have that much effect on the resultant quantiles.

Currently no EPIC reference datasets exist so any data from HumanMethylationEPIC microarrays will be converted to match the 450K reference dataset.
<<UnevaluatedCodeecc, eval=FALSE>>=
# Code will not work with melon as melon only has a subset of probes
estimateCellCounts.wmln(melon.dasen.pf)
@

\subsection{Assess Normalisation Violence - \Rfunction{qual}}
The \Rfunction{qual} function seeks to explore a facet of analysis that has not been explored in EWAS extensively and that is the potentially source of confounding introduced by the normalisation method used. The  \Rfunction{qual} function seeks to quantify how much a sample has changed during normalisation and will relay it back. The idea behind this is that a sample that has changed drastically from normalisation may have had some nascent problems with it. As a result if you suspect a sample has changed too much during normalisation the  \Rfunction{qual} function will identify this. If you do remove samples using  \Rfunction{qual}, make sure to renormalise the data afterwards!

<<qual, keep.source=TRUE, fig=TRUE>>=
normv <- qual(betas(melon.dasen.pf), betas(melon.pf))
plot(normv[,1:2])
@

\subsection{Remove MAF/SNP heterozygotes before analysis}
The last function we will discuss is pwod (probe wise outlier detection). This tool rather simply scans a betas matrix and removes any highly outlying signals (4 IQRs away). These signals are likely caused my MAFs and or SNP heterozygotes within your sample population and may be in probes that are not in the SNP lists/cross reactive probe lists that you may want to remove from the data prior to statistical testing.

<<pwod, keep.source=TRUE>>=
melon.pwod.dasen.pf <- pwod(betas(melon.dasen.pf))
@

\subsection{Example workflow: method for MethyLumiSet}
<<workflow, keep.source=TRUE>>=
data(melon)
# load in melon dataset
# Optional read-in data using:
# melon <- readEPIC('path/to/idats')

outliers <- outlyx(melon, plot=F)
sum(outliers$out)
sum(bscon(melon)<85)
# Check for outliers

melon.pf<-pfilter(melon)
# perform QC on raw data matrix using default thresholds
melon.dasen.pf<-dasen(melon.pf)
# preprocess using our best method

qual <- qual(betas(melon.dasen.pf), betas(melon.pf))
# Check for bad samples (again)

sex  <- pData(melon.dasen.pf)$sex
# extract phenotypic information for test
bet<-betas(melon.dasen.pf)
# extract processed beta values
melon.sextest<-sextest(bet,sex)
# run t-test to idenitify sex difference

agep(melon.dasen.pf)
# Check ages (see if they match up e.t.c)

melon.pwod <- pwod(bet)
# Clean up data for statistical testing
@

\subsection{Further analysis}
We can't offer much help with the actual analysis, which will be different
for every experiment.  In general though, you need to get your experimental
variables into the same order as your arrays and apply some kind of statistical
test to each row of the table of betas.  In the workflow shown here, the
array barcodes are preserved and can be retrieved with \Rfunction{sampleNames}
or \Rfunction{colnames}.  It's convenient to read in a samplesheet, for
example in csv format, with the barcodes as row names.

\section{References}
[1] Pidsley R, Wong CCY, Volta M, Lunnon K, Mill J, Schalkwyk LC:
A data-driven approach to preprocessing Illumina 450K methylation
array data (submitted)

[2] Dedeurwaerder S, Defrance M, Calonne E, Sotiriou C, Fuks F:
Evaluation of the Infinium Methylation 450K technology . Epigenetics
2011, 3(6):771-784.

[3] Touleimat N, Tost J: Complete pipeline for Infinium R Human
Methylation 450K BeadChip data processing using subset quantile
normalization for accurate DNA methylation estimation. Epigenomics
2012, 4:325-341)

[4]Teschendorff AE, Marabita F, Lechner M, Bartlett T, Tegner J, Gomez-Cabrero D,
Beck S. A Beta-Mixture Quantile Normalisation method for correcting probe design
bias in Illumina Infinium 450k DNA methylation data. Bioinformatics. 2012 Nov 21.

[5] Maksimovic J, Gordon L, Oshlack A: SWAN: Subset quantile
Within-Array Normalization for Illumina Infinium HumanMethylation450
BeadChips. Genome biology 2012, 13(6):R44


\end{document}

% R CMD Sweave waternette.rnw
% R CMD texi2dvi --pdf --clean waternette.tex
% R CMD Stangle waternette.rnw