%\VignetteIndexEntry{CopyNumber450k User's Guide} %\VignetteDepends{minfiData} %\VignetteDepends{CopyNumber450k} %\VignetteDepends{CopyNumber450kData} %\VignettePackage{CopyNumber450k} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \documentclass[12pt]{article} \usepackage[utf8]{inputenc} \usepackage{authblk} \title{The CopyNumber450k User's Guide:\\Calling CNVs from Illumina 450k Methylation Arrays} \author[1,3]{Simon Papillon-Cavanagh} \author[2]{Jean-Philippe Fortin} \author[3]{Nicolas De Jay} \affil[1]{McGill University and Genome Quebec Innovation Centre, Montreal, QC, Canada} \affil[2]{Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA} \affil[3]{Department of Human Genetics, McGill University, Montreal, QC, Canada} \SweaveOpts{highlight=TRUE, tidy=TRUE, keep.space=TRUE, keep.blank.space=FALSE, keep.comment=TRUE} <>= options(keep.source=TRUE) @ \begin{document} \maketitle \tableofcontents \section{Introduction} The \Rpackage{CopyNumber450k} package provides tools for calling copy number variantions (CNVs) from Illumina 450k methylation arrays. \Rpackage{CopyNumber450k} uses the sum of methylated and unmethylated alleles as a proxy of DNA molecule abundance. Aberrant intensities is used as an indication of putative CNVs. \Rpackage{CopyNumber450k} provides a set of user-friendly methods to infer, plot and assess statistical significance of CNVs, as described below. \subsection{Loading libraries} <>= library(CopyNumber450k) library(CopyNumber450kData) library(minfiData) @ \section{Example Dataset} The starting point of \Rpackage{CopyNumber450k} is an \Rcode{RGChannelSet}: an object that must be created with the \Rpackage{minfi} package directly from the raw IDAT files. To learn how to create such an object, please refer to the detailed vignette of the \Rpackage{minfi} package. The \Rcode{RGChannelSet} object is read into a \Rcode{CNV450kSet} object, which internally extracts the needed data and annotation. We strongly suggest that the reader consults \Rpackage{minfi}'s vignette on how to load data and to use quality control plots to ensure that the data is of acceptable quality \cite{minfi}. \subsection{Data requirements} In order to construct a \Rcode{CNV450kSet} object, the input sample set must contain at least three samples whose associated \Rcode{Sample\_Group} phenotypic data value is \Rcode{control}. A sample whose \Rcode{Sample\_Group} value is not \Rcode{control} is considered to be a case sample and will be analyzed and segmented. In order to facillitate the use of this package, we provide 52 control samples in the \Rpackage{CopyNumber450kData} R/Bioconductor package. We strongly suggest that you use those controls if you do not have sufficient control samples in your data set. <>= # Load control data (n=52) from CopyNumber450kData data(RGcontrolSetEx) # Load example data (n=6) from minfiData data(RGsetEx) # In order to reduce example time, let's use only one sample # and 30 controls (instead of 52). In real life situations, it is advised to # use all the available controls. RGsetEx <- RGsetEx[, 5] RGcontrolSetEx <- RGcontrolSetEx[, sample(1:ncol(RGcontrolSetEx), 30)] @ \subsection{Creating the object} Most users will directly load their \Rcode{RGChannelSet} using the \Rcode{CNV450kSet(RGset)} method. However, the \Rpackage{minfi} package provides the \Rcode{combine} function, a helper method which merges two \Rcode{RGChannelSet}s into a single object. This method may be used when using the \Rpackage{CopyNumber450kData} control set in order to combine it with the user's data set. It is important to ensure that samples have been assigned to relevant groups (i.e. control samples have \Rcode{control} as \Rcode{Sample\_Group}). Note that the provided control samples in \Rpackage{CopyNumber450kData} are already properly annotated. Please note that in the following code example we will use only a subset of probes to proceed. This is to be avoided in real life cases as it will significantly reduce the resolution and will output unrealistic results by artificially smoothing the segmentation. <>= # Ensure that Sample_Group values are valid head(pData(RGcontrolSetEx)$Sample_Group) head(pData(RGsetEx)$Sample_Group) # Combine both RGsets in a single RGset RGset <- combine(RGcontrolSetEx, RGsetEx) # Create the object mcds <- CNV450kSet(RGset) # In order to speed up example computation, we will randomly subset the # probes used by CopyNumber450k. THIS SHOULD NEVER BE DONE AS IT SERVES # ONLY FOR SPEEDING UP THE EXAMPLE. mcds <- mcds[sample(1:nrow(mcds), 10000), ] @ \subsection{SNP probes} Single nucleotide polymorphisms (SNPs) affect probe hydridization affinity and may therefore artificially skew intensity levels. Hence, it is useful to drop the probes that contain or target SNPs. Although \Rpackage{minfi} offers a function to drop SNP probes, it does so only for probes that directly target SNP CpGs. We recommend that the \Rpackage{CopyNumber450k} \Rcode{dropSNPprobes} method be used with our package. Since \Rcode{dropSNPprobes} drops probes that both target and contain known SNPs, it leads to a more representative normalization and segmentation output. <>= mcds <- dropSNPprobes(mcds, maf_threshold=0.01) @ \section{Data Visualization and Quality Assessment} \Rpackage{CopyNumber450k} offers quality assessment plots to identify potential batch and normalization effects. \subsection{Density Plot} The density plot allows the user to observe differences in density distributions between samples. Certain effects, such as array position on the chip, can be clearly observed using the density plot (Figure ~\ref{fig:densityplot}). Additionnaly, we offer coloring features which can be based on different sample phenotypic features. Please refer to the method documentation for all possible groupings. \begin{figure} \begin{center} <>= plotDensity(mcds, main="Pre-normalization density plot", color.by="array.row") @ \end{center} \caption{Global sample intensity by array position} \label{fig:densityplot} \end{figure} \subsection{Principal Component Analysis (PCA) Plot} A PCA plotting method, with the same coloring features as \Rcode{plotDensity} is provided (Figure~\ref{fig:PCAplot}). This method may allow users to identify confounding factors influencing data variability, such a chip bias or batch effects. \begin{figure} \begin{center} <>= plotPCA(mcds, main="Pre-normalization PCA plot", color.by="sample.group") @ \end{center} \caption{PCA plot of sample intensities} \label{fig:PCAplot} \end{figure} \section{Normalization} Currently, \Rpackage{CopyNumber450k} offers two normalization procedures: quantile normalization \cite{Bolstad:2003} and functional normalization \cite{Fortin:2013}. Underlying quantile normalization is the assumption that the distributions of signal intensities are similar across subjects\cite{Bolstad:2003}. Therefore, we suggest that this method be used for datasets in which no global differences in CNVs across samples are expected. \subsection{Quantile normalization} The quantile normalization will match all the samples distribution and mostly remove any batch effects \cite{Bolstad:2003}. However, in samples with gross chromosomal aberrations, the signal may be lost. We suggest that you use the quantile normalization in samples where you expect small aberrations. <>= mcds.q <- normalize(mcds, "quantile") @ \subsection{Functional normalization} In situations where large-scale differences are expected (e.g. in rare cases of polyploidy), we recommend the functional normalization, a new method developed specifically for the 450k array in which the similarity of distributions between samples is not assumed \cite{Fortin:2013}. In this sense functional normalization is more conservative than quantile normalization. \begin{figure} \begin{center} <>= mcds.f <- normalize(mcds, "functional") plotDensity(mcds.f, main="Density plot of functional normalized data") @ \end{center} \caption{Density plot of sample intensities after functional normalization} \label{fig:funnormDensity} \end{figure} Compared to the previous \Rcode{densityPlot} (Figure~\ref{fig:densityplot}), post-normalization density distributions are much more similar to one another and the array position bias is removed (Figure~\ref{fig:funnormDensity}). \section{Segmentation} After normalization, it is necessary to bin probes together in order to identify possibly amplified or deleted genomic segments. We use a binary circular segmentation algorithm provided by the \Rpackage{DNAcopy} package \cite{dnacopy}. After segmentation, each segment is further processed by comparing its median intensity to a distribution inferred from the control samples. This allows us to assess whether the segment intensity value is attributable to random variability or if it is significantly different from what would be expected by chance. Finally, each segment is annotated with the genes it contains. For simplicity, let us assume that the assayed sample does not contain gross aberrations and proceed with the object that has been normalized with the quantile normalization (\Rcode{mcds.q}). <>= mcds.q <- segmentize(mcds.q) @ \section{Results} Provided that the segmentation has been performed, calling \Rcode{getSegments} on the object returns a list containing the genomic segments in which each row represents a genomic segment found in a sample. \subsection{Text format} The segments can be saved in a \Rcode{CSV} format, by using the common \Rcode{write.csv} function that was redefined to \Rcode{CNV450kSet} objects. <>= write.csv(mcds.q, file="segments.csv") @ \subsection{A plot of the sample's genome} In order to obtain a visual representation of the sample's genome, \Rpackage{CopyNumber450k} offers the \Rcode{plotSample} function, which takes the segmentized \Rcode{CNV450kSet} object and the index of the sample to plot as input. In this example, since we randomly subsetted the probes, the resolution will be too low to find any interesting results. (Figure~\ref{fig:plotSample}). \begin{figure} \begin{center} <>= plotSample(mcds.q, 1, main="Genomic view of Sample 1") @ \end{center} \caption{Genomic view of a sample} \label{fig:plotSample} \end{figure} The user may specify a region of interest and plot only this region to observe chromosomal breakpoints, as illustrated by the focal amplification on chr1 in the assayed sample (Figure~\ref{fig:plotSamplefocal}). \begin{figure} \begin{center} <>= plotSample(mcds.q, 1, chr="chr1", main="Zoom of chr1", ylim=c(-.25,.25)) @ \end{center} \label{fig:plotSamplefocal} \end{figure} \section{SessionInfo} <>= toLatex(sessionInfo()) @ \bibliographystyle{plain} \bibliography{biblio} \end{document}