% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{cn.farms: Manual for the R package} %\VignetteDepends{cn.farms} %\VignettePackage{cn.farms} %\VignetteKeywords{copy number analysis, factor analysis, sparse coding, latent variables, Laplace distribution, EM algorithm, microarray, farms, CNV, copy number variant, sparse overcomplete representation} \documentclass[article]{bioinf} \usepackage{amsmath,amssymb} \usepackage{bm} \usepackage{natbib} \usepackage{hyperref} \usepackage{cnFarmsDefs} \hypersetup{colorlinks=false, pdfborder=0 0 0, pdftitle={cn.FARMS: Manual for the R package}, pdfauthor={Djork-Arn\'e Clevert and Andreas Mitterecker}} \title{cn.FARMS: a latent variable model to detect copy number variations in microarray data with a low false discovery rate \\ \textit{--- Manual for the \Rpackage{cn.farms} package ---}} \author{Djork-Arn\'e Clevert and Andreas Mitterecker} \affiliation{Institute of Bioinformatics, Johannes Kepler University Linz\\Altenberger Str. 69, 4040 Linz, Austria\\ \email{okko@clevert.de and mitterecker@bioinf.jku.at}} \usepackage[noae]{Sweave} \SweaveOpts{eps=FALSE} \begin{document} << echo=FALSE,results=hide>>= options(width=80) set.seed(0) library(cn.farms) farmsVers <- packageDescription("cn.farms")$Version ## toy-data which is used for testing the vignette load(system.file("exampleData/normData.RData", package="cn.farms")) load(system.file("exampleData/slData.RData", package="cn.farms")) notes(experimentData(normData))$annotDir <- system.file("exampleData/annotation/pd.genomewidesnp.6/1.1.0", package="cn.farms") cores <- 1 runtype <- "ff" npData <- slData @ \newcommand{\farmsVers}{\Sexpr{farmsVers}} \manualtitlepage[Version \farmsVers, \today] \newlength{\auxparskip} \setlength{\auxparskip}{\parskip} \setlength{\parskip}{0pt} \tableofcontents \clearpage \setlength{\parskip}{\auxparskip} \section{Introduction} The \Rpackage{cn.farms} package provides a novel copy number variation (CNV) detection method, called ``cn.FARMS'', which is based on our FARMS (``factor analysis for robust microarray summarization'' \citep{Hochreiter:06}) algorithm. FARMS is since 2006 the leading summarization method of the international ``affycomp'' competition if sensitivity and specificity are considered simultaneously. We extended FARMS to cn.FARMS \citep{Clevert:11} for detecting CNVs by moving from mRNA copy numbers to DNA copy numbers.\\ In the following section we will briefly describe the algorithm and provide a quick start guide. For further information regarding the algorithm and its assessment see the \Rpackage{cn.farms} homepage at \href{http://www.bioinf.jku.at/software/cnfarms/cnfarms.html}{http://www.bioinf.jku.at/software/cnfarms/cnfarms.html}. \section{cn.FARMS: FARMS for CNV Detection} \label{sec:farmspipeline} cn.FARMS is described ``in a nutshell'' by the preprocessing pipeline depicted in Figure \ref{fig:preprocess_chain_fig}:\linebreak {\bf (1) Normalization} is performed at two levels. It has as {\em input} the raw probe intensity values and as {\em output} intensity values at chromosome locations which are leveled between arrays and are allele independent. At the {\em first level} normalization methods remove technical variations between arrays arising from differences in sample preparation or labeling, array production (e.g.\ batch effects), or scanning differences. The goal of the first level is to correct for array-wide effects. At the {\em second level} alleles are combined to one intensity value at a chromosome location and a correction for cross-hybridization between allele A and allele B probes is performed. Cross-hybridization arise due to close sequence similarity between the probes of different alleles, therefore a probe of one allele picks up a signal of the other allele. The optional corrections for differences in PCR yield can be performed at this step or after ``single-locus modeling''. \begin{figure}[!b] \begin{center} \includegraphics[angle=0,width=0.99\textwidth]{figures/figure2} \caption{Copy number analysis for (Affymetrix) DNA genotyping arrays as a three-step pipeline: (1) Normalization, (2) Modeling, and (3) Segmentation. Modeling is divided into ``single-locus modeling'' and ``multi-loci modeling'' with ``fragment length correction'' as an optional intermediate step. The cn.FARMS pipeline is: normalization by sparse overcomplete representation, single-locus modeling by FARMS, fragment length correction, and multi-loci modeling by FARMS. \label{fig:preprocess_chain_fig}} \end{center} \end{figure} We propose sparse overcomplete representation in the two-dimensional space of allele A and B intensity to correct for cross-hybridization between allele A and allele B probes. Therefore we do not only estimate the AA and the BB cross-hybridization like CRMA \citep{Bengtsson:08} but also the AB cross-hybridization. The latter takes into account that hybridization and cross-hybridization may be different for the AB genotype, where for both allele probes target fragments are available and compete for hybridization. After allele correction, we follow CRMA and normalize by scaling the probes to a pre-specified mean intensity value. CNV probes which have only one allele are scaled in the same way. \linebreak {\bf(2) Modeling} is also performed at two levels. The {\em input} is the probe intensity values which independently measure the copy number of a specific target fragment or DNA probe locus. The {\em output} is an estimate for the region copy number. At the {\em first level}, ``single-locus modeling'' the probes which measure the same fragment are combined to a raw fragment copy number (``raw'' means that the copy number is still a continuous values) --- see Figure~\ref{fig:meta_probeset_fig}. These raw fragment copy numbers are estimated by FARMS. The original FARMS was designed to summarize probes which target the same mRNA. This can readily transfered to CNV analysis where FARMS now summarizes probes which target the same DNA fragment. Either both strands can be summarized together or separately where our default is the former. \cite{Nannya:05} suggested considering fragment characteristics like sequence patterns and the length because they affect PCR amplification. For example, PCR is usually less efficient for longer fragments, which lead to fewer copies to hybridize and result in weaker probe intensities. Following these suggestions cn.FARMS performs an optional intermediate level to correct for the fragment length and sequence features to make raw fragment copy numbers comparable along the chromosome. At the {\em second level}, ``multi-loci modeling'', the raw copy numbers of neighboring fragments or neighboring DNA probe loci are combined to a ``meta-probe set'' which targets a DNA region. \begin{figure}[t] \begin{center} \includegraphics[angle=0,width= 0.75\columnwidth]{figures/figure1} \caption{The copy number hierarchy probes-fragment-region. Fragment copy numbers serve as meta-probes used for ``multi-loci modeling'' which yields region copy numbers. Inner boxes: The probes which target a fragment (often at a SNP position) are single-locus summarized to a raw copy number of this fragment. Note, that instead of fragments a DNA probe loci can be summarized. Outer box: The raw fragment copy numbers are the meta-probes for a DNA region and are multi-loci summarized to a raw region copy number. \label{fig:meta_probeset_fig}} \end{center} \end{figure} The raw fragment copy numbers from single-locus modeling are now themselves probes for a DNA region as depicted in Figure~\ref{fig:meta_probeset_fig}. Again we use FARMS to summarize meta-probes and to estimate a raw copy number for the region. This modeling across samples is novel as previous methods only model along the chromosome. Multi-loci modeling considerably reduces the false discovery rates, because raw copy numbers of neighboring fragments or neighboring DNA probe loci must agree to each other on the copy number, which reduces the likelihood of a discovery by chance. However, low FDR is traded against high resolution by the window size for multi-loci modeling, i.e.~ by how many raw copy numbers of neighboring fragments or neighboring DNA probe loci are combined. The more loci are combined, the more the FDR is reduced, because more meta-probes must mutually agree on the region's copy number. The window size for multi-loci modeling is a hyperparameter which trades off low FDR against high resolution. We recommend a window size of 5 as default, 3 for high resolution, and 10 for low FDR. Alternatively to a fixed number of CNV or SNP sites, the cn.FARMS software allows defining a window in terms of base pairs. In this case, multi-loci modeling may use a different number of meta-probes at different DNA locations, in particular for less than two meta-probes multi-loci modeling is skipped. Note, however that controlling the FDR is more difficult because a minimal number of meta-probes cannot be assured for each window and modeling with few meta-probes is prone to false discoveries. FARMS supplies an informative/non-informative (I/NI) call \citep{Talloen:07,Talloen:10} which is used to detect CNVs. Additionally, the I/NI value gives the signal-to-noise-ratio of the estimated raw copy number.\linebreak {\bf(3) Segmentation} can afterwards be performed by \Rpackage{DNAcopy}. \section{Getting Started: cn.FARMS} \label{sec:started} The following example was created by a very small subset of the HapMap data set. As usual, it is necessary to load the \Rpackage{cn.farms} package: \begin{Sinput} library(cn.farms) \end{Sinput} \subsection{Quick start : Process SNP 6.0 array} \noindent The \Rpackage{hapmapsnp6} package is loaded for testing purpose. \begin{Sinput} > library("hapmapsnp6") > celDir <- system.file("celFiles", package="hapmapsnp6") > filenames <- dir(path=celDir, full.names=TRUE) \end{Sinput} \noindent Next, the user specifies a working directory on the harddisk whereto save the results. \begin{Sinput} > workDir <- tempdir() > dir.create(workDir, showWarnings=FALSE, recursive=TRUE) > setwd(workDir) \end{Sinput} \noindent For reasons of computational time and memory consumption \Rpackage{cn.farms} supports high-performance computation. The parameter \verb+cores+ specifies number of CPUs requested for the cluster and the parameter \verb+runtype+ indicates how the data matrix should be stored. \verb+runtype="ff"+ creates a transient flat-file which will not be saved automatically. Whereas \verb+runtype="bm"+ creates a persistent flat-file which can be saved permanently. \begin{Sinput} > cores <- 2 > runtype <- "ff" \end{Sinput} \noindent Next, the user specifies a subdirectory whereto save the flat-files. \begin{Sinput} > dir.create("ffObjects/ff", showWarnings=FALSE, recursive=TRUE) > ldPath(file.path(getwd(), "ffObjects")) > options(fftempdir=file.path(ldPath(), "ff")) \end{Sinput} \noindent The directory (\verb+celDir="where/are/my/cel-files"+) which contain the cel-files has to be specified. \begin{Sinput} > celDir <- system.file("celFiles", package="hapmapsnp6") > filenames <- dir(path=celDir, full.names=TRUE) \end{Sinput} \noindent The following step will create the annotation file. \begin{Sinput} > if(exists("annotDir")) { > createAnnotation(filenames=filenames, annotDir=annotDir) > } else { > createAnnotation(filenames=filenames) > } \end{Sinput} \noindent Afterwards, the data will be corrected for cross-hybridization and normalized. \begin{Sinput} > normMethod <- "SOR" > ## normalization of SNP data > if(exists("annotDir")) { > normData <- normalizeCels(filenames, method=normMethod, cores, > alleles=TRUE, annotDir=annotDir, runtype=runtype) > } else { > normData <- normalizeCels(filenames, method=normMethod, cores, > alleles=TRUE, runtype=runtype) > } \end{Sinput} \noindent Now, the normalized data will be summarized at DNA probe loci. \verb+summaryMethod <- "Variational"+ indicates which FARMS approach should be used and \verb+summaryParam$cyc <- c(10, 10)+ specifies the number of iterations of the EM-algorithm. The parameter \verb+summaryWindow+ indicates whether DNA probe loci on the same DNA fragments are summarized together (\verb+summaryWindow="fragment"+) or if the DNA probe loci are summarized separately (\verb+summaryWindow="std"+ is the default setting). << echo=TRUE>>= summaryMethod <- "Variational" summaryParam <- list() summaryParam$cyc <- c(10) callParam <- list(cores=cores, runtype=runtype) slData <- slSummarization(normData, summaryMethod=summaryMethod, summaryParam=summaryParam, callParam=callParam, summaryWindow="std") show(slData) assayData(slData)$intensity[1:10, 1:5] ## intensity values assayData(slData)$L_z[1:10, 1:5] ## relative values @ \noindent Now, the intensity values of the non-polymorphic probes (CN-probes) were normalized. \begin{Sinput} > if (exists("annotDir")) { npData <- normalizeNpData(filenames, cores, annotDir=annotDir) } else { npData <- normalizeNpData(filenames, cores, runtype=runtype) } \end{Sinput} \noindent This step combines non-polymorphic probes and single-locus summarized SNP-probes. << echo=TRUE>>= combData <- combineData(slData, npData, runtype=runtype) show(combData) @ \noindent In this final step intensity values of non-polymorphic probes and single-locus summarized SNP-probes are multi-locus summarized with a windows size of 5 probes (\verb+windowParam$windowSize <- 5+). The window size for multi-loci modeling is a hyperparameter which trades off low FDR against high resolution. We recommend a window size of 5 as default, 3 for high resolution, and 7 for low FDR. Setting \verb+windowParam$overlap <- TRUE+ inidicates that the multi-locus summariaztion is done by step-wise moving the window over the genome. Alternatively to a fixed number of CNV or SNP sites, the cn.FARMS software allows defining a window in terms of base pairs. To make use of this option set \verb+windowMethod <- "bps"+. In this case, multi-loci modeling may use a different number of meta-probes at different DNA locations, in particular for less than two meta-probes multi-loci modeling is skipped. Note, however that controlling the FDR is more difficult because a minimal number of meta-probes cannot be assured for each window and modeling with few meta-probes is prone to false discoveries. << echo=TRUE>>= windowMethod <- "std" windowParam <- list() windowParam$windowSize <- 5 windowParam$overlap <- TRUE summaryMethod <- "Variational" summaryParam <- list() summaryParam$cyc <- c(20) callParam <- list(cores=cores, runtype=runtype) mlData <- mlSummarization(slData, windowMethod =windowMethod, windowParam =windowParam, summaryMethod=summaryMethod, summaryParam =summaryParam, callParam =callParam) names(assayData(mlData)) assayData(mlData)$intensity[1:10, 1:5] assayData(mlData)$L_z[1:10, 1:5] @ \noindent Next, the summarized data will be segmented in order to identify break points. Therefore we provide a parallelized version of \Rpackage{DNAcopy}. << echo=TRUE>>= colnames(assayData(mlData)$L_z) <- sampleNames(mlData) segments <- dnaCopySf( x =assayData(mlData)$L_z[, 1:10], chrom =fData(mlData)$chrom, maploc =fData(mlData)$start, cores =cores, smoothing=FALSE) head(fData(segments)) @ \noindent To get further information, e.g. how to process Affymetrix 500K arrays, please check the followings demos. \begin{Sinput} > demo(package="cn.farms") Demos in package 'cn.farms': demo01Snp6 Demo for an Affymetrix SNP6 data set demo02Snp5 Demo for an Affymetrix SNP5 data set demo03Snp500k Demo for an Affymetrix 500K data set demo04Snp250k Demo for an Affymetrix 250K NSP data set demo05Testing Run the examples \end{Sinput} \noindent The most recent \Rpackage{cn.farms} version can be found at \href{http://www.bioinf.jku.at/software/cnfarms/cnfarms.html}{http://www.bioinf.jku.at/software/cnfarms/cnfarms.html}. \section{Setup} \label{sec:setup} This vignette was built on: << echo=TRUE>>= sessionInfo() @ \bibliographystyle{natbib} \bibliography{cnv} \end{document}