% \VignetteIndexEntry{ADaCGH2 Overview} % \VignetteDepends{ADaCGH2} % \VignetteKeywords{ADaCGH2 CGH segmentation parallelization} % \VignettePackage{ADaCGH2} \documentclass[a4paper,11pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \SweaveOpts{echo=TRUE} \usepackage[latin1]{inputenc} \title{Analysis of data from aCGH experiments using parallel computing and ff objects} \author{Ram\'on D\'{\i}az-Uriarte$^{1}$ and Daniel Rico$^{1}$} \begin{document} \maketitle \begin{center} 1. Structural Computational Biology Group. Spanish National Cancer Center (CNIO), Madrid (SPAIN). {\tt rdiaz02@gmail.com}, {\tt drico@cnio.es} \end{center} \tableofcontents \section{Overview:} ADaCGH2 is a package for the analysis of CGH data. The main features of ADaCGH2 are: \begin{itemize} \item Parallelization of (several of) the main segmentation/calling algorithms currently available, allow efficient usage of computing clusters. \item Data access data using the \textit{ff} package (\url{http://cran.r-project.org/web/packages/ff/index.html}), making it possible to analyze data from very large projects. \item Parallelization and \textit{ff} are used simultaneously. WaviCGH \cite{Carro2010} (\url{http://wavi.bioinfo.cnio.es}), a web-server application for the analysis and visualization of array-CGH data that uses ADaCGH2, consitutes a clear demonstration of the usage of \textit{ff} on a computing cluster with shared storage over NFS. \end{itemize} ADaCGH2 is a major re-write of our former package ADaCGH \cite{Diaz-Uriarte2007}. We have improved the parallelization and, specially, changed completely the data handling routines. Thanks to the usage of the \textit{ff} package, ADaCGH2 can now analyze data sets of more than four million probes in machines with no more than 2 GB of RAM. \section{A commented example} For both interactive and non-interactive executions we will often execute the following in sequence: \begin{enumerate} \item Convert the original data to \textit{ff} objects \item Initialize the computing cluster \item Carry out segmentation and calling \item Plot the results \end{enumerate} We cover each in turn in the remaining of this section. <<>>= library(ADaCGH2) @ <>= options(width = 70) @ \subsection{Choosing a working directory} This package reads and writes quite a few files to the hard drive. The easiest way to organize your work is to create a separated directory for each project. All plot files and ff data will be stored there. At the end of this example, we will remove this directory. (Just in case, I check for its existence first.) <<>>= if(!file.exists("ADaCGH2_vignette_tmp_dir")) dir.create("ADaCGH2_vignette_tmp_dir") originalDir <- getwd() setwd("ADaCGH2_vignette_tmp_dir") @ \subsection{Convert original data to \texttt{ff} objects} The first two steps need not be carried out in that sequence. Conversion of original data does not use the computing cluster, but can be done after initialization of the cluster. The purpose of this step is to write down the \texttt{ff} files to disk, so they are available for the segmentation and ploting functions. (See below, section \ref{snap} for examples of using objects from Limma and snapCGH). To allow the conversion to be carried out in data from previous sessions, the conversion takes as input the name of an RData. <<>>= fname <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx1") tableChromArray <- inputDataToADaCGHData(filename = fname) tableChromArray @ The first command is used in this example to find the complete path of the example data set. The actual call to the function is the second expression. The return object is not used for anything further and can be deleted (or not saved). With large data sets that can be the single step that consumes the most RAM, since we need to load the original data into R. Even if we remove the original data and call \texttt{gc()}, R might not return all of the memory to the operating system, and this might be inconvenient in multiuser environments and/or long running processes. Thus, it is possible to execute the above in a separate R process that is spawned exclusively just for the conversion. For instance, we could use the \texttt{multicore} package and do: <>= require(multicore) parallel(inputDataToADaCGHData(filename = fname), silent = FALSE) tableChromArray <- collect()[[1]] if(inherits(tableChromArray, "try-error")) { stop("ERROR in input data conversion") } @ That way, the \texttt{ff} are produced and stored locally in the hard drive, but the R process where the original data was loaded (and the conversion to \texttt{ff} carried out) dies after the conversion, freeing back the memory to the operating system. \subsection{Initialize computing cluster} Initialization is done with the function \texttt{snowfallInit}, a function which uses the package \texttt{snowfall} to initialize the cluster and set up the slave nodes. In the example we will use a sockect cluster, but MPI can also be used (our servers use LAM-MPI). In this example we will use 2 nodes. <<>>= snowfallInit(universeSize = 2, typecluster = "SOCK") @ \subsection{Carry out segmentation and calling} Segmentation and calling are carried out with the \texttt{pSegment} functions. Here we show just one such example. Many more are available from the help. <>= help(pSegment) @ <<>>= hs_mad.out <- pSegmentHaarSeg("cghData.RData", "chromData.RData", merging = "MAD") @ We can take a quick look at the output. We first open the \texttt{ff} objects (the output is a list of \texttt{ff} objects) and then call \texttt{summary} on the smoothed list <<>>= lapply(hs_mad.out, open) summary(hs_mad.out[[1]][,]) @ \subsection{Plot the results} Plotting produces PNG files for easier sharing over the Internet. The plotting function takes as main arguments the names of RData files. We first write to disk the (\texttt{ff}) object with the results, and then call the plotting function: <<>>= save(hs_mad.out, file = "hs_mad.out.RData", compress = FALSE) pChromPlot(outRDataName = "hs_mad.out.RData", cghRDataName = "cghData.RData", chromRDataName = "chromData.RData", posRDataName = "posData.RData", probenamesRDataName = "probeNames.RData", imgheight = 350) @ \section{Using CGHregions} The CGHregions package \cite{CGHregions-manual} is a BioConductor package that implements a well known method \cite{VandeWiel2007} for dimension reduction for aCGH data (see a review of common regions issues and methods in \cite{Rueda2010a}). The \texttt{CGHregions} function accepts different type of input, among others a data frame. The function \texttt{outputToCGHregions} produces that data frame, ready to be used as input to CGHregions (for the next example, you will need to have the \textit{CGHregions} package installed). <<>>= forcghr <- outputToCGHregions(hs_mad.out) if(require(CGHregions)) { regions1 <- CGHregions(forcghr) regions1 } @ Please note that \texttt{outputToCGHregions} does NOT check if the calls are something that can be meaningfully passed to CGHregions. In particular, you probably do NOT want to use this function when \texttt{pSegment} has been called using \texttt{merging = "none"}. \section{\label{snap}Input data from Limma and snapCGH} Many aCGH studies use pre-processing steps similar to those of gene expression data. The \texttt{MAList} object, from \textit{Limma} and \texttt{SegList} object, from \textit{snapCGH}, are commonly used to store aCGH information. The following examples illustrate the usage of the function \texttt{inputDataToADaCGHData} to convert \texttt{MAList} and \texttt{SrgList} data into a format suitable for \textit{ADaCGH2}. We will start with objects produced by \textit{snapCGH}. The following code is copied from the \textit{snapCGH} vignette (pp.\ 2 and 3). Please check the original vignette for details. In summary, a set of array files are read, the data are normalized and, finally, averaged over clones. snapCGH uses limma for the initial import of data and, next, with the \texttt{read.clonesinfo} function adds additional information such as chromosome and position. The \texttt{MA} object created is of class \texttt{MAList}, but with added information (compared to a basic, original, limma MAList object). \texttt{MA2} is of type \texttt{SegList}. <<>>= datadir <- system.file("testdata", package = "snapCGH") targets <- readTargets("targets.txt", path = datadir) RG1 <- read.maimages(targets$FileName, path = datadir, source = "genepix") RG1 <- read.clonesinfo("cloneinfo.txt", RG1, path = datadir) ## snapCGH-specific RG1$printer <- getLayout(RG1$genes) types <- readSpotTypes("SpotTypes.txt", path = datadir) RG1$genes$Status <- controlStatus(types, RG1) RG1$design <- c(-1, -1) RG2 <- backgroundCorrect(RG1, method = "minimum") ## class RGList MA <- normalizeWithinArrays(RG2, method = "median") ## class MAList class(MA) MA2 <- processCGH(MA, method.of.averaging = mean, ID = "ID") ## class SegList class(MA2) @ All the information (intensity ratios and location) is available in the \texttt{MA} and \texttt{MA2} objects. We can directly convert them to \textit{ADaCGH2} objects (we set \texttt{na.omit = TRUE} as the data contain missing values). The first call process the \texttt{MAList} and the second the \texttt{SegList}. <<>>= tmp <- inputDataToADaCGHData(MAList = MA, na.omit = TRUE) tmp <- inputDataToADaCGHData(MAList = MA2, na.omit = TRUE, minNumPerChrom = 4) @ We need to change the argument to \texttt{minNumPerChrom} because, after the data processing step in \texttt{processCGH}, chromosome 21 has only four observations. The original \texttt{MAList} as produced directly from \textit{limma} do not have chromosome and position information. That is what the \texttt{read.clonesinfo} function from \textit{snapCGH} did. To allow using objects directly from \textit{limma} and incorporating position information, we will use an approach to directly mimicks that in \textit{snapCGH}. If you use and \texttt{MAList} you can also provide a \texttt{cloneinfo} argument; this can be either the full path to a file with the format required by \texttt{read.clonesinfo} or, else, the name of an object with (at least) three columns, names \texttt{ID}, \texttt{Chr}, and \texttt{Position}. We copy from the limma vignette (section 3.2, p.8), changing the names of objets by appending ``.limma''. <<>>= targets.limma <- readTargets("targets.txt", path = datadir) RG.limma <- read.maimages(targets.limma, path = datadir, source="genepix") RG.limma <- backgroundCorrect(RG.limma, method="normexp", offset=50) MA.limma <- normalizeWithinArrays(RG.limma) @ We can add the chromosomal and position information in two different ways. First, as in \texttt{read.clonesinfo} or, else, we can provide the name of a file (with the same format as required by \texttt{read.clonesinfo}). Note that \texttt{fclone} is a path (and, thus, a character vector). <<>>= fclone <- list.files(path = system.file("testdata", package = "snapCGH"), full.names = TRUE, pattern = "cloneinfo.txt") fclone tmp <- inputDataToADaCGHData(MAList = MA.limma, cloneinfo = fclone, na.omit = TRUE) @ Alternatively, we can provide the name of an object with the additional information. For illustrative purposes, we can use here the columns of the \texttt{MA} object. <<>>= acloneinfo <- MA$genes tmp <- inputDataToADaCGHData(MAList = MA.limma, cloneinfo = acloneinfo, na.omit = TRUE) @ \section{Clean up actions} These are not strictly necessary, but we will explicitly stop the cluster and remove the directory we created. To make sure there are no file permission problems, we also explicitly delete some of the "ff" files and objects (and we wait a few seconds to allow pending I/O operations to happen before we delete the directory). <<>>= sfStop() load("chromData.RData") load("posData.RData") load("cghData.RData") delete(cghData); rm(cghData) delete(posData); rm(posData) delete(chromData); rm(chromData) lapply(hs_mad.out, delete) rm(hs_mad.out) setwd(originalDir) print(getwd()) Sys.sleep(3) @ <<>>= unlink("ADaCGH2_vignette_tmp_dir", recursive = TRUE) print(dir()) Sys.sleep(3) @ %% \section{Implementation details} \newpage \bibliographystyle{apalike} \bibliography{ADaCGH2} \end{document} % To try to get call graphs with proftools % <<>>= % library(proftools) % library(ADaCGH2) % ## Do not use cluster if drawing call graphs % ## snowfallInit(universeSize = 2, typecluster = "SOCK") % fname <- list.files(path = system.file("data", package = "ADaCGH2"), % full.names = TRUE, pattern = "inputEx2") % tableChromArray <- inputDataToADaCGHData(filename = fname) % Rprof(tmp <- tempfile()) % cbs.out <- pSegmentDNAcopy("cghData.RData", % "chromData.RData") % Rprof() % profileCallGraph2Dot(readProfileData(tmp), score = "total", % filename = "pSegment.dot", % rankdir = "TB") % fname <- list.files(path = system.file("data", package = "ADaCGH2"), % full.names = TRUE, pattern = "inputEx1") % Rprof(tmp <- tempfile()) % hmm_mad.out <- pSegmentHMM("cghData.RData", % "chromData.RData", merging = "MAD") % Rprof() % profileCallGraph2Dot(readProfileData(tmp), score = "total", % filename = "pSegmentHMM.dot", % rankdir = "TB") % ## edit, change name, see and produce pdf % ## dotty pSegment-DNAcopy.dot % ## dot -Tpdf pSegment-DNAcopy.dot > foo.pdf % ## xpdf foo.pdf % @