%\VignetteIndexEntry{Segmentation Overview} %\VignetteDepends{limma, DNAcopy, GLAD} %\VignetteKeywords{Segmentation, aCGH} %\VignettePackage{snapCGH} \documentclass[11pt,a4paper]{article} \title{snapCGH: Segmentation,\\ Normalization and Processing of aCGH Data\\ Users' Guide} \author{ML Smith, JC Marioni, TJ Hardcastle, NP Thorne} %\settocname{Table of Contents} \setlength{\oddsidemargin}{0.5cm} %margin \setlength{\evensidemargin}{0.5cm} %margin \setlength{\textwidth}{15.2cm} \addtolength\footskip{1cm} \pagestyle{plain} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \usepackage{Sweave} \begin{document} \maketitle \section*{Citing snapCGH} If you have used \Rpackage{snapCGH} in your work please cite the package using the following:\\ Smith, M.L., Marioni, J.C., Hardcastle, T.J., Thorne, N.P.\\ \indent snapCGH: Segmentation, Normalization and Processing of aCGH Data Users' Guide,\\ \indent Bioconductor, 2006\\ \noindent If you make use of the function \Rfunction{runBioHMM}, please cite:\\ Marioni, J. C., Thorne, N. P., and Tavar\'{e}, S. (2006).\\ \indent BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data.\\ \indent \textit{Bioinformatics} \textbf{22}, 1144 - 1146 \section*{Introduction} This document outlines some of the commands used to read in, investigate and subsequently segment array CGH data. The files analysed represent 2 breast cancer cell lines obtained from Jessica M Pole and Paul AW Edwards. <<1,results=hide>>= library(snapCGH) @ \Rpackage{snapCGH} is designed to be used in conjunction with \Rpackage{limma} and so it will automatically load that library before proceeding. In addition to \Rpackage{limma}, the following packages are also loaded: \Rpackage{GLAD}, \Rpackage{DNAcopy}, \Rpackage{tilingArray} and \Rpackage{aCGH}. Each of these impliments an alternative segmentation method that may be applied to the data.\\ \section*{Reading Data} We read in the samples and create the initial RG object using the following commands. <<2, results=hide>>= datadir <- system.file("testdata", package="snapCGH") targets <- readTargets("targets.txt", path=datadir) RG1 <- read.maimages(targets$FileName, path=datadir, source = "genepix") @ % % Positional information about the clones on the array (e.g. which chromosome and the position on the chromosome a clone if from) can be incorporated using the \Rfunction{readPositionalInfo} function. This function accepts either an \Robject{RGList} or \Robject{MAList} along with an argument specifying which array platform the data was produced on. Currently the only supported platforms are Agilent, Bluefuse and Nimblegen, but this should expand in the near future. If your platform isn't supported or \Rfunction{readPostitionalInfo} returns an error (e.g. if the a new version of the array output data becomes incompatible with the current function) then the positional information can be read in separately using the \Rfunction{read.clonesinfo} function. In order to do this it is necessary to create a clones info file. Such a file (which can be created using Excel and saved as a 'txt' file) must contain columns called Position and Chr which give the position along a chromosome (in Mb) and the chromosome to which a clone belongs. The X and Y chromosomes can be labelled either as X and Y or 23 and 24. Both instances are handled by \Rfunction{read.clonesinfo}. The clones info file must be ordered in the same way as the clones are ordered in the array output data. The second command adds information about the structure of the slide (blocks/rows/columns) to the \Robject{RG} object. Finally we read in a spot types file. This file contains information about the control status of particular spots on the array and allows specific spots to be highlighted in many of the plotting functions. The content of a spot types file is covered extensively within the \Rpackage{limma} manual. <<3, results=hide>>= RG1 <- read.clonesinfo("cloneinfo.txt", RG1, path=datadir) RG1$printer <- getLayout(RG1$genes) types <- readSpotTypes("SpotTypes.txt", path=datadir) RG1$genes$Status <- controlStatus(types, RG1) @ Commonly, when aCGH experiments are carried out the reference channel is dyed using Cy5 and the test channel is dyed using Cy3. This is the opposite way to expression data. In order to take this into account we need to specify which channel is the reference within our \Rclass{RGList}. To do this we create a design vector with each column corresponding to an array in the experiment. A value of 1 indicates that the Cy3 channel is the reference, whilst a value of -1 equates to Cy5 being the reference, as is the case in this example. <<3a>>= RG1$design <- c(-1,-1) @ We now proceed to use the function \Rfunction{backgroundCorrect} to remove the background intensity for each spot. In this example we have chosen the method 'minimum' which subtracts the background value from the forground. Any intensity which is zero or negative after the background subtraction is set equal to half the minimum of the positive corrected intensities for that array. For other background correction methods please see the appropriate help file. <<4>>= RG2 <- backgroundCorrect(RG1, method="minimum") @ %% Drop this step below?????? this should be a weight function Next, we normalise the data. Here we will carry out a (global) median normalisation. Other options for normalization methods are: \Rfunarg{none}, \Rfunarg{loess}, \Rfunarg{printtiploess}, \Rfunarg{composite} and \Rfunarg{robustspline}. The output of the normalization function is a new type of object called an \Rclass{MAList}. This is composed of the $log_{2}$ ratios, intensities, gene and slide layout information which it gleans from the \Robject{RG} object. If you feel that no normalisation is required then it is possible to skip this stage and proceed directly onto the next step maintaining the \Rclass{RGList}. <<5>>= MA <- normalizeWithinArrays(RG2, method="median") @ We are now ready to process the data with the purpose of segmenting the dataset into regions corresponding to sections of the genome where there are the same number of copy number gains or losses.\\ Firstly, we use the \Rfunction{processCGH} to 'tidy up' the \Rclass{MAList} object. The \Rfunarg{method.of.averaging} option defines how clones of the same type should be averaged. If this is specified the duplicates are removed following the averaging leaving only one occurence of each clone set. The \Rfunarg{ID} argument is used to indicate the name of the column containing a unique identifier for each clone type. It is this identifier that is used when averaging replicates. As mentioned above this function accepts both \Rclass{RGLists} and \Rclass{MALists}. <<6, results=hide>>= MA2 <- processCGH(MA,method.of.averaging=mean, ID = "ID") @ \section*{Segmentation} We are now ready to fit the segmentation method. For larger data sets this step can take a long time (several hours). In this example we call the homogeneous hidden Markov model available in the \Rpackage{aCGH} package. <>= SegInfo.Hom <- runHomHMM(MA2, criteria = "AIC") @ At the current time there are methods for calling four other segmentation algorithms in addition to the method shown above. Three of these are included as wrapper functions to methods available in other packages, specifically: \Rpackage{DNAcopy}, \Rpackage{GLAD} and \Rpackage{tilingArray}. These functions can be called using the following code: <>= SegInfo.GLAD <- runGLAD(MA2) SegInfo.DNAcopy <- runDNAcopy(MA2) SegInfo.TilingArray <- runTilingArray(MA2) @ The final method, called BioHMM, is a heterogeneous hidden Markov model and is maintained within \Rpackage{snapCGH}. It can be called using the following command. By default it incorporates the distance between clones into the model assigning a higher probability of state change to clones that are a larger distance apart on a chromosome. This option is control using the argument \Rfunarg{useCloneDists}. <>= SegInfo.Bio <- runBioHMM(MA2) @ We now deal with the fact that the segmentation methods sometimes have a tendency to fit states whose means are very close together. We overcome this problem by merging states whose means are within a given threshold. There are two different methods for carrying out the merging process. For more information on their differences please see the appropriate page in the helpfiles. <>= SegInfo.Hom.merged <- mergeStates(SegInfo.Hom, MergeType = 1) @ We are now ready to use any of the plotting functions available in the library. \section*{Plotting Functions} The library comes with a variety of plotting functions that provide visual representations of the data at various stages of the analysis process. Firstly we will look at the \Rfunction{genomePlot} function. This function takes either an \Rclass{MAList} or a \Rclass{SegList} object (in this example we've used an \Rclass{MAList}) and plots the M-value for each gene against it's position on the genome. The \Rfunarg{array} argument indicates which array is plotted. This function utilizes the spot types data that was read in earlier to highlight specific genes of interest. <>= genomePlot(MA2, array = 1) @ \begin{figure}[!ht] \begin{center} <>= <> @ \end{center} \end{figure} \pagebreak It is also possible to look at specific chromosomes, rather than the entire genome as in the previous example. Which particular chromosome is to be plotted is specified using the \Rfunarg{chrom.to.plot} argument. <