%\VignetteIndexEntry{Clonality} %\VignetteDepends{DNAcopy} %\VignetteKeywords{Clonality testing} %\VignettePackage{Clonality} %\VignetteDepends{DNAcopy, gdata, Clonality} \documentclass[11pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \SweaveOpts{echo=FALSE} \setlength{\textheight}{8.5in} \setlength{\textwidth}{6in} \setlength{\topmargin}{-0.25in} \setlength{\oddsidemargin}{0.25in} \setlength{\evensidemargin}{0.25in} \begin{document} \setkeys{Gin}{width=0.99\textwidth} \title{\bf Clonality: A Package for Clonality testing} \author{Irina Ostrovnaya} \maketitle \begin{center} Department of Epidemiology and Biostatistics\\ Memorial Sloan-Kettering Cancer Center\\ {\tt ostrovni@mskcc.org}\\ \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Overview} This document presents an overview of the {\tt Clonality} package. This package can be used to test whether two tumors are clonal (metastases) or independent (double primaries) using their copy number or loss of heterozygosity (LOH) profiles. For LOH data it implements Concordant Mutations (CM) test \citep{begg07} and Likelihood Ratio (LR) test \citep{lrloh}. For copy number profiles the package implements the methodology based on the likelihood ratio described in \citep{cghSM}. \section{Copy number profiles} We will show how to test independence of the copy number profiles from the same patient using breast cancer data. The BAC arrays of the pairs of lobular carcinoma in situ (LCIS) and invasive lobular carcinoma (ILC) were studied in \citep{hwang} and available at \url{http://waldman.ucsf.edu/Breast/Hwang.data.xls}. We will load package {\tt gdata} in order to read the excel file. <>= library(DNAcopy) library(Clonality) library(gdata) @ We will read the dataset and remove rows or columns with too many NAs. <>= data<-read.xls("http://waldman.ucsf.edu/Breast/Hwang.data.xls") data<-data[!is.na(data[,2]),] data<-data[apply(is.na(data),1,sum)<=50,] data<-data[,apply(is.na(data),2,sum)<=1000] @ <>= data[1:5,1:10] @ Rows of data correspond to probes (genomic markers). The first column is probe name; the second column is the chromosome where the probe is located; the third column is probe's genomic position recorded as an index. All subsequent columns correspond to the samples and contain log-ratios. Since there are no genomic locations in this dataset, we will download another dataset and map the genomic locations to the probes. If the genomic locations were known, we would not need this step and the column with the probe names. <>= arrayinfo<-read.xls("http://waldman.ucsf.edu/Colon/nakao.data.xls") #needed to extract genomic locations data$Position<-arrayinfo$Mb[match(toupper(as.character(data[,1])),toupper(as.character(arrayinfo[,1])))] data<-data[!is.na(data$Position),] @ Now we will remove repeated genomic locations: <>= length(unique(paste(data$Chromosome, data$Position))) #there are repeated genomic locations data<-data[c(TRUE,data$Position[-1]!=data$Position[-1864]),] #discard probes with repeated genomic locations data<-data[data$Chromosome<=22,] #getting rid of X and Y chromosomes @ <>= dim(data) @ As the final step of data preparation, we have to create a CNA (copy number array) object as described DNAcopy. To save computational time, we only take the first three patients. (As a result, gain/loss frequencies used for analysis will be very imprecise and the reference distribution will have very few comparisons.) <>= dataCNA<-CNA(as.matrix(data[,c(4:6,28:30)]),maploc=data$Position,chrom=data$Chromosome,sampleid=names(data)[c(4:6,28:30)]) #taking the first 3 patients only to shorten the computation time; use c(4:51) for the full dataset @ Our methodology allows at most one genomic change per chromosome arm, estimated by the one-step Circular Binary Segmentation (CBS) algorithm (\citep{cbs2}). If the data had many more than 15,000 markers, most outstanding, and likely a short change would be picked up, which would not be representative of the chromosome pattern. To avoid this, one can use the following function: <>= dataAve<- ave.adj.probes(dataCNA,2) @ Here we have averaged every two consecutive marker. For this dataset, though, averaging is not necessary. The chromosomes should be split into arms before the clonality analysis since it increases the number of independent genomic units. <>= dataCNA$maploc<-dataCNA$maploc*1000 #transforming maploc to Kb scale dataCNA$chrom<- splitChromosomes(dataCNA$chrom,dataCNA$maploc) #splits the chromosomes into arms @ Next we have to create a vector of patient labels that matches the samples. <>= ptlist<-substr(names(dataCNA)[-c(1,2)],1,4) @ Finally, we can run the clonality analysis: <
>= results<-clonality.analysis(dataCNA, ptlist, nmad = 1.25, reference = TRUE, allpairs = FALSE) @ The main information is in the output LR: <>= results$LR @ The likelihood ratios LR2 for sample LC02 is much smaller than 1, therefore these tumors are independent. Patients LC03 and LC04 have LR2 much higher than one, and we can conclude that their tumors are clonal. The reference distribution for LR2 under the hypothesis of independence is constructed by pairing tumors from different patients that are independent by default. The $p$-value column reflects the percentiles of a particular patient's LR2 in the reference distribution: clonal tumors would have small $p$-values. We can view the genomewide plots of patient LC03 using: <>= genomewidePlots(results$OneStepSeg, results$ChromClass,ptlist , c("LC03LCIS", "LC03ILC"),results$LR, plot.as.in.analysis = TRUE) @ Patterns for each chromosome would be plotted by: %<>= %chromosomePlots(results$OneStepSeg, ptlist,ptname="LC03",nmad=1.25) % @ The overlap between the histograms of LR2 from original pairs of tumors and the reference distribution are produced by: <>= histogramPlot(results$LR[,4], results$refLR[,4]) @ \subsection{Choice of segmentation algorithm} Note that the user can potentially specify the segmentation method to be used. Currently the default behavior of the clonality.analysis function is to use the CBS algorithm to identify the most significant change in each chromosome arm. The internal function for this purpose is "oneseg" called as oneseg(x, alpha, nperm, sbdry) There are 4 arguments to oneseg:\\ x: is the finite logratio data ordered by genomic position. \\ alpha: the significance level used by CBS. \\ nperm: the number of permutations for the reference distribution. \\ sbdry: early stopping boundary for declaring no change (calculated from alpha and nperm). \\ The output of this function is a vector of 3 numbers where the first is the number of change-points detected (must be 0, 1 or 2), and the second and the third numbers are the start and end of the left segment if there is only one change-point, and of the middle segment when there are 2 change-points. The function allows the user to specify alternative alpha and nperm for 'oneseg' as a list using the segpar argument e.g. segpar=list(alpha=0.05, nperm=1000). Since sbdry is always calculated in clonality.analysis function from alpha and nperm it is not specified. Alternate segmentation algorithm can be used. It requires the user to create a function that takes the ordered logratio from one chromosome arm as argument "x" as in oneseg. The name of this function should not be 'oneseg' and is passed through the 'segmethod' argument and all other necessary arguments that are needed passed as a list through 'segpar' argument. \section{LOH data} The LOH data has to be combined in a matrix where first column has marker names and the following columns have LOH calls for each sample. Here we simulate a dataset with 10 pairs of tumors and 20 markers. First pair of tumor is clonal, and the rest of them are independent. If the marker is heterozygous and there is no LOH, then it is denoted by 0. LOH at maternal or paternal alleles is marked by 1 or 2. <>= set.seed(25) LOHtable<-cbind(1:20,matrix(sample(c(0,1,2),20*20,replace=TRUE),20)) LOHtable[,3]<-LOHtable[,2] LOHtable[1,3]<-0 @ <>= LOHtable[,1:5] LOHclonality(LOHtable,rep(1:10,each=2),pfreq=NULL,noloh=0,loh1=1,loh2=2) @ First p-value is small, indicating clonality, for both CM and LR tests. The rest of the p-values are not significant. Markers that are not informative (e.g. homozygous) in a particular tumor should be given NA instead of a call. Such markers will be dropped from the analysis of this specific patient. Below are the details of the session information: <>= sessionInfo() @ \bibliographystyle{apalike} \bibliography{Clonality} \end{document}