%\VignetteIndexEntry{mBPCR} %\VignetteDepends{} %\VignetteKeywords{Bayesian Piecewise Constant Regression for DNA copy number estimation} %\VignettePackage{mBPCR} \documentclass[11pt]{article} \usepackage{amsmath} \SweaveOpts{echo=FALSE} \begin{document} \setkeys{Gin}{width=0.99\textwidth} \title{\bf mBPCR: A package for DNA copy number profile estimation} \author{P. M. V. Rancoita$^{1,2,3}$ and M. Hutter$^{4}$} \date{} \maketitle \noindent $^1$Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA), Manno-Lugano, Switzerland\\ $^2$Laboratory of Experimental Oncology, Oncology Institute of Southern Switzerland (IOSI), Bellinzona, Switzerland\\ $^3$Dipartimento di Matematica, Universit\`{a} degli Studi di Milano, Milano, Italy\\ $^4$RSISE$\,@\,$ ANU and SML$\,@\,$NICTA, Canberra, ACT, 0200, Australia \begin{center} {\tt paola@idsia.ch} \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} The algorithm mBPCR is a tool for estimating the profile of the log$_2$ratio of copy number data. The procedure is a Bayesian piecewise constant regression and can be applied, generally, to estimate any piecewise constant function (like the log$_2$ratio of the copy number data). The method is described in \cite{Rancoita:08a} and represents a significant improvement of the original algorithm BPCR, presented in \cite{Hutter:07pcregx} and \cite{Hutter:07pcreg}. This document shows several examples of how to use the package. The data used are principally: the Affymetrix GeneChip Mapping 10K Array data of cell line REC-1 \cite{Rinaldi:06a} and the Affymetrix GeneChip Mapping 250K Array data of chromosome 11 of cell line JEKO-1 (unpublished). \section{Example 1: profile estimation} In this example we estimate the copy number profile of sample {\tt rec10k}. <>= library(mBPCR) @ \noindent First, we import the 10K Array data of cell line REC-1. <>= data(rec10k) @ \noindent During the computation, the algorithm needs to create a vector of size ({\tt maxProbeNumber}+1)({\tt maxProbeNumber}+2)/2, where {\tt maxProbeNumber} is the maximum number of probes of a chromosome (or arm of a chromosome, for denser Array). Hence, before the estimation, we must verify if we have enough RAM to allocate such a vector. In case of the 10K Array data, we know that all chromosomes have less than 1000 probes, thus we verify if we can set {\tt maxProbeNumber=1000}, with the following commands, <>= maxProbeNumber <- 1000 @ <>= A <- array(1, dim=(maxProbeNumber+1)*(maxProbeNumber+2)/2) @ \noindent If last command does not give any error regarding the memory allocation, then we can set {\tt maxProbeNumber=1000} and remove {\tt A} to save space. <>= remove(A) @ \noindent To estimate the profile of one or more chromosomes, we need to set the parameter {\tt chrToBeAnalyzed} with the vector of the names of the chromosomes that we want to analyze (the names allowed are: {\tt X}, {\tt Y} and any integer from {\tt 1} to {\tt 22}). In the following example, we estimate the profile of chromosomes 3 and 5 of sample REC-1. Instead, to estimate the profile of the whole genome, we need to set {\tt chrToBeAnalyzed = c(1:22,"X")}. <>= results <- estProfileWithMBPCR(rec10k$SNPname, rec10k$Chromosome, rec10k$PhysicalPosition, rec10k$log2ratio, chrToBeAnalyzed=c(3,5), maxProbeNumber=1000) @ \noindent We can nicely write the results on tab delimited files in the working directory, by using the function {\tt writeEstProfile} (Tables \ref{tab_mBPCR} and \ref{tab_bounds} show the first lines of the two tables created by the command below). Setting {\tt sampleName="rec10k"}, the name of the files will contain the name of the sample {\tt rec10k}. If {\tt path=NULL}, the tables will not be written on files, but only returned by the function. <>= writeEstProfile(path='', sampleName='rec10k',rec10k$SNPname, rec10k$Chromosome, rec10k$PhysicalPosition, rec10k$log2ratio, chrToBeWritten=c(3,5), results$estPC, results$estBoundaries) @ <>= library(xtable) temp <- writeEstProfile(path=NULL, sampleName='rec10k',rec10k$SNPname,rec10k$Chromosome, rec10k$PhysicalPosition, rec10k$log2ratio,chrToBeWritten=c(3,5), results$estPC, results$estBoundaries) @ <>= xx <- xtable(head(temp$mBPCRestimate)) label(xx) <- "tab_mBPCR" caption(xx) <- "Example of table containing the profile estimated with mBPCR." align(xx) <- c("|c|","|c|","c|","c|","c|","c|") print(xx,latex.environments=c("center","small"),include.rownames=FALSE) @ \tabcolsep=2pt <>= xx <- xtable(head(temp$mBPCRbreakpoints)) label(xx) <- "tab_bounds" caption(xx) <- "Example of table containing a summary of the breakpoints estimated with mBPCR." align(xx) <- c("|c|","|c|","c|","c|","c|","c|","c|","c|") print(xx,latex.environments=c("center","small"),include.rownames=FALSE) @ \noindent We can also estimate the profile with a Bayesian regression curve \cite{Rancoita:08a}. For example, with the following command we estimate the profile of chromosome~3 using both mBPCR and the Bayesian Regression Curve with $\hat{K}_2$. <>= results <- estProfileWithMBPCR(rec10k$SNPname, rec10k$Chromosome, rec10k$PhysicalPosition, rec10k$log2ratio, chrToBeAnalyzed=3, regr='BRC', maxProbeNumber=1000) @ \noindent After the estimation, we can plot the profiles using the function {\tt plotEstProfile}. For example, the following command plots the profile of chromosome~3 estimated with both methods. <>= plotEstProfile(sampleName='rec10k', rec10k$Chromosome, rec10k$PhysicalPosition, rec10k$log2ratio, chrToBePlotted=3, results$estPC, maxProbeNumber=2000, regrCurve=results$regrCurve, regr='BRC') @ As second example, we estimate the profile of chromosome 11 of sample JEKO-1. Notice that we need to set {\tt maxProbeNumber <- 9000} (because both arms of chromosome 11 contain less than 9000 probes) and, if this is possible on your machine, the computation can be long. Moreover, for the estimation, we use the estimates of the parameters computed on the whole genome to achieve a better profile (for the estimation of the global parameters, see the use of function {\tt estGlobParam} in Section~3). <>= data(jekoChr11Array250Knsp) @ <>= maxProbeNumber <- 9000 @ <>= A <- array(1, dim=(maxProbeNumber+1)*(maxProbeNumber+2)/2) @ <>= remove(A) @ <>= results <- estProfileWithMBPCR(jekoChr11Array250Knsp$SNPname, jekoChr11Array250Knsp$Chromosome, jekoChr11Array250Knsp$PhysicalPosition, jekoChr11Array250Knsp$log2ratio, chrToBeAnalyzed=11, maxProbeNumber=9000, rhoSquare=0.0479, nu=-3.012772e-10, sigmaSquare=0.0699) @ <>= plotEstProfile(sampleName='jeko250Knsp', jekoChr11Array250Knsp$Chromosome, jekoChr11Array250Knsp$PhysicalPosition, jekoChr11Array250Knsp$log2ratio, chrToBePlotted=11, results$estPC, maxProbeNumber=9000) @ \section{Example 2: use of function {\tt estGlobParam}} In general, even if we are not interested in the analysis of the whole genome, the global parameters should be estimated on the entire sample, using the function {\tt estGlobParam}. Here, we estimate the global parameters of sample REC-1 (in the following, the variance of the segment $\rho^2$ is estimated with $\hat{\rho}^2_1$). <>= data(rec10k) @ <>= estGlobParam(rec10k$log2ratio) @ \section{Example 3: use of function {\tt computeMBPCR}} If we are interested in estimating only a part of a chromosome or a simulated sample, we should not use the function {\tt estProfileWithMBPCR}, but use the function {\tt computeMBPCR} which estimates the profile directly. In the following example, we estimates the profile of a part of chromosome 11 of sample JEKO-1. <>= data(jekoChr11Array250Knsp) @ \noindent We select a part of chromosome 11. <>= y <- jekoChr11Array250Knsp$log2ratio[10600:11200] @ <>= p <- jekoChr11Array250Knsp$PhysicalPosition[10600:11200] @ \noindent We estimate the profile with mBPCR and BRC with $\hat{K}_2$, using the global parameters estimated on the whole genome. <>= results <- computeMBPCR(y, nu=-3.012772e-10, rhoSquare=0.0479, sigmaSquare=0.0699, regr='BRC') @ \noindent Finally, we plot the results. <>= plot(p,y) points(p, results$estPC, type='l', col='red') points(p, results$regrCurve, type='l', col='green') legend(x='bottomleft', legend=c('mBPCR', 'BRC with K_2'), lty=c(1, 1), col=c(4, 2)) @ \section{Example 4: {\tt importCNData}, an easy function to import data} There is also the possibility to easily import external data, by using {\tt importCNData}. The data should be in a tab delimited file and the data table should have at least four columns representing, respectively, the probe names, the chromosome to which each probe belongs, the physical positions of the probes inside the chromosome and the copy number data (an example of table can be found in Table \ref{tab_example}). The allowed names of the chromosomes are: {\tt X}, {\tt Y} and any integer from {\tt 1} to {\tt 22}). In the following example, we import data of sample REC-1. \begin{table} \centering \begin{small} \begin{tabular}{|c|c|c|c|} \hline % after \\: \hline or \cline{col1-col2} \cline{col3-col4} ... SNPname & Chromosome & PhysicalPosition & log2ratio\\ \hline SNP\_A-1509443 & 1 & 2882121 & -0.184424571\\ SNP\_A-1518557 & 1& 3985402& 0.097610797 \\ SNP\_A-1517286 & 1 &4804829& 0.443606651 \\ SNP\_A-1516024& 1 &4982250 &-1.089267338\\ SNP\_A-1514538 & 1& 5468765 &-0.862496476\\ SNP\_A-1516403& 1 & 5596686 &1.097610797\\ $\vdots$&$\vdots$&$\vdots$&$\vdots$\\ \hline \end{tabular} \end{small} \caption{Example of data table.}\label{tab_example} \end{table} As first step, we need to set a variable with the path of the file containing the data. To import our data, we set {\tt path} as the path of REC-1 data in the folder of package {\tt mBPCR}, <>= path <- system.file("extdata", "rec10k.txt", package = "mBPCR") @ \noindent Then, we use function {\tt importCNData}. The parameter {\tt NRowSkip} denotes how many rows there are before the table (notice that the name of the columns must be skipped). If the copy number data are not in log$_2$ratio scale, the parameter {\tt ifLogRatio} should be put as zero. <>= rec10k <- importCNData(path, NRowSkip=1) @ \noindent Now, the SNP name are in the variable {\tt SNPname}, the chromosomes of the probes are in {\tt chr}, the physical positions in {\tt position} and the raw log$_2$ratio data in {\tt logratio}. Here, we plot the raw data of chromosome 3. <>= plot(rec10k$position[rec10k$chr == 3], rec10k$logratio[rec10k$chr == 3], xlab='Chromosome 3', ylab='log2ratio') @ \section{Example 5: estimation of samples in an {\tt oligoSnpSet} object} In this example, we estimate the profile of chromosome 2 of a sample that is contained in an {\tt oligoSnpSet} object. We load the data that are contained in the package {\tt oligoClasses}. <>= data(oligoSetExample, package="oligoClasses") @ The object {\tt oligoSet} contains the data of a HapMap sample. Since the sample should not have many copy number changes, we use {\tt rhoSquare} equal to the one of the sample REC-1. Moreover, the data are not in log$_2$ratio scale, thus we set {\tt ifLogRatio=0}. <>= library(mBPCR) @ <>= r <-estProfileWithMBPCRforOligoSnpSet(oligoSet, sampleToBeAnalyzed=1, chrToBeAnalyzed=2, maxProbeNumber=1000, ifLogRatio=0, rhoSquare=0.0889637) @ After the estimation, we can plot the profile. <>= cc <- r$estPC @ <>= cc1 <- cc[chromosome(cc) == "2",1] @ <>= par(las=1) @ <>= plot(position(cc1), copyNumber(cc1), ylim=c(-0.23, 0.1), ylab="copy number", xlab="base position") @ \section{Suggestions} \noindent For an optimal use of mBPCR, especially in case of samples coming from patients, we suggest to take care to the following issues: \begin{itemize} \item even if the goal is to estimate the profile of only a part of the genome, the global parameters should be estimated on the whole genome; \item if the goal is to estimate the profile of one or more patients, it is better to estimate the variance of the segment levels ($\rho^2$) on a cell line, or on a sample with many aberrations, and use this value in the profile estimation of all patients. In fact, we need many aberrations to estimate well $\rho^2$. \end{itemize} \begin{thebibliography}{99} \bibitem{Hutter:07pcregx} M. Hutter. Exact Bayesian regression of piecewise constant functions. {\it {Bayesian Analysis}}, 2(4): 635--664, 2007. \bibitem{Hutter:07pcreg} M. Hutter. Bayesian Regression of Piecewise Constant Functions. In J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. David, D. Heckerman, A.F.M. Smith, and M. West, editors, {\it {Bayesian Statistics: Proceedings of the Eighth Valencia International Meeting}}. %pages? Universitat de Val\`{e}ncia and International Society for Bayesian Analysis, 2007. \bibitem{Rancoita:08a} P.M.V. Rancoita, M. Hutter, F. Bertoni, and I. Kwee. Bayesian DNA copy number analysis. {\it BMC Bioinformatics}, 10(10), 2009. \bibitem{Rinaldi:06a} A. Rinaldi, I. Kwee, M. Taborelli, C. Largo, S. Uccella, V. Martin, G. Poretti, G. Gaidano, G. Calabrese, G. Martinelli, {\it et al.}. Genomic and expression profiling identifies the B-cell associated tyrosine kinase Syk as a possible therapeutic target in mantle cell lymphoma. {\it {British Journal of Haematology}}, 132: 303--316, 2006. \end{thebibliography} \end{document}