%\VignetteIndexEntry{MicroArray Chromosome Analysis Tool} %\VignetteDepends{graphics,Biobase,annotate} %\VignetteKeywords{microarray chromosome differential gene expression} %\VignettePackage{macat} %%% check whether we are running pdflatex \newif\ifmacatpdf \ifx\pdfoutput\undefined \macatpdffalse % we are not running pdflatex \else \pdfoutput=1 % we are running pdflatex \pdfcompresslevel=9 % compression level for text and image; \macatpdftrue \fi %%% choose global options depending on whether %%% we are running under pdflatex \ifmacatpdf \documentclass[11pt, a4paper, fleqn, pdftex]{article} \usepackage[pdftex]{graphicx, geometry, color} \geometry{headsep=3ex,hscale=.9} \usepackage[pdftex,linktocpage]{hyperref} \hypersetup{pdfstartview={FitH}, pdfpagemode={UseOutlines}, pdftitle={User's guide to MACAT}, pdfauthor={Joern Toedling}, pdfsubject={Bioconductor vignette}, } \pdfpageheight= \paperheight \pdfpagewidth= \paperwidth \else \documentclass[11pt, a4paper, fleqn, dvips]{article} \usepackage[dvips]{graphicx,geometry,color} \usepackage{hyperref} \fi \usepackage{amsmath,a4,t1enc} \parindent0mm \parskip2ex plus0.5ex minus0.3ex \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \addtolength{\textwidth}{1cm} \addtolength{\oddsidemargin}{-0.5cm} \addtolength{\evensidemargin}{-0.5cm} \begin{document} \SweaveOpts{eps=false} \title{MACAT - MicroArray Chromosome Analysis Tool } \author{Joern Toedling, Sebastian Schmeier,Matthias Heinig,\\ Benjamin Georgi, and Stefan Roepcke} \date{} \maketitle \tableofcontents \pagebreak \section{Introduction} This project aims at linking the term \emph{differential gene expression} to the chromosomal localization of genes. \\ Microarray data analysts have defined tumor subtypes by specific gene expression profiles, consisting of genes that show differential expression between subtypes (see, e.g., \cite{Yeoh2002}). However, tumor subtypes have also been characterized by phenomena involving large chromosomal regions. For instance, Christiansen et al. \cite{Christiansen2004} report on a subtype of acute myeloid leukemia, showing mutations in the \emph{AML1} gene on chromosome 21 along with deletions or loss of chromosome arm 7q. A natural approach to bridging the gap between these two paradigms is to link scoring for differential gene expression to the chromosomal localization of genes. Tumor subtypes can be defined by differential expression patterns affecting sizeable regions of certain chromosomes.\\ In the following, we propose a statistical approach for identifying significantly differentially expressed regions on the chromosomes based on a regularized t-statistic (see section \ref{sec:scoringttest}). We address the problem of interpolating the scores between genes by applying kernel functions (see section \ref{smoothing}). In order to evaluate the significance of scores we conduct permutation experiments (section \ref{StatEval}). An integral part of the project is the visualization of the results in order to provide convenient access to the statistical analysis without requiring a profound mathematical background (section \ref{sec:viz}). \\ The package is implemented in the R statistical programming language (\emph{www.r-project.org}). It requires functionalities provided by the Bioconductor package \cite{bioconductor}, which is a collection of R-libraries dealing with various aspects of the analysis of biological data including normalization, assessing background information on genes, and visualization of data. \\ We apply our method on a publicly available data set of acute lymphoblastic leukemia (ALL), described in \cite{Yeoh2002}. This data set consists of 327 tumor samples subdivided into 10 classes. In order to investigate chromosomal phenomena defining one tumor class, we consider the expression levels of that class versus all the other subclasses. \section{Methods} \subsection{Data Preprocessing} We assume normalized expression data, which can be provided already as a matrix in R or in form of a delimited text file. In the preprocessing step the expression data is integrated with gene location data for the given microarray into one common data format. For this \Rpackage{macat} provides the \Rfunction{preprocessedLoader} or its convenience-wrapper \Rfunction{buildMACAT}, which employ various Bioconductor \cite{bioconductor} functions.\\ The resulting data format is a list containing: \begin{itemize} \item Gene identifier \item Gene location (chromosome, strand, coordinate) \item Sample labels, denoting for instance tumor (sub)types \item Expression levels as a matrix \item An identifier for the type of microarray, used in the experiments \end{itemize} Currently \Rpackage{macat} is limited to commercial Affymetrix$^\textup{\textregistered}$ oligonucleotide microarrays. \subsection{Scoring of Differential Gene Expression} \label{sec:scoringttest} For each gene we compute a statistic denoting the degree of differential expression between two given groups of samples. In the context of the leukemia data set the two groups of samples are given by one tumor class in the first group and the remaining nine classes forming the second. The statistic is the regularized t-score introduced in \cite{Tusher2001}. This so-called \emph{relative difference} is defined as $$ d(i) = \frac{ \overline{x}_A(i) - \overline{x}_B(i) } { s(i) + s_0 } $$ where $\overline{x}_A(i)$ and $\overline{x}_B(i)$ are the mean expression levels of gene $i$ in group $A$ and $B$ respectively, $s(i)$ is the pooled standard deviation of the expression values of gene $i$, and $s_0$ is constant for all genes. In essence, $d(i)$ is Student's t-statistic augmented by a fudge factor $s_0$ in the denominator, which is introduced to prevent a high statistic for genes with a very low standard deviation $s(i)$. We set $s_0$ to the median over all gene standard deviations $s(i)$, analogous to \cite{Tibshirani2002}. \\ Since the null-distribution of these regularized t-scores is conceptually different from a known t-distribution, we have to approximate it by random permutations (for details see section \ref{StatEval}).\\ However, \Rpackage{macat} also allows to compute Student's classical t-statistic for each gene and assess significance based on the underlying t-distribution. \subsection{Smoothing Kernels} \label{smoothing} On each chromosome only a limited number of genes can be measured with a microarray. The distance in base pairs between measured genes also varies greatly. Since we want to compute differential expression statistics for larger chromosomal regions, we need a method to interpolate scores between measured values. This interpolation, however, does not aim to assign statistics to non-coding regions, but to provide a smooth estimate of differential expression over large chromosomal regions.\\ Formally this can be seen as a regression problem, $y = f(x)$, with $y$ being the t-score and $x$ being one base pair position on the chromosome. (In the following, $x$ will often be called \emph{chromosomal coordinate}.) Score $y_j$ is to be estimated for a large number of coordinates $x_j$, when $f(x)$ is known only for few observations $x_i$. Kernel methods achieve \emph{smooth} estimates of the function $f(x)$ by fitting different models at each query point $x_0$ using only observations close to $x_0$ \cite{HastieElementsChapter6}.\\ Three different kernel approaches will be discussed here: \label{kernel} \begin{itemize} \item k-Nearest Neighbor: For every chromosomal coordinate compute the average of the k nearest genes. \item Radial basis function (rbf): For every chromosomal coordinate compute the average over all genes weighted by distance as explained in detail below. \item Base-Pair-Distance Kernel: Similar to the k-Nearest-Neighbors, but using this kernel the average is taken over all genes within a certain radius of the position, whose value has to be determined. \end{itemize} \label{math} The kernel approaches presented above can be expressed as weighted sums of scores. Thus, we use a matrix multiplication as the basic framework for the computations, where we multiply our score matrix with the so called kernel matrix. Let $E$ be the score matrix, with rows corresponding to $n$ genes and columns corresponding to $m$ samples. For each gene $g$ the chromosomal coordinate $geneLocation_g$ is known. Further, let $K$ be the kernel matrix. Let us assume that we want to interpolate the scores at $s$ genomic locations that are stored in a vector we call $\vec{s}$. So the kernel matrix has the dimensions ($n \times s$). One entry of $K(g, l)$ represents the weight that the gene $g$ has at the location $l$. So the product $$ E^TK = S $$ gives the smoothed matrix $S$ of dimension ($m \times s$), where one entry $S(i, l)$ represents the interpolation of the score of sample $i$ at location $\vec{s}_{l}$. The smoothed matrix $S$ is the product of the score matrix $E$ times the kernel matrix $K$. The weights in $K$ depend only on the location of the genes relative to the steps, for which interpolated values are to be computed.\\ For the three kernels listed above the kernel functions are: \begin{itemize} \item k-nearest neighbors \[ K_{kNN}(k, g, l) = \left\{ \begin{array}{ll} 1 & \mbox{if $g$ is one of the $k$ nearest neighbors of $l$} \\ 0 & \mbox{else} \\ \end{array} \right. \] \item radial basis function \[ K_{rbf}(\gamma, g, l) = \left. \begin{array}{l} \exp (-\gamma \cdot \| geneLocation_{g} - \vec{s}_{l} \|^2) \\ \end{array} \right. \] \item base-pair distance \[ K_{bpd}(d, g, l) = \left\{ \begin{array}{ll} 1 & \mbox{if $geneLocation_{g}$ is within radius $d$ of $l$} \\ 0 & \mbox{else} \end{array} \right. \] \end{itemize} \label{params} The free parameters ($k,~\gamma,~d$) of the kernel functions determine the degree of smoothing. Take for instance the kNN kernel: the smaller $k$ gets, the fewer genes are averaged and in the extreme case interpolations take only the value of the next gene ($k = 1$). In the case of the base-pair distance kernel with a very small distance $d$, you will see spikes at the locations, where genes are located, and zero elsewhere. For very large distances the smoothing will remove all spikes and the value for each position is roughly the same. \\ More formally, kernel parameters define the width of the window, over which values are used to estimate the regression function. Changing the window width is a \emph{bias-variance tradeoff} situation. If the window around $x_0$ is narrow, only a small number of observations will be used to estimate $f(x_0)$. In this case the estimate will have a relatively large variance but only a small bias. However, if the window is wide, a large number of observations will be used for the estimate. This estimate tends to show a low variance but comes with a high bias, since now observations $x_i$ far from $x_0$ are used as well and we still assume that $f(x_i)$ is close to $f(x_0)$ \cite{HastieElementsChapter6}.\\ This shows that a good choice of the kernel parameters is very important. For the three kernel functions presented above, we fit the parameters from the data: \begin{itemize} \item kNN: The number of genes on the chromosome is determined and $k$ is set to cover approximately 10\% of the genes. \item rbf: The width of the window is controlled by the parameter $\gamma$. We require that for each position $l$, for which we interpolate the score, there should be at least two genes within the window for averaging. If both of these genes are located equally far from $l$, both will receive a weight of 1/2, thus resulting in a simple average over these two genes' scores. To allow for these weights, the maximal gene distance ($max$) between any two neighboring genes on the chromosome has to be determined. Then $\gamma$ can be computed as: $$\gamma = \frac{\ln{2}}{(max/2)^2}$$ \item base-pair distance: as with the rbf kernel, we require that the average is computed at least over two genes. So the parameter $d$, which describes the radius, within which genes are to be used to compute the average, is set to $max$, the maximal gene distance between any two genes on the chromosome. \end{itemize} These biologically motivated heuristics for choosing the kernel parameters provide good smoothing results on test data without any overfitting of estimated scores to the observed values.\\ Nevertheless, we recommend to determine the optimal parameter settings by cross-validation \cite{HastieElementsChapter7}, for which \Rpackage{macat} provides the function \Rfunction{evaluateParameters}. Hereby, one part of the genes is left out, the interpolation is computed based on the remaining genes only, and the quality of interpolation is assessed by the residual sum of squares between the scores of the left-out genes and their interpolated values. The above mentioned biologically heuristics, by default, provide a starting point for a grid-search over multiples of these parameter settings. The optimal parameter setting is the one with the minimal residual sum of squares. For instance, in Figure \ref{fig:evalparam}, the result of an evaluation of the parameter $k$ from the $k$-nearest-neighbor kernel is depicted. From this evaluation, the optimal $k$ seems to be approximately 30.\\ \begin{figure}[htbp] \begin{center} \includegraphics[width = 0.9 \columnwidth]{evalkNN6} \caption{Result plot for evaluating kNN parameter settings} \label{fig:evalparam} \end{center} \end{figure} \subsection{Statistical Evaluation} \label{StatEval} To judge the significance of differential gene expression, we propose to investigate random permutations of the class labels. To obtain a reliable simulation of the empirical distribution, we suggest to choose at least $B \geq 1000$ permutations, preferably more. For each of these permutations the (regularized) t-statistic is recomputed for each gene. Thus, for each gene we obtain $B$ permutation statistics and consequently an \emph{empirical p-value}, denoting the proportion of the $B$ permutation statistics being greater or equal than the actual statistic that is based on the true class labels. The permutation statistics also provide upper and lower significance borders, which are smoothed using the same kernel function as for the original statistics. Optionally, to judge the significance of differential expression over whole chromosomal regions, one could instead investigate permutations of the ordering of genes on chromosomes. For each of these permutations the smoothing of gene-specific scores is recalculated. This way, one obtains a null-distribution for scores over chromosomal regions. Given this null-distribution, it is possible to define confidence intervals for scores from random sample groupings and assess significance of scores observed in a relevant sample grouping.\\ However, the standard procedure in \Rpackage{macat} is to permute the class labels, which is considerably faster and yields results that are easier to validate and interpret statistically.\\ This approach is implemented in the \Rpackage{macat}-function \Rfunction{evalScoring}, which can be seen as the core function of \Rpackage{macat}. See appendix \ref{ExampleSession} for an exemplary use of the function and its arguments.\\ One important issue is that, when analyzing many chromosomes and classes, one might obtain statistically significant results by chance (\emph{multiple- testing problem}). In the given setting, classical procedures correcting for multiple testing, such as the \emph{Bonferroni} correction, cannot be applied. We advise the user to be aware of the problem and to validate results by alternative methods. \subsection{Visualization} In order to facilitate a better understanding of both the data and the statistical analysis, it is helpful to employ meaningful and concise visualizations. \Rpackage{macat} includes plotting functionality for several questions of interest. \label{sec:viz} \begin{itemize} \item Plotting raw and kernelized expression levels versus coordinates of genes on one chromosome. \item Visualize raw and kernelized statistical scores versus coordinates of genes on one chromosome. \item Emphasis of interesting chromosomal regions with listing of relevant genes. \end{itemize} Chromosomal regions showing significant differential expression can be visualized by plotting the result of the \Rfunction{evalScoring} function (see figure \ref{fig:exampleplot}). Hereby scores for genes (black dots), the sliding average of the 0.025 and 0.975 quantiles of the permuted scores (grey lines), the sliding average permuted scores (red line), and highlighted regions (yellow dotted), where the score exceeds the quantiles, are plotted along one chromosome. The yellow regions are deemed interesting, showing significant over- or under-expression according to the underlying statistic. The plot region ranges from zero to the length of the respective chromosome. \begin{figure}[htbp] \begin{center} \includegraphics[width = 0.9 \columnwidth]{chrom6TkNN} \caption{Example plot of plot.MACATevalScoring} \label{fig:exampleplot} \end{center} \end{figure} One can generate an HTML-page (see figure \ref{fig:html}) on-the-fly by setting the argument \Rfunarg{output} in the function \Rfunction{plot.MACATevalScoring} to ``html''. The generated HTML-page provides information about genes located within the highlighted chromosomal regions. For each gene some annotation, a click-able LocusID linked to the NCBI web site, and the empirical p-value is provided. \begin{figure}[htbp] \begin{center} %\epsfig{file=chrom6T.eps, width = 0.9 \columnwidth} \includegraphics[ width = 0.9 \columnwidth]{chrom6T} \caption{Example for a generated HTML-page} \label{fig:html} \end{center} \end{figure} \section{Results} This section describes the results, we obtain from an exemplary analysis on ``T- vs. B-lymphocyte ALL''. \\ First the regularized t-score \cite{Tibshirani2002} is computed as described in section \ref{sec:scoringttest}. To include information about the distances between the genes, we used the radial basis kernel for smoothing with the default parameters (see section \ref{smoothing}). We have investigated regions of chromosome six for significant differential expression. Figure \ref{fig:html} shows that there is a region on the p-arm of chromosome six that is significantly under-expressed. The genes within that region comprise also the well known MHC class II genes that are known to be expressed by B-lymphocytes, but not by T-lymphocytes. Other genes in this region remain to be investigated in more detail.\\ Apart from this region the data reveals no other significant differences in gene expression between T- and B-lymphocyte ALL on chromosome six. In table \ref{examplegenes}, we show a list of genes found to be located in significantly differentially expressed regions on different chromosomes when analyzing different subtypes of leukemia versus all other subtypes. \begin{table}[htbp] \begin{center} \begin{tabular}{llllp{7.5cm}} \bf{Class} & \bf{Chrom} & \bf{Cytoband} & \bf{LocusID} & \bf{OMIM Annotation} \\ MLL & 8 & 8q24.12 & 4609 & ''Alitalo et al. \cite{Alitalo} found that the MYC gene, which is involved by \emph{translocation} in the generation of \emph{Burkitt lymphoma}, is amplified, resulting in homogeneously staining chromosomal regions (HSRs) in a human neuroendocrine tumor cell line derived from a \emph{colon cancer}. The HSR resided on a distorted X chromosome; amplification of MYC had been accompanied by translocation of the gene from its normal position on 8q24.''\\ MLL & 9 & 9p22.3 & 6595 & ''Mammalian SWI/SNF complexes are ATP-dependent chromatin remodeling enzymes that have been implicated in the regulation of gene expression, cell cycle control, and \emph{oncogenesis}.''\\ MLL & 9 & 9p24.3 & 23189 & '' The suggested role for this protein is in tumorogenesis of renal cell \emph{carcinoma}.''\\ MLL & 11 & 11p13 & 7490 & ''Mutations in this gene can be associated with the development of \emph{Wilms tumors} in the kidney or with abnormalities of the genito-urinary tract.'' \\ E2A & 1 & 1q23 & 5087 & ''Wiemels et al.\cite{wiemels} sequenced the genomic fusion between the PBX1 and E2A genes in 22 pre-B acute lymphoblastic leukemias and 2 cell lines. Kamps et al. \cite{kamps} discussed the chimeric genes created by the human t(1;19) translocation in pre-B-cell acute lymphoblastic leukemias. The authors cloned 2 different E2A-PBX1 fusion transcripts and showed that NIH-3T3 cells transfected with cDNAs encoding the fusion proteins were able to cause malignant tumors in nude mice.''\\ E2A & 9 & 9p24.3 & 23189 & ''By RT-PCR and Western blot analysis, Sarkar et al. \cite{sarkar} demonstrated that KANK expression was suppressed in most renal tumors and in kidney tumor cell lines due to methylation at CpG sites in the gene.'' \end{tabular} \caption{Example genes with OMIM annotation \cite{omim} detected by MACAT to be located in differentially expressed regions. 'Class' denotes the analyzed tumor subtype.} \label{examplegenes} \end{center} \end{table} \section{Discussion and Outlook} As described in the previous section, applying our method, we detected a chromosomal region as significantly differentially expressed, which is in agreement with biological knowledge of the two sample classes involved. This gives an indication that the chromosomal regions annotated as being significant by the method are indeed biologically meaningful. Many of the genes found by MACAT (see table \ref{examplegenes}) are known oncogenes or are at least associated with oncogenesis. This fits well with our expectation, since it appears reasonable that different tumor subtypes can be characterized on the molecular level by different expression levels for genes relevant to oncogenesis. One point of interest for future research would be the application of \Rpackage{macat} on other publicly available data sets and contrasting the results with relevant biological expert knowledge to evaluate performance. This way, one could get an clearer impression of the extend of possible applications. Another point would be the exploration of different approaches for obtaining relative gene expression values for groups of samples. For instance one could use another data set as a reference for samples with ''normal'' gene expression levels (``normal'' denoting samples from patients with a tumor). Of course, due to the noisy nature of microarray data, this approach would have to overcome some structural and technical difficulties to make the measured expression levels comparable. Given that a suitable data set could be found and the compatibility issues resolved, one could examine the characteristic patterns of chromosomal phenomena in tumor subclasses as opposed to normal tissue. The method, which we have described, can detect significant differential expression for chromosomal regions. However, the reason for the differential expression, be it a mutation, translocation, hypermethylation, loss of heterozygosity, or another event, remains to be investigated. Thus, results obtained by our method should be verified by suitable experiments. One could think of a customized cDNA or oligo chip that contains all known genes, including fusion-genes, stemming from translocation events, in the regions that were found to be differentially expressed. To validate the borders of found regions, it would be useful to also incorporate genes that neighbor found regions. Other possibly useful experiments include Real-time PCR to measure the amount of mRNA for one (or few) specific genes, comparing the results to measured expression levels. The nature of the chromosomal event leading to the differential expression can be investigated by methylation array experiments or by fluorescence in situ hybridization. Nevertheless, we have shown that our method provides a solid baseline for gene-wise experiments. \pagebreak \appendix \section{Example Session} \label{ExampleSession} \subsection{The Beginning} In this section, we show an example \Rpackage{macat} session. First of all, one has to include the library and to load the data set provided in the data package \Rpackage{stjudem}. <<>>= library(macat) loaddatapkg("stjudem") data(stjude) @ We now have a data object called \Robject{stjude}. \subsection{First Investigation} Let us have a closer look on the data. The data is already in the right format. It has been pre-processed by the function \Rfunction{preprocessedLoader}. <<>>= summary(stjude) @ We can for example access the first 10 gene names by typing <<>>= stjude$geneName[1:10] @ The different labels in the data set can be accessed by <<>>= unique(stjude$labels) table(stjude$labels) @ There are ten different classes of tumor patients. The next question is how many probe sets are mapped to chromosome 1. <<>>= sum(stjude$chromosome==1) @ Now for some visualization. We want to plot the sliding average of the expression values from sample 3 along chromosome 6 with the default rbf-kernel (for details see section \ref{sec:viz}). <>= plotSliding(stjude, 6, sample=3, kernel=rbf) @ \begin{figure}[htbp] \begin{center} %<>= %print(plotSliding(stjude, chrom=6, sample=3, kernel=rbf)) %@ \includegraphics[width = 0.9 \columnwidth]{Slidingchrom6s3} \caption{Sliding Average of the expression values of chromosome 6 with rbf-kernel.} \label{fig:sliding} \end{center} \end{figure} See the result in figure \ref{fig:sliding}. \subsection{Deeper Investigation} Next we investigate the data for chromosomal regions showing differential expression. Again, we search chromosome 6 for some specific differences between T-lymphocyte ALL (class "T") and B-lymphocyte ALL (all other classes). Take a look on section \ref{sec:scoringttest} (Scoring differential expression) for details on the score. \\ We want to use a kNN-kernel for interpolation between the scores. First, we determine the optimal parameter settings, using the function \Rfunction{evaluateParameters}. <>= evalkNN6 = evaluateParameters(stjude, class="T", chromosome=6, kernel=kNN, paramMultipliers=c(0.01,seq(0.1,2.0,0.1),2.5)) @ Let's have a look at the result: \begin{verbatim} > evalkNN6$best $k [1] 30 > plot(evalkNN6) \end{verbatim} The result should look similar to Figure \ref{fig:evalparam}. According to this result, the optimal parameter setting for the kNN kernel with our data seems to be $k=30$. We use this $k$ for the search for significantly differentially expressed chromosome regions. The \Rfunction{evalScoring} is used to build a \Rclass{MACATevalScoring} object. This may take some time due to the number of permutations. \begin{verbatim} > e1 = evalScoring(stjude, class = "T", chromosome = 6, nperms = 1000, kernel = kNN, kernelparams = evalkNN6$best, cross.validate= FALSE) Investigating 43 samples of class T ... Compute observed test statistics... Building permutation matrix... Compute 1000 permutation test statistics... 250 ...500 ...750 ...1000 ... Compute empirical p-values... Compute quantiles of empirical distributions... Computing sliding values for scores... Compute sliding values for permutations... All done. \end{verbatim} \subsection{Collecting Results} Next, we use the plot function \Rfunction{plot.MACATevalScoring} to generate an HTML-page. Therefore, we set the parameter \Rfunarg{output} to "html" (see section \ref{sec:viz}). \begin{verbatim} > plot(e1, output="html") \end{verbatim} See the result in your web browser; it should look similar to Figure \ref{fig:html}. Some annotation of genes, which lie in the highlighted regions, is provided.\\[0.3cm] This ends our short example session. Have fun using \Rpackage{macat}! \newpage \section*{Acknowledgments} The authors would like to thank Stefanie Scheid and Florian Markowetz for helpful discussions and proof-reading, Claudio Lottaz for countless hints on building an R-package, and Alexander Schliep, Rainer Spang, Eike Staub, and Martin Vingron for setting up our knowledge and motivation for this project. Our special thanks go to two reviewers, whose comments have led to great improvements of \Rpackage{macat}. \begin{thebibliography}{00} \bibitem{Yeoh2002} E.J.~Yeoh et~al. \newblock Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. \newblock {\em Cancer Cell}, 1(2):133--143, March 2002. \bibitem{Christiansen2004} D.~H. Christiansen, M.~K. Andersen, and J. Pedersen-Bjergaard. \newblock Mutations of {\em AML1} are common in therapy-related myelodysplasia following therapy with alkylating agents and are significantly associated with deletion or loss of chromosome arm 7q and with subsequent leukemic transformation. \newblock {\em Blood}, 104(5):1474--1481, 2004. \bibitem{bioconductor} Robert~C. Gentleman, Vincent~J. Carey, Douglas~J. Bates, Benjamin~M. Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, Kurt Hornik, Torsten Hothorn, Wolfgang Huber, Stefano Iacus, Rafael Irizarry, Friedrich Leisch, Cheng Li, Martin Maechler, Anthony~J. Rossini, Guenther Sawitzki, Colin Smith, Gordon~K. Smyth, Luke Tierney, Yee~Hwa Yang, , and Jianhua Zhang. \newblock Bioconductor: Open software development for computational biology and bioinformatics. \newblock {\em Genome Biology}, 5:\penalty0 R80, 2004. \bibitem{Tusher2001} V.G. Tusher, R.~Tisbhirani, and G.~Chu. \newblock Significance analysis of microarrays applied to ionizing radiation response. \newblock {\em Proc. Nat. Acad. Sci.}, 98(9):5116--5121, April 2001. \bibitem{Tibshirani2002} R.~Tibshirani et~al. \newblock Diagnosis of multiple cancer types by shrunken centroids of gene expression. \newblock {\em Proc. Nat. Acad. Sci.}, 99(10):6567--6572, 2002. \bibitem{HastieElementsChapter6} T.~Hastie, R.~Tibshirani, and J.~Friedman \newblock {\em The Elements of Statistical Learning. Chapter 6}. \newblock Springer Verlag, New York, 2001. \bibitem{HastieElementsChapter7} T.~Hastie, R.~Tibshirani, and J.~Friedman \newblock {\em The Elements of Statistical Learning. Chapter 7}. \newblock Springer Verlag, New York, 2001. \bibitem{Alitalo} K.~Alitalo, M.~Schwab, C.~Lin, H.E. Varmus, and J.M. Bishop. \newblock Homogeneously staining chromosomal regions contain amplified copies of an abundantly expressed cellular oncogene (c-myc) in malignant neuroendocrine cells from a human colon carcinoma. \newblock {\em Proc. Nat. Acad. Sci.}, 80:1707--1711, 1983. \bibitem{wiemels} J.~L. Wiemels, B.~C. Leonard, Y.~Wang, M.~R. Segal, S.~P. Hunger, M.~T. Smith, V.~Crouse, X.~Ma, P.~A. Buffler, and S.~R. Pine. \newblock Site-specific translocation and evidence of postnatal origin of the t(1;19) e2a-pbx1 fusion in childhood acute lymphoblastic leukemia. \newblock {\em Proc. Nat. Acad. Sci.}, 99:15101--15106, 2002. \bibitem{kamps} M.~P. Kamps, A.~T. Look, and D.~Baltimore. \newblock The human t(1:19) translocation in pre-b all produces multiple nuclear e2a-pbx1 fusion proteins with differing transforming potentials. \newblock {\em Genes Dev.}, 5:358--368, 1991. \bibitem{sarkar} S.~Sarkar, B.~C. Roy, N.~Hatano, T.~Aoyagi, K.~Gohji, and R.~Kiyama. \newblock A novel ankyrin repeat-containing gene (kank) located at 9p24 is a growth suppressor of renal cell carcinoma. \newblock {\em J. Biol. Chem.}, 277:36585--36591, 2002. \bibitem{omim} V.A. McKusick. \newblock {\em Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders}. \newblock Johns Hopkins University Press, Baltimore, 12 edition, 1982. \end{thebibliography} \end{document}