%\VignetteIndexEntry{CMA Manual} %\VignetteKeywords{Classification} %\VignetteDepends{CMA} %\VignettePackage{CMA} \input{preamble.tex} \title{\Huge{\texttt{CMA}} package vignette} \author{Martin Slawski \footnote{\url{ms@cs.uni-sb.de}}\\ Anne-Laure Boulesteix \footnote{\url{www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/}}} \begin{document} \date{ \normalsize{Department of Computer Science, Saarland University, D - 66041 Saarbruecken, Germany} \\ \normalsize{Department of Medical Informatics, Biometry and Epidemiology (IBE) University of Munich , D-81377 Munich, Germany}} \maketitle <>= library(CMA) @ \section{Statistical background} \label{sec:0} For the last few years, microarray-based class prediction has been a major topic in statistics and machine learning. Traditional methods often yield unsatisfactory results or are even inapplicable in the $p \gg n$ setting. Hence, microarray studies have stimulated the development of new approaches and motivated the adaptation of known traditional methods to high-dimensional data. Moreover, model selection and evaluation of prediction rules proves to be highly difficult in this situation for several reasons. Firstly, the hazard of overfitting, which is common to all prediction problems, is increased by high dimensionality. Secondly, the usual evaluation scheme based on the splitting into learning and test data sets often applies only partially in the case of small samples. Lastly, modern classification techniques rely on the proper choice of hyperparameters whose optimization is highly computer-intensive, especially in the case of high-dimensional data. \section{Class prediction based high-dimensional data with small samples}\label{sec:1} \subsection{Settings} The classification problem can be briefly outlined as follows: \begin{itemize} \item we have a predictor space $\mc{X}$, here $\mc{X} \subseteq \R^p$ \item we have a finite set of class labels $\mc{Y} = \{0,\ldots,K-1\}$, with $K$ denoting the total number of classes \item $P(\bm{x},y)$ denotes the joint probability distribution on $\mc{X} \times \mc{Y}$ \item we are given a finite sample $S = \{(\bm{x}_1,y_1),\ldots,(\bm{x}_n,y_n) \}$ of $n$ predictor-class pairs. \end{itemize} %%% correct The task is to find a decision function \begin{align*} \begin{split} \widehat{f}: \; \mc{X}\; \rightarrow \; & \mc{Y} \\ \bm{x} \mapsto & \widehat{f}(\bm{x}) \end{split} \end{align*} (the $\widehat{\cdot}$ indicates estimation from the given sample $S$) such that the \emph{generalization error} \begin{equation}\label{riskfunctional} R[f] = \E_{P(\bm{x},y)}[L(f(\bm{x}), y)] = \int_{\mc{X} \times \mc{Y}} L(y, f(\bm{x})) \; d P(\bm{x},y) \end{equation} is minimized. $L(\cdot, \cdot)$ is a suitable loss function, usually taken to be the indicator loss ($1, \; \text{if} \; f(\bm{x}) \neq y$, $0, \text{otherwise}$). %Other loss functions and performances measures are discussed extensively in %section \ref{sec:2}.\\ \subsection{Estimation of prediction accuracy} As we are only equipped with a finite sample $S$ and the underlying distribution is unknown, approximations to (\ref{riskfunctional}) have to be found. Its empirical counterpart \begin{equation}\label{erisk} R[f]_{\text{emp}} = n^{-1} \su L(y_i, f(\bm{x}_i)) \end{equation} has a (usually large) negative bias for (\ref{riskfunctional}), thus model selection based on (\ref{erisk}) leads to overfitting the sample $S$.\\ A first improvement involves a split of into parts $\mc{L}$ (learning sample), $\mc{T}$ (test sample) with the intention to separate model selection and -evaluation, by doing model selection only with $\mc{L}$ and evaluating the resulting decision function $f(\cdot)$ only on $\mc{T}$. For microarray data, the sample sizes $n$ is usually very small, leading to serious problems when estimating prediction accuracy and when constructing a prediction rule based on the available data, a problem which is also related to model choice (see Section \ref{subsec:pred}). When splitting the original data into two approximately equally sized data sets (learning set and test set), the performance of $f(\cdot)$ is strongly diminuished due to a further reduction of the sample size and the evaluation is unreliable and highly variant. Hence, alternative designs are needed. In the package \texttt{CMA}, we pursue the following route for mitigating at least the second consequence, by resampling and aggregating: \begin{itemize} \item Generate $B$ splits of $S = (\mc{L}_b, \mc{T}_b), b=1,\ldots,B$ into learning- and test sample \item Obtain $\widehat{f}_b$ from $\mc{L}_b, \; b=1,\ldots,B$ \item Setting $\mc{I}_b = \{i: y_i \notin \mc{L}_b \}$, we use \begin{equation}\label{errorglobal} \wh{\varepsilon} = \frac{1}{B} \sum_{b=1}^B \frac{1}{|\mathcal{I}_b|} \sum_{i \in \mathcal{I}_b} I(y_i \neq \widehat{f}_b(\bm{x}_i)) \end{equation} as estimator for (\ref{riskfunctional}). \end{itemize} The strategy is to reduce the variance of the error estimator by averaging.\\ If $n$ is large, this procedure will not improve much on a simple splitting.\\ As splitting rules, the function \texttt{GenerateLearningsets} implements: \begin{itemize} \item \textbf{Leaving-one-out cross-validation} :\\ $\mc{T}_b$ consists only of one observation, this is repeated for each observation in $S$, so that $B = n$. \item $k$-\textbf{fold cross-validation} (\texttt{method = "CV", fold = , niter = }):\\ $S$ is split into $k$ parts of equal size. For each iteration $b$, the $b$-th part is used as $\mc{T}_b$ and the union of the remaining parts as $\mc{L}_b$. Setting $\texttt{fold = n}$ is equivalent to \texttt(method = "LOOCV"). As the splitting is not uniquely determined for \texttt{fold} $< n$, the whole procedure can be repeated $\texttt{niter}$ times. \item Monte-Carlo-cross-validation (\texttt{method = "MCCV", fold=, ntrain=, niter= }):\\ $B$=\texttt{niter} random learning samples of cardinality $\texttt{ntrain}$ are generated. \item Bootstrap (\texttt{method = "bootstrap", ntrain = , niter = }):\\ $B$=\texttt{niter} bootstrap samples of cardinality $\texttt{ntrain}$ are used as learning samples. \end{itemize} Furhermore, \emph{stratified sampling} is possible by setting the argument \texttt{strat = TRUE}. This implies, that for each $\mc{L}_b$, the proportion of the classes $\{0,\ldots,K-1 \}$ is the same as for the full $S$. This option is very useful (and sometimes even necessary) in order to guarantee that each class is sufficiently often represented in each $\mc{L}_b$, in particular if there are classes that are small in size.\\ Benefits and drawbacks of above splitting rules are discussed in \citet{BragaNeto} and \citet{Mol2005}.\\ \subsection{Constructing a prediction rule}\label{subsec:pred} The second main issue is the construction of a appropriate prediction rule. In microarray data analysis, we have to deal with the $n\ll p$ situation, i.e. the number of predictors exceeds by far the number of observations. Some class prediction methods only work for the case $n \ll p$, e.g. linear or qudratic discriminant analysis which are based on the inversion of a matrix of size $p\times p$ and rank $n-1$. In the $n\ll p$ setting, the hazard of overfitting is especially acute: perfect separation of the classes for a given sample based on a high number of predictors is always possible. However, the resulting classification rule may generalize poorly on independent test data. There are basically three approaches to cope with the $n\ll p$ setting: \begin{enumerate} \item variable selection using, e.g., univariate statistical tests \item regularization or shrinkage methods, such as the Support Vector Machine (\citet{Boser1992}), $\ell_2$ or $\ell_1$ penalized logistic regression (\citet{Zhu2004}; \citet{glmpath}) or Boosting (\citet{Friedman2001}; \citet{Buhlmann2003}) from which some also perform variable selection \item dimension reduction or feature extraction. Most prominent is the Partial Least Squares method (\citet{Bou2007a}). \end{enumerate} Most classification methods depend on a vector of hyperparameters $\bm{\lambda}$ that have to be correctly chosen. Together with variable selection, this is part of the model selection and has thus to be performed separately for each learning sample $\mc{L}_b$ to avoid bias. An optimal vector of hyperparameters $\bm{\lambda}^{\text{opt}}$ is determined by defining a discrete set of values whose performance is then measured by cross-validation. This involves a further splitting step, which is sometimes called 'inner loop' or 'nested' cross-validation (\citet{Sta2005a}). More precisely, each learning set $\mc{L}_b$ is divided into $l=1,\ldots,k$ parts such that the $l$-th part forms the test sample for the $l$-th (inner) iteration. Analogously to (\ref{errorglobal}), this procedure can be used to derive an error estimator for each value on the grid of candiate hyperparameter values. The optimal vector $\bm{\lambda}^{\text{opt}}$ is chosen to minimize this error criterion. Note that there are often several such minimizers. Furthermore, the minimizer found by this procedure can be relatively far away from the true minimizer, depending on how fine the discrete grid has been chosen.\\ The choice of the inner cross-validation scheme is difficult. With a high $k$, computation times soon become prohibitively high. With a low $k$, the size of $\mc{L}_{b_l}, \; l=1,\ldots,k$ is strongly reduced compared to the complete sample $S$, which may have an impact on the derived optimal parameter values. Nevertheless, nested cross-validation is commonly used in this context (\citet{review1}).\\ Considering the computational effort required for hyperparameter optimization and the small sample sizes, one may prefer class prediction methods that do not depend on many hyperparameters and/or behave robustly against changes of the hyperparameter values. \section{Overview of CMA features}\label{sec:2} In a nutshell, the package has the following features. \begin{itemize} \item It offers a uniform, user-friendly interface to a total of 21 classification methods (\ref{methodsoverview}), comprising classical approaches (such as discriminant analysis) as well as more sophisticated methods, e.g. Support Vector Machines (SVM) or boosting techniques. User-friendliness means that the input formats are uniform among different methods, that the user may choose between three different input formats and that the output is highly self-explicable and informative. \item Probability estimations for predicted observations are provided by most of the classifiers, with only a few exceptions. This is more informative than only returning class labels and enables a more precise comparison of different classifiers. \item It automatically generates learning samples as explained in \hyperref[sec:1]{section }{\ref{sec:1}}, including the generation of stratified samples. \item Preliminary variable selection (if any) is performed for each iteration separately based on one of the following ranking procedures, using the method \texttt{GeneSelection}: \begin{itemize} \item ordinary two-sample t.test (\texttt{method = "t.test"}) \item Welch modification of the t.test (\texttt{method = "welch.test"}) \item Wilcoxon rank sum test (\texttt{method = "wilcox.test"}) \item F test (\texttt{method = "f.test"}) when $K>2$ \item Kruskal-Wallis test (\texttt{method = "kruskal.test"}) when $K>2$ \item 'moderated' t and F test, respectively, \\ using the package \texttt{limma}(\citet{limma}) (\texttt{method = "limma"}) \item One-step Recursive Feature Elimination (\citet{rfe}) (\texttt{method = "rfe"}) \item random forest variable importance measure (\texttt{method = "rf"}) \item the Lasso (\texttt{method = "lasso"}) \item the elastic net (\texttt{method = "elasticnet"}) \item componentwise boosting (\texttt{method = "boosting"}) \item the ad-hoc criterion used in \citet{golub1999} \end{itemize} For most methods, the implementation is very fast. The package \texttt{CMA} uses own functions instead of the pre-defined \texttt{R} functions. Additionally, the multi-class case is fully supported, even if the chosen \texttt{method} is not defined for it. The workaround is realized by using either a pairwise or a one-vs-all scheme. \item Hyperparameter tuning is carried out using the scheme outlined in section \hyperref[sec:1]{section }{\ref{sec:1}}, for a \emph{fixed} (sub)set of variables. It can be performed in a fully automatically manner using pre-defined grids. Alternatively, it can be completely customized by the user. \item The method \texttt{classification} enables the user to combine gene selection, hyperparameter tuning and class prediction into one single step. But each step can also be performed separately. \item Performance can be assessed using several performance measures which are commonly used in practice: \begin{itemize} \item the misclassification rate \item the sensitivity and specifity, for $K=2$ \item the empirical area under the curve (AUC), for $K=2$, if the class prediction method returns a probability, \item the Brier score \item the average probability of correct classification \end{itemize} Each performance measure can be 1) averaged for all predictions globally, 2) averaged within each iteration first and then over all iterations or 3) averaged within each observation first and then over all observations. Based on the results, the function \texttt{obsinfo} can be used to identify observations that are frequenly misclassified (and are thus candidates for outliers). \item Comparison of the performance of several classifiers can be performed using one or several of the above performance measures. This comparison can be tabulated and visualized using the method \texttt{comparison}. \item Most results can quickly be summarized and visualized using pre-defined convenience methods, for example: \begin{itemize} \item \texttt{plot,cloutput-method} produces a probability plot, also known as 'voting plot' \item \texttt{plot,genesel-method} visualizes variable importance via a barplot \item \texttt{roc,cloutput-method} draws the empirical ROC curve \item \texttt{toplist,genesel-method} shows the most important variables \item \texttt{summary,evaloutput-method} Makes a summary out of iteration- or observationwise performance measures \end{itemize} \item The implementation is fully organized in \texttt{S4} classes, thus making the extension of \texttt{CMA} very easy. In particular, own classification methods can easily be integrated if they return a proper object of class \texttt{cloutput}. \item The class prediction methods implemented in \texttt{CMA} are summarized in Table \ref{tab:methods}. \end{itemize} \begin{table} \begin{tabular}[h!]{llll} \label{methodsoverview} %\hline \textbf{method name} &\texttt{CMA} \textbf{function name} & \textbf{Package} & \textbf{Reference} \\ \hline Componentwise Boosting & \texttt{compBoostCMA} & \texttt{CMA} & \citet{Buhlmann2003} \\ Diagonal Discriminant Analysis & \texttt{dldaCMA} & \texttt{CMA} & \citet{dabook} \\ Elastic Net & \texttt{ElasticNetCMA} & \texttt{glmpath} & \citet{elasticnet} \\ Fisher's Discriminant Analysis& \texttt{fdaCMA} & \texttt{CMA} & \citet{Rip1996} \\ Flexible Discriminant Analysis & \texttt{flexdaCMA} & \texttt{mgcv} & \citet{Rip1996} \\ Tree-based Boosting &\texttt{gbmCMA} & \texttt{gbm} & \citet{Friedman2001} \\ $k$-nearest neighbours & \texttt{knnCMA} & \texttt{class} & \citet{Rip1996} \\ Linear Discriminant Analysis ${\ast}$ & \texttt{ldaCMA} & \texttt{MASS} & \citet{dabook} \\ Lasso & \texttt{LassoCMA} & \texttt{glmpath} & \citet{glmpath} \\ Feed-Forward Neural Networks & \texttt{nnetCMA} & \texttt{nnet} & \citet{Rip1996} \\ Probalistic nearest neighbours & \texttt{pknnCMA} & \texttt{CMA} & $-$ \\ Penalized Logistic Regression & \texttt{plrCMA} & \texttt{CMA} & \citet{Zhu2004} \\ Partial Least Squares ${\star}$ + $\ast$ & \verb+pls_ldaCMA+ & \texttt{plsgenomics} & \citet{Bou2007a} \\ $\star$ + logistic regression & \verb+pls_lrCMA+ & \texttt{plsgenomics} & \citet{Bou2007a} \\ $\star$ + Random Forest & \verb+pls_rfCMA+ & \texttt{plsgenomics} & \citet{Bou2007a} \\ Probabilistic Neural Networks & \texttt{pnnCMA} & \texttt{CMA} & \citet{specht} \\ Quadratic Discriminant Analysis ${\ast}$ & \texttt{qdaCMA} & \texttt{MASS} & \citet{dabook} \\ Random Forest &\texttt{rfCMA} & \texttt{randomForest} & \citet{Breiman2001} \\ PAM & \texttt{scdaCMA} & \texttt{CMA} & \citet{Tib2002} \\ Shrinkage Discriminant Analysis & \texttt{shrinkldaCMA} & \texttt{CMA} & $-$ \\ Support Vector Machine & \texttt{svmCMA} & \texttt{e1071} & \citet{ScholkopfSmola2002} \\ \hline \end{tabular} \label{tab:methods} \end{table} \section{Comparison with existing packages}\label{sec:3} The idea of an interace for the integration of classification methods for microarray data is not new and we here argue why \texttt{CMA} can be a significant improvement with respect to the following aspects: standardized and reproducible analysis, neutral comparisons of existing methods, comfortable use. The \texttt{CMA} package shows similarities to the \texttt{Bioconductor} package \texttt{MLInterfaces} standing for 'An interace to various machine learning methods' (\citet{Carey2007}), see also the \text{Bioconductor} textbook for a presentation of an older version. Contrary to \texttt{CMA}, \texttt{MLInterfaces} also offers access to popular 'unsupervised learning' methods such as clustering, independent component analysis etc. that can be beneficial for exploratory analyses as well as for revealing classes in a preparatory step preceding supervised classification. The package architecture is very similar the \texttt{CMA} structure in the sense that wrapper functions are used to call classification methods from other packages. Up to now, \texttt{CMA} includes more predefined features than \texttt{MLInterfaces} as far as variable selection, hyperparameter tuning, classifier evaluation and comparison are concerned. While the method \texttt{xval} is flexible for experienced users, it provides only LOOCV or 'leave-one-group out' as predefined options. As this package adresses also unexperienced users, we decided to include the most common validation schemes in a standardized manner. As consequence, we additionally hope to increase reproducibility of results which is a major concern in statistics in general and for microarray data analysis in particular. In the current version, variable selection can also be carried separately for each different learning set, but this seems not to be a standard procedure. In the examples presented in the book mentionned above, variable selection is only performed once using the \emph{complete} sample $S$ although this procedure is widely known to yield optimistically biased results (\citet{Amb2002}). Moreover, hyperparameter tuning is completely missing in \texttt{MLInterfaces}. In our opinion, this makes the objective comparison of different class prediction methods difficult. If tuning is ignored, simpler methods without (or with few) tuning parameters tend to perform seemingly better than more complex algorithms. We 'borrowed' from \texttt{MLInterface} some ideas regarding the selection of classification methods (\hyperref[sec:2]{section }{\ref{sec:2}}) and some functionalities such as the variable importance plot, or the \texttt{Planarplot}. %For the sake of fairness of comparison, it %has to be stated that our thorough test was unfortunately performed %at a time where (according to the authors) \texttt{MLInterfaces} %is in a phase of new developments. We would also like to mention the package \texttt{e1071} (\citet{e1071pack}) whose \texttt{tune} function served as the basic idea for the \texttt{CMA} tuning functionalities.\\ The package \texttt{MCRestimate} (\citet{MCRestimate}) emphasizes very similar aspects as \texttt{CMA}, focussing on the estimation of misclassification rates and cross-validation for model selection and evaluation. It is (to our knowledge) the only \texttt{Biconductor} package that supports hyperparameter tuning, but obviously referring to the function \texttt{e1071:::tune}. Compared to \texttt{CMA}, it is a bit less comprehensive, in particular with respect to variable selection. Moreover, the package structure is, to our opinion, less stringent than that of CMA, mainly because it lacks a class structure (neither \texttt{S3} nor \texttt{S4} is used). \section{Example 1: Focussing on one method}\label{sec:ex1} The following two sections demonstrate the usual workflow when using \texttt{CMA} and its essential features, methods and objects. While this section focusses on optimizing and evaluating one method, the following section is devoted to classifier comparison. We use the famous leukaemia dataset of \citet{golub1999} as a data example. The sample consists of 38 observations in total, from which 27 belong to class 0 (acute lymphoblastic leukemia) and 11 to class 1 (acute myeloid leukemia). We start by loading and the dataset and extracting gene expression and class labels, respectively. <>= data(golub) golubY <- golub[,1] golubX <- as.matrix(golub[,-1]) @ Following the approach described in \hyperref[sec:1]{section }{\ref{sec:1}}, we generate several learning samples by different splitting rules, taking into account that each of them has advantages and disadvantages. Learning samples are generated by the function \texttt{GenerateLearningsets}, which returns an object of class \texttt{learningsets}: <>= loodat <- GenerateLearningsets(y=golubY, method ="LOOCV") class(loodat) getSlots(class(loodat)) show(loodat) @ For five-fold cross-validation, we use the following commands: <>= set.seed(321) fiveCVdat <- GenerateLearningsets(y=golubY, method = "CV", fold = 5, strat = TRUE) @ Note that stratified learning sets are generated by setting the argument \texttt{strat = TRUE}. The random seed should be set for reproducibility. We can proceed analogously for Monte-Carlo cross-validation and bootstrap: <>= set.seed(456) MCCVdat <- GenerateLearningsets(y=golubY, method = "MCCV", niter = 3, ntrain = floor(2/3*length(golubY)), strat = TRUE) set.seed(651) bootdat <- GenerateLearningsets(y=golubY, method = "bootstrap", niter = 3, strat = TRUE) @ In this example, we choose the Support Vector Machine with linear kernel as classification method. Variable selection is not strictly necessary, but it has empirically been shown that the performance of the SVM method can significantly be improved when noise features are removed (\citet{Has2001}). For simplicity, we choose the distribution free Wilcoxon-Test to rank the variables, separately for each learning sample : <>= varsel_fiveCV <- GeneSelection(X = golubX, y=golubY, learningsets = fiveCVdat, method = "wilcox.test") varsel_MCCV <- GeneSelection(X = golubX, y=golubY, learningsets = MCCVdat, method = "wilcox.test") varsel_boot <- GeneSelection(X = golubX, y=golubY, learningsets = bootdat, method = "wilcox.test") @ Now let us have a closer look at \verb+varsel_fiveCV+. The \texttt{toplist} methods provides easy access to the top-ranked variables: <>= show(varsel_fiveCV) toplist(varsel_fiveCV, iter=1) seliter <- numeric() for(i in 1:5) seliter <- c(seliter, toplist(varsel_fiveCV, iter = i, top = 10, show = FALSE)$index) sort(table(seliter), dec = TRUE) @ We see that just 1 variable is among the top 10 in every learning sample. The next step is hyperparameter tuning. It is well-known that the SVM needs much tuning (\citet{Sta2005a}) and that its performance can decrease drastically without tuning. For simplicity, we use a linear kernel, thus avoiding the tuning of kernel parameters. The choice of the parameter $C$ in the primal objective of the SVM \begin{equation*} P(\bm{w}) = \norm{\bm{w}}^2 + C \su \xi_i, \; C > 0 \end{equation*} has still to be done. $\bm{w}$ denotes the weight vector of the maximum margin hyperplane, while the $\{ \xi_i \}_{i=1}^n$ quantify the amount of violation of this hyperplane by the learning sample. Increasing $C$ penalizes violations more severely, forcing the hyperplane to separate the learning sample (and thus probably producing overfitting).\\ In order to find an appropriate value for $C$ (which corresponds to the argument \texttt{cost} in \texttt{e1071:::svm}), we take the best 100 genes from the previous variable ranking step. As hyperparameter tuning is very time-intensive, we use the five-fold CV procedure only in this demonstrating example. The resulting best value for \texttt{cost} will then also be used for the other other CV procedures. In this example, we consider five candidate values: 0.1,1,10,100,200. Due to the nested cross-validation procedure, the SVM is trained $3 \times 5 \times 5 = 75$ times (!) in the following example. Three is to the number of inner cross-validation folds, the first five is the number of candidate values and the second five is the number of outer cross-validation folds. <>= set.seed(351) tuningstep <- CMA:::tune(X = golubX, y=golubY, learningsets = fiveCVdat, genesel = varsel_fiveCV, nbgene = 100, classifier = svmCMA, grids = list(cost = c(0.1, 1, 10, 100, 200)),probability=T) @ <>= show(tuningstep) unlist(best(tuningstep)) @ The results seem to equivocally favour \texttt{cost=0.1}. However, if we visualize the tuning results, we see that almost all tuning parameter values perform quite well. Please note that CMA will by default choose the smallest value if performances of several values are equal: <>= par(mfrow = c(2,2)) for(i in 1:4) plot(tuningstep, iter = i, main = paste("iteration", i)) @ Nonetheless, we will choose \texttt{cost=0.1} since it produced good results.\\ We can now turn the attention to class prediction: <>= #class_loo <- classification(X = golubX, y=golubY, learningsets = loodat, # genesel = varsel_loo, nbgene = 100, classifier = svmCMA, # cost = 100) class_fiveCV <- classification(X = golubX, y=golubY, learningsets = fiveCVdat, genesel = varsel_fiveCV, nbgene = 100, classifier = svmCMA, cost = 0.1,probability=T) class_MCCV <- classification(X = golubX, y=golubY, learningsets = MCCVdat, genesel = varsel_MCCV, nbgene = 100, classifier = svmCMA, cost = 0.1,probability=T) class_boot <- classification(X = golubX, y=golubY, learningsets = bootdat, genesel = varsel_boot, nbgene = 0.1, classifier = svmCMA, cost = 100,probability=T) @ The results of \texttt{classification} are lists where each element is an object of class \texttt{cloutput}, generated for each learning sample. For visualization purpose, we use the \texttt{join} function to combine the single list elements into 'big' objects. We first put the results from the four splitting schemes into one list and then use \texttt{lapply()}: <>= resultlist <- list(class_fiveCV, class_MCCV, class_boot) result <- lapply(resultlist, join) @ The probability (or voting) plot is one of the most popular visualizatio method in microarray-based classification: <>= schemes <- c("five-fold CV", "MCCV", "bootstrap") par(mfrow = c(3,1)) for(i in seq(along = result)) plot(result[[i]], main = schemes[i]) @ \texttt{ftable} applied to objects of class \texttt{cloutput} yields confusion matrices: <>= invisible(lapply(result, ftable)) @ \texttt{roc} draws simple ROC cuves: <>= par(mfrow = c(2,2)) for(i in seq(along = result)) roc(result[[i]]) @ We can now \texttt{join} again to aggregate over the different splitting rules: <>= totalresult <- join(result) ftable(totalresult) @ Confusion matrices implicity quantify performance via misclasification. For more advanced performance evaluation, one can use \texttt{evaluation}. Note that the input has to be a list (and not an object of class \texttt{cloutput}). For the Monte-Carlo cross-validation scheme, we have: <>= av_MCCV <- evaluation(class_MCCV, measure = "average probability") show(av_MCCV) boxplot(av_MCCV) summary(av_MCCV) @ \texttt{measure = "average probability"} stands for the average predicted probability for the correct class, or formally: \begin{equation*} \sum_{b=1}^B \sum_{i \in \mc{L}_b} \sum_{k=0}^{K-1} I(y_i = k) \widehat{p}(y_i = k | \bm{x}_i), \end{equation*} with $\widehat{p}(k|\bm{x})$ denoting the condtional predicted probability for class $k$, given $\bm{x}$.\\ \\ By default, the evalution scheme is iterationwise, but it can also be done observationwise : <>= av_obs_MCCV <- evaluation(class_MCCV, measure = "average probability", scheme = "obs") show(av_obs_MCCV) @ One might also wonder which observations are misclassified very often. To find it out, one can use: <>= obsinfo(av_obs_MCCV, threshold = 0.6) @ \section{Example 2: classifier comparison}\label{sec:ex2} \texttt{CMA} implements a complete bundle of methods based on the principle of discriminant analysis. Here, we compare six of them: diagonal-, linear- and quadratic discrimininant analysis, discriminant analysis by Fisher, shrunken centroids discriminant analyis (also known as PAM) and Partial Least Squares followed by linear discriminant analysis, , applied to the small blue round cell tumour dataset of \citet{Khan2001} which comprises 65 samples from four tumour classes.\\ From a theoretical point of view, linear-, quadratic- and Fisher's discriminant analysis are apriori inferior due to the fact that they do not work in the $p \gg n$ without variable selection. Shruken centroids discriminant analysis is assumed to work better than the simple diagonal discriminant analysis because it can 'shrink-out' noise variables. Partial Least Squares is also expected to work well.\\ But let as see how this looks in pratice. As data basis, we will use (stratified) five-fold cross-validation, repeated ten times in order to achive more stable results. <<>>= data(khan) khanY <- khan[,1] khanX <- as.matrix(khan[,-1]) set.seed(27611) fiveCV5iter <- GenerateLearningsets(y=khanY, method = "CV", fold = 5, niter = 5, strat = TRUE) @ Contrary to the step-by-step procedure in the previous example, we will here always use the flexible method \texttt{classification}. We start with diagonal discriminant analysis which neither needs variable selection nor tuning. <>= class_dlda <- classification(X = khanX, y=khanY, learningsets = fiveCV5iter, classifier = dldaCMA) @ We now rank the variables (genes) according to the t statistic as basis for variable selection necessary for linear-, quadratic- and Fisher's linear discriminant analysis: <>= genesel_da <- GeneSelection(X=khanX, y=khanY, learningsets = fiveCV5iter, method = "t.test", scheme = "one-vs-all") @ For the class prediction that follows, we pass the generated learning sets, the gene rankings (\verb+genesel_da+) to \texttt{classification}. The number of genes that are retained is specified by the argument \texttt{nbgene}. <>= class_lda <- classification(X = khanX, y=khanY, learningsets = fiveCV5iter, classifier = ldaCMA, genesel= genesel_da, nbgene = 10) class_fda <- classification(X = khanX, y=khanY, learningsets = fiveCV5iter, classifier = fdaCMA, genesel = genesel_da, nbgene = 10, comp = 2) class_qda <- classification(X = khanX, y=khanY, learningsets = fiveCV5iter, classifier = qdaCMA, genesel = genesel_da, nbgene = 1) @ Shrunken centroids discriminant analysis does not do variable selection, but hyperparameter tuning for the shinkage intensity. We here use the pre-specified grid by setting \texttt{tuninglist} to an empty list. <>= set.seed(876) class_scda <- classification(X = khanX, y=khanY, learningsets = fiveCV5iter, classifier = scdaCMA, tuninglist = list(grids = list())) @ At last, we use partial least squares (with two components, which is the default). <>= class_plsda <- classification(X = khanX, y=khanY, learningsets = fiveCV5iter, classifier = pls_ldaCMA) @ A comparison can be performed very quickly, needing only a very few lines. The method \texttt{comparison} does the whole job: <>= dalike <- list(class_dlda, class_lda, class_fda, class_qda, class_scda, class_plsda) par(mfrow = c(3,1)) comparison <- compare(dalike, plot = TRUE, measure = c("misclassification", "brier score", "average probability")) print(comparison) @ \bibliographystyle{plainnat} \bibliography{classification} \end{document} @