% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{pdmclass Overview} % \VignetteDepends{pdmclass, Biobase, fibroEset} % \VignetteKeywords{Expression Analysis, Postprocessing} % \VignettePackage{pdmclass} \documentclass[11pt]{article} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage{times} \usepackage{comment} \parindent 0.5in \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{\textit{#1}} \bibliographystyle{plainnat} \begin{document} \title{\bf Using pdmclass} \author{James W. MacDonald} \maketitle \section{Overview} Classification is a statistical technique that uses measurements on a defined set of samples (a training set) to build a rule that can be used to infer the group membership of future samples. An example would be using gene expression data to classify cancer patients according to the expected response to a certain course of therapy. There are many ways to build classification rules, but the general idea is the same; find patterns in the training set that are unique to each sample type and use this information to determine the class of new samples. Microarrays hold great promise for building classifiers because of the amount of information that can be generated from each array. However, this is a two edged sword -- much of the information cannot be distinguished from noise (low expressing genes), and having many more observations than samples can make the analysis computationally and statistically difficult. Therefore, it is usually desirable to pre-filter the genes down to a much smaller (100 - 200) set of genes before building the classifier. This itself is not a simple proposition -- ideally this list should contain genes that are as uncorrelated as possible, because data from correlated genes is in some sense redundant information. An alternative to manually subsetting the data is to use regularized regression models (partial least squares, ridge regression, principal components regression), which are designed to work with large numbers of correlated predictor variables. Since these methods are generally used in situations where the response is continuous (and in classification the response is categorical), we can use the optimal scoring algorithm of \citet{hastie:1994}, which extends these methods to classification problems. For a more detailed description of these methods, please refer to \citet{ghosh:2003}. \section{A Simple Example} In this example we will use the \Rpackage{fibroEset} package, which contains expression data from Affymetrix HG-U95Av2 chips that were used to analyze early passage primary fibroblast cell lines from 18 human, 10 bonobo, and 11 gorilla samples. Although the utility of a classifier based on this data set is questionable at best, it does provide a workable example. We first load the package: <<>>= library("pdmclass") data("fibroEset") fibroEset pData(fibroEset) @ First we fit a classifier to this data, using the expression values and the second column of the phenoData slot. Note here that in classical statistical applications the convention is for rows to contain subjects and columns to contain observations -- we therefore have to transpose the expression data to meet this convention. In addition, we use the usual R formula interface -- for more information, see the \Rfunction{formula} help page. <<>>= y <- as.factor(pData(fibroEset)[,2]) x <- t(exprs(fibroEset)) gn.class <- pdmClass(y ~ x, method = "pls") @ Once we have fit the classifier we can make a plot that shows how well the samples are grouping. <>= plot(gn.class, pch = levels(y)) @ \begin{figure} \centering <>= plot(gn.class, pch = levels(y)) @ \caption{Plot of Fitted PLS Classifier} \label{fig:pls} \end{figure} Figure~\ref{fig:pls} shows the fitted pls classifier. As would be expected, the different species are quite well separated and very tightly grouped. Samples with more subtle differences would not be expected to group this nicely. Having built the classifier, we will most likely want to use it to predict the class of new samples for which we don't know the classes \textit{a priori}. However, before we do this, it is prudent to test the classifier to see how accurate it is. We could simply take the data we used to build the classifier as if it were new data and predict the class of each sample. <<>>= predict(gn.class) @ Since we are predicting the class of the data that was used to build the classifier, we expect that these results will be much better than what could be expected with a set of new data. To get an unbiased estimate of the accuracy, we need a 'test set'. The canonical method of creating and testing a classifier is to split a set of data into a training and testing set. The training set is used to make the classifier and then the testing set is used to estimate the accuracy of the classifier by comparing the predicted class for each sample to the actual class. Unfortunately, it is often difficult to get sufficient numbers of samples to build an accurate classifier, so it may not be possible to split into a training and testing set. An alternative is to perform a 'leave one out' cross-validation where we remove a single sample and then build a classifier with the remaining samples. We then predict the class of the sample that was removed, repeating the process for each sample in turn. We will then have a vector of class assignments for each sample that we can compare to the true class membership to create a 'confusion matrix'. <<>>= tst <- pdmClass.cv(y, x, method = "pls") confusion(tst, y) @ Here we can see that we expect an error rate of approximately \Sexpr{round(attr(confusion(tst, y), "error"),3)} when we apply this classifier to new samples. After building a classifier, we may be interested in the genes that have the most influence in differentiating between sample types. For this we can use the \Rfunction{pdmGenes} function. <<>>= gns <- featureNames(fibroEset) len <- 10 tmp <- pdmGenes(y~x, genelist = gns, list.length = len, B=10) tmp @ The \Rfunction{pdmGenes} function selects the top n genes (set by the argument \Rfunarg{list.length}) from the original classifier, and then determines the importance of these genes by repeatedly taking bootstrap samples from the data and seeing what proportion of the time these genes are actually found to be influential in fitting a classifier using the bootstrap samples. The basic idea being that a truly influential gene will repeatedly show up as being influential as we make slight perturbations to the data. Note that we are using \Rfunarg{contr.treatment} contrasts for these model fits, so one set of samples will always be set as the baseline, and the output will list the genes influential in the comparison of the other samples to the baseline sample. \bibliography{pdmclass} \end{document}