--- title: "An Introduction to **ClassifyR**" author: Dario Strbenac, Graham Mann, Jean Yang, John Ormerod
The University of Sydney, Australia. output: BiocStyle::html_document: toc: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{An Introduction to the ClassifyR Package} --- ```{r, echo = FALSE, results = "asis"} options(width = 130) ``` ## Overview **ClassifyR** provides a structured pipeline for cross-validated classification. Classification is viewed in terms of four stages, data transformation, feature selection, classifier training, and prediction. The stages can be run in any order that is sensible. Each step can be provided with custom functions that follow some rules about parameters. The driver function *runTests* implements different varieties of cross-validation. They are: * Ordinary k-fold cross-validation * Permutation of the order of samples followed by k-fold cross-validation * Repeated x% test set cross-validation * leave-k-out cross-validation *runTests* can use parallel processing capabilities in R to speed up cross-validations when many CPUs are available. The output of *runTests* is a *ClassifyResult* object which can be directly used by the performance evaluation functions. The process of classification is summarised by a flowchart. ```{r, echo = FALSE} htmltools::img(src = knitr::image_uri("ClassifyRprocedure.png"), style = 'margin-left: auto;margin-right: auto') ``` Importantly, ClassifyR implements a number of methods for classification using different kinds of changes in measurements between classes. Most classifiers work with features where the means are different. In addition to changes in means (DM), **ClassifyR** also allows for classification using differential varibility (DV; changes in scale) and differential distribution (DD; changes in location and/or scale). See the appendix section "Common *params* Specifications for Common Classifications" for some ready-to-use parameter sets for standard use of some classifiers. In the following sections, some of the most useful functions provided in **ClassifyR** will be demonstrated. However, a user can provide any feature selection, training, or prediction function to the classification framework, as long as it meets some simple rules about the input and return parameters. See the appendix section of this guide titled "Rules for New Functions" for a description of these. ## Comparison to Existing Classification Frameworks There are a few other frameworks for classification in R. The table below provides a comparison of which features they offer. Package | Run User-defined Classifiers | Parallel Execution on any OS | Parameter Tuning | Intel DAAL Performance Metrics | Ranking and Selection Plots | Class Distribution Plot | Sample-wise Error Heatmap | Direct Support for MultiAssayExperiment Input -------|--------|-------|--------|--------|---------|-----------|----------|---------- **ClassifyR** | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes caret | Yes | Yes | Yes | No | No | No | No | No MLInterfaces | Yes | No | No | No | No | No | No | No MCRestimate | Yes | No | Yes | No | No | No | No | No CMA | No | No | Yes | No | No | No | No | No ## Provided Functionality Although being a cross-validation framework, a number of popular feature selection and classification functions are provided by the package which meet the requirements of functions to be used by it (see the last section). ### Provided Feature Selection and Classification Methods Functions with names ending in "interface" indicate wrappers for existing methods implemented in other packages. Different methods select different types of changes (i.e. location and/or scale) between classes. ```{r, echo = FALSE} htmltools::img(src = knitr::image_uri("providedSelection.png"), style = 'margin-left: auto;margin-right: auto') ``` Likewise, a variety of classifiers is also provided. ```{r, echo = FALSE} htmltools::img(src = knitr::image_uri("providedClassifiers.png"), style = 'margin-left: auto;margin-right: auto') ``` If a desired selection or classification method is not already implemented, rules for writing functions to work with **ClassifyR** are outlined in the next section. ## Case Study : Diagnosing Asthma To demonstrate some key features of ClassifyR, a data set consisting of the 2000 most variably expressed genes and 190 people will be used to quickly obtain results. The journal article corresponding to the data set was published in *Scientific Reports* in 2018 and is titled [A Nasal Brush-based Classifier of Asthma Identified by Machine Learning Analysis of Nasal RNA Sequence Data](http://www.nature.com/articles/s41598-018-27189-4). ```{r, message = FALSE} library(ClassifyR) data(asthma) measurements[1:5, 1:5] head(classes) ``` The numeric matrix variable *measurements* stores the normalised values of the RNA gene abundances for each sample and the factor vector *classes* identifies which class the samples belong to. The measurements were normalised using **DESeq2**'s *varianceStabilizingTransformation* function, which produces $log_2$-like data. For more complex data sets with multiple kinds of experiments (e.g. DNA methylation, copy number, gene expression on the same set of samples) a [*MultiAssayExperiment*](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html) is recommended for data storage and supported by **ClassifyR**'s methods. ### runTests Driver Function of Cross-validated Classification *runTests* is the main function in **ClassifyR** which handles the sample splitting and parallelisation, if used, of cross-validation. To begin with, a simple classifier will be specified. It uses a **limma** moderated t-test ranking for feature selection and DLDA for classification. The *limmaSelection* function also uses DLDA for estimating a resubstitution error rate for a number of top-*f* ranked features, as a heuristic for picking *f* features from the feature ranking which are used in the training and prediction stages of classification. This classifier relies on differences in means between classes. No parameters need to be specified, because this is the default classification of *runTests*. ```{r, tidy = FALSE} DMresults <- runTests(measurements, classes, datasetName = "Asthma", classificationName = "Different Means", permutations = 20, folds = 5, seed = 2018, verbose = 1) DMresults ``` Here, 20 permutations and 5 folds cross-validation is specified by the values of *permutations* and *folds*. For computers with more than 1 CPU, the number of cores to use can be given to *runTests* by using the argument *parallelParams*. The parameter *seed* is important to set for result reproducibility when doing a cross-validation such as this, because it employs randomisation to partition the samples into folds. For more details about *runTests* and the parameter classes used by it, consult the help pages of such functions. ### Evalulation of a Classification The most frequently selected gene can be identified using the *distribution* function and its relative abundance values for all samples can be displayed visually by *plotFeatureClasses*. ```{r, fig.height = 8, fig.width = 8, results = "hold", message = FALSE} selectionPercentages <- distribution(DMresults, plot = FALSE) sortedPercentages <- sort(selectionPercentages, decreasing = TRUE) head(sortedPercentages) mostChosen <- names(sortedPercentages)[1] bestGenePlot <- plotFeatureClasses(measurements, classes, mostChosen, dotBinWidth = 0.1, xAxisLabel = "Normalised Expression") ``` The means of the abundance levels of *ZDHHC1* are substantially different between the people with and without asthma. *plotFeatureClasses* can also plot categorical data, such as may be found in a clinical data table, as a bar chart. Classification error rates, as well as many other prediction performance measures, can be calculated with *calcCVperformance*. Next, the balanced error rate is calculated considering all samples, each of which was in the test set once. The balanced error rate is defined as the average of the classification errors of each class. See the documentation of *calcCVperformance* for a list of performance metrics which may be calculated. ```{r} DMresults <- calcCVperformance(DMresults, "balanced error") DMresults performance(DMresults) ``` The error rate is about 20%. If only a vector of predictions and a vector of actual classes is available, such as from an old study which did not use **ClassifyR** for cross-validation, then *calcExternalPerformance* can be used on a pair of factor vectors which have the same length. ### Comparison of Different Classifications The *samplesMetricMap* function allows the visual comparison of sample-wise error rate or accuracy measures from different *ClassifyResult* objects. Firstly, a classifier will be run that uses Kullback-Leibler divergence ranking and resubstitution error as a feature selection heuristic and a naive Bayes classifier for classification. This classification will use features that have either a change in location or in scale between classes. ```{r, tidy = FALSE} selectParams <- SelectParams(KullbackLeiblerSelection, resubstituteParams = ResubstituteParams()) trainParams <- TrainParams(naiveBayesKernel) predictParams <- PredictParams(predictor = NULL, getClasses = function(result) result, weighted = "weighted", weight = "height difference", returnType = "both") paramsList <- list(selectParams, trainParams, predictParams) DDresults <- runTests(measurements, classes, datasetName = "Asthma", classificationName = "Differential Distribution", permutations = 20, folds = 5, seed = 2018, params = paramsList, verbose = 1) DDresults ``` The naive Bayes kernel classifier has options specifying how the distances between class densities are used. For more information, consult the documentation of the *naiveBayesKernel* function. Now, the classification error for each sample is also calculated for both the differential means and differential distribution classifiers and both *ClassifyResult* objects generated so far are plotted with *samplesMetricMap*. ```{r, fig.width = 10, fig.height = 7} library(grid) DMresults <- calcCVperformance(DMresults, "sample error") DDresults <- calcCVperformance(DDresults, "sample error") resultsList <- list(Abundance = DMresults, Distribution = DDresults) errorPlot <- samplesMetricMap(resultsList, metric = "error", xAxisLabel = "Sample", showXtickLabels = FALSE, plot = FALSE) grid.draw(errorPlot) ``` The benefit of this plot is that it allows the easy identification of samples which are hard to classify and could be explained by considering additional information about them. Differential distribution class prediction appears to be biased to the majority class (No Asthma). The features being ranked and selected in the feature selection stage can be compared within and between classifiers by the plotting functions *rankingPlot* and *selectionPlot*. Consider the task of visually representing how consistent the feature rankings of the top 50 different features were for the differential distribution classifier for all 5 folds in the 20 cross-validations. ```{r} rankOverlaps <- rankingPlot(list(DDresults), topRanked = 1:100, xLabelPositions = c(1, seq(10, 100, 10)), lineColourVariable = "None", pointTypeVariable = "None", columnVariable = "None", plot = FALSE) rankOverlaps ``` The top-ranked features are fairly similar between all pairs of the 100 cross-validations. For a large cross-validation scheme, such as leave-2-out cross-validation, or when *results* contains many classifications, there are many feature set comparisons to make. Note that *rankingPlot* and *selectionPlot* have a *parallelParams* options which allows for the calculation of feature set overlaps to be done on multiple processors. ## Using an Independent Test Set Sometimes, cross-validation is unnecessary. This happens when studies have large sample sizes and are well-designed such that a large number of samples is prespecified to form a test set. The classifier is only trained on the training sample set, and makes predictions only on the test sample set. This can be achieved by using the function *runTest* directly. See its documentation for required inputs. ## Cross-validating Selected Features on a Different Data Set Once a cross-validated classification is complete, the usefulness of the features selected may be explored in another data set. *previousSelection* is a function which takes an existing *ClassifyResult* object and returns the features selected at the equivalent iteration which is currently being processed. This is necessary, because the models trained on one dataset are not directly transferrable to a new dataset; the classifier training (e.g. choosing thresholds, fitting model coefficients) is redone. ## Generating a ROC Plot Some classifiers can be set to output scores or probabilities representing how likely a sample is to be from one of the classes, rather than class labels. This enables different score thresholds to be tried, to generate pairs of false positive and false negative rates. The naive Bayes classifier used previously had its *returnType* parameter set to *"both"*, so class predictions and scores were both stored in the classification result. In this case, a data.frame with two columns (named "class" and "score") is returned by the classifier to the framework. Setting *returnType* to *"score"* is also sufficient to generate a ROC plot. Many existing classifiers in other R packages also have an option that allows a score or probability to be calculated. ```{r, fig.height = 5, fig.width = 6} ROCcurves <- ROCplot(list(DDresults), fontSizes = c(24, 12, 12, 12, 12)) ``` This ROC plot shows the classifiability of the asthma data set is high. Other included functions which can output scores are *fisherDiscriminant*, *DLDApredictInterface*, and *SVMpredictInterface*. ## Parameter Tuning Some classifiers allow the setting of a tuning parameter, which controls some aspect of their model learning. An example of doing parameter tuning with a linear SVM is presented. This particular SVM has a single tuning parameter, the cost. Higher values of this parameter penalise misclassifications more. This is achieved in ClassifyR by providing a variable called *tuneParams* to the TrainParams container constructor. *tuneParams* is a named list, with the names being the names of the tuning variables, and the contents being vectors of values to try. If *tuneParams* has more than one element, all combination of values of the tuning variables are tried. The performance criterion specified to *resubstituteParams* is also used as the criterion for choosing the best tuning parameter(s). Any of the non-sample-specific performance metrics which *calcCVperformance* calculates can be optimised. ```{r} SVMresults <- runTests(measurements, classes, datasetName = "Asthma", classificationName = "Tuned SVM", permutations = 20, folds = 5, seed = 2018, params = list(SelectParams(limmaSelection, resubstituteParams = ResubstituteParams()), TrainParams(SVMtrainInterface, kernel = "linear", resubstituteParams = ResubstituteParams(), tuneParams = list(cost = c(0.01, 0.1, 1, 10))), PredictParams(SVMpredictInterface, getClasses = function(result) result)) ) ``` The chosen values of the parameters are stored for every validation, and can be accessed with the *tunedParameters* function. ```{r} length(tunedParameters(SVMresults)) tunedParameters(SVMresults)[[1]] ``` The cost value of 10 is chosen all five folds of the first sample permutation. ## Summary **ClassifyR** is a framework for cross-validated classification that provides a variety of unique functions for performance evaluation. It provides wrappers for many popular classifiers but is designed to be extensible if other classifiers are desired. ## Appendix: Common *params* Specifications for Common Classifications These sets of parameters can simply be specified as the *params* parameter to *runTest* or *runTests*. - Likelihood ratio (negative binomial) feature selection and Poisson LDA for count data. ```{r} selectParams <- SelectParams(edgeRselection, resubstituteParams = ResubstituteParams()) trainParams <- TrainParams(classifyInterface) predictParams <- PredictParams(NULL, getClasses = function(result) result[["ytehat"]]) params = list(selectParams, trainParams, predictParams) ``` - Transform values into distances from the median (mean is default), select based on different variability using Bartlett's test and classify using Fisher's LDA. ```{r} transformParams <- TransformParams(subtractFromLocation, intermediate = "training", location = "median") selectParams <- SelectParams(bartlettSelection, resubstituteParams = ResubstituteParams()) trainParams <- TrainParams(fisherDiscriminant) predictParams <- PredictParams(NULL, getClasses = function(result) result) params = list(transformParams, selectParams, trainParams, predictParams) ``` - Use a differential distribution feature selection based on Kullback-Leibler divergence and a naive Bayes classifier. ```{r} selectParams <- SelectParams(KullbackLeiblerSelection, resubstituteParams = ResubstituteParams()) trainParams <- TrainParams(naiveBayesKernel) predictParams <- PredictParams(NULL, getClasses = function(result) result) params = list(selectParams, trainParams, predictParams) ``` - Use the selection and classification provided by the **pamr** package implementing nearest shrunken centroid methods. Note that the training occurs before the selecting in this example. ```{r} trainParams <- TrainParams(NSCtrainInterface) selectParams <- SelectParams(NSCselectionInterface, intermediate = "trained") predictParams <- PredictParams(NSCpredictInterface, getClasses = function(result) result) params = list(trainParams, selectParams, predictParams) ``` - A random forest classifer. Typically, feature selection is not done for random forests because they can find some interesting relationships between variables that feature selection methods may overlook. The features used by the trees can be added to the *ClassifyResult* object after the training has been done using the *forestFeatures* convenience method. The number of trees is changed from the default of 500 to 100. ```{r} trainParams <- TrainParams(randomForestInterface, ntree = 100, getFeatures = forestFeatures) predictParams <- PredictParams(NULL, getClasses = function(result) result[["test"]][["predicted"]]) params = list(trainParams, predictParams) ``` ## Appendix: Rules Regarding Input Variables of New Functions The required inputs and type of output that each stage of classifiation has is summarised by the table below. The functions can have any number of other arguments after the set of arguments which are mandatory. ```{r, echo = FALSE} htmltools::img(src = knitr::image_uri("functionRules.png"), style = 'margin-left: auto;margin-right: auto') ``` The argument *verbose* is sent from *runTest* to these functions so they must handle it, even if not explicitly using it. In the **ClassifyR** framework, *verbose* is a number which indicates the amount of progress messages to be printed. If verbose is 0, no progress messages are printed. If it is 1, only one message is printed for every 10 cross-validations completed. If it is 2, in addition to the messages printed when it is 1, a message is printed each time one of the stages of classification (transformation, feature selection, training, prediction) is done. If it is 3, in addition to the messages printed for values 1 and 2, progress messages are printed from within the classification functions themselves. A version of each included transformation, selection, training and prediction function is typically implemented for (1) a numeric matrix for which the rows are for features and columns are for samples (a data storage convention in bioinformatics) and a factor vector of the same length as the number of columns of the matrix, (2) a *DataFrame* where the columns are naturally for the features, possibly of different data types (i.e. categorical and numeric), and rows are for samples, and a class specification and (3) a *MultiAssayExperiment* which stores sample class information in the *colData* slot's *DataFrame* with column name "class". For the inputs (1 and 3) which are not *DataFrame*, they are converted to one, because the other data types can be stored as a *DataFrame* without loss of information and the transformation, selection and classification functions which accept a *DataFrame* contain the code to do the actual computations. At a minimum, a new function must have a method taking a *DataFrame* as input with the sample classes either stored in a column named "class" or provided as a factor vector. Although not required, providing a version of a function that accepts a numeric matrix with an accompanying factor vector and another version that accepts a MultiAssayExperiment is desirable to provide flexibility regarding input data. See the code of existing functions in the package for examples of this, if intending to implement novel classification-related functions to be used with **ClassifyR**. ## References Strbenac D., Yang, J., Mann, G.J. and Ormerod, J. T. (2015) [ClassifyR: an R package for performance assessment of classification with applications to transcriptomics](http://bioinformatics.oxfordjournals.org/content/31/11/1851), *Bioinformatics*, 31(11):1851-1853
Strbenac D., Mann, G.J., Yang, J. and Ormerod, J. T. (2016) [Differential distribution improves gene selection stability and has competitive classification performance for patient survival](http://nar.oxfordjournals.org/content/44/13/e119), *Nucleic Acids Research*, 44(13):e119