Contents

0.1 Overview

ClassifyR provides a structured pipeline for cross-validated classification. Classification is viewed in terms of four stages, data transformation, feature selection, classifier training, and prediction. The stages can be run in any order that is sensible.

Each step can be provided with custom functions that follow some rules about parameters. The driver function runTests implements different varieties of cross-validation. They are:

runTests can use parallel processing capabilities in R to speed up cross-validations when many CPUs are available. The output of runTests is a ClassifyResult object which can be directly used by the performance evaluation functions. The process of classification is summarised by a flowchart.

Importantly, ClassifyR implements a number of methods for classification using different kinds of changes in measurements between classes. Most classifiers work with features where the means are different. In addition to changes in means (DM), ClassifyR also allows for classification using differential varibility (DV; changes in scale) and differential distribution (DD; changes in location and/or scale). See the appendix section “Common params Specifications for Common Classifications” for some ready-to-use parameter sets for standard use of some classifiers.

In the following sections, some of the most useful functions provided in ClassifyR will be demonstrated. However, a user can provide any feature selection, training, or prediction function to the classification framework, as long as it meets some simple rules about the input and return parameters. See the appendix section of this guide titled “Rules for New Functions” for a description of these.

0.2 Comparison to Existing Classification Frameworks

There are a few other frameworks for classification in R. The table below provides a comparison of which features they offer.

Package Run User-defined Classifiers Parallel Execution on any OS Parameter Tuning Intel DAAL Performance Metrics Ranking and Selection Plots Class Distribution Plot Sample-wise Error Heatmap Direct Support for MultiAssayExperiment Input
ClassifyR Yes Yes Yes Yes Yes Yes Yes Yes
caret Yes Yes Yes No No No No No
MLInterfaces Yes No No No No No No No
MCRestimate Yes No Yes No No No No No
CMA No No Yes No No No No No

0.3 Provided Functionality

Although being a cross-validation framework, a number of popular feature selection and classification functions are provided by the package which meet the requirements of functions to be used by it (see the last section).

0.3.1 Provided Feature Selection and Classification Methods

Functions with names ending in “interface” indicate wrappers for existing methods implemented in other packages. Different methods select different types of changes (i.e. location and/or scale) between classes.

Likewise, a variety of classifiers is also provided.

If a desired selection or classification method is not already implemented, rules for writing functions to work with ClassifyR are outlined in the next section.

0.3.2 Provided Network-based Methods

A number of methods are provided for users to enable classification in a network-centric way. The sub-network and meta-feature creation functions should be used before cross-validation is done.

0.3.3 Pre-validation

Pre-validation is an approach to provide a fairer way to compare the benefit of omics data to traditional, freely-available clinical data. For each omics technology considered, training and testing is done on all of the various partitions of the data and the predictions of each sample are simply added as a column to the clinical data table. Then, cross-validation is done as usual. If an omics data set is important, it will often be included as a selected feature.

Pre-validation is activated by specifying params to be a named list, with one of the elements being named “prevaliated”, which specifies the kind of classification to do on the resultant clinical data table. The classification procedure is typically a logistic regression type, such as Elastic net regularised regression. Other lists must be named with each name matching an assay in the measurements object, wihch must of of type MultiAssayExperiment. For example, if a data object had two assays named RNA and protein, as well as some clinical data about the patients, then a suitable specification of params for the function runTests would be:

library(ClassifyR)
resubParams <- ResubstituteParams(nFeatures = 1:10, performanceType = "balanced error", better = "lower")
paramsList <- list(RNA = list(SelectParams(limmaSelection, resubstituteParams = resubParams),
                              TrainParams(DLDAtrainInterface),
                              PredictParams(DLDApredictInterface)),
                   protein = list(SelectParams(limmaSelection, resubstituteParams = resubParams),
                              TrainParams(DLDAtrainInterface),
                              PredictParams(DLDApredictInterface)),
                   prevalidated = list(TrainParams(elasticNetGLMtrainInterface, getFeatures = elasticNetFeatures),
                                       PredictParams(elasticNetGLMpredictInterface))
                   )

elasticNetFeatures is a function that enables selected features to be extracted from the trained models, which are simply the variables with a beta coefficient that is not zero. Unlike most classifiers, for elastic net GLM, the feature selection happens during model training - not independently before it.

0.4 Case Study : Diagnosing Asthma

To demonstrate some key features of ClassifyR, a data set consisting of the 2000 most variably expressed genes and 190 people will be used to quickly obtain results. The journal article corresponding to the data set was published in Scientific Reports in 2018 and is titled A Nasal Brush-based Classifier of Asthma Identified by Machine Learning Analysis of Nasal RNA Sequence Data.

data(asthma)
measurements[1:5, 1:5]
##        Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
## HBB        9.72    11.98    12.15    10.60     8.18
## BPIFA1    14.06    13.89    17.44    11.87    15.01
## XIST      12.28     6.35    10.21     6.27    11.21
## FCGR3B    11.42    13.25     7.87    14.75     6.77
## HBA2       7.83     9.42     9.68     8.96     6.43
head(classes)
## [1] No  No  No  No  Yes No 
## Levels: No Yes

The numeric matrix variable measurements stores the normalised values of the RNA gene abundances for each sample and the factor vector classes identifies which class the samples belong to. The measurements were normalised using DESeq2’s varianceStabilizingTransformation function, which produces \(log_2\)-like data.

For more complex data sets with multiple kinds of experiments (e.g. DNA methylation, copy number, gene expression on the same set of samples) a MultiAssayExperiment is recommended for data storage and supported by ClassifyR’s methods.

0.4.1 runTests Driver Function of Cross-validated Classification

runTests is the main function in ClassifyR which handles the sample splitting and parallelisation, if used, of cross-validation. To begin with, a simple classifier will be demonstrated. It uses a t-test or ANOVA ranking (depending on the number of classes) for feature selection and DLDA for classification. The differentMeansSelection function also uses DLDA for estimating a resubstitution error rate for a number of top-f ranked features, as a heuristic for picking f features from the feature ranking which are used in the training and prediction stages of classification. This classifier relies on differences in means between classes. No parameters need to be specified, because this is the default classification of runTests.

DMresults <- runTests(measurements, classes, datasetName = "Asthma",
                      classificationName = "Different Means", permutations = 20, folds = 5,
                      seed = 2018, verbose = 1)
DMresults
## An object of class 'ClassifyResult'.
## Data Set Name: Asthma.
## Classification Name: Different Means.
## Feature Selection Name: Difference in Means.
## Features: List of length 20 of lists of length 5 of feature identifiers.
## Validation: 20 Permutations, 5 Folds.
## Predictions: List of data frames of length 20.
## Performance Measures: None calculated yet.

Here, 20 permutations and 5 folds cross-validation is specified by the values of permutations and folds. For computers with more than 1 CPU, the number of cores to use can be given to runTests by using the argument parallelParams. The parameter seed is important to set for result reproducibility when doing a cross-validation such as this, because it employs randomisation to partition the samples into folds. For more details about runTests and the parameter classes used by it, consult the help pages of such functions.

0.4.2 Evalulation of a Classification

The most frequently selected gene can be identified using the distribution function and its relative abundance values for all samples can be displayed visually by plotFeatureClasses.

selectionPercentages <- distribution(DMresults, plot = FALSE)
sortedPercentages <- sort(selectionPercentages, decreasing = TRUE)
head(sortedPercentages)
mostChosen <- names(sortedPercentages)[1]
bestGenePlot <- plotFeatureClasses(measurements, classes, mostChosen, dotBinWidth = 0.1,
                                   xAxisLabel = "Normalised Expression")

## allFeatures
##  CLEC2B FAM126A   IFRD1   SDCBP  DUSP10  TM4SF1 
##      98      98      98      97      89      81

The means of the abundance levels of CLEC2B are substantially different between the people with and without asthma. plotFeatureClasses can also plot categorical data, such as may be found in a clinical data table, as a bar chart.

Classification error rates, as well as many other prediction performance measures, can be calculated with calcCVperformance. Next, the balanced error rate is calculated considering all samples, each of which was in the test set once. The balanced error rate is defined as the average of the classification errors of each class.

See the documentation of calcCVperformance for a list of performance metrics which may be calculated.

DMresults <- calcCVperformance(DMresults, "balanced error")
DMresults
## An object of class 'ClassifyResult'.
## Data Set Name: Asthma.
## Classification Name: Different Means.
## Feature Selection Name: Difference in Means.
## Features: List of length 20 of lists of length 5 of feature identifiers.
## Validation: 20 Permutations, 5 Folds.
## Predictions: List of data frames of length 20.
## Performance Measures: Balanced Error Rate.
performance(DMresults)
## $`Balanced Error Rate`
##  [1] 0.2284946 0.2078446 0.1992913 0.2234848 0.2078446 0.1810850 0.1886608 0.1917155 0.2154203 0.1917155 0.1957478 0.2184751
## [13] 0.2104106 0.1886608 0.2108993 0.2300831 0.1871945 0.2033236 0.2073558 0.2113881

The error rate is about 20%. If only a vector of predictions and a vector of actual classes is available, such as from an old study which did not use ClassifyR for cross-validation, then calcExternalPerformance can be used on a pair of factor vectors which have the same length.

0.4.3 Comparison of Different Classifications

The samplesMetricMap function allows the visual comparison of sample-wise error rate or accuracy measures from different ClassifyResult objects. Firstly, a classifier will be run that uses Kullback-Leibler divergence ranking and resubstitution error as a feature selection heuristic and a naive Bayes classifier for classification. This classification will use features that have either a change in location or in scale between classes.

selectParams <- SelectParams(KullbackLeiblerSelection, resubstituteParams = ResubstituteParams())
trainParams <- TrainParams(naiveBayesKernel)
predictParams <- PredictParams(predictor = NULL, weighted = "weighted",
                               weight = "height difference", returnType = "both")
paramsList <- list(selectParams, trainParams, predictParams)
DDresults <- runTests(measurements, classes, datasetName = "Asthma",
                      classificationName = "Differential Distribution",
                      permutations = 20, folds = 5, seed = 2018,
                      params = paramsList, verbose = 1)
DDresults
## An object of class 'ClassifyResult'.
## Data Set Name: Asthma.
## Classification Name: Differential Distribution.
## Feature Selection Name: Kullback-Leibler Divergence.
## Features: List of length 20 of lists of length 5 of feature identifiers.
## Validation: 20 Permutations, 5 Folds.
## Predictions: List of data frames of length 20.
## Performance Measures: None calculated yet.

The naive Bayes kernel classifier has options specifying how the distances between class densities are used. For more information, consult the documentation of the naiveBayesKernel function.

Now, the classification error for each sample is also calculated for both the differential means and differential distribution classifiers and both ClassifyResult objects generated so far are plotted with samplesMetricMap.

library(grid)
DMresults <- calcCVperformance(DMresults, "sample error")
DDresults <- calcCVperformance(DDresults, "sample error")
resultsList <- list(Abundance = DMresults, Distribution = DDresults)
errorPlot <- samplesMetricMap(resultsList, metric = "error", xAxisLabel = "Sample",
                              showXtickLabels = FALSE, plot = FALSE)
grid.draw(errorPlot)

The benefit of this plot is that it allows the easy identification of samples which are hard to classify and could be explained by considering additional information about them. Differential distribution class prediction appears to be biased to the majority class (No Asthma).

The features being ranked and selected in the feature selection stage can be compared within and between classifiers by the plotting functions rankingPlot and selectionPlot. Consider the task of visually representing how consistent the feature rankings of the top 50 different features were for the differential distribution classifier for all 5 folds in the 20 cross-validations.

rankOverlaps <- rankingPlot(list(DDresults), topRanked = 1:100,
                            xLabelPositions = c(1, seq(10, 100, 10)),
                            lineColourVariable = "None", pointTypeVariable = "None",
                            columnVariable = "None", plot = FALSE)
rankOverlaps

The top-ranked features are fairly similar between all pairs of the 100 cross-validations.

For a large cross-validation scheme, such as leave-2-out cross-validation, or when results contains many classifications, there are many feature set comparisons to make. Note that rankingPlot and selectionPlot have a parallelParams options which allows for the calculation of feature set overlaps to be done on multiple processors.

0.5 Using an Independent Test Set

Sometimes, cross-validation is unnecessary. This happens when studies have large sample sizes and are well-designed such that a large number of samples is prespecified to form a test set. The classifier is only trained on the training sample set, and makes predictions only on the test sample set. This can be achieved by using the function runTest directly. See its documentation for required inputs.

0.6 Cross-validating Selected Features on a Different Data Set

Once a cross-validated classification is complete, the usefulness of the features selected may be explored in another dataset. previousSelection is a function which takes an existing ClassifyResult object and returns the features selected at the equivalent iteration which is currently being processed. This is necessary, because the models trained on one data set are not directly transferrable to a new dataset; the classifier training (e.g. choosing thresholds, fitting model coefficients) is redone.

0.7 Generating a ROC Plot

Some classifiers can be set to output scores or probabilities representing how likely a sample is to be from one of the classes, rather than class labels. This enables different score thresholds to be tried, to generate pairs of false positive and false negative rates. The naive Bayes classifier used previously had its returnType parameter set to “both”, so class predictions and scores were both stored in the classification result. In this case, a data.frame with two columns (named “class” and “score”) is returned by the classifier to the framework. Setting returnType to “score” is also sufficient to generate a ROC plot. Many existing classifiers in other R packages also have an option that allows a score or probability to be calculated.

ROCcurves <- ROCplot(list(DDresults), fontSizes = c(24, 12, 12, 12, 12))

This ROC plot shows the classifiability of the asthma data set is high. Other included functions which can output scores are fisherDiscriminant, DLDApredictInterface, and SVMpredictInterface.

0.8 Parameter Tuning

Some classifiers allow the setting of a tuning parameter, which controls some aspect of their model learning. An example of doing parameter tuning with a linear SVM is presented. This particular SVM has a single tuning parameter, the cost. Higher values of this parameter penalise misclassifications more.

This is achieved in ClassifyR by providing a variable called tuneParams to the TrainParams container constructor. tuneParams is a named list, with the names being the names of the tuning variables, and the contents being vectors of values to try. The parameter tuneOptimise is a character vector of length two for specifying the criterion for choosing the best tuning parameter(s). Any of the non-sample-specific performance metrics which calcCVperformance calculates can be optimised.

selectParams <- SelectParams(differentMeansSelection, resubstituteParams = ResubstituteParams())
trainParams <- TrainParams(SVMtrainInterface, kernel = "linear",
                           tuneParams = list(cost = c(0.01, 0.1, 1, 10)),
                           tuneOptimise = c(metric = "balanced error", better = "lower"))
predictParams <- PredictParams(SVMpredictInterface)
SVMresults <- runTests(measurements, classes, datasetName = "Asthma",
                       classificationName = "Tuned SVM", permutations = 20, folds = 5, seed = 2018,
                       params = list(selectParams, trainParams, predictParams)
                       )

The chosen values of the parameters are stored for every validation, and can be accessed with the tunedParameters function.

length(tunedParameters(SVMresults))
## [1] 20
tunedParameters(SVMresults)[[1]]
## [[1]]
## [[1]]$cost
## [1] 10
## 
## 
## [[2]]
## [[2]]$cost
## [1] 10
## 
## 
## [[3]]
## [[3]]$cost
## [1] 10
## 
## 
## [[4]]
## [[4]]$cost
## [1] 10
## 
## 
## [[5]]
## [[5]]$cost
## [1] 10

The cost value of 10 is chosen in all of the folds of the first sample permutation.

0.9 Summary

ClassifyR is a framework for cross-validated classification that provides a variety of unique functions for performance evaluation. It provides wrappers for many popular classifiers but is designed to be extensible if other classifiers are desired.

0.10 Appendix: Common params Specifications for Common Classifications

These sets of parameters can simply be specified as the params parameter to runTest or runTests.

selectParams <- SelectParams(edgeRselection, resubstituteParams = ResubstituteParams())
trainParams <- TrainParams(classifyInterface)
predictParams <- PredictParams(NULL)
params = list(selectParams, trainParams, predictParams)
transformParams <- TransformParams(subtractFromLocation, intermediate = "training",
                                   location = "median")
selectParams <- SelectParams(bartlettSelection,
                             resubstituteParams = ResubstituteParams())
trainParams <- TrainParams(fisherDiscriminant)
predictParams <- PredictParams(NULL)
params = list(transformParams, selectParams, trainParams, predictParams)
selectParams <- SelectParams(KullbackLeiblerSelection,
                             resubstituteParams = ResubstituteParams())
trainParams <- TrainParams(naiveBayesKernel)
predictParams <- PredictParams(NULL)
params = list(selectParams, trainParams, predictParams)
trainParams <- TrainParams(NSCtrainInterface)
selectParams <- SelectParams(NSCselectionInterface, intermediate = "trained")
predictParams <- PredictParams(NSCpredictInterface)
params = list(trainParams, selectParams, predictParams)
trainParams <- TrainParams(randomForestTrainInterface, ntree = 100, getFeatures = forestFeatures)
predictParams <- PredictParams(randomForestPredictInterface)
params = list(trainParams, predictParams)
selectParams <- SelectParams(differentMeansSelection, resubstituteParams = ResubstituteParams())
trainParams <- TrainParams(SVMtrainInterface, kernel = "linear")
predictParams <- PredictParams(SVMpredictInterface)
params = list(selectParams, trainParams, predictParams)
resubstituteParams <- ResubstituteParams(nFeatures = 1:10, # The top 1 to 10 sub-networks.
                      performanceType = "balanced error", better = "lower")
selectParams <- SelectParams(networkCorrelationsSelection, resubstituteParams = resubstituteParams)
trainParams <- TrainParams(naiveBayesKernel)
predictParams <- PredictParams(NULL)
params <- list(selectParams, trainParams, predictParams)
metaFeatures <- interactorDiffsTable # Creation by interactorDifferences function is suggested.
featureSets <- networkSets # An object of class FeatureSetCollection.

0.11 Appendix: Rules Regarding Input Variables of New Functions

The required inputs and type of output that each stage of classifiation has is summarised by the table below. The functions can have any number of other arguments after the set of arguments which are mandatory.

The argument verbose is sent from runTest to these functions so they must handle it, even if not explicitly using it. In the ClassifyR framework, verbose is a number which indicates the amount of progress messages to be printed. If verbose is 0, no progress messages are printed. If it is 1, only one message is printed for every 10 cross-validations completed. If it is 2, in addition to the messages printed when it is 1, a message is printed each time one of the stages of classification (transformation, feature selection, training, prediction) is done. If it is 3, in addition to the messages printed for values 1 and 2, progress messages are printed from within the classification functions themselves.

A version of each included transformation, selection, training and prediction function is typically implemented for (1) a numeric matrix for which the rows are for features and columns are for samples (a data storage convention in bioinformatics) and a factor vector of the same length as the number of columns of the matrix, (2) a DataFrame where the columns are naturally for the features, possibly of different data types (i.e. categorical and numeric), and rows are for samples, and a class specification and (3) a MultiAssayExperiment which stores sample class information in the colData slot’s DataFrame with column name “class”. For the inputs (1 and 3) which are not DataFrame, they are converted to one, because the other data types can be stored as a DataFrame without loss of information and the transformation, selection and classification functions which accept a DataFrame contain the code to do the actual computations. At a minimum, a new function must have a method taking a DataFrame as input with the sample classes either stored in a column named “class” or provided as a factor vector. Although not required, providing a version of a function that accepts a numeric matrix with an accompanying factor vector and another version that accepts a MultiAssayExperiment is desirable to provide flexibility regarding input data. See the code of existing functions in the package for examples of this, if intending to implement novel classification-related functions to be used with ClassifyR.

0.12 References

Strbenac D., Yang, J., Mann, G.J. and Ormerod, J. T. (2015) ClassifyR: an R package for performance assessment of classification with applications to transcriptomics, Bioinformatics, 31(11):1851-1853
Strbenac D., Mann, G.J., Yang, J. and Ormerod, J. T. (2016) Differential distribution improves gene selection stability and has competitive classification performance for patient survival, Nucleic Acids Research, 44(13):e119