--- title: "Creating a Wrapper Function" author: Dario Strbenac
The University of Sydney, Australia. output: BiocStyle::html_document: toc: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Example: Creating a Wrapper Function for the k-NN Classifier} --- ```{r, echo = FALSE, results = "asis"} options(width = 130) ``` ## Introduction **ClassifyR** is a *framework* for cross-validated classification and the rules for functions to be used with are shown by the table below. A fully worked example is shown how to incorporate an existing classifier into the framework. The functions can have any number of other arguments after the set of arguments which are mandatory. ```{r, echo = FALSE} htmltools::img(src = knitr::image_uri("functionRules.png"), style = 'margin-left: auto;margin-right: auto') ``` ## *k* Nearest Neighbours There is an implementation of the *k* Nearest Neighbours algorithm in the package **class**. Its function has the form `knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)`. It accepts a `matrix` or a `data.frame` variable as input, but **ClassifyR** calls transformation, feature selection and classifier functions with a `DataFrame`, a core Bioconductor data container from [S4Vectors](https://bioconductor.org/packages/release/bioc/html/S4Vectors.html). It also expects training data to be the first parameter, the classes of it to be the second parameter and the test data to be the third. Therefore, a wrapper for `DataFrame` reordering the parameters is created. ```{r, eval = FALSE} setGeneric("kNNinterface", function(measurementsTrain, ...) standardGeneric("kNNinterface")) setMethod("kNNinterface", "DataFrame", function(measurementsTrain, classesTrain, measurementsTest, ..., verbose = 3) { splitDataset <- .splitDataAndOutcomes(measurementsTrain, classesTrain) trainingMatrix <- as.matrix(splitDataset[["measurements"]]) test <- test[, isNumeric, drop = FALSE] if(!requireNamespace("class", quietly = TRUE)) stop("The package 'class' could not be found. Please install it.") if(verbose == 3) message("Fitting k Nearest Neighbours classifier to data and predicting classes.") class::knn(as.matrix(measurementsTrain), as.matrix(measurementsTest), measurementsTest, ...) }) ``` The function only emits a progress message if `verbose` is 3. The verbosity levels are explained in the introductory vignette. `.splitDataAndOutcomes` is an internal function in **ClassifyR** which ensures that outcomes are not in `measurements` when model training happens. If `classesTrain` is a factor vector, then the function has no effect. If `classesTrain` is the character name of a column in `measurementsTrain`, that column is removed from the table and returned as a separate variable. The `...` parameter captures any options to be passed onto `knn`, such as `k` (number of neighbours considered) and `l` (minimum vote for a definite decision), for example. The function is also defensive and removes any non-numeric columns from the input table. **ClassifyR** also accepts a `matrix` and a `MultiAssayExperiment` as input. Provide convenience methods for these inputs which converts them into a `DataFrame`. In this way, only the `DataFrame` version of `kNNinterface` does the classification. ```{r, eval = FALSE} setMethod("kNNinterface", "matrix", function(measurementsTrain, classesTrain, measurementsTest, ...) { kNNinterface(DataFrame(measurementsTrain, check.names = FALSE), classesTrain, DataFrame(measurementsTest, check.names = FALSE), ...) }) setMethod("kNNinterface", "MultiAssayExperiment", function(measurementsTrain, measurementsTest, targets = names(measurementsTrain), classesTrain, ...) { tablesAndClasses <- .MAEtoWideTable(measurementsTrain, targets, classesTrain) trainingTable <- tablesAndClasses[["dataTable"]] classes <- tablesAndClasses[["outcomes"]] testingTable <- .MAEtoWideTable(measurementsTest, targets) .checkVariablesAndSame(trainingTable, testingTable) kNNinterface(trainingTable, classes, testingTable, ...) }) ``` The `matrix` method simply involves transposing the input matrices, which **ClassifyR** expects to store features in the rows and samples in the columns (customary in bioinformatics), and casting them to a `DataFrame`, which dispatches to the kNNinterface method for a `DataFrame`, which carries out the classification. The conversion of a `MultiAssayExperiment` is more complicated. **ClassifyR** has an internal function `.MAEtoWideTable` which converts a `MultiAssayExperiment` to a wide `DataFrame`. `targets` specifies which assays to include in the conversion. By default, it can also filters the resultant table to contain only numeric variables. The internal validity function `.checkVariablesAndSame` checks that there is at least 1 column after filtering and that the training and testing table have the same number of variables. ## Verifying the Implementation Create a data set with 10 samples and 10 features with a clear difference between the two classes. Run leave-out-out cross-validation. ```{r, message = FALSE} classes <- factor(rep(c("Healthy", "Disease"), each = 5), levels = c("Healthy", "Disease")) measurements <- matrix(c(rnorm(50, 10), rnorm(50, 5)), ncol = 10) colnames(measurements) <- paste("Sample", 1:10) rownames(measurements) <- paste("mRNA", 1:10) library(ClassifyR) knnParams <- ModellingParams(selectParams = NULL, trainParams = TrainParams(kNNinterface), predictParams = NULL) CVparams <- CrossValParams("Leave-k-Out", leave = 1) classified <- runTests(measurements, classes, CVparams, knnParams) classified cbind(predictions(classified), known = actualOutcomes(classified)) ``` `NULL` is specified instead of a function to `PredictParams` because one function does training and prediction. As expected for this easy task, the classifier predicts all samples correctly. ## Appendix: Rules Regarding Input Variables of New Functions The argument *verbose* is sent from *runTest* to these functions so they must handle it, even if not explicitly using it. In the **ClassifyR** framework, *verbose* is a number which indicates the amount of progress messages to be printed. If verbose is 0, no progress messages are printed. If it is 1, only one message is printed for every 10 cross-validations completed. If it is 2, in addition to the messages printed when it is 1, a message is printed each time one of the stages of classification (transformation, feature selection, training, prediction) is done. If it is 3, in addition to the messages printed for values 1 and 2, progress messages are printed from within the classification functions themselves. A version of each included transformation, selection, training and prediction function is typically implemented for (1) a numeric matrix for which the rows are for features and columns are for samples (a data storage convention in bioinformatics) and a factor vector of the same length as the number of columns of the matrix, (2) a *DataFrame* where the columns are naturally for the features, possibly of different data types (i.e. categorical and numeric), and rows are for samples, and a class specification and (3) a *MultiAssayExperiment* which stores sample class information in the *colData* slot's *DataFrame* with column name "class". For the inputs (1 and 3) which are not *DataFrame*, they are converted to one, because the other data types can be stored as a *DataFrame* without loss of information and the transformation, selection and classification functions which accept a *DataFrame* contain the code to do the actual computations. At a minimum, a new function must have a method taking a *DataFrame* as input with the sample classes either stored in a column named "class" or provided as a factor vector. Although not required, providing a version of a function that accepts a numeric matrix with an accompanying factor vector and another version that accepts a MultiAssayExperiment is desirable to provide flexibility regarding input data. See the code of existing functions in the package for examples of this, if intending to implement novel classification-related functions to be used with **ClassifyR**.