---
title: 'MLInterfaces 2.0 -- a new design'
author:
- name: "VJ Carey"
- name: "P. Atieno"
affiliation: "Vignette translation from Sweave to Rmarkdown / HTML"
package: MLInterfaces
vignette: >
%\VignetteIndexEntry{MLInterfaces 2.0 -- a new design}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
output:
BiocStyle::html_document
---
Introduction
============
MLearn, the workhorse method of MLInterfaces, has been streamlined to
support simpler development.
In 1.\*, MLearn included a substantial switch statement, and the
external learning function was identified by a string. Many massage
tasks were wrapped up in switch case elements devoted to each method.
MLearn returned instances of MLOutput, but these had complicated
subclasses.
MLearn now takes a signature c("formula", "data.frame", "learnerSchema", "numeric"), with the
expectation that extra parameters captured in \... go to the fitting
function. The complexity of dealing with expectations and return values
of different machine learning functions is handled primarily by the
learnerSchema instances. The basic realizations are that
- most learning functions use the formula/data idiom, and additional
parameters can go in \...
- the problem of converting from the function's output structures
(typically lists, but sometimes also objects with attributes) to the
uniform structure delived by MLearn should be handled as generically
as possible, but specialization will typically be needed
- the conversion process can be handled in most cases using only the
native R object returned by the learning function, the data, and the
training index set.
- some functions, like knn, are so idiosyncratic (lacking formula
interface or predict method) that special software is needed to
adapt MLearn to work with them
Thus we have defined a learnerSchema class,
```{r setup, message=FALSE}
library(MLInterfaces)
library(gbm)
getClass("learnerSchema")
```
along
with a constructor used to define a family of schema objects that help
MLearn carry out specific tasks of learning.
Some examples
=============
We define interface schema instances with suffix \"I\".
randomForest has a simple converter:
```{r lkrf}
randomForestI@converter
```
The job of the converter is to populate as much as the classifierOutput
instance as possible. For something like nnet, we can do more:
```{r lknn}
nnetI@converter
```
We can get posterior class
probabilities.
To obtain the predictions necessary for confusionMatrix computation, we
may need the converter to know about parameters used in the fit. Here,
closures are used.
```{r lkknn}
knnI(k=3, l=2)@converter
```
So we can have the following calls:
```{r show, message=FALSE}
library(MASS)
data(crabs)
kp = sample(1:200, size=120)
rf1 = MLearn(sp~CL+RW, data=crabs, randomForestI, kp, ntree=100)
rf1
RObject(rf1)
knn1 = MLearn(sp~CL+RW, data=crabs, knnI(k=3,l=2), kp)
knn1
```
Making new interfaces
=====================
A simple example: ada
---------------------
The ada method of the *ada* package has a formula interface and a
predict method. We can create a learnerSchema on the fly, and then use
it:
```{r mkadaI, message=FALSE}
adaI = makeLearnerSchema("ada", "ada", standardMLIConverter )
arun = MLearn(sp~CL+RW, data=crabs, adaI, kp )
confuMat(arun)
RObject(arun)
```
What is the standardMLIConverter?
```{r lks}
standardMLIConverter
```
Dealing with gbm
----------------
The *gbm* package workhorse fitter is `gbm`. The formula input must have
a numeric response, and the predict method only returns a numeric
vector. There is also no namespace. We introduced a gbm2 function
```{r lkggg}
gbm2
```
that requires a two-level factor response and
recodes for use by gbm. It also returns an S3 object of newly defined
class gbm2, which only returns a factor. At this stage, we could use a
standard interface, but the prediction values will be unpleasant to work
with. Furthermore the predict method requires specification of
n.trees. So we pass a parameter n.trees.pred.
```{r tryg}
BgbmI
set.seed(1234)
gbrun = MLearn(sp~CL+RW+FL+CW+BD, data=crabs, BgbmI(n.trees.pred=25000,thresh=.5), kp, n.trees=25000, distribution="bernoulli", verbose=FALSE )
gbrun
confuMat(gbrun)
summary(testScores(gbrun))
```
Additional features
===================
The xvalSpec class allows us to specify types of cross-validation, and
to control carefully how partitions are formed. More details are
provided in the MLprac2\_2 vignette.
The MLearn approach to clustering and other forms of unsupervised learning
==========================================================================
A learner schema for a clustering method needs to specify clearly the
feature distance measure. We will experiment here. Our main requirements
are
- ExpressionSets are the basic input objects
- The typical formula interface would be `~.` but one can imagine
cases where a factor from phenoData is specified as a 'response' to
color items, and this will be allowed
- a clusteringOutput class will need to be defined to contain the
results, and it will propagate the result object from the native
learning method.