---
title: "Machine learning pitfalls"
author:
-   name: "Jakob Wirbel and Georg Zeller"
    affiliation: "EMBL Heidelberg"
    email: "georg.zeller@embl.de"
date: "Date last modified: 2020-11-07"
output: BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{SIAMCAT ML pitfalls}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{ASCII}
---

```{r,include=FALSE}
knitr::opts_chunk$set(collapse=TRUE)
```

# About This Vignette

In this vignette, we want to explore two pitfalls for machine learning analysis
that can lead to overly optimistic performance estimates.  
When setting up cross-validation workflows, the main objective is usually to 
estimate how well a trained model would perform on external data, which is
specifically important when considering biomarker discovery. However, more 
complex workflows involving feature selection or time-course data can be
challenging to setup correctly. Incorrect workflows in which information leaks
from the test to the training data can lead to overfitting and poor 
generalization to external datasets.  
Here, we focus on supervised feature selection and the naive splitting of 
dependent data.

## Setup

First, we load the packages needed to perform the analyses.

```{r load, message=FALSE, warning=FALSE}
library("tidyverse")
library("SIAMCAT")
```


# Supervised Feature Selection

Supervised feature selection means that the label information is taken into
account before the cross-validation split. Within this procedure, the features
are selected if they are associated with the label (for example after 
differential abundance testing), using the complete dataset for the calculation
of feature association and leaving no data aside for unbiased model 
evaluation.  
A correct way to perform feature selection would be to nest the selection step
into the cross-validation procedure. That means that the calculation of 
feature association is performed for each training fold separately.


## Load the Data

As an example, we are going to use two datasets of colorectal cancer (CRC) 
which are available through the `curatedMetagenomicData` package.     
Since the model trainig procedure takes a long time, this vignette is not 
evaluated upon build of the package, but if you execute the code chunks for 
yourself, you should get similar results.

```{r load_curatedMetagenomicsData, eval=FALSE}
library("curatedMetagenomicData")
```

First, we are going to load the dataset from 
[Thomas et al](https://doi.org/10.1038/s41591-019-0405-7)
as training dataset.

```{r load_data_thomas, eval=FALSE}
x <- 'ThomasAM_2018a.metaphlan_bugs_list.stool'
feat.t <- curatedMetagenomicData(x=x, dryrun=FALSE)
feat.t <- feat.t[[x]]@assayData$exprs
# clean up metaphlan profiles to contain only species-level abundances
feat.t <- feat.t[grep(x=rownames(feat.t), pattern='s__'),]
feat.t <- feat.t[grep(x=rownames(feat.t),pattern='t__', invert = TRUE),]
stopifnot(all(colSums(feat.t) != 0))
feat.t <- t(t(feat.t)/100)
```

As an external dataset, we are going to use the data from 
[Zeller et al.](https://doi.org/10.15252/msb.20145645).

```{r load_data_zeller, eval=FALSE}
x <- 'ZellerG_2014.metaphlan_bugs_list.stool'
feat.z <- curatedMetagenomicData(x=x, dryrun=FALSE)
feat.z <- feat.z[[x]]@assayData$exprs
# clean up metaphlan profiles to contain only species-level abundances
feat.z <- feat.z[grep(x=rownames(feat.z), pattern='s__'),]
feat.z <- feat.z[grep(x=rownames(feat.z),pattern='t__', invert = TRUE),]
stopifnot(all(colSums(feat.z) != 0))
feat.z <- t(t(feat.z)/100)
```

We can also extract the corresponding metadata from the `combined_metadata` 
object which is part of the `curatedMetagenomicData` package.

```{r metadata, eval=FALSE}
meta.t <- combined_metadata %>% 
    filter(dataset_name == 'ThomasAM_2018a') %>% 
    filter(study_condition %in% c('control', 'CRC'))
rownames(meta.t) <- meta.t$sampleID
meta.z <- combined_metadata %>% 
    filter(dataset_name == 'ZellerG_2014') %>% 
    filter(study_condition %in% c('control', 'CRC'))
rownames(meta.z) <- meta.z$sampleID
```

The MetaPhlAn2 profiler used for the profiles outputs only species which are
present in the dataset. Therefore, we can have the case that there are species 
in the matrix for `ThomasAM_2018` which are not present in the matrix for 
`ZellerG_2014` and vice verse. In order to use them as training and external
test set for `SIAMCAT`, we have to first make sure that the set of
features for both datasets overlap completely (see also the **Holdout Testing
with SIAMCAT** vignette).

```{r combine features, eval=FALSE}
species.union <- union(rownames(feat.t), rownames(feat.z))
# add Zeller_2014-only species to the Thomas_2018 matrix
add.species <- setdiff(species.union, rownames(feat.t))
feat.t <- rbind(feat.t, 
            matrix(0, nrow=length(add.species), ncol=ncol(feat.t),
                dimnames = list(add.species, colnames(feat.t))))

# add Thomas_2018-only species to the Zeller_2014 matrix
add.species <- setdiff(species.union, rownames(feat.z))
feat.z <- rbind(feat.z, 
            matrix(0, nrow=length(add.species), ncol=ncol(feat.z),
                dimnames = list(add.species, colnames(feat.z))))
```

Now, we are ready to start the model training process. For this, we chose 
three different feature selection cutoffs and prepare a tibble to hold the
results:

```{r setup_2}
fs.cutoff <- c(20, 100, 250)

auroc.all <- tibble(cutoff=character(0), type=character(0), 
                    study.test=character(0), AUC=double(0))
```

## Train Model without Feature Selection

First, we will train a model without any feature selection, using all the 
available features. We add it to the results matrix twice (both with `correct`
and `incorrect`) for easier plotting later.

```{r train_full, eval=FALSE}
sc.obj.t <- siamcat(feat=feat.t, meta=meta.t,
                    label='study_condition', case='CRC')
sc.obj.t <- filter.features(sc.obj.t, filter.method = 'prevalence',
                            cutoff = 0.01)
sc.obj.t <- normalize.features(sc.obj.t, norm.method = 'log.std',
                                norm.param=list(log.n0=1e-05, sd.min.q=0))
sc.obj.t <- create.data.split(sc.obj.t,
                                num.folds = 10, num.resample = 10)
sc.obj.t <- train.model(sc.obj.t, method='lasso')
sc.obj.t <- make.predictions(sc.obj.t)
sc.obj.t <- evaluate.predictions(sc.obj.t)

auroc.all <- auroc.all %>% 
    add_row(cutoff='full', type='correct', 
            study.test='Thomas_2018', 
            AUC=as.numeric(sc.obj.t@eval_data$auroc)) %>% 
    add_row(cutoff='full', type='incorrect', study.test='Thomas_2018', 
            AUC=as.numeric(sc.obj.t@eval_data$auroc)) 
```

We then also apply the model to the external dataset and record the 
generalization to another dataset:
```{r ext_val_full, eval=FALSE}
sc.obj.z <- siamcat(feat=feat.z, meta=meta.z,
                    label='study_condition', case='CRC')
sc.obj.z <- make.predictions(sc.obj.t, sc.obj.z)
sc.obj.z <- evaluate.predictions(sc.obj.z)
auroc.all <- auroc.all %>% 
    add_row(cutoff='full', type='correct', 
            study.test='Zeller_2014', 
            AUC=as.numeric(sc.obj.z@eval_data$auroc)) %>% 
    add_row(cutoff='full', type='incorrect', 
            study.test='Zeller_2014', 
            AUC=as.numeric(sc.obj.z@eval_data$auroc)) 
```

## Incorrect Procedure: Train with Supervised Feature Selection

For the incorrect feature selection procedure, we can test the features for
differential abundance, using the complete dataset, and then chose the top 
associated features.


```{r train_global, eval=FALSE}
sc.obj.t <- check.associations(sc.obj.t, detect.lim = 1e-05,
                                fn.plot = 'assoc_plot.pdf')
mat.assoc <- associations(sc.obj.t)
mat.assoc$species <- rownames(mat.assoc)
# sort by p-value
mat.assoc <- mat.assoc %>% as_tibble() %>% arrange(p.val)
```

Based on the P values from the `check.association` function, we now
chose `X` number of features on which to train the model.

```{r train_global_2, eval=FALSE}
for (x in fs.cutoff){
    # select x number of features based on p-value ranking
    feat.train.red <- feat.t[mat.assoc %>%
                                slice(seq_len(x)) %>%
                                pull(species),]
    sc.obj.t.fs <- siamcat(feat=feat.train.red, meta=meta.t,
                            label='study_condition', case='CRC')
    # normalize the features without filtering
    sc.obj.t.fs <- normalize.features(sc.obj.t.fs, norm.method = 'log.std',
        norm.param=list(log.n0=1e-05,sd.min.q=0),feature.type = 'original')
    # take the same cross validation split as before
    data_split(sc.obj.t.fs) <- data_split(sc.obj.t)
    # train
    sc.obj.t.fs <- train.model(sc.obj.t.fs, method = 'lasso')
    # make predictions
    sc.obj.t.fs <- make.predictions(sc.obj.t.fs)
    # evaluate predictions and record the result
    sc.obj.t.fs <- evaluate.predictions(sc.obj.t.fs)
    auroc.all <- auroc.all %>% 
        add_row(cutoff=as.character(x), type='incorrect', 
                study.test='Thomas_2018',
                AUC=as.numeric(sc.obj.t.fs@eval_data$auroc))
    # apply to the external dataset and record the result
    sc.obj.z <- siamcat(feat=feat.z, meta=meta.z,
                        label='study_condition', case='CRC')
    sc.obj.z <- make.predictions(sc.obj.t.fs, sc.obj.z)
    sc.obj.z <- evaluate.predictions(sc.obj.z)
    auroc.all <- auroc.all %>% 
        add_row(cutoff=as.character(x), type='incorrect', 
                study.test='Zeller_2014', 
                AUC=as.numeric(sc.obj.z@eval_data$auroc))
}
```

## Correct Procedure: Train with Nested Feature Selection

Feature selection can be performed correctly if it is nested within the
cross-validation procedure. We can do it using `SIAMCAT` by specifying the 
`perform.fs` parameter in the `train.model` function. 

```{r train_nested, eval=FALSE}
for (x in fs.cutoff){
    # train using the original SIAMCAT object 
    # with correct version of feature selection
    sc.obj.t.fs <- train.model(sc.obj.t, method = 'lasso', perform.fs = TRUE,
        param.fs = list(thres.fs = x,method.fs = "AUC",direction='absolute'))
    # make predictions
    sc.obj.t.fs <- make.predictions(sc.obj.t.fs)
    # evaluate predictions and record the result
    sc.obj.t.fs <- evaluate.predictions(sc.obj.t.fs)
    auroc.all <- auroc.all %>% 
        add_row(cutoff=as.character(x), type='correct', 
                study.test='Thomas_2018',
                AUC=as.numeric(sc.obj.t.fs@eval_data$auroc))
    # apply to the external dataset and record the result
    sc.obj.z <- siamcat(feat=feat.z, meta=meta.z,
                        label='study_condition', case='CRC')
    sc.obj.z <- make.predictions(sc.obj.t.fs, sc.obj.z)
    sc.obj.z <- evaluate.predictions(sc.obj.z)
    auroc.all <- auroc.all %>% 
        add_row(cutoff=as.character(x), type='correct', 
                study.test='Zeller_2014', 
                AUC=as.numeric(sc.obj.z@eval_data$auroc))
}
```

## Plot the Results

Now, we can plot the resulting performance estimates for the cross-validation
and the external validation as well:

```{r data, echo=FALSE}
auroc.all <- tibble(
    cutoff=rep(rep(c('20', '100', '250', 'full'), each=2), 2),
    type=rep(c('incorrect', 'correct'), 8),
    study.test=rep(c('Thomas_2018', 'Zeller_2014'), each=8),
    AUC=c(0.809, 0.608, 0.812, 0.659, 0.727, 0.678, 0.677, 0.677,
            0.620, 0.688, 0.694, 0.732, 0.737, 0.737, 0.736, 0.736))
```
```{r plot_auroc}
auroc.all %>%
    # facetting for plotting
    mutate(split=case_when(study.test=="Thomas_2018"~
                            'Cross validation (Thomas 2018)',
                        TRUE~"External validation (Zeller 2014)")) %>%
    # convert to factor to enforce ordering
    mutate(cutoff=factor(cutoff, levels = c(fs.cutoff, 'full'))) %>%
    ggplot(aes(x=cutoff, y=AUC, col=type)) +
        geom_point() + geom_line(aes(group=type)) +
        facet_grid(~split) +
        scale_y_continuous(limits = c(0.5, 1), expand = c(0,0)) +
        xlab('Features selected') +
        ylab('AUROC') +
        theme_bw() + 
        scale_colour_manual(values = c('correct'='blue', 'incorrect'='red'),
            name='Feature selection procedure') + 
        theme(panel.grid.minor = element_blank(), legend.position = 'bottom')
    
```

As you can see, the incorrect feature selection procedure leads to inflated 
AUROC values but lower generalization to a truly external dataset, especially
when very few features were selected. In contrast, the correct procedure 
gives a lower cross-validation results but a better estimation for how the 
model would perform on external data.


# Naive Splitting of Dependent Data

Another issue in machine learning workflows can occur when samples are not
independent. For example, microbiome samples taken from the same individual 
at different time points are usually more similar to each other than to samples
from other individuals. If these samples are split randomly in a naive 
cross-validation procedure, the case could arise that samples from the same
individual will end up in the training and the test fold. In this case, the
model would learn to generalize across time-points for the same individual 
compared to the desired model that should learn to distinguish the label 
across individuals.  
To avoid this issue, dependent measurements need to be blocked during 
cross-validation, to ensure that samples within the same block will stay in the
same fold (for training and testing). 

## Load the Data

As an example, we are going to use several datasets of Crohn's disease (CD) 
which are available through the EMBL cluster. The data have already been
filtered and cleaned.  
Since the model training would take again quite a long time, this part of the
vignette is not evaluated upon building of the package, but you should be
able to execute it yourself.

```{r load_data_ibd}
data.loc <- 'https://zenodo.org/api/files/d81e429c-870f-44e0-a44a-2a4aa541b6c1/'

# metadata
meta.all <- read_tsv(paste0(data.loc, 'meta_all_cd.tsv'))

# features
feat.motus <- read.table(paste0(data.loc, 'feat_rel_filt_cd.tsv'),
                        sep='\t', stringsAsFactors = FALSE,
                        check.names = FALSE)
```

When we look at the number of samples and number of individuals, we see that
that there are several samples per individual for example in the `HMP2` study.

```{r no_samples_per_indiv}
x <- meta.all %>% 
    group_by(Study, Group) %>% 
    summarise(n.all=n(), .groups='drop')
y <- meta.all %>% 
    select(Study, Group, Individual_ID) %>% 
    distinct() %>% 
    group_by(Study, Group) %>% 
    summarize(n.indi=n(),  .groups='drop')
full_join(x,y)
```

Therefore, we are going to train a model on the `HMP2` study. However, the 
number of samples per individual varies quite a lot across samples, therefore
we want to randomly select a set of 5 samples per individual:

```{r hmp_samples}
meta.all %>% 
    filter(Study=='HMP2') %>% 
    group_by(Individual_ID) %>% 
    summarise(n=n(), .groups='drop') %>% 
    pull(n) %>% hist(20)

# sample 5 samples per individual
meta.train <- meta.all %>% 
    filter(Study=='HMP2') %>% 
    group_by(Individual_ID) %>%
    sample_n(5, replace = TRUE) %>%
    distinct() %>%
    as.data.frame()
rownames(meta.train) <- meta.train$Sample_ID
```

For evaluation, we only want a single sample per individual, therefore we 
can create a new matrix removing repeated samples for the other studies:

```{r meta_ind}
meta.ind <- meta.all %>% 
    group_by(Individual_ID) %>% 
    filter(Timepoint==min(Timepoint)) %>% 
    ungroup()
```

Lastly, we can already create a tibble to hold the resulting AUROC values:
```{r create_auc_tibble}
auroc.all <- tibble(type=character(0), study.test=character(0), AUC=double(0))
```

## Train with Naive Cross-validation

The naive way to split samples for cross-validation does not take into account
the dependency between samples. Therefore, the pipeline would look basically
like this:

```{r train_incorrect_ibd_models, eval=FALSE}
sc.obj <- siamcat(feat=feat.motus, meta=meta.train,
                    label='Group', case='CD')
sc.obj <- normalize.features(sc.obj, norm.method = 'log.std',
    norm.param=list(log.n0=1e-05,sd.min.q=1),feature.type = 'original')
sc.obj.naive <- create.data.split(sc.obj, num.folds = 10, num.resample = 10)
sc.obj.naive <- train.model(sc.obj.naive, method='lasso')
sc.obj.naive <- make.predictions(sc.obj.naive)
sc.obj.naive <- evaluate.predictions(sc.obj.naive)
auroc.all <- auroc.all %>% 
    add_row(type='naive', study.test='HMP2', 
        AUC=as.numeric(eval_data(sc.obj.naive)$auroc))
```

## Train with Blocked Cross-validation

The correct way to to take into account repeated samples would be to block
the cross-validation procedure by individuals. This way, samples from the same
individual will always end up in the same fold. This can be performed in
`SIAMCAT` by specifying the `inseparable` parameter in the `create.data.split`
function:

```{r train_correct_ibd_models, eval=FALSE}
sc.obj.block <- create.data.split(sc.obj, num.folds = 10, num.resample = 10,
                                inseparable = 'Individual_ID')
sc.obj.block <- train.model(sc.obj.block, method='lasso')
sc.obj.block <- make.predictions(sc.obj.block)
sc.obj.block <- evaluate.predictions(sc.obj.block)
auroc.all <- auroc.all %>% 
    add_row(type='blocked', study.test='HMP2', 
        AUC=as.numeric(eval_data(sc.obj.block)$auroc))
```

## Apply to External Datasets

Now, we can apply both models to external datasets and record the resulting
accuracy:

```{r apply_ibd_models, eval=FALSE}
for (i in setdiff(unique(meta.all$Study), 'HMP2')){
    meta.test <- meta.ind %>% 
        filter(Study==i) %>% 
        as.data.frame()
    rownames(meta.test) <- meta.test$Sample_ID
    # apply naive model
    sc.obj.test <- siamcat(feat=feat.motus, meta=meta.test, 
                            label='Group', case='CD')
    sc.obj.test <- make.predictions(sc.obj.naive, sc.obj.test)
    sc.obj.test <- evaluate.predictions(sc.obj.test)
    auroc.all <- auroc.all %>% 
    add_row(type='naive', study.test=i,
            AUC=as.numeric(eval_data(sc.obj.test)$auroc))
    # apply blocked model
    sc.obj.test <- siamcat(feat=feat.motus, meta=meta.test, 
                            label='Group', case='CD')
    sc.obj.test <- make.predictions(sc.obj.block, sc.obj.test)
    sc.obj.test <- evaluate.predictions(sc.obj.test)
    auroc.all <- auroc.all %>% 
        add_row(type='blocked', study.test=i,
                AUC=as.numeric(eval_data(sc.obj.test)$auroc))
}
```

## Plot the Results

Now, we can compare the resulting AUROC values between the two different
approaches:

```{r load_results_dp, echo=FALSE}
auroc.all <- tibble(
    type=rep(c('naive', 'blocked'), 5),
    study.test=(rep(c('metaHIT', 'Lewis_2015', 'He_2017',
        'Franzosa_2019', 'HMP2'), each=2)),
    AUC=c(0.77, 0.82, 0.80, 0.82, 0.788, 0.855, 0.739, 0.774, 0.988, 0.667))
```
```{r plot_results_db}
auroc.all %>%
    # convert to factor to enforce ordering
    mutate(type=factor(type, levels = c('naive', 'blocked'))) %>%
    # facetting for plotting
    mutate(CV=case_when(study.test=='HMP2'~'CV', 
                        TRUE~'External validation')) %>%
    ggplot(aes(x=study.test, y=AUC, fill=type)) +
        geom_bar(stat='identity', position = position_dodge(), col='black') +
        theme_bw() +
        coord_cartesian(ylim=c(0.5, 1)) +
        scale_fill_manual(values=c('red', 'blue'), name='') +
        facet_grid(~CV, space = 'free', scales = 'free') +
        xlab('') + ylab('AUROC') +
        theme(legend.position = c(0.8, 0.8))
```

As you can see, the naive cross-validation procedure leads to a inflated 
performance estimation compared to the blocked cross-validation. However, 
when assessing generalization to truly external datasets, the blocked 
procedure results in better performance.

# Session Info

```{r session_info}
sessionInfo()
```