---
title: "Meta-analysis using SIAMCAT"
author:
-   name: "Jakob Wirbel, and Georg Zeller"
    affiliation: "EMBL Heidelberg"
    email: "georg.zeller@embl.de"
date: "Date last modified: 2020-11-05"
output: BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{SIAMCAT meta-analysis}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{ASCII}
---

```{r,include=FALSE}
knitr::opts_chunk$set(collapse=TRUE)
```

# About This Vignette

In this vignette, we want to demonstrate how `SIAMCAT` can facilitate
metagenomic meta-analyses, focussing both on association testing and 
ML workflows.
As an example, we use five different studies of Crohn's disease (CD), 
since we have taxonomic profiles from five different metagenomic datasets 
available. Those studies are:

1. [metaHIT](htpps://doi.org/10.1038/nbt.2939)
2. [Lewis et al. 2015](https://doi.org/10.1016/j.chom.2015.09.008)
3. [He et al. 2017](https://doi.org/10.1093/gigascience/gix050)
4. [Franzosa et al. 2019](https://doi.org/10.1038/s41564-018-0306-4)
5. [HMP2](https://doi.org/10.1038/s41586-019-1237-9)


## Setup

```{r setup, warning=FALSE, message=FALSE}
library("tidyverse")
library("SIAMCAT")
```

First, we load the data for all studies, which are available for download 
from [Zenodo](https://doi.org/10.5281/zenodo.7117162). The raw data have been 
preprocessed and taxonomically profiled with 
[mOTUs2](https://doi.org/10.1038/s41467-019-08844-4) and were 
then aggregated at genus level.

```{r load_data, message=FALSE}
# base url for data download
data.loc <- 'https://zenodo.org/api/files/d81e429c-870f-44e0-a44a-2a4aa541b6c1/'
# datasets
datasets <- c('metaHIT', 'Lewis_2015', 'He_2017', 'Franzosa_2019', 'HMP2')
# metadata
meta.all <- read_tsv(paste0(data.loc, 'meta_all_cd.tsv'))
# features
feat <- read.table(paste0(data.loc, 'feat_genus_cd.tsv'), 
                check.names = FALSE, stringsAsFactors = FALSE, quote = '', 
                sep='\t')
feat <- as.matrix(feat)
# check that metadata and features agree
stopifnot(all(colnames(feat) == meta.all$Sample_ID))
```

Let us have a look at the distribution of groups across the studies
```{r table}
table(meta.all$Study, meta.all$Group)
```

Some of the studies contain more than one sample for the same subject. For
example, the HMP2 publication focussed on the longitudinal aspect of CD.
Therefore. we want to take this into account when training and evaluating 
the machine learning model (see the vignette about **Machine learning 
pitfalls**) and when performing the association testing. Thus, it will be 
convenient to create a second metadata table containing a single entry for 
each individual.

```{r meta_dereplicate}
meta.ind <- meta.all %>% 
    group_by(Individual_ID) %>% 
    filter(Timepoint==min(Timepoint)) %>% 
    ungroup()
```

# Compare Associations

## Compute Associations with SIAMCAT

To test for associations, we can encapsulate each dataset into a different
`SIAMCAT` object and use the `check.associations` function:

```{r compute_associations, message=FALSE, warning=FALSE}
assoc.list <- list()
for (d in datasets){
    # filter metadata and convert to dataframe
    meta.train <- meta.ind %>% 
        filter(Study==d) %>% 
        as.data.frame()
    rownames(meta.train) <- meta.train$Sample_ID

    # create SIAMCAT object
    sc.obj <- siamcat(feat=feat, meta=meta.train, label='Group', case='CD')
    # test for associations
    sc.obj <- check.associations(sc.obj, log.n0=1e-05, 
                                feature.type = 'original')
    # extract the associations and save them in the assoc.list
    temp <- associations(sc.obj)
    temp$genus <- rownames(temp)
    assoc.list[[d]] <- temp %>% 
        select(genus, fc, auc, p.adj) %>% 
        mutate(Study=d)
}
# combine all associations
df.assoc <- bind_rows(assoc.list)
df.assoc <- df.assoc %>% filter(genus!='unclassified')
head(df.assoc)
```

## Plot Heatmap for Interesting Genera

Now, we can compare the associations stored in the `df.assoc` tibble. For 
example, we can extract features which are very strongly associated with the
label (single-feature AUROC > 0.75 or < 0.25) in at least one of the studies
and plot the generalized fold change as heatmap.

```{r genera_of_interest}
genera.of.interest <- df.assoc %>% 
    group_by(genus) %>% 
    summarise(m=mean(auc), n.filt=any(auc < 0.25 | auc > 0.75), 
        .groups='keep') %>% 
    filter(n.filt) %>% 
    arrange(m)
```

After we extracted the genera, we plot them:

```{r heatmap}
df.assoc %>% 
    # take only genera of interest
    filter(genus %in% genera.of.interest$genus) %>% 
    # convert to factor to enforce an ordering by mean AUC
    mutate(genus=factor(genus, levels = rev(genera.of.interest$genus))) %>% 
    # convert to factor to enforce ordering again
    mutate(Study=factor(Study, levels = datasets)) %>% 
    # annotate the cells in the heatmap with stars
    mutate(l=case_when(p.adj < 0.01~'*', TRUE~'')) %>%  
    ggplot(aes(y=genus, x=Study, fill=fc)) + 
        geom_tile() + 
        scale_fill_gradient2(low = '#3B6FB6', high='#D41645', mid = 'white', 
            limits=c(-2.7, 2.7), name='Generalized\nfold change') + 
        theme_minimal() + 
        geom_text(aes(label=l)) +
        theme(panel.grid = element_blank()) + 
        xlab('') + ylab('') +
        theme(axis.text = element_text(size=6))
```


# Study as Confounding Factor

Additionally, we can check how differences between studies might influence the
variance of specific genera. To do so, we create a singel `SIAMCAT` object 
which holds the complete datasets and then we run the `check.confounder`
function.

```{r check_confounders, warning=FALSE}
df.meta <- meta.ind %>% 
    as.data.frame()
rownames(df.meta) <- df.meta$Sample_ID
sc.obj <- siamcat(feat=feat, meta=df.meta, label='Group', case='CD')
check.confounders(sc.obj, fn.plot = './confounder_plot_cd_meta.pdf',
                feature.type='original')
```


![](./confounder_plot_cd_meta.png)


The resulting variance plot shows that some genera are strongly impacated by
differences between studies, other genera not so much. Of note, the genera that
vary most with the label (CD vs controls) do not show a lot of variance across
studies.


# ML Meta-analysis

## Train LASSO Models

Lastly, we can perform the machine learning (ML) meta-analysis: we first train
one model for each datasets and then apply it to the other datasets using
the holdout testing functionality of `SIAMCAT`. For datasets with repeated 
samples across subjects, we block the cross-validation for subjects in order 
not to bias the results (see also the vignette about **Machine learning 
pitfalls**).

```{r ml_meta_analysis, message=FALSE, warning=FALSE, eval=FALSE}
# create tibble to store all the predictions
auroc.all <- tibble(study.train=character(0), 
                    study.test=character(0),
                    AUC=double(0))
# and a list to save the trained SIAMCAT objects
sc.list <- list()
for (i in datasets){
    # restrict to a single study
    meta.train <- meta.all %>% 
        filter(Study==i) %>% 
        as.data.frame()
    rownames(meta.train) <- meta.train$Sample_ID

    ## take into account repeated sampling by including a parameters
    ## in the create.data.split function
    ## For studies with repeated samples, we want to block the
    ## cross validation by the column 'Individual_ID'
    block <- NULL
    if (i %in% c('metaHIT', 'Lewis_2015', 'HMP2')){
        block <- 'Individual_ID'
        if (i == 'HMP2'){ 
            # for the HMP2 dataset, the number of repeated sample per subject 
            # need to be reduced, because some subjects have been sampled 
            # 20 times, other only 5 times
            meta.train <- meta.all %>% 
                filter(Study=='HMP2') %>% 
                group_by(Individual_ID) %>% 
                sample_n(5, replace = TRUE) %>% 
                distinct() %>% 
                as.data.frame()
            rownames(meta.train) <- meta.train$Sample_ID
        }
    }
    # create SIAMCAT object
    sc.obj.train <- siamcat(feat=feat, meta=meta.train, 
                            label='Group', case='CD')
    # normalize features
    sc.obj.train <- normalize.features(sc.obj.train, norm.method = 'log.std',
        norm.param=list(log.n0=1e-05, sd.min.q=0),feature.type = 'original')
    # Create data split
    sc.obj.train <- create.data.split(sc.obj.train,
        num.folds = 10, num.resample = 10, inseparable = block)
    # train LASSO model
    sc.obj.train <- train.model(sc.obj.train, method='lasso')


    ## apply trained models to other datasets

    # loop through datasets again
    for (i2 in datasets){
        if (i == i2){
            # make and evaluate cross-validation predictions (same dataset)
            sc.obj.train <- make.predictions(sc.obj.train)
            sc.obj.train <- evaluate.predictions(sc.obj.train)
            auroc.all <- auroc.all %>% 
                add_row(study.train=i, study.test=i,
                    AUC=eval_data(sc.obj.train)$auroc %>% as.double())
        } else {
            # make and evaluate on the external datasets
            # use meta.ind here, since we want only one sample per subject!
            meta.test <- meta.ind %>% 
                filter(Study==i2) %>%
                as.data.frame()
            rownames(meta.test) <- meta.test$Sample_ID
            sc.obj.test <- siamcat(feat=feat, meta=meta.test,
                                    label='Group', case='CD')
            # make holdout predictions
            sc.obj.test <- make.predictions(sc.obj.train, 
                                            siamcat.holdout = sc.obj.test)
            sc.obj.test <- evaluate.predictions(sc.obj.test)
            auroc.all <- auroc.all %>% 
                add_row(study.train=i, study.test=i2,
                    AUC=eval_data(sc.obj.test)$auroc %>% as.double())
        }
    }
    # save the trained model
    sc.list[[i]] <- sc.obj.train
}
```
```{r load_aurocs, echo=FALSE, message=FALSE}
fn.in.auroc <- system.file(
    "extdata",
    "cd_meta_auroc.tsv",
    package = "SIAMCAT"
)
auroc.all <- read_tsv(fn.in.auroc)
```

After we trained and applied all models, we can calculate the test average for
each dataset:

```{r get_test_average}
test.average <- auroc.all %>% 
    filter(study.train!=study.test) %>% 
    group_by(study.test) %>% 
    summarise(AUC=mean(AUC), .groups='drop') %>% 
    mutate(study.train="Average")
```

Now that we have the AUROC values, we can plot them into a nice heatmap:

```{r plot_auroc}
# combine AUROC values with test average
bind_rows(auroc.all, test.average) %>% 
    # highlight cross validation versus transfer results
    mutate(CV=study.train == study.test) %>%
    # for facetting later
    mutate(split=case_when(study.train=='Average'~'Average', TRUE~'none')) %>% 
    mutate(split=factor(split, levels = c('none', 'Average'))) %>% 
    # convert to factor to enforce ordering
    mutate(study.train=factor(study.train, levels=c(datasets, 'Average'))) %>% 
    mutate(study.test=factor(study.test, 
                            levels=c(rev(datasets),'Average'))) %>% 
    ggplot(aes(y=study.test, x=study.train, fill=AUC, size=CV, color=CV)) +
        geom_tile() + theme_minimal() +
        # text in tiles
        geom_text(aes_string(label="format(AUC, digits=2)"), 
            col='white', size=2)+
        # color scheme
        scale_fill_gradientn(colours=rev(c('darkgreen','forestgreen', 
                                        'chartreuse3','lawngreen', 
                                        'yellow')), limits=c(0.5, 1)) +
        # axis position/remove boxes/ticks/facet background/etc.
        scale_x_discrete(position='top') + 
        theme(axis.line=element_blank(), 
                axis.ticks = element_blank(), 
                axis.text.x.top = element_text(angle=45, hjust=.1), 
                panel.grid=element_blank(), 
                panel.border=element_blank(), 
                strip.background = element_blank(), 
                strip.text = element_blank()) + 
        xlab('Training Set') + ylab('Test Set') + 
        scale_color_manual(values=c('#FFFFFF00', 'grey'), guide=FALSE) + 
        scale_size_manual(values=c(0, 1), guide=FALSE) + 
        facet_grid(~split, scales = 'free', space = 'free')
```

## Investigate Feature Weights

Now that we the trained models (and we saved them in the `sc.list` object), 
we can also extract the model weights using `SIAMCAT` and compare to the 
associations we computed above.

```{r weights, warning=FALSE, message=FALSE, eval=FALSE}
weight.list <- list()
for (d in datasets){
    sc.obj.train <- sc.list[[d]]
    # extract the feature weights out of the SIAMCAT object
    temp <- feature_weights(sc.obj.train)
    temp$genus <- rownames(temp)
    # save selected info in the weight.list
    weight.list[[d]] <- temp %>% 
        select(genus, median.rel.weight, mean.rel.weight, percentage) %>% 
        mutate(Study=d) %>% 
        mutate(r.med=rank(-abs(median.rel.weight)), 
            r.mean=rank(-abs(mean.rel.weight)))
}
# combine all feature weights into a single tibble
df.weights <- bind_rows(weight.list)
df.weights <- df.weights %>% filter(genus!='unclassified')
```
```{r load_weights, echo=FALSE, message=FALSE}
fn.in.weights<- system.file(
    "extdata",
    "cd_meta_weights.tsv",
    package = "SIAMCAT"
)
df.weights <- read_tsv(fn.in.weights)
```

Using this, we can plot another heatmap with the weights, 
focussing on the genera of interest for which we plotted the associations
as heatmap above.

```{r plot_weights_heatmap, warning=FALSE}
# compute absolute feature weights
abs.weights <- df.weights %>% 
    group_by(Study) %>% 
    summarise(sum.median=sum(abs(median.rel.weight)),
                sum.mean=sum(abs(mean.rel.weight)),
                .groups='drop')

df.weights %>% 
    full_join(abs.weights) %>% 
    # normalize by the absolute model size
    mutate(median.rel.weight=median.rel.weight/sum.median) %>% 
    # only include genera of interest
    filter(genus %in% genera.of.interest$genus) %>% 
    # highlight feature rank for the top 20 features
    mutate(r.med=case_when(r.med > 20~NA_real_, TRUE~r.med)) %>%
    # enforce the correct ordering by converting to factors again
    mutate(genus=factor(genus, levels = rev(genera.of.interest$genus))) %>% 
    mutate(Study=factor(Study, levels = datasets)) %>% 
    ggplot(aes(y=genus, x=Study, fill=median.rel.weight)) + 
        geom_tile() + 
        scale_fill_gradientn(colours=rev(
            c('#007A53', '#009F4D', "#6CC24A", 'white',
            "#EFC06E", "#FFA300", '#BE5400')), 
            limits=c(-0.15, 0.15)) +
        theme_minimal() + 
        geom_text(aes(label=r.med), col='black', size= 2) +
        theme(panel.grid = element_blank()) + 
        xlab('') + ylab('') +
        theme(axis.text = element_text(size=6))


```


# Session Info
```{r session_info}
sessionInfo()
```