--- title: "Holdout Testing with SIAMCAT" author: - name: "Jakob Wirbel, Konrad Zych, and Georg Zeller" affiliation: "EMBL Heidelberg" email: "georg.zeller@embl.de" date: "Date last modified: 2018-09-24" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{SIAMCAT holdout testing} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{ASCII} --- # Introduction One of the functionalities of the `SIAMCAT` package is the training of statistical machine learning models on metagenomics data. In this vignette, we demonstrate how such a model can be built on one dataset and then be applied on another, similarly processed holdout dataset. This might be of interest when comparing data from two different studies on the same disease. In this vignette, we look at two datasets from studies on colorectal cancer (CRC). The first study from [Zeller et al.](http://europepmc.org/abstract/MED/25432777) investigated metagenomic markers for CRC in a population in France, while the second study from [Yu et al.](https://europepmc.org/abstract/MED/26408641) used samples from China for the same goal. Both datasets were profiled with the same [taxonomic profiling tool](https://github.com/motu-tool/mOTUs_v2), yielding the same taxonomic identifiers, which is required for holdout testing. # Load the Data The datasets can be found on the web repository for public metagenomics datasets from the [Zeller group](https://www.embl.de/research/units/scb/zeller/index.html). ```{r start, message=FALSE, warning=FALSE} library("SIAMCAT") data.loc <- 'https://zenodo.org/api/files/d81e429c-870f-44e0-a44a-2a4aa541b6c1/' # this is data from Zeller et al., Mol. Syst. Biol. 2014 fn.meta.fr <- paste0(data.loc, 'meta_Zeller.tsv') fn.feat.fr <- paste0(data.loc, 'specI_Zeller.tsv') # this is the external dataset from Yu et al., Gut 2017 fn.meta.cn <- paste0(data.loc, 'meta_Yu.tsv') fn.feat.cn <- paste0(data.loc, 'specI_Yu.tsv') ``` First of all, we build a `SIAMCAT` object using the data from the French study in the same way that we have seen before in the main `SIAMCAT` vignette. ```{r siamcat_fr} # features # be vary of the defaults in R!!! feat.fr <- read.table(fn.feat.fr, sep='\t', quote="", check.names = FALSE, stringsAsFactors = FALSE) # the features are counts, but we want to work with relative abundances feat.fr.rel <- prop.table(as.matrix(feat.fr), 2) # metadata meta.fr <- read.table(fn.meta.fr, sep='\t', quote="", check.names=FALSE, stringsAsFactors=FALSE) # create SIAMCAT object siamcat.fr <- siamcat(feat=feat.fr.rel, meta=meta.fr, label='Group', case='CRC') ``` We can load the data from the Chinese study in a similar way and also create a `SIAMCAT` object for the holdout dataset. ```{r siamcat_cn} # features feat.cn <- read.table(fn.feat.cn, sep='\t', quote="", check.names = FALSE) feat.cn.rel <- prop.table(as.matrix(feat.cn), 2) # metadata meta.cn <- read.table(fn.meta.cn, sep='\t', quote="", check.names=FALSE, stringsAsFactors = FALSE) # SIAMCAT object siamcat.cn <- siamcat(feat=feat.cn.rel, meta=meta.cn, label='Group', case='CRC') ``` # Model Building on the French Dataset ## Preprocessing With the French dataset, we perform the complete process of model building within `SIAMCAT`, including data preprocessing steps like data validation, filtering, and data normalization. ```{r preprocessing_fr} siamcat.fr <- filter.features( siamcat.fr, filter.method = 'abundance', cutoff = 0.001, rm.unmapped = TRUE, verbose=2 ) siamcat.fr <- normalize.features( siamcat.fr, norm.method = "log.std", norm.param = list(log.n0 = 1e-06, sd.min.q = 0.1), verbose = 2 ) ``` ## Model Training Now, we can build the statistical model. We use the same parameters as in the main `SIAMCAT` vignette, where the process is explained in more detail. ```{r build_model_fr, results='hide'} siamcat.fr <- create.data.split( siamcat.fr, num.folds = 5, num.resample = 2 ) siamcat.fr <- train.model( siamcat.fr, method = "lasso" ) ``` ## Predictions Finally, we can make predictions for each cross-validation fold and evaluate the predictions as seen in the main `SIAMCAT` vignette. ```{r predict_evaluate_fr, results='hide'} siamcat.fr <- make.predictions(siamcat.fr) siamcat.fr <- evaluate.predictions(siamcat.fr) ``` # Application on the Holdout Dataset Now that we have successfully built the model for the French dataset, we can apply it to the Chinese holdout dataset. First, we will normalize the Chinese dataset with the same parameters that we used for the French dataset in order to make the data comparable. For that step, we can use the frozen normalization functionality in the `normalize.features` function in `SIAMCAT`. We supply to the function all normalization parameters saved in the `siamcat.fr` object, which can be accessed using the `norm_params` accessor. ## Frozen Normalization ```{r normalize_cn} siamcat.cn <- normalize.features(siamcat.cn, norm.param=norm_params(siamcat.fr), feature.type='original', verbose = 2) ``` ## Holdout Predictions Next, we apply the trained model to predict the holdout dataset. ```{r predict_cn, results='hide'} siamcat.cn <- make.predictions( siamcat = siamcat.fr, siamcat.holdout = siamcat.cn, normalize.holdout = FALSE) ``` Note that the `make.predictions` function can also take care of the normalization of the holdout dataset. ```{r alternative_pipeline_cn, eval=FALSE} ## Alternative Code, not run here siamcat.cn <- siamcat(feat=feat.cn.rel, meta=meta.cn, label='Group', case='CRC') siamcat.cn <- make.predictions(siamcat = siamcat.fr, siamcat.holdout = siamcat.cn, normalize.holdout = TRUE) ``` Again, we have to evaluate the predictions: ```{r eval_cn, message=FALSE} siamcat.cn <- evaluate.predictions(siamcat.cn) ``` # Model Evaluation Now, we can compare the performance of the classifier on the original and the holdout dataset by using the `model.evaluation.plot` function. Here, we can supply several `SIAMCAT` objects for which the model evaluation will be plotted in the same plot. Note that we can supply the objects as named objects in order to print the names in the legend. ```{r eval_plot, eval=FALSE} model.evaluation.plot('FR-CRC'=siamcat.fr, 'CN-CRC'=siamcat.cn, colours=c('dimgrey', 'orange')) ``` ![](./eval_plot_holdout.png) # Session Info ```{r session_info} sessionInfo() ```