---
title: "Peak Matrix Processing for metabolomics datasets"
author: 
- name: "Andris Jankevics"
  affiliation: Phenome Centre Birmingham, University of Birmingham, UK
  email: a.jankevics@bham.ac.uk
- name: Ralf Johannes Maria Weber
  affiliation: Phenome Centre Birmingham, University of Birmingham, UK
  email: r.j.weber@bham.ac.uk

package: pmp

date: "`r Sys.Date()`"
output: 
    BiocStyle::html_document:
        toc: true
        toc_depth: 2  
        number_sections: true  
        toc_float: true

bibliography: pmp.bib
csl: harvard-university-of-birmingham.csl

vignette: >
    %\VignetteIndexEntry{Peak Matrix Processing for metabolomics datasets}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

# Introduction

Metabolomics data (pre-)processing workflows consist of multiple steps
including peak picking, quality assessment, missing value imputation,
normalisation and scaling. Several software solutions (commercial and
open-source) are available for raw data processing, including r-package XCMS, 
to generate processed outputs in the form of a two dimensional data matrix.

These outputs contain hundreds or thousands of so called "uninformative" or
"irreproducible" features. Such features could strongly hinder outputs of
subsequent statistical analysis, biomarker discovery or metabolic pathway
inference. Common practice is to apply peak matrix validation and filtering
procedures as described in @guida2016, @broadhurst2018 and @schiffman2019. 

Functions within the `pmp` (Peak Matrix Processing) package are designed to
help users to prepare data for further statistical data analysis in a fast,
easy to use and reproducible manner.

This vignette showcases a range of commonly applied Peak Matrix Processing
steps for metabolomics datasets.

# Installation

You should have R version 4.0.0 or above and Rstudio installed to be able to
run this notebook.

Execute following commands from the R terminal.

```{r eval=FALSE, include=TRUE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("pmp")
```

```{r message=FALSE, warning=FALSE}
library(pmp)
library(SummarizedExperiment)
library(S4Vectors)
```

# Data formats

Recently a review by @stanstrup2019 reported and discussed a broad range of
heterogeneous R tools and packages that are available via `Bioconductor`,
`CRAN`, `Github` and similar public repositories. 

`pmp` package utilises `r Biocpkg("SummarizedExperiment")` class from 
Bioconductor for data input and output.

For example, outputs from widely used `r Biocpkg("xcms")` package can be 
converted to a `SummarizedExperiment` object using functions
`featureDefinitions`, `featureValues` and `pData` on the `xcms` output object.

Additionally `pmp` also supports any matrix-like `R` data 
structure (e.g. an ordinary matrix, a data frame) as an input. If the input is
a matrix-like structure `pmp` will perform several checks for data integrity.
Please see section \@ref(endomorphisms) for more details.

# Example dataset, MTBLS79

In this tutorial we will be using a direct infusion mass spectrometry (DIMS) 
dataset consisting of 172 samples measured across 8 batches and is included in
`pmp` package as `SummarizedExperiemnt` class object `MTBLS79`. 
More detailed description of the dataset is available from  @kirwan2014, 
[MTBLS79](https://www.ebi.ac.uk/metabolights/MTBLS79) and R man page.

```{r, eval=FALSE}
help ("MTBLS79")
```

```{r}
data(MTBLS79)
MTBLS79
```

This dataset before peak matrix filtering contains 172 samples, 2488 features
and 18222 missing values across all samples what is roughly around 4.2%.

```{r}
sum(is.na(assay(MTBLS79)))
sum(is.na(assay(MTBLS79))) / length(assay(MTBLS79)) * 100
```

# Filtering a dataset

Missing values in the dataset can be filtered across samples or features. The
command below will remove all samples with more than 10 % missing values.

```{r}
MTBLS79_filtered <- filter_samples_by_mv(df=MTBLS79, max_perc_mv=0.1)

MTBLS79_filtered

sum(is.na(assay(MTBLS79_filtered)))
```

Missing values sample filter has removed two samples from the dataset. 
Outputs from any `pmp` function can be used as inputs for another `pmp`
function. For example we can apply missing value filter across features on the
output of the previous call. The function call below will filter features based
on the quality control (QC) sample group only.

```{r}
MTBLS79_filtered <- filter_peaks_by_fraction(df=MTBLS79_filtered, min_frac=0.9, 
    classes=MTBLS79_filtered$Class, method="QC", qc_label="QC")

MTBLS79_filtered

sum(is.na(assay(MTBLS79_filtered)))
```

We can add another filter on top of the previous result. For this additional
filter  we will use the same function call, but this time missing values will
be calculated across all samples and not only within the “QC” group.

```{r}
MTBLS79_filtered <- filter_peaks_by_fraction(df=MTBLS79_filtered, min_frac=0.9, 
    classes=MTBLS79_filtered$Class, method="across")

MTBLS79_filtered

sum(is.na(assay(MTBLS79_filtered)))
```

Applying these 3 filters has reduced the number of missing values from 18222 to 
4779. 

Another common filter approach is to filter features by the coefficient of
variation (CV) or RSD% of QC samples. The example shown below will use a 30%
threshold.

```{r}
MTBLS79_filtered <- filter_peaks_by_rsd(df=MTBLS79_filtered, max_rsd=30, 
    classes=MTBLS79_filtered$Class, qc_label="QC")

MTBLS79_filtered

sum(is.na(assay(MTBLS79_filtered)))
```

# Processing history

Every function in `pmp` provides a history of parameter values that have been
applied. If a user has saved outputs from an R session, it’s also easy to check
what function calls were executed.

```{r}
processing_history(MTBLS79_filtered)
```

# Data normalisation

Next, we will apply probabilistic quotient normalisation (PQN).

```{r}
MTBLS79_pqn_normalised <- pqn_normalisation(df=MTBLS79_filtered, 
    classes=MTBLS79_filtered$Class, qc_label="QC")
```

# Missing value imputation

A unified function call for several commonly used missing value imputation
algorithms is also included in pmp. Supported methods are: k-nearest neighbours
(knn), random forests (rf), Bayesian PCA missing value estimator (bpca), mean
or median value of the given feature and a constant small value. In the example
below we will apply knn imputation. 

Within `mv_imputaion` interface user can easily apply different 
mehtod without worrying about input data type or tranposing dataset.

```{r}
MTBLS79_mv_imputed <- mv_imputation(df=MTBLS79_pqn_normalised,
    method="knn")
```

# Data scaling
The generalised logarithm (glog) transformation algorithm is available to
stabilise the variance across low and high intensity mass spectral features.

```{r}
MTBLS79_glog <- glog_transformation(df=MTBLS79_mv_imputed,
    classes=MTBLS79_filtered$Class, qc_label="QC")
```

`glog_transformation` function uses QC samples to optimse scaling factor 
`lambda`. Using the function `glog_plot_plot_optimised_lambda` it's possible to
visualise if the optimsation of the given parameter has converged at the
minima.

```{r plot_glog, fig.width=5}
opt_lambda <- 
    processing_history(MTBLS79_glog)$glog_transformation$lambda_opt
glog_plot_optimised_lambda(df=MTBLS79_mv_imputed,
    optimised_lambda=opt_lambda,
    classes=MTBLS79_filtered$Class, qc_label="QC")
```

# Data integrity check and endomorphisms {#endomorphisms}

Functions in the `pmp` package are designed to validate input data if the user
chooses not to use the `r Biocpkg("SummarizedExperiment")` class object.

For example, if the input `matrix` consists of features stored in columns and
samples in rows or *vice versa*, any function within the `pmp` package will be
able to handle this in the correct manner.

```{r}
peak_matrix <- t(assay(MTBLS79))
sample_classes <- MTBLS79$Class

class(peak_matrix)
dim(peak_matrix)
class(sample_classes)
```

Let's try to use these objects as input for `mv_imputation` and 
`filter_peaks_by_rsd`.

```{r message=TRUE, warning=TRUE}
mv_imputed <- mv_imputation(df=peak_matrix, method="mn")
rsd_filtered <- filter_peaks_by_rsd(df=mv_imputed, max_rsd=30, 
    classes=sample_classes, qc_label="QC")

class (mv_imputed)
dim (mv_imputed)

class (rsd_filtered)
dim (rsd_filtered)
```

Note that `pmp` has automatically transposed the input object to use the
largest dimension as features, while the original R data type `matrix` has been
retained also for the function output.

# Session information

```{r}
sessionInfo()
```

# References