---
title: "Parallel Annotation"
date: "2020-10-11"
package: peakPantheR
output:
BiocStyle::html_document:
toc_float: true
vignette: >
%\VignetteIndexEntry{Parallel Annotation}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\VignetteDepends{peakPantheR,faahKO,pander,BiocStyle,foreach,doParallel}
%\VignettePackage{peakPantheR}
%\VignetteKeywords{mass spectrometry, metabolomics}
---
```{r biocstyle, echo = FALSE, results = "asis" }
BiocStyle::markdown()
```
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
**Package**: `r Biocpkg("peakPantheR")`
**Authors**: Arnaud Wolfer, Goncalo Correia
```{r init, message = FALSE, echo = FALSE, results = "hide" }
## Silently loading all packages
library(BiocStyle)
library(peakPantheR)
library(faahKO)
library(pander)
library(doParallel)
library(foreach)
```
# Introduction
The `peakPantheR` package is designed for the detection, integration and
reporting of pre-defined features in MS files (_e.g. compounds, fragments,
adducts, ..._).
The **Parallel Annotation** is set to detect and integrate **multiple**
compounds in **multiple** files in **parallel** and store results in a
**single** object. It can be employed to integrate a large number of expected
features across a dataset.
Using the `r Biocpkg("faahKO")` raw MS dataset as an example, this vignette
will:
* Detail the **Parallel Annotation** concept
* Apply the **Parallel Annotation** to a subset of pre-defined features in the
`r Biocpkg("faahKO")` dataset
## Abbreviations
- **ROI**: _Regions Of Interest_
* reference _RT_ / _m/z_ windows in which to search for a feature
- **uROI**: _updated Regions Of Interest_
* modifed ROI adapted to the current dataset which override the reference
ROI
- **FIR**: _Fallback Integration Regions_
* _RT_ / _m/z_ window to integrate if no peak is found
- **TIC**: _Total Ion Chromatogram_
* the intensities summed across all masses for each scan
- **EIC**: _Extracted Ion Chromatogram_
* the intensities summed over a mass range, for each scan
# Parallel Annotation Concept
Parallel compound integration is set to process **multiple** compounds in
**multiple** files in **parallel**, and store results in a **single** object.
```{r, out.width = "700px", echo = FALSE}
knitr::include_graphics("../man/figures/parallelAnnotation.png")
```
To achieve this, `peakPantheR` will:
1. load a list of expected _RT_ / _m/z_ ROI and a list of files to process
2. initialise an output object with expected ROI and file paths
3. first pass (_without peak filling_) on a subset of representative samples
(e.g QC samples):
+ for each file, detect features in each ROI and keep highest intensity
+ determine peak statistics for each feature
+ store results + EIC for each ROI
4. visual inspection of first pass results, update ROI:
+ diagnostic plots: all EICs, peak apex _RT_ / _m/z_ & peak width evolution
+ correct ROI (remove interfering feature, correct _RT_ shift)
+ define fallback integration regions (FIR) if no feature is detected
(median _RT_ / _m/z_ start and end of found features)
5. initialise a new output object, with updated regions of interest (uROI) and
fallback integration regions (FIR), with all samples
6. second pass (_with peak filling_) on all samples:
+ for each file, detect features in each uROI and keep highest intensity
+ determine peak statistics for each feature
+ integrate FIR when no peaks are found
+ store results + EIC for each uROI
7. summary statistics:
+ plot EICs, apex and peakwidth evolution
+ compare first and second pass
8. return the resulting object and/or table (_row: file, col: compound_)
```{r, out.width = "700px", echo = FALSE}
knitr::include_graphics("../man/figures/parallelAnnotation_procedure.png")
```
> Diagram of the workflow and functions used for parallel annotation.
# Parallel Annotation Example
We can target 2 pre-defined features in 6 raw MS spectra file from the
`r Biocpkg("faahKO")` package using `peakPantheR_parallelAnnotation()`. For more
details on the installation and input data employed, please consult the
[Getting Started with peakPantheR](getting-started.html) vignette.
## Input Data
First the paths to 3 MS file from the `r Biocpkg("faahKO")` are located and used
as input spectras. In this example these 3 samples are considered as
representative of the whole run (e.g. Quality Control samples):
```{r}
library(faahKO)
## file paths
input_spectraPaths <- c(system.file('cdf/KO/ko15.CDF', package = "faahKO"),
system.file('cdf/KO/ko16.CDF', package = "faahKO"),
system.file('cdf/KO/ko18.CDF', package = "faahKO"))
input_spectraPaths
```
Two targeted features (_e.g. compounds, fragments, adducts, ..._) are defined
and stored in a table with as columns:
* `cpdID` (numeric)
* `cpdName` (character)
* `rtMin` (sec)
* `rtMax` (sec)
* `rt` (sec, optional / `NA`)
* `mzMin` (m/z)
* `mzMax` (m/z)
* `mz` (m/z, optional / `NA`)
```{r, eval = FALSE}
# targetFeatTable
input_targetFeatTable <- data.frame(matrix(vector(), 2, 8, dimnames=list(c(),
c("cpdID", "cpdName", "rtMin", "rt", "rtMax", "mzMin",
"mz", "mzMax"))), stringsAsFactors=FALSE)
input_targetFeatTable[1,] <- c("ID-1", "Cpd 1", 3310., 3344.888, 3390.,
522.194778, 522.2, 522.205222)
input_targetFeatTable[2,] <- c("ID-2", "Cpd 2", 3280., 3385.577, 3440.,
496.195038, 496.2, 496.204962)
input_targetFeatTable[,c(3:8)] <- sapply(input_targetFeatTable[,c(3:8)],
as.numeric)
```
```{r, results = "asis", echo = FALSE}
# use pandoc for improved readability
input_targetFeatTable <- data.frame(matrix(vector(), 2, 8, dimnames=list(c(),
c("cpdID", "cpdName", "rtMin", "rt", "rtMax", "mzMin",
"mz", "mzMax"))), stringsAsFactors=FALSE)
input_targetFeatTable[1,] <- c("ID-1", "Cpd 1", 3310., 3344.888, 3390.,
522.194778, 522.2, 522.205222)
input_targetFeatTable[2,] <- c("ID-2", "Cpd 2", 3280., 3385.577, 3440.,
496.195038, 496.2, 496.204962)
input_targetFeatTable[,c(3:8)] <- sapply(input_targetFeatTable[,c(3:8)],
as.numeric)
rownames(input_targetFeatTable) <- NULL
pander::pandoc.table(input_targetFeatTable, digits = 9)
```
Additional compound and spectra metadata can be provided but isn't employed
during the fitting procedure:
```{r, eval=FALSE}
# spectra Metadata
input_spectraMetadata <- data.frame(matrix(c("sample type 1", "sample type 2",
"sample type 1"), 3, 1,
dimnames=list(c(),c("sampleType"))),
stringsAsFactors=FALSE)
```
```{r, results = "asis", echo = FALSE}
# use pandoc for improved readability
input_spectraMetadata <- data.frame(matrix(c("sample type 1", "sample type 2",
"sample type 1"), 3, 1,
dimnames=list(c(),c("sampleType"))),
stringsAsFactors=FALSE)
pander::pandoc.table(input_spectraMetadata)
```
## Initialise and Run Parallel Annotation
A `peakPantheRAnnotation` object is first initialised with the path to the files
to process (`spectraPaths`), features to integrate (`targetFeatTable`) and
additional information and parameters such as `spectraMetadata`, `uROI`, `FIR`
and if they should be used (`useUROI=TRUE`, `useFIR=TRUE`):
```{r}
library(peakPantheR)
init_annotation <- peakPantheRAnnotation(spectraPaths = input_spectraPaths,
targetFeatTable = input_targetFeatTable,
spectraMetadata = input_spectraMetadata)
```
The resulting `peakPantheRAnnotation` object is not annotated, does not contain
and use `uROI` and `FIR`
```{r}
init_annotation
```
`peakPantheR_parallelAnnotation()` will run the annotation across files in
parallel (if `ncores` >0) and return the successful annotations
(`result$annotation`) and failures (`result$failures`):
```{r}
# annotate files serially
annotation_result <- peakPantheR_parallelAnnotation(init_annotation, ncores=0,
curveModel='skewedGaussian',
verbose=TRUE)
# successful fit
nbSamples(annotation_result$annotation)
data_annotation <- annotation_result$annotation
data_annotation
# list failed fit
annotation_result$failures
```
## Process Parallel Annotation Results
Based on the fit results, updated ROI (`uROI`) and fallback integration region
(`FIR`) can be automatically determined using `annotationParamsDiagnostic()`:
* `uROI` are established as the min/max (`rt` and `m/z`) of the found peaks
(+/- 5% in RT)
* `FIR` are established as the median of found `rtMin`, `rtMax`, `mzMin`,
`mzMax`
```{r}
updated_annotation <- annotationParamsDiagnostic(data_annotation, verbose=TRUE)
# uROI now exist
updated_annotation
```
`outputAnnotationDiagnostic()` will save to disk
`annotationParameters_summary.csv` containing the original `ROI` and newly
determined `uROI` and `FIR` for manual validation. Additionnaly a diagnostic
plot for each compound is saved for reference and can be generated in parallel
with the argument `ncores`:
```{r, eval=FALSE}
# create a colourScale based on the sampleType
uniq_sType <- sort(unique(spectraMetadata(updated_annotation)$sampleType),
na.last=TRUE)
col_sType <- unname( setNames(c('blue', 'red'),
c(uniq_sType))[spectraMetadata(updated_annotation)$sampleType] )
# create a temporary location to save the diagnotic (otherwise provide the path
# to the selected location)
output_folder <- tempdir()
# output fit diagnostic to disk
outputAnnotationDiagnostic(updated_annotation, saveFolder=output_folder,
savePlots=TRUE, sampleColour=col_sType,
verbose=TRUE, ncores=2)
```
The data saved in `annotationParameters_summary.csv` is as follow:
```{r, results = "asis", echo = FALSE}
# use pandoc for improved readability, display the diagnostic results
tmp_csv <- data.frame(matrix(nrow=2,ncol=21,dimnames=list(c(), c('cpdID',
'cpdName', 'X', 'ROI_rt', 'ROI_mz','ROI_rtMin', 'ROI_rtMax',
'ROI_mzMin', 'ROI_mzMax', 'X', 'uROI_rtMin', 'uROI_rtMax', 'uROI_mzMin',
'uROI_mzMax', 'uROI_rt', 'uROI_mz', 'X', 'FIR_rtMin', 'FIR_rtMax',
'FIR_mzMin', 'FIR_mzMax'))), stringsAsFactors=FALSE)
tmp_csv[1,] <- c('ID-1','Cpd 1', '|', 3344.888, 522.2, 3310., 3390., 522.194778,
522.205222,'|', 3305.75893, 3411.436284, 522.194778, 522.205222,
3344.888, 522.2, '|', 3326.10635, 3407.272648, 522.194778,
522.205222)
tmp_csv[2,] <- c('ID-2','Cpd 2', '|', 3385.577, 496.2, 3280., 3440., 496.195038,
496.204962,'|',3337.376665, 3462.449033, 496.195038, 496.204962,
3385.577, 496.2, '|', 3365.023857, 3453.404957, 496.195038,
496.204962)
tmp_csv[,-c(1,2,3,10,17)] <- sapply(tmp_csv[,-c(1,2,3,10,17)], as.numeric)
colnames(tmp_csv) <- c('cpdID', 'cpdName', 'X', 'ROI_rt', 'ROI_mz','ROI_rtMin',
'ROI_rtMax', 'ROI_mzMin', 'ROI_mzMax', 'X', 'uROI_rtMin',
'uROI_rtMax', 'uROI_mzMin', 'uROI_mzMax', 'uROI_rt',
'uROI_mz', 'X', 'FIR_rtMin', 'FIR_rtMax', 'FIR_mzMin',
'FIR_mzMax')
pander::pandoc.table(tmp_csv, digits=9)
```
```{r, out.width = "700px", echo = FALSE}
knitr::include_graphics(
"../man/figures/parallel_annotation_diagnostic_cpd1.png")
```
> Diagnostic plot for compound 1: The top panel is an overlay of the extracted
EIC across all samples with the fitted curve as dotted line. The panel under the
EIC represent each found peak RT peakwidth (`rtMin`, `rtMax` and apex marked as
dot), ordered with the first sample at the top. The bottom 3 panels represent
found `RT` (peakwidth), `m/z` (peakwidth) and `peak area` by run order, with the
corresponding histograms to the right
`ROI` exported to `.csv` can be updated based on the diagnostic plots; `uROI`
(updated ROI potentially used for all samples) and `FIR` (fallback integration
regions for when no peak is found) can also be tweaked to better fit the peaks.
### Retention time correction
The optional `retentionTimeCorrection()` method provides an interface to adjust
the expected ROI rt values and account for chromatographic batch effects. By
comparing expected and found rt values for a set of reference compounds, a model
of the chromatographic shift for the present batch can be established. This
model can be in turned used to correct the expected retention time of all
targeted compounds.
In order to apply this method, the `peakPantheRAnnotation` must be previously
annotated (`isAnnotated=TRUE`).
The retention time correction algorithm to use can be selected using the
`method` argument (currently `polynomial` and
`constant` methods are available).
`retentionTimeCorrection()` fits a correction function to model the dependency
of the mean `rt_dev_sec` per reference feature with the expected databased
retention time.
If `useUROI=TRUE`, the expected retention time value is taken from the `UROI_rt`
field, otherwise `ROI_rt` is used.
If `robust=TRUE`, the RANSAC algorithm is used to automatically detect outliers
and exclude them from the fit (this should only be used with a large number of
reference features).
`retentionTimeCorrection()` returns a list with 2 elements:
* a modified `peakPantheRAnnotation` object
* a `ggplot2` diagnostic plot (optional, depending on whether `TRUE` or `FALSE`
is passed to the `diagnostic` argument).
The returned `peakPantheRAnnotation` object contains the same `uROI` and `FIR`
`mz` values as the original annotation, but the retention time related
parameters (`rt`, `rtMin` and `rtMax`) are replaced by the adjusted values.
The `rtMax` and `rtMin` are set as the corrected `rt` value plus or minus half
the value passed to the `rtWindowWidth` argument, respectively.
`useUROI` is also set to TRUE.
To continue with the workflow, simply set a new annotation object with the
fit parameters established by `retentionTimeCorrection()` and call
`peakPantheR_parallelAnnotation()` for the final annotation.
```{r, results="asis", echo=FALSE}
# Example with constant correction.
rtCorrectionOutput <- retentionTimeCorrection(updated_annotation,
rtCorrectionReferences=c('ID-1'),
method='constant',
robust=FALSE,
rtWindowWidth=15,
diagnostic=TRUE)
updated_annotation <- rtCorrectionOutput$annotation
# The ggplot2 plot object
rtCorrectionOutput$plot
# Example with second degree polynomial, without using RANSAC
# # to obtain a robust fit
rtCorrectionOutput <- retentionTimeCorrection(updated_annotation,
rtCorrectionReferences=NULL,
method='polynomial',
params=list(polynomialOrder=2),
robust=FALSE, rtWindowWidth=15,
diagnostic=TRUE)
```
## New Initialisation with Updated Parameters to be Applied to All Study Samples
Following this manual validation of the fit on reference samples, the modified
parameters in the `.csv` file can be reloaded and applied to all study samples.
### Load new fit parameters
`peakPantheR_loadAnnotationParamsCSV()` will load the new `.csv` parameters (as
generated by `outputAnnotationDiagnostic()`) and initialise a
`peakPantheRAnnotation` object without `spectraPaths`, `spectraMetadata` or
`cpdMetadata` which will need to be added before annotation. `useUROI` and
`useFIR` are set to `FALSE` by default and will need to be modified according to
the analysis to run. `uROIExist` is established depending on the `.csv` uROI
column, and will only be set to TRUE if no `NA` are present. It is possible to
reset the `FIR` values with the `uROI` windows using `resetFIR()`.
```{r, eval=FALSE}
update_csv_path <- '/path_to_new_csv/'
# load csv
new_annotation <- peakPantheR_loadAnnotationParamsCSV(update_csv_path)
#> uROIExist set to TRUE
#> New peakPantheRAnnotation object initialised for 2 compounds
new_annotation
#> An object of class peakPantheRAnnotation
#> 2 compounds in 0 samples.
#> updated ROI exist (uROI)
#> does not use updated ROI (uROI)
#> does not use fallback integration regions (FIR)
#> is not annotated
new_annotation <- resetFIR(new_annotation)
#> FIR will be reset with uROI values
```
### Add new samples to process
Now that the fit parameters were set on 3 representative samples (e.g. QC), the
same processing can be applied to all study samples. `resetAnnotation()` will
reinitialise all the results and modify the samples or compounds targeted as
required:
```{r}
## new files
new_spectraPaths <- c(system.file('cdf/KO/ko15.CDF', package = "faahKO"),
system.file('cdf/WT/wt15.CDF', package = "faahKO"),
system.file('cdf/KO/ko16.CDF', package = "faahKO"),
system.file('cdf/WT/wt16.CDF', package = "faahKO"),
system.file('cdf/KO/ko18.CDF', package = "faahKO"),
system.file('cdf/WT/wt18.CDF', package = "faahKO"))
new_spectraPaths
```
Below we define the metadata of these new samples:
```{r}
## new spectra metadata
new_spectraMetadata <- data.frame(matrix(c("KO", "WT", "KO", "WT", "KO", "WT"),
6, 1, dimnames=list(c(), c("Group"))),
stringsAsFactors=FALSE)
```
```{r, results = "asis", echo = FALSE}
# use pandoc for improved readability
new_spectraMetadata <- data.frame(matrix(c("KO", "WT", "KO", "WT", "KO", "WT"),
6, 1, dimnames=list(c(), c("Group"))),
stringsAsFactors=FALSE)
pander::pandoc.table(new_spectraMetadata)
```
```{r, echo=FALSE}
new_annotation <- resetAnnotation(updated_annotation,
spectraPaths=new_spectraPaths,
spectraMetadata=new_spectraMetadata,
useUROI=TRUE, useFIR=TRUE, verbose=FALSE)
```
```{r, eval=FALSE}
## add new samples to the annotation loaded from csv, useUROI, useFIR
new_annotation <- resetAnnotation(new_annotation, spectraPaths=new_spectraPaths,
spectraMetadata=new_spectraMetadata,
useUROI=TRUE, useFIR=TRUE)
#> peakPantheRAnnotation object being reset:
#> Previous "ROI", "cpdID" and "cpdName" value kept
#> Previous "uROI" value kept
#> Previous "FIR" value kept
#> Previous "cpdMetadata" value kept
#> New "spectraPaths" value set
#> New "spectraMetadata" value set
#> Previous "uROIExist" value kept
#> New "useUROI" value set
#> New "useFIR" value set
```
```{r}
new_annotation
```
## Run Final Parallel Annotation
We can now run the final annotation on all samples with the optimised targeted
features:
```{r}
# annotate files serially
new_annotation_result <- peakPantheR_parallelAnnotation(new_annotation,
ncores=0, verbose=FALSE)
# successful fit
nbSamples(new_annotation_result$annotation)
final_annotation <- new_annotation_result$annotation
final_annotation
# list failed fit
new_annotation_result$failures
```
### Output final results
The final fits can be saved to disk with `outputAnnotationDiagnostic()`:
```{r, eval=FALSE}
# create a colourScale based on the sampleType
uniq_group <- sort(unique(spectraMetadata(final_annotation)$Group),na.last=TRUE)
col_group <- unname( setNames(c('blue', 'red'),
c(uniq_sType))[spectraMetadata(final_annotation)$Group] )
# create a temporary location to save the diagnotic (otherwise provide the path
# to the selected location)
final_output_folder <- tempdir()
# output fit diagnostic to disk
outputAnnotationDiagnostic(final_annotation, saveFolder=final_output_folder,
savePlots=TRUE, sampleColour=col_group, verbose=TRUE)
```
For each processed sample, a `peakTables` contains all the fit information for
all compounds targeted. `annotationTable( , column)` will group the values
across all samples and compounds for any `peakTables` column:
```{r, eval=FALSE}
# peakTables for the first sample
peakTables(final_annotation)[[1]]
```
```{r, results = "asis", echo = FALSE}
# use pandoc for improved readability
pander::pandoc.table(peakTables(final_annotation)[[1]])
```
```{r, eval=FALSE}
# Extract the found peak area for all compounds and all samples
annotationTable(final_annotation, column='peakArea')
```
```{r, results = "asis", echo = FALSE}
# use pandoc for improved readability
pander::pandoc.table(annotationTable(final_annotation, column='peakArea'))
```
Finally all annotation results can be saved to disk as `.csv` with
`outputAnnotationResult()`. These `.csv` will contain the compound metadata,
spectra metadata and a file for each column of peakTables (with samples as rows
and compounds as columns):
```{r, eval=FALSE}
# create a temporary location to save the diagnotic (otherwise provide the path
# to the selected location)
final_output_folder <- tempdir()
# save
outputAnnotationResult(final_annotation, saveFolder=final_output_folder,
annotationName='ProjectName', verbose=TRUE)
#> Compound metadata saved at /final_output_folder/ProjectName_cpdMetadata.csv
#> Spectra metadata saved at
#> /final_output_folder/ProjectName_spectraMetadata.csv
#> Peak measurement "found" saved at /final_output_folder/ProjectName_found.csv
#> Peak measurement "rtMin" saved at /final_output_folder/ProjectName_rtMin.csv
#> Peak measurement "rt" saved at /final_output_folder/ProjectName_rt.csv
#> Peak measurement "rtMax" saved at /final_output_folder/ProjectName_rtMax.csv
#> Peak measurement "mzMin" saved at /final_output_folder/ProjectName_mzMin.csv
#> Peak measurement "mz" saved at /final_output_folder/ProjectName_mz.csv
#> Peak measurement "mzMax" saved at /final_output_folder/ProjectName_mzMax.csv
#> Peak measurement "peakArea" saved at
#> /final_output_folder/ProjectName_peakArea.csv
#> Peak measurement "maxIntMeasured" saved at
#> /final_output_folder/ProjectName_maxIntMeasured.csv
#> Peak measurement "maxIntPredicted" saved at
#> /final_output_folder/ProjectName_maxIntPredicted.csv
#> Peak measurement "is_filled" saved at
#> /final_output_folder/ProjectName_is_filled.csv
#> Peak measurement "ppm_error" saved at
#> /final_output_folder/ProjectName_ppm_error.csv
#> Peak measurement "rt_dev_sec" saved at
#> /final_output_folder/ProjectName_rt_dev_sec.csv
#> Peak measurement "tailingFactor" saved at
#> /final_output_folder/ProjectName_tailingFactor.csv
#> Peak measurement "asymmetryFactor" saved at
#> /final_output_folder/ProjectName_asymmetryFactor.csv
#> Summary saved at /final_output_folder/ProjectName_summary.csv
```
# See Also
* [Getting Started with peakPantheR](getting-started.html)
* [Real Time Annotation](real-time-annotation.html)
* [Graphical user interface use](peakPantheR-GUI.html)