---
title: "LC-MS data preprocessing and analysis with xcms"
package: xcms
output:
BiocStyle::html_document:
toc_float: true
includes:
in_header: xcms.bioschemas.html
vignette: >
%\VignetteIndexEntry{LC-MS data preprocessing and analysis with xcms}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\VignetteDepends{xcms,RColorBrewer,faahKO,pander,BiocStyle,pheatmap,SummarizedExperiment}
%\VignettePackage{xcms}
%\VignetteKeywords{mass spectrometry, metabolomics}
bibliography: references.bib
csl: biomed-central.csl
---
```{r biocstyle, echo = FALSE, results = "asis" }
BiocStyle::markdown()
```
**Package**: `r Biocpkg("xcms")`
**Authors**: Johannes Rainer
**Modified**: `r file.info("xcms.Rmd")$mtime`
**Compiled**: `r date()`
```{r init, message = FALSE, echo = FALSE, results = "hide"}
## Silently loading all packages
library(BiocStyle)
library(xcms)
library(faahKO)
library(pander)
register(SerialParam())
```
# Introduction
The `r Biocpkg("xcms")` package provides the functionality to perform the
preprocessing of LC-MS, GC-MS or LC-MS/MS data in which raw signals from mzML,
mzXML or CDF files are processed into *feature* abundances. This preprocessing
includes chromatographic peak detection, sample alignment and correspondence
analysis.
The first version of the package was already published in 2006 [@Smith:2006ic]
and has since been updated and modernized in several rounds to better integrate
it with other R-based packages for the analysis of untargeted metabolomics
data. This includes version 3 of *xcms* that used the `r Biocpkg("MSnbase")`
package for MS data representation [@gattoMSnbaseEfficientElegant2020a]. The
most recent update (*xcms* version 4) enables in addition preprocessing of MS
data represented by the modern `r Biocpkg("MsExperiment")` and
`r Biocpkg("Spectra")` packages which provides an even better integration with
the [RforMassSpectrometry](https://rformassspectrometry.org) R package ecosystem
simplifying e.g. also compound annotation [@rainer_modular_2022].
This document describes data import, exploration and preprocessing of a simple
test LC-MS data set with the *xcms* package version >= 4. The same functions can
be applied to the older *MSnbase*-based workflows (xcms version 3). Additional
documents and tutorials covering also other topics of untargeted metabolomics
analysis are listed at the end of this document. There is also a [xcms
tutorial](https://jorainer.github.io/xcmsTutorials) available with more examples
and details.
To get a complete overview of LCMS-MS analysis, an end-to-end workflow
[Metabonaut website](https://rformassspectrometry.github.io/metabonaut/), which
integrate the *xcms* preprocessing steps with the downstream analysis, is available.
# Preprocessing of LC-MS data
## Data import
*xcms* supports analysis of any LC-MS(/MS) data that can be imported with the
`r Biocpkg("Spectra")` package. Such data will typically be provided in
(AIA/ANDI) NetCDF, mzXML and mzML format but can, through dedicated extensions
to the *Spectra* package, also be imported from other sources, e.g. also
directly from raw data files in manufacturer's formats.
For demonstration purpose we will analyze in this document a small subset of the
data from [@Saghatelian04] in which the metabolic consequences of the knock-out
of the fatty acid amide hydrolase (FAAH) gene in mice was investigated. The raw
data files (in NetCDF format) are provided through the `r Biocpkg("faahKO")`
data package. The data set consists of samples from the spinal cords of 6
knock-out and 6 wild-type mice. Each file contains data in centroid mode
acquired in positive ion polarity from 200-600 m/z and 2500-4500 seconds. To
speed-up processing of this vignette we will restrict the analysis to only 8
files.
Below we load all required packages, locate the raw CDF files within the
*faahKO* package and build a *phenodata* `data.frame` describing the
experimental setup. Generally, such data frames should contain all relevant
experimental variables and sample descriptions (including also the names of the
raw data files) and will be imported into R using either the `read.table()`
function (if the file is in *csv* or tabulator delimited text file format) or
also using functions from the *readxl* R package if it is in Excel file format.
```{r load-libs-pheno, message = FALSE }
library(xcms)
library(faahKO)
library(RColorBrewer)
library(pander)
library(pheatmap)
library(MsExperiment)
## Get the full path to the CDF files
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
recursive = TRUE)[c(1, 2, 5, 6, 7, 8, 11, 12)]
## Create a phenodata data.frame
pd <- data.frame(sample_name = sub(basename(cdfs), pattern = ".CDF",
replacement = "", fixed = TRUE),
sample_group = c(rep("KO", 4), rep("WT", 4)),
stringsAsFactors = FALSE)
```
We next load our data using the `readMsExperiment` function from the
`r Biocpkg("MsExperiment")` package.
```{r}
faahko <- readMsExperiment(spectraFiles = cdfs, sampleData = pd)
faahko
```
The MS spectra data from our experiment is now available as a `Spectra` object
within `faahko`. Note that this `MsExperiment` container could in addition to
spectra data also contain other types of data or also references to other
files. See the vignette from the `r Biocpkg("MsExperiment")` for more
details. Also, when loading data from mzML, mzXML or CDF files, by default only
general spectra data is loaded into memory while the actual *peaks data*,
i.e. the m/z and intensity values are only retrieved on-the-fly from the raw
files when needed (this is similar to the *MSnbase* *on-disk* mode described in
[@gattoMSnbaseEfficientElegant2020a]). This guarantees a low memory footprint
hence allowing to analyze also large experiments without the need of high
performance computing environments. Note that also different alternative
*backends* (and hence data representations) could be used for the `Spectra`
object within `faahko` with eventually even lower memory footprint, or higher
performance. See the package vignette from the `r Biocpkg("Spectra")` package or
the [SpectraTutorials](https://jorainer.github.io/SpectraTutorials) tutorial for
more details on `Spectra` backends and how to change between them.
## Initial data inspection
The `MsExperiment` object is a simple and flexible container for MS
experiments. The *raw* MS data is stored as a `Spectra` object that can be
accessed through the `spectra()` function.
```{r}
spectra(faahko)
```
All spectra are organized *sequentially* (i.e., not by file) but the
`fromFile()` function can be used to get for each spectrum the information to
which of the data files it belongs. Below we simply count the number of spectra
per file.
```{r}
table(fromFile(faahko))
```
Information on samples can be retrieved through the `sampleData()` function.
```{r}
sampleData(faahko)
```
Each row in this `DataFrame` represents one sample (input file). Using `[` it is
possible to subset a `MsExperiment` object **by sample**. Below we subset the
`faahko` to the 3rd sample (file) and access its spectra and sample data.
```{r}
faahko_3 <- faahko[3]
spectra(faahko_3)
sampleData(faahko_3)
```
As a first evaluation of the data we below plot the base peak chromatogram (BPC)
for each file in our experiment. We use the `chromatogram()` method and set the
`aggregationFun` to `"max"` to return for each spectrum the maximal intensity
and hence create the BPC from the raw data. To create a total ion chromatogram
we could set `aggregationFun` to `"sum"`.
```{r data-inspection-bpc, message = FALSE, fig.align = "center", fig.width = 12, fig.height = 6, fig.cap = "Base peak chromatogram."}
## Get the base peak chromatograms. This reads data from the files.
bpis <- chromatogram(faahko, aggregationFun = "max")
## Define colors for the two groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:2], "60")
names(group_colors) <- c("KO", "WT")
## Plot all chromatograms.
plot(bpis, col = group_colors[sampleData(faahko)$sample_group])
```
The `chromatogram()` method returned a `MChromatograms` object that organizes
individual `Chromatogram` objects (which in fact contain the chromatographic
data) in a two-dimensional array: columns represent samples and rows
(optionally) m/z and/or retention time ranges. Below we extract the chromatogram
of the first sample and access its retention time and intensity values.
```{r data-inspection-chromatogram, message = FALSE }
bpi_1 <- bpis[1, 1]
rtime(bpi_1) |> head()
intensity(bpi_1) |> head()
```
From the BPC above it seems that after around 4200 seconds no signal is measured
anymore. Thus, we filter below the full data set to a retention time range from
2550 to 4250 seconds using the `filterRt()` function. Note that at present this
will only subset the spectra within the `MsExperiment`. Subsequently we
re-create also the BPC.
```{r}
faahko <- filterRt(faahko, rt = c(2550, 4250))
## creating the BPC on the subsetted data
bpis <- chromatogram(faahko, aggregationFun = "max")
```
We next create boxplots representing the distribution of the total ion currents
per data file. Such plots can be very useful to spot potentially problematic MS
runs. To extract this information, we use the `tic()` function on the `Spectra`
object within `faahko` and split the values by file using `fromFile()`.
```{r data-inspection-tic-boxplot, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 4, fig.cap = "Distribution of total ion currents per file." }
## Get the total ion current by file
tc <- spectra(faahko) |>
tic() |>
split(f = fromFile(faahko))
boxplot(tc, col = group_colors[sampleData(faahko)$sample_group],
ylab = "intensity", main = "Total ion current")
```
In addition, we can also cluster the samples based on similarity of their base
peak chromatograms. Samples would thus be grouped based on similarity of their
LC runs. For that we need however to *bin* the data along the retention time
axis, since retention times will generally differ between samples. Below we use
the `bin()` function on the BPC to bin intensities into 2 second wide retention
time bins. The clustering is then performed using complete linkage hierarchical
clustering on the pairwise correlations of the binned base peak chromatograms.
```{r data-inspection-bpc-heatmap, message = FALSE, fig.align = "center", fig.width = 7, fig.height = 6, fig.cap = "Grouping of samples based on similarity of their base peak chromatogram."}
## Bin the BPC
bpis_bin <- bin(bpis, binSize = 2)
## Calculate correlation on the log2 transformed base peak intensities
cormat <- cor(log2(do.call(cbind, lapply(bpis_bin, intensity))))
colnames(cormat) <- rownames(cormat) <- bpis_bin$sample_name
## Define which phenodata columns should be highlighted in the plot
ann <- data.frame(group = bpis_bin$sample_group)
rownames(ann) <- bpis_bin$sample_name
## Perform the cluster analysis
pheatmap(cormat, annotation = ann,
annotation_color = list(group = group_colors))
```
The samples cluster in a pairwise manner, with the KO and WT samples for the
same sample index having the most similar BPC.
## Chromatographic peak detection
Chromatographic peak detection aims at identifying all signal in each sample
created from ions of the same originating compound species. Chromatographic peak
detection can be performed in *xcms* with the `findChromPeaks()` function and a
*parameter* object which defines and configures the algorithm that should be
used (see `?findChromPeaks` for a list of supported algorithms). Before running
any peak detection it is however strongly suggested to first visually inspect
the extracted ion chromatogram of e.g. internal standards or compounds known to
be present in the samples in order to evaluate and adapt the settings of the
peak detection algorithm since the default settings will not be appropriate for
most LC-MS setups.
Below we extract the EIC for one compound using the `chromatogram()` function by
specifying in addition the m/z and retention time range where we would expect
the signal for that compound.
```{r peak-detection-plot-eic, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 5, fig.cap = "Extracted ion chromatogram for one peak." }
## Define the rt and m/z range of the peak area
rtr <- c(2700, 2900)
mzr <- c(334.9, 335.1)
## extract the chromatogram
chr_raw <- chromatogram(faahko, mz = mzr, rt = rtr)
plot(chr_raw, col = group_colors[chr_raw$sample_group])
```
Note that `Chromatogram` objects extracted by the `chromatogram()` method
contain an `NA` value if in a certain scan (i.e. in a spectrum for a specific
retention time) no signal was measured in the respective m/z range. This is
reflected by the lines not being drawn as continuous lines in the plot above.
The peak above has thus a width of about 50 seconds. We can use this information
to define the `peakwidth` parameter of the *centWave* peak detection method
[@Tautenhahn:2008fx] that we will use for chromatographic peak detection on our
data. The `peakwidth` parameter allows to define the expected lower and upper
width (in retention time dimension) of chromatographic peaks. For our data we
set it thus to `peakwidth = c(20, 80)`. The second important parameter for
*centWave* is `ppm` which defines the expected maximum deviation of m/z values
of the centroids that should be included into one chromatographic peak (note
that this is **not** the precision of m/z values provided by the MS instrument
manufacturer).
For the `ppm` parameter we extract the full MS data (intensity, retention time
and m/z values) corresponding to the above peak. To this end we first filter the
raw object by retention time, then by m/z and finally plot the object with `type
= "XIC"` to produce the plot below. We use the *pipe* (`|>`) operator to better
illustrate the corresponding workflow.
```{r peak-detection-plot-ms-data, message = FALSE, warning = FALSE, fig.aligh = "center", fig.width = 14, fig.height = 14, fig.cap = "Visualization of the raw MS data for one peak. For each plot: upper panel: chromatogram plotting the intensity values against the retention time, lower panel m/z against retention time plot. The individual data points are colored according to the intensity." }
faahko |>
filterRt(rt = rtr) |>
filterMz(mz = mzr) |>
plot(type = "XIC")
```
In the present data there is actually no variation in the m/z values. Usually
the m/z values of the individual centroids (lower panel) in the plots above
would scatter around the *real* m/z value of the compound (in an intensity
dependent manner).
The first step of the *centWave* algorithm defines *regions of interest* (ROI)
that are subsequently screened for the presence of chromatographic peaks. These
ROIs are defined based on the difference of m/z values of centroids from
consecutive scans (spectra). In detail, centroids from consecutive scans are
included into a ROI if the difference between their m/z and the mean m/z of the
ROI is smaller than the user defined `ppm` parameter. A reasonable choice for
the `ppm` could thus be the maximal m/z difference of data points from
neighboring scans/spectra that are part of a chromatographic peak for an
internal standard of known compound. It is suggested to inspect the ranges of
m/z values for several compounds (either internal standards or compounds known
to be present in the sample) and define the `ppm` parameter for *centWave*
according to these. See also this
[tutorial](https://jorainer.github.io/xcmsTutorials) for additional
information and examples on choosing and testing peak detection settings.
Chromatographic peak detection can also be performed on extracted ion
chromatograms, which helps testing and tuning peak detection settings. Note
however that peak detection on EICs does not involve the first step of
*centWave* described above and will thus **not** consider the `ppm`
parameter. Also, since less data is available to the algorithms, background
signal estimation is performed differently and different settings for `snthresh`
will need to be used (generally a lower `snthresh` will be used for EICs since
the estimated background signal tends to be higher for data subsets than for the
full data). Below we perform the peak detection with the `findChromPeaks()`
function on the EIC generated above. The submitted *parameter* object defines
which algorithm will be used and allows to define the settings for this
algorithm. We use a `CentWaveParam` parameter object to use and configure the
*centWave* algorithm with default settings, except for `snthresh`.
```{r peak-detection-eic, message = FALSE}
xchr <- findChromPeaks(chr_raw, param = CentWaveParam(snthresh = 2))
```
We can access the identified chromatographic peaks with the `chromPeaks()`
function.
```{r peak-detection-eic-chromPeaks}
chromPeaks(xchr)
```
Parallel to the `chromPeaks()` matrix there is also a `chromPeakData()` data
frame that allows to add arbitrary annotations to each chromatographic peak,
such as e.g. the MS level in which the peak was detected:
```{r peak-detection-chromatogram-chromPeakData}
chromPeakData(xchr)
```
Below we plot the EIC along with all identified chromatographic peaks using the
`plot()` function on the result object from above. Additional parameters
`peakCol` and `peakBg` allow to define a foreground and background (fill) color
for each identified chromatographic peak in the provided result object (i.e., we
need to define one color for each row of `chromPeaks(xchr)` - column `"column"`
(or `"sample"` if present) in that peak matrix specifies the sample in which the
peak was identified).
```{r peak-detection-eic-plot, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 8, fig.cap = "Signal for an example peak. Red and blue colors represent KO and wild type samples, respectively. Peak area of identified chromatographic peaks are highlighted in the sample group color."}
## Define a color for each sample
sample_colors <- group_colors[xchr$sample_group]
## Define the background color for each chromatographic peak
bg <- sample_colors[chromPeaks(xchr)[, "column"]]
## Parameter `col` defines the color of each sample/line, `peakBg` of each
## chromatographic peak.
plot(xchr, col = sample_colors, peakBg = bg)
```
If we are happy with the settings we can use them for the peak detection on the
full data set (i.e. on the `MsExperiment` object with the data for the whole
experiment). Note that below we set the argument `prefilter` to `c(6, 5000)` and
`noise` to `5000` to reduce the run time of this vignette. With this setting we
consider only ROIs with at least 6 centroids with an intensity larger than 5000
for the *centWave* chromatographic peak detection.
```{r peak-detection-centwave, message = FALSE, results = "hide" }
cwp <- CentWaveParam(peakwidth = c(20, 80), noise = 5000,
prefilter = c(6, 5000))
faahko <- findChromPeaks(faahko, param = cwp)
```
The results of `findChromPeaks()` on a `MsExperiment` object are returned as an
`XcmsExperiment` object. This object extends `MsExperiment` directly (hence
providing the same access to all raw data) and contains all *xcms*
preprocessing results. Note also that additional rounds of chromatographic peak
detections could be performed and their results being added to existing peak
detection results by additional calls to `findChromPeaks()` on the result object
and using parameter `add = TRUE`.
The `chromPeaks` function can also here be used to access the results from the
chromatographic peak detection. Below we show the first 6 identified
chromatographic peaks.
```{r peak-detection-chromPeaks, message = FALSE }
chromPeaks(faahko) |>
head()
```
Columns of this `chromPeaks()` matrix might differ depending on the used peak
detection algorithm. Columns that all algorithms have to provide are: `"mz"`,
`"mzmin"`, `"mzmax"`, `"rt"`, `"rtmin"` and `"rtmax"` that define the m/z and
retention time range of the chromatographic peak (i.e. all mass peaks within
that area are used to integrate the peak signal) as well as the m/z and
retention time of the peak apex. Other mandatory columns are `"into"` and
`"maxo"` with the absolute integrated peak signal and the maximum peak
intensity. Finally, `"sample"` provides the index of the sample in which the
peak was identified.
Additional annotations for each individual peak can be extracted with the
`chromPeakData()` function. This data frame could also be used to add/store
arbitrary annotations for each detected peak (that don't necessarily need to be
numeric).
```{r peak-detection-chromPeakData}
chromPeakData(faahko)
```
Peak detection will not always work perfectly for all types of peak shapes
present in the data set leading to peak detection artifacts, such as (partially
or completely) overlapping peaks or artificially split peaks (common issues
especially for *centWave*). *xcms* provides the `refineChromPeaks()` function
that can be called on peak detection results in order to *refine* (or clean)
peak detection results by either removing identified peaks not passing a certain
criteria or by merging artificially split or partially or completely overlapping
chromatographic peaks. Different algorithms are available that can again be
configured with their respective parameter objects: `CleanPeaksParam` and
`FilterIntensityParam` allow to remove peaks with their retention time range or
intensity being below a specified threshold, respectively. With
`MergeNeighboringPeaksParam` it is possible to merge chromatographic peaks and
hence remove many of the above mentioned (*centWave*) peak detection
artifacts. See also `?refineChromPeaks` for more information and help on the
different methods.
Below we post-process the peak detection results merging peaks that overlap in a
4 second window per file and for which the signal between them is lower than 75%
of the smaller peak's maximal intensity. See the `?MergeNeighboringPeaksParam`
help page for a detailed description of the settings and the approach.
```{r peak-postprocessing, message = FALSE}
mpp <- MergeNeighboringPeaksParam(expandRt = 4)
faahko_pp <- refineChromPeaks(faahko, mpp)
```
An example for a merged peak is given below.
```{r peak-postprocessing-merged, fig.widht = 10, fig.height = 5, fig.cap = "Result from the peak refinement step. Left: data before processing, right: after refinement. The splitted peak was merged into one."}
mzr_1 <- 305.1 + c(-0.01, 0.01)
chr_1 <- chromatogram(faahko[1], mz = mzr_1)
chr_2 <- chromatogram(faahko_pp[1], mz = mzr_1)
par(mfrow = c(1, 2))
plot(chr_1)
plot(chr_2)
```
*centWave* thus detected originally 3 chromatographic peaks in the m/z slice
(left panel in the plot above) and peak refinement with
`MergeNeighboringPeaksParam` and the specified settings merged the first two
peaks into a single one (right panel in the figure above). Other close peaks,
with a lower intensity between them, were however not merged (see below).
```{r peak-postprocessing-not-merged, fig.widht = 10, fig.height = 5, fig.cap = "Result from the peak refinement step. Left: data before processing, right: after refinement. The peaks were not merged."}
mzr_1 <- 496.2 + c(-0.01, 0.01)
chr_1 <- chromatogram(faahko[1], mz = mzr_1)
chr_2 <- chromatogram(faahko_pp[1], mz = mzr_1)
par(mfrow = c(1, 2))
plot(chr_1)
plot(chr_2)
```
It is also possible to perform the peak refinement on extracted ion
chromatograms. This could again be used to test and fine-tune the settings for
the parameter and to avoid potential problematic behavior. The `minProp`
parameter for example has to be carefully chosen to avoid merging of isomer
peaks (like in the example above). With the default `minProp = 0.75` only peaks
are merged if the signal between the two peaks is **higher** than 75% of the
smaller peak's maximal intensity. Setting this value too low could eventually
result in merging of isomers as shown below.
```{r peak-postprocessing-chr, fig.width = 5, fig.height = 5}
#' Too low minProp could cause merging of isomers!
res <- refineChromPeaks(chr_1, MergeNeighboringPeaksParam(minProp = 0.05))
chromPeaks(res)
plot(res)
```
Thus, before running such a peak refinement evaluate that isomers present in the
data set were not wrongly merged based on the chosen settings.
Before proceeding we next replace the `faahko` object with the results from the
peak refinement step.
```{r}
faahko <- faahko_pp
```
Below we use the data from the `chromPeaks()` matrix to calculate per-file
summaries of the peak detection results, such as the number of peaks per file as
well as the distribution of the retention time widths.
```{r peak-detection-peaks-per-sample, message = FALSE, results = "asis" }
summary_fun <- function(z)
c(peak_count = nrow(z), rt = quantile(z[, "rtmax"] - z[, "rtmin"]))
T <- chromPeaks(faahko) |>
split.data.frame(f = chromPeaks(faahko)[, "sample"]) |>
lapply(FUN = summary_fun) |>
do.call(what = rbind)
rownames(T) <- basename(fileNames(faahko))
pandoc.table(
T,
caption = paste0("Summary statistics on identified chromatographic",
" peaks. Shown are number of identified peaks per",
" sample and widths/duration of chromatographic ",
"peaks."))
```
While by default `chromPeaks()` will return all identified chromatographic peaks
in a result object it is also possible to extract only chromatographic peaks for
a specified m/z and/or rt range:
```{r peak-detection-chrom-peak-table-selected, message = FALSE}
chromPeaks(faahko, mz = c(334.9, 335.1), rt = c(2700, 2900))
```
We can also plot the location of the identified chromatographic peaks in the
m/z - retention time space for one file using the `plotChromPeaks()`
function. Below we plot this information for the third sample.
```{r peak-detection-chrom-peaks-plot, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 8, fig.cap = "Identified chromatographic peaks in the m/z by retention time space for one sample." }
plotChromPeaks(faahko, file = 3)
```
As a general overview of the peak detection results we can in addition visualize
the number of identified chromatographic peaks per file along the retention time
axis. Parameter `binSize` allows to define the width of the bins in rt dimension
in which peaks should be counted. This number of chromatographic peaks within
each bin is then shown color-coded in the resulting plot.
```{r peak-detection-chrom-peak-image, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 8, fig.cap = "Frequency of identified chromatographic peaks along the retention time axis. The frequency is color coded with higher frequency being represented by yellow-white. Each line shows the peak frequency for one file." }
plotChromPeakImage(faahko, binSize = 10)
```
Note that extracting ion chromatorams from an *xcms* result object will in
addition to the chromatographic data also extract any identified chromatographic
peaks within that region. This can thus also be used to validate and verify that
the used peak detection settings identified e.g. peaks for known compounds or
internal standards properly. Below we extract the ion chromatogram for the m/z -
rt region above and access the detected peaks in that region using the
`chromPeaks()` function.
```{r peak-detection-eic-example-peak, message = FALSE}
chr_ex <- chromatogram(faahko, mz = mzr, rt = rtr)
chromPeaks(chr_ex)
```
We can also plot this extracted ion chromatogram which will also visualize all
identified chromatographic peaks in that region.
```{r peak-detection-highlight-chrom-peaks-plot-polygon, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 8, fig.cap = "Signal for an example peak. Red and blue colors represent KO and wild type samples, respectively. The signal area of identified chromatographic peaks are filled with a color." }
sample_colors <- group_colors[chr_ex$sample_group]
plot(chr_ex, col = group_colors[chr_raw$sample_group], lwd = 2,
peakBg = sample_colors[chromPeaks(chr_ex)[, "sample"]])
```
Chromatographic peaks can be visualized also in other ways: by setting `peakType
= "rectangle"` the they are indicated with a rectangle instead of the default
highlighting option (`peakType = "polygon"`) used above. As a third alternative
it would also possible to just indicate the apex position for each identified
chromatographic peak with a single point (`peakType = "point"`). Below we plot
the data again using `peakType = "rectangle"`.
```{r peak-detection-highlight-chrom-peaks-plot, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 8, fig.cap = "Signal for an example peak. Red and blue colors represent KO and wild type samples, respectively. The rectangles indicate the identified chromatographic peaks per sample." }
plot(chr_ex, col = sample_colors, peakType = "rectangle",
peakCol = sample_colors[chromPeaks(chr_ex)[, "sample"]],
peakBg = NA)
```
Finally we plot also the distribution of peak intensity per sample. This allows
to investigate whether systematic differences in peak signals between samples
are present.
```{r peak-detection-chrom-peak-intensity-boxplot, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 8, fig.cap = "Peak intensity distribution per sample." }
## Extract a list of per-sample peak intensities (in log2 scale)
ints <- split(log2(chromPeaks(faahko)[, "into"]),
f = chromPeaks(faahko)[, "sample"])
boxplot(ints, varwidth = TRUE, col = sample_colors,
ylab = expression(log[2]~intensity), main = "Peak intensities")
grid(nx = NA, ny = NULL)
```
Over and above the signal of the identified chromatographic peaks is comparable
across samples, with the exception of samples 3, 4 and, to some degree, also 7
that show slightly higher average intensities, but also a lower number of
detected peaks (indicated by the smaller width of the boxes).
Note that in addition to the above described identification of chromatographic
peaks, it is also possible to *manually* define and add chromatographic peaks
with the `manualChromPeaks()` function (see `?manualChromPeaks` help page for
more information).
## Alignment
The time at which analytes elute in the chromatography can vary between samples
(and even compounds). Such differences were already visible in the BPC and even
the extracted ion chromatogram plot in the previous section. The alignment
step, also referred to as retention time correction, aims to adjust these
differences by shifting signals along the retention time axis and aligning them
between different samples within an experiment.
A plethora of alignment algorithms exist (see [@Smith:2013gr]), with some of
them being also implemented in *xcms*. Alignment of LC-MS data can be performed
in *xcms* using the `adjustRtime()` method and an algorithm-specific parameter
class (see `?adjustRtime` for an overview of available methods in *xcms*).
In the example below we use the *obiwarp* method [@Prince:2006jj] to align the
samples. We use a `binSize = 0.6` which creates warping functions in m/z bins of
0.6. Also here it is advisable to modify and adapt the settings for each
experiment.
```{r alignment-obiwarp, message = FALSE, results = "hide" }
faahko <- adjustRtime(faahko, param = ObiwarpParam(binSize = 0.6))
```
Note that `adjustRtime()`, besides calculating adjusted retention times for each
spectrum, adjusts also the retention times of the identified chromatographic
peaks in the *xcms* result object. Adjusted retention times of individual
spectra can be extracted from the result object using either the
`adjustedRtime()` function or using `rtime()` with parameter `adjusted = TRUE`
(the default):
```{r alignment-rtime, message = FALSE }
## Extract adjusted retention times
adjustedRtime(faahko) |> head()
## Or simply use the rtime method
rtime(faahko) |> head()
## Get raw (unadjusted) retention times
rtime(faahko, adjusted = FALSE) |> head()
```
To evaluate the impact of the alignment we plot the BPC on the adjusted data. In
addition we plot also the differences between the adjusted and the raw retention
times per sample using the `plotAdjustedRtime()` function. To disable the
automatic extraction of all identified chromatographic peaks by the
`chromatogram()` function (which would not make much sense for a BPC) we use
`chromPeaks = "none"` below.
```{r alignment-obiwarp-plot, message = FALSE, fig.align = "center", fig.width = 12, fig.height = 8, fig.cap = "Obiwarp aligned data. Base peak chromatogram before (top) and after alignment (middle) and difference between adjusted and raw retention times along the retention time axis (bottom)." }
## Get the base peak chromatograms.
bpis_adj <- chromatogram(faahko, aggregationFun = "max", chromPeaks = "none")
par(mfrow = c(3, 1), mar = c(4.5, 4.2, 1, 0.5))
plot(bpis, col = sample_colors)
grid()
plot(bpis_adj, col = sample_colors)
grid()
## Plot also the difference of adjusted to raw retention time.
plotAdjustedRtime(faahko, col = sample_colors)
grid()
```
Too large differences between adjusted and raw retention times could indicate
poorly performing samples or alignment.
At last we evaluate also the impact of the alignment on the test peak.
```{r alignment-peak-groups-example-peak, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 10, fig.cap = "Example extracted ion chromatogram before (top) and after alignment (bottom)." }
par(mfrow = c(2, 1))
## Plot the raw data
plot(chr_raw, col = sample_colors)
grid()
## Extract the chromatogram from the adjusted object
chr_adj <- chromatogram(faahko, rt = rtr, mz = mzr)
plot(chr_adj, col = sample_colors, peakType = "none")
grid()
```
**Note**: *xcms* result objects (`XcmsExperiment` as well as `XCMSnExp`) contain
both the raw and the adjusted retention times for all spectra and subset
operation will in many cases drop adjusted retention times. Thus it might
sometimes be useful to immediately **replace** the raw retention times in the
data using the `applyAdjustedRtime()` function.
### Subset-based alignment
For some experiments it might be better to perform the alignment based on only a
subset of the available samples, e.g. if pooled QC samples were injected at
regular intervals or if the experiment contains blanks. All alignment methods in
*xcms* support such a subset-based alignment in which retention time shifts are
estimated on only a specified subset of samples followed by an alignment of the
whole data set based on the aligned subset.
The subset of samples for such an alignment can be specified with the parameter
`subset` of the `PeakGroupsParam` or `ObiwarpParam` object. Parameter
`subsetAdjust` allows to specify the *method* by which the *left-out* samples
will be adjusted. There are currently two options available:
- `subsetAdjust = "previous"`: adjust the retention times of a non-subset
sample based on the alignment results of the previous subset sample (e.g. a
QC sample). If samples are e.g. in the order *A1*, *B1*, *B2*, *A2*, *B3*,
*B4* with *A* representing QC samples and *B* study samples, using
`subset = c(1, 4)` and `subsetAdjust = "previous"` would result in all *A*
samples to be aligned with each other and non-subset samples *B1* and *B2*
being adjusted based on the alignment result of subset samples *A1* and *B3*
and *B4* on those of *A2*.
- `subsetAdjust = "average"`: adjust retention times of non-subset samples based
on an interpolation of the alignment results of the previous and subsequent
subset sample. In the example above, *B1* would be adjusted based on the
average of adjusted retention times between subset (QC) samples *A1* and
*A2*. Since there is no subset sample after non-subset samples *B3* and *B4*
these will be adjusted based on the alignment results of *A2* alone. Note
that a weighted average is used to calculate the adjusted retention time
averages, which uses the inverse of the difference of the index of the
non-subset sample to the subset samples as weights. Thus, if we have a
setup like *A1*, *B1*, *B2*, *A2* the adjusted retention times of *A1*
would get a larger weight than those of *A2* in the adjustment of
non-subset sample *B1* causing it's adjusted retention times to be closer
to those of *A1* than to *A2*. See below for examples.
**Important**: both cases require a meaningful/correct ordering of the samples
within the object (i.e., samples should be ordered by injection index).
The examples below aim to illustrate the effect of these alignment options. We
assume samples 1, 4 and 7 in the *faahKO* data set to be QC pool samples. We
thus want to perform the alignment based on these samples and subsequently
adjust the retention times of the left-out samples (2, 3, 5, 6 and 8) based on
interpolation of the results from the neighboring *subset* (QC) samples. After
initial peak grouping we perform the subset-based alignment with the *peak
groups* method by passing the indices of the QC samples with the `subset`
parameter to the `PeakGroupsParam` function and set `subsetAdjust = "average"`
to adjust the study samples based on interpolation of the alignment results from
neighboring subset/QC samples.
Note that for any subset-alignment all parameters such as `minFraction` are
relative to the `subset`, not the full experiment!
Below we first remove any previous alignment results with the
`dropAdjustedRtime()` function to allow a fresh alignment using the subset-based
option outlined above. In addition to removing adjusted retention times for all
spectra, this function will also *restore* the original retention times for
identified chromatographic peaks.
```{r subset-define, message = FALSE, warning = FALSE}
faahko <- dropAdjustedRtime(faahko)
## Define the experimental layout
sampleData(faahko)$sample_type <- "study"
sampleData(faahko)$sample_type[c(1, 4, 7)] <- "QC"
```
For an alignment with the *peak groups* method an initial peak grouping
(correspondence) analysis is required, because the algorithm estimates retention
times shifts between samples using the retention times of *hook peaks*
(i.e. chromatographic peaks present in most/all samples). Here we use the
default settings for an *peak density* method-based correspondence, but it is
strongly advised to adapt the parameters for each data set (details in the next
section). The definition of the sample groups (i.e. assignment of individual
samples to the sample groups in the experiment) is mandatory for the
`PeakDensityParam`. If there are no sample groups in the experiment,
`sampleGroups` should be set to a single value for each file (e.g. `rep(1,
length(fileNames(faahko))`).
```{r alignment-subset, message = FALSE, warning = FALSE}
## Initial peak grouping. Use sample_type as grouping variable
pdp_subs <- PeakDensityParam(sampleGroups = sampleData(faahko)$sample_type,
minFraction = 0.9)
faahko <- groupChromPeaks(faahko, param = pdp_subs)
## Define subset-alignment options and perform the alignment
pgp_subs <- PeakGroupsParam(
minFraction = 0.85,
subset = which(sampleData(faahko)$sample_type == "QC"),
subsetAdjust = "average", span = 0.4)
faahko <- adjustRtime(faahko, param = pgp_subs)
```
Below we plot the results of the alignment highlighting the subset samples in
green. This nicely shows how the interpolation of the `subsetAdjust = "average"`
works: retention times of sample 2 are adjusted based on those from subset
sample 1 and 4, giving however more weight to the closer subset sample 1 which
results in the adjusted retention times of 2 being more similar to those of
sample 1. Sample 3 on the other hand gets adjusted giving more weight to the
second subset sample (4).
```{r alignment-subset-plot-2, message = FALSE, warning = FALSE, fig.align = "center", fig.width = 10, fig.height = 10, fig.cap = "Subset-alignment results with option average. Difference between adjusted and raw retention times along the retention time axis. Samples on which the alignment models were estimated are shown in green, study samples in grey." }
clrs <- rep("#00000040", 8)
clrs[sampleData(faahko)$sample_type == "QC"] <- c("#00ce0080")
par(mfrow = c(2, 1), mar = c(4, 4.5, 1, 0.5))
plot(chromatogram(faahko, aggregationFun = "max", chromPeaks = "none"),
col = clrs)
grid()
plotAdjustedRtime(faahko, col = clrs, peakGroupsPch = 1,
peakGroupsCol = "#00ce0040")
grid()
```
Option `subsetAdjust = "previous"` would adjust the retention times of a
non-subset sample based on a single subset sample (the previous), which results
in most cases in the adjusted retention times of the non-subset sample being
highly similar to those of the subset sample which was used for adjustment.
## Correspondence
Correspondence is usually the final step in LC-MS data preprocessing in which
data, presumably representing signal from the same originating ions, is matched
across samples. As a result, chromatographic peaks from different samples with
similar m/z and retention times get grouped into LC-MS *features*. The function
to perform the correspondence in *xcms* is called `groupChromPeaks()` that again
supports different algorithms which can be selected and configured with a
specific parameter object (see `?groupChromPeaks` for an overview). For our
example we will use the *peak density* method [@Smith:2006ic] that, within small
slices along the m/z dimension, combines chromatographic peaks depending on the
density of these peaks along the retention time axis. To illustrate this, we
*simulate* below the peak grouping for an m/z slice containing multiple
chromatoghaphic peaks within each sample using the `plotChromPeakDensity()`
function and a `PeakDensityParam` object with parameter `minFraction = 0.4`
(features are only defined if in at least 40% of samples a chromatographic peak
was present) - parameter `sampleGroups` is used to define to which sample group
each sample belongs.
```{r correspondence-example-slice, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 10, fig.cap = "Example for peak density correspondence. Upper panel: chromatogram for an mz slice with multiple chromatographic peaks. lower panel: identified chromatographic peaks at their retention time (x-axis) and index within samples of the experiments (y-axis) for different values of the bw parameter. The black line represents the peak density estimate. Grouping of peaks (based on the provided settings) is indicated by grey rectangles." }
## Define the mz slice.
mzr <- c(305.05, 305.15)
## Extract and plot the chromatograms
chr_mzr <- chromatogram(faahko, mz = mzr)
## Define the parameters for the peak density method
pdp <- PeakDensityParam(sampleGroups = sampleData(faahko)$sample_group,
minFraction = 0.4, bw = 30)
plotChromPeakDensity(chr_mzr, col = sample_colors, param = pdp,
peakBg = sample_colors[chromPeaks(chr_mzr)[, "sample"]],
peakCol = sample_colors[chromPeaks(chr_mzr)[, "sample"]],
peakPch = 16)
```
The upper panel in the plot shows the extracted ion chromatogram for each sample
with the detected peaks highlighted. The retention times for the individual
chromatographic peaks in each sample (y-axis being the index of the sample in
the data set) are shown in the lower panel with the solid black line
representing the density estimate for the distribution of detected peaks along
the retention time. Parameter `bw` defines the *smoothness* of this
estimation. The grey rectangles indicate which chromatographic peaks would be
grouped into a feature (each grey rectangle thus representing one feature). This
type of visualization is thus ideal to test, validate and optimize
correspondence settings on manually defined m/z slices before applying them to
the full data set. For the tested m/z slice the settings seemed to be OK and we
are thus applying them to the full data set below. Especially the parameter `bw`
will be very data set dependent (or more specifically LC-dependent) and should
be adapted to each data set.
Another important parameter is `binSize` that defines the size of the m/z slices
(bins) within which peaks are being grouped. This parameter thus defines the
required similarity in m/z values for the chromatographic peaks that are then
assumed to represent signal from the same (type of ion of a) compound and hence
evaluated for grouping. By default, a constant m/z bin size is used, but by
changing parameter `ppm` to a value larger than 0, m/z-relative bin sizes would
be used instead (i.e., the bin size will increase with the m/z value hence
better representing the measurement error/precision of some MS instruments). The
bin sizes (and subsequently the m/z width of the defined features) would then
reach a maximal value of `binSize` plus `ppm` parts-per-million of the largest
m/z value of any chromatographic peak in the data set.
See also the [xcms
tutorial](https://jorainer.github.io/xcmsTutorials) for more examples and
details.
```{r correspondence, message = FALSE }
## Perform the correspondence using fixed m/z bin sizes.
pdp <- PeakDensityParam(sampleGroups = sampleData(faahko)$sample_group,
minFraction = 0.4, bw = 30)
faahko <- groupChromPeaks(faahko, param = pdp)
```
As an alternative we perform the correspondence using m/z relative bin sizes.
```{r}
## Drop feature definitions and re-perform the correspondence
## using m/z-relative bin sizes.
faahko_ppm <- groupChromPeaks(
dropFeatureDefinitions(faahko),
PeakDensityParam(sampleGroups = sampleData(faahko)$sample_group,
minFraction = 0.4, bw = 30, ppm = 10))
```
The results will be *mostly* similar, except for the higher m/z range (in which
larger m/z bins will be used). Below we plot the m/z range for features against
their median m/z. For the present data set (acquired with a triple quad
instrument) no clear difference can be seen for the two approaches hence we
proceed the analysis with the fixed bin size setting. A stronger relationship
would be expected for example for data measured on TOF instruments.
```{r, fig.cap = "Relationship between a feature's m/z and the m/z width (max - min m/z) of the feature. Red points represent the results with the fixed m/z bin size, blue with the m/z-relative bin size."}
## Calculate m/z width of features
mzw <- featureDefinitions(faahko)$mzmax - featureDefinitions(faahko)$mzmin
mzw_ppm <- featureDefinitions(faahko_ppm)$mzmax -
featureDefinitions(faahko_ppm)$mzmin
plot(featureDefinitions(faahko_ppm)$mzmed, mzw_ppm,
xlab = "m/z", ylab = "m/z width", pch = 21,
col = "#0000ff20", bg = "#0000ff10")
points(featureDefinitions(faahko)$mzmed, mzw, pch = 21,
col = "#ff000020", bg = "#ff000010")
```
Results from the correspondence analysis can be accessed with the
`featureDefinitions()` and `featureValues()` function. The former returns a data
frame with general information on each of the defined features, with each row
being one feature and columns providing information on the median m/z and
retention time as well as the indices of the chromatographic peaks assigned to
the feature in column `"peakidx"`. Below we show the information on the first 6
features.
```{r}
featureDefinitions(faahko) |> head()
```
The `featureValues()` function returns a `matrix` with rows being features and
columns samples. The content of this matrix can be defined using the `value`
argument which can be any column name in the `chromPeaks()` matrix. With the
default `value = "into"` a matrix with the integrated signal of the peaks
corresponding to a feature in a sample are returned. This is then generally used
as the intensity matrix for downstream analysis. Below we extract the
intensities for the first 6 features.
```{r correspondence-feature-values, message = FALSE }
featureValues(faahko, value = "into") |> head()
```
As we can see we have several missing values in this feature matrix. Missing
values are reported if in one sample no chromatographic peak was detected in the
m/z - rt region of the feature. This does however not necessarily mean that
there is no signal for that specific ion in that sample. The chromatographic
peak detection algorithm could also just have failed to identify any peak in
that region, e.g. because the signal was too noisy or too low. Thus it is
advisable to perform, after correspondence, also a gap-filling (see next
section).
The performance of peak detection, alignment and correspondence should always be
evaluated by inspecting extracted ion chromatograms e.g. of known compounds,
internal standards or identified features in general. The
`featureChromatograms()` function allows to extract chromatograms for each
feature present in `featureDefinitions()`. The returned `MChromatograms` object
contains an ion chromatogram for each feature (each row containing the data for
one feature) and sample (each column representing containing data for one
sample). Parameter `features` allows to define specific features for which the
EIC should be returned. These can be specified with their index or their ID
(i.e. their row name in the `featureDefinitions()` data frame. If `features` is
not defined, EICs are returned for **all** features in a data set, which can
take also a considerable amount of time. Below we extract the chromatograms for
the first 4 features.
```{r featureChromatograms, message = FALSE }
feature_chroms <- featureChromatograms(faahko, features = 1:4)
feature_chroms
```
And plot the extracted ion chromatograms. We again use the group color for each
identified peak to fill the area.
```{r feature-eic, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 8, fig.cap = "Extracted ion chromatograms for features 1 to 4." }
plot(feature_chroms, col = sample_colors,
peakBg = sample_colors[chromPeaks(feature_chroms)[, "sample"]])
```
To access the EICs of the second feature we can simply subset the
`feature_chroms` object.
```{r}
eic_2 <- feature_chroms[2, ]
chromPeaks(eic_2)
```
## Gap filling
Missing values in LC-MS data are in many cases the result of the chromatographic
peak detection algorithm failing to identify peaks (because of noisy or low
intensity signal). The aim of the gap filling step is to reduce the number of
such missing values by integrating signals from the original data files for
samples in which no chromatographic peak was found from the m/z - rt region
where signal from the ion is expected. Gap filling can be performed in *xcms*
with the `fillChromPeaks()` function and a parameter object selecting and
configuring the gap filling algorithm. The method of choice is
`ChromPeakAreaParam` that integrates the signal (in samples in which no
chromatographic peak was found for a feature) in the m/z - rt region that is
defined based on the m/z and retention time ranges of all detected
chromatographic peaks of that specific feature. The lower m/z limit of the area
is defined as the lower quartile (25% quantile) of the `"mzmin"` values of all
peaks of the feature, the upper m/z value as the upper quartile (75% quantile)
of the `"mzmax"` values, the lower rt value as the lower quartile (25% quantile)
of the `"rtmin"` and the upper rt value as the upper quartile (75% quantile) of
the `"rtmax"` values. Below we perform this gap filling on our test data and
extract the feature values for the first 6 features after gap filling. An `NA`
is reported if no signal is measured at all for a specific sample.
```{r fill-chrom-peaks, message = FALSE }
faahko <- fillChromPeaks(faahko, param = ChromPeakAreaParam())
featureValues(faahko, value = "into") |> head()
```
```{r export-result, eval = FALSE, echo = FALSE}
save(faahko, file = "faahko.RData")
```
## Final result
While we can continue using the *xcms* result set for further analysis
(e.g. also for feature grouping with the `r Biocpkg("MsFeatures")` package; see
the LC-MS feature grouping vignette for details) we could also extract all
results as a `SummarizedExperiment` object. This is the *standard* data
container for Bioconductor defined in the `r Biocpkg("SummarizedExperiment")`
package and integration with other Bioconductor packages might thus be easier
using that type of object. Below we use the `quantify()` function to extract the
*xcms* preprocessing results as such a `SummarizedExperiment`
object. Internally, the `featureValues()` function is used to generate the
feature value matrix. We can pass any parameters from that function to the
`quantify()` call. Below we use `value = "into"` and `method = "sum"` to report
the integrated peak signal as intensity and to sum these values in samples in
which more than one chromatographic peak was assigned to a feature (for that
option it is important to run `refineChromPeaks()` like described above to merge
overlapping peaks in each sample).
```{r}
library(SummarizedExperiment)
res <- quantify(faahko, value = "into", method = "sum")
res
```
The information from `featureDefinitions()` is now stored in the `rowData()` of
this object. The `rowData()` provides annotations and information for each
**row** in the `SummarizedExperiment` (which in our case are the **features**).
```{r}
rowData(res)
```
Annotations for **columns** (in our case **samples**) are stored as
`colData()`. In this data frame each row contains annotations for one sample
(and hence one column in the feature values matrix).
```{r}
colData(res)
```
Finally, the feature matrix is stored as an *assay* within the object. Note
that a `SummarizedExperiment` can have multiple assays which have to be numeric
matrices with the number of rows and columns matching the number of features and
samples, respectively. Below we list the names of the available assays.
```{r}
assayNames(res)
```
And we can access the actual data using the `assay()` function, optionally also
providing the name of the assay we want to access. Below we show the first 6
lines of that matrix.
```{r}
assay(res) |> head()
```
Since a `SummarizedExperiment` supports multiple assays, we in addition add also
the feature value matrix **without** filled-in values (i.e. feature intensities
that were added by the gap filling step).
```{r}
assays(res)$raw_nofill <- featureValues(faahko, filled = FALSE, method = "sum")
```
With that we have now two assays in our result object.
```{r}
assayNames(res)
```
And we can extract the feature values without gap-filling:
```{r}
assay(res, "raw_nofill") |> head()
```
Finally, a history of the full processing with *xcms* is available as *metadata*
in the `SummarizedExperiment`.
```{r}
metadata(res)
```
This same information can also be extracted from the *xcms* result object using
the `processHistory()` function. Below we extract the information for the first
processing step.
```{r}
processHistory(faahko)[[1]]
```
These processing steps contain also the individual parameter objects used for
the analysis, hence allowing to exactly reproduce the analysis.
```{r}
processHistory(faahko)[[1]] |> processParam()
```
At last we perform also a principal component analysis to evaluate the grouping
of the samples in this experiment. Note that we did not perform any data
normalization hence the grouping might (and will) also be influenced by
technical biases.
```{r final-pca, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 8, fig.cap = "PCA for the faahKO data set, un-normalized intensities." }
## Extract the features and log2 transform them
ft_ints <- log2(assay(res, "raw"))
## Perform the PCA omitting all features with an NA in any of the
## samples. Also, the intensities are mean centered.
pc <- prcomp(t(na.omit(ft_ints)), center = TRUE)
## Plot the PCA
pcSummary <- summary(pc)
plot(pc$x[, 1], pc$x[,2], pch = 21, main = "",
xlab = paste0("PC1: ", format(pcSummary$importance[2, 1] * 100,
digits = 3), " % variance"),
ylab = paste0("PC2: ", format(pcSummary$importance[2, 2] * 100,
digits = 3), " % variance"),
col = "darkgrey", bg = sample_colors, cex = 2)
grid()
text(pc$x[, 1], pc$x[,2], labels = res$sample_name, col = "darkgrey",
pos = 3, cex = 2)
```
We can see the expected separation between the KO and WT samples on PC2. On PC1
samples separate based on their ID, samples with an ID <= 18 from samples with
an ID > 18. This separation might be caused by a technical bias
(e.g. measurements performed on different days/weeks) or due to biological
properties of the mice analyzed (sex, age, litter mates etc).
# Further data processing and analysis
## Quality-based filtering of features
When dealing with metabolomics results, it is often necessary to filter features
based on certain criteria. These criteria are typically derived from statistical
formulas applied to full rows of data, where each row represents a feature. The
`filterFeatures()` function provides a robust solution for filtering features
based on these conventional quality assessment criteria. It supports multiple
types of filtering, allowing users to tailor the filtering process to their
specific needs, all controlled by the `filter` argument. This function and its
implementations are applicable to both `XcmsExperiment` results objects and
`SummarizedExperiment` objects.
We will demonstrate how to use the `filterFeatures()` function to perform
quality assessment and filtering on both the `faahko` and `res` variables
defined above. The `filter` argument can accommodate various types of input,
each determining the specific type of quality assessment and filtering to be
performed.
The `PercentMissingFilter` allows to filter features based on the percentage of
missing values for each feature. This function takes as an input the parameter
`f` which is supposed to be a vector of length equal to the length of the object
(i.e. number of samples) with the sample type for each. The function then
computes the percentage of missing values per sample groups and filters
features based on this. Features with a percent of missing values larger than
the threshold in all sample groups will be removed. Another option is to base
this quality assessment and filtering only on QC samples.
Both examples are shown below:
```{r}
# To set up parameter `f` to filter only based on QC samples
f <- sampleData(faahko)$sample_type
f[f != "QC"] <- NA
# To set up parameter `f` to filter per sample type excluding QC samples
f <- sampleData(faahko)$sample_type
f[f == "QC"] <- NA
missing_filter <- PercentMissingFilter(threshold = 30, f = f)
# Apply the filter to faakho object
filtered_faahko <- filterFeatures(object = faahko, filter = missing_filter)
# Apply the filter to res object
missing_filter <- PercentMissingFilter(threshold = 30, f = f)
filtered_res <- filterFeatures(object = res, filter = missing_filter)
```
Here, no feature was removed, meaning that all the features had less than 30%
of `NA` values in at least one of the sample type.
Although not directly relevant to this experiment, the `BlankFlag` filter can be
used to flag features based on the intensity relationship between blank and QC
samples. More information can be found in the documentation of the filter:
```{r}
# Retrieve documentation for the main function and the specific filter.
?filterFeatures
?BlankFlag
```
The `RsdFilter` enable users to filter features based on their relative
standard deviation (coefficient of variation) for a specified `threshold`. It
is recommended to base the computation on quality control (QC) samples,
as demonstrated below:
```{r}
# Set up parameters for RsdFilter
rsd_filter <- RsdFilter(threshold = 0.3,
qcIndex = sampleData(filtered_faahko)$sample_type == "QC")
# Apply the filter to faakho object
filtered_faahko <- filterFeatures(object = filtered_faahko, filter = rsd_filter)
# Now apply the same strategy to the res object
rsd_filter <- RsdFilter(threshold = 0.3, qcIndex = filtered_res$sample_type == "QC")
filtered_res <- filterFeatures(object = filtered_res, filter = rsd_filter, assay = "raw")
```
All features with an RSD (CV) strictly larger than 0.3 in QC samples were thus
removed from the data set.
The `DratioFilter` can be used to filter features based on the D-ratio or
*dispersion ratio*, which compares the standard deviation in QC samples to that
in study samples.
```{r}
# Set up parameters for DratioFilter
dratio_filter <- DratioFilter(
threshold = 0.5,
qcIndex = sampleData(filtered_faahko)$sample_type == "QC",
studyIndex = sampleData(filtered_faahko)$sample_type == "study")
# Apply the filter to faahko object
filtered_faakho <- filterFeatures(object = filtered_faahko,
filter = dratio_filter)
# Now same but for the res object
dratio_filter <- DratioFilter(
threshold = 0.5,
qcIndex = filtered_res$sample_type == "QC",
studyIndex = filtered_res$sample_type == "study")
filtered_res <- filterFeatures(object = filtered_res,
filter = dratio_filter)
```
All features with an D-ratio strictly larger than 0.5 were thus removed from
the data set.
## Alignment to an external reference dataset
In certain experiments, aligning two different datasets is necessary. This
can involve comparing runs of the same samples conducted across different
laboratories or runs with MS2 recorded after the initial MS1 run. Across
laboratories and over time, the same samples may result in variation in
retention time, especially because the LC system can be quite unstable. In these
cases, an alignment step using the `adjustRtime()` function with the
`LamaParam` parameter can allow the user to perform this type of alignment.
We will go through this step by step below.
Let's load an already analyzed dataset `ref` and our previous dataset before
alignment, which will be `tst`. We will first restrict their retention time
range to be the same for both dataset.
```{r}
ref <- loadXcmsData("xmse")
tst <- loadXcmsData("faahko_sub2")
```
Now, we will attempt to align these two samples with the previous dataset. The
first step is to extract landmark features (referred to as *lamas*). To achieve
this, we will identify the features present in every QC sample of the `ref`
dataset. To do so, we will categorize (using `factor()`) our data by
`sample_type` and only retain the QC samples. This variable will be utilized to
filter the features using the `PercentMissingFilter` parameter within the
`filterFeatures()` function (see section above for more information on this
method)
```{r}
f <- sampleData(ref)$sample_type
f[f != "QC"] <- NA
ref <- filterFeatures(ref, PercentMissingFilter(threshold = 0, f = f))
ref_mz_rt <- featureDefinitions(ref)[, c("mzmed","rtmed")]
head(ref_mz_rt)
nrow(ref_mz_rt)
```
This is what the `lamas` input should look like for alignment. In terms of
how this method works, the alignment algorithm matches chromatographic peaks
from the experimental data to the lamas, fitting a model based on this
match to adjust their retention times and minimize differences between
the two datasets.
Now we can define our `param` object `LamaParama` to prepare for the
alignment. Parameters such as `tolerance`, `toleranceRt`, and `ppm` relate
to the matching between chromatographic peaks and lamas. Other parameters
are related to the type of fitting generated between these data points.
More details on each parameter and the overall method can be found by
searching `?adjustRtime`. Below is an example using default parameters.
```{r}
param <- LamaParama(lamas = ref_mz_rt, method = "loess", span = 0.5,
outlierTolerance = 3, zeroWeight = 10, ppm = 20,
tolerance = 0, toleranceRt = 20, bs = "tp")
#' input into `adjustRtime()`
tst_adjusted <- adjustRtime(tst, param = param)
tst_adjusted <- applyAdjustedRtime(tst_adjusted)
```
We extract the base peak chromatogram (BPC) to visualize and evaluate the
alignment:
```{r fig.height=12, fig.width=5}
#' evaluate the results with BPC
bpc <- chromatogram(ref, chromPeaks = "none")
bpc_tst_raw <- chromatogram(tst, chromPeaks = "none")
bpc_tst_adj <- chromatogram(tst_adjusted, chromPeaks = "none")
```
We generate plots to visually compare the alignment to the reference
dataset (black) both before (red) and after (blue) adjustment:
```{r fig.height=4, fig.width=10}
#' BPC of a sample
par(mfrow = c(1, 2), mar = c(4, 2.5, 1, 0.5))
plot(bpc[1, 1], col = "#00000080", main = "Before Alignment")
points(rtime(bpc_tst_raw[1, 1]), intensity(bpc_tst_raw[1, 1]), type = "l",
col = "#ff000080")
grid()
plot(bpc[1, 1], col = "#00000080", main = "After Alignment")
points(rtime(bpc_tst_adj[1, 1]), intensity(bpc_tst_adj[1, 1]), type = "l",
col = "#0000ff80")
grid()
```
It appears that certain time intervals (2500 to 3000 and 3500 to 4500 seconds)
exhibit better alignment than others. This variance can be elucidated by
examining the distribution of matched peaks, as illustrated below. The
`matchLamaChromPeaks()` function facilitates the assessment of how well the
`lamas` correspond with the chromatographic peaks in each file. This analysis
can be conducted prior to any adjustments.
```{r}
param <- matchLamasChromPeaks(tst, param = param)
mtch <- matchedRtimes(param)
#' BPC of the first sample with matches to lamas overlay
par(mfrow = c(1, 1))
plot(bpc[1, 1], col = "#00000080", main = "Distribution CP matched to Lamas")
points(rtime(bpc_tst_adj[1, 1]), intensity(bpc_tst_adj[1, 1]), type = "l",
col = "#0000ff80")
grid()
abline(v = mtch[[1]]$obs)
```
The overlay of BPC above provides insight into the correlation between accurate
alignment and the presence of peaks matching with `lamas`. For this particular
sample no chromatographic peaks were matched to the `lamas` between 2500 and
3000 seconds and hence the alignment in that region was not good. For the second
file, chrom peaks could also be matched in that region resulting in a better
alignment.
```{r}
par(mfrow = c(1, 1))
plot(bpc[1, 2], col = "#00000080", main = "Distribution CP matched to Lamas")
points(rtime(bpc_tst_adj[1, 2]), intensity(bpc_tst_adj[1, 2]), type = "l",
col = "#0000ff80")
grid()
abline(v = mtch[[2]]$obs)
```
Furthermore, a more detailed examination of the matching and the model used for
fitting each file is possible. Numerical information can be obtained using the
`summarizeLamaMatch()` function. From this, the percentage of chromatographic
peaks utilized for alignment can be computed relative to the total number of
peaks in the file. Additionally, it is feasible to directly `plot()` the `param`
object for the file of interest, showcasing the distribution of these
chromatographic peaks along with the fitted model line.
```{r}
#' access summary of matches and model information
summary <- summarizeLamaMatch(param)
summary
#' coverage for each file
summary$Matched_peaks / summary$Total_peaks * 100
#' access the information on the model of for the first file
summary$Model_summary[[1]]
#' Plot obs vs. ref with fitting line
plot(param, index = 1L, main = "ChromPeaks versus Lamas for the first file",
colPoint = "red")
abline(0, 1, lty = 3, col = "grey")
grid()
```
# Additional details and notes
## Subsetting and filtering
*xcms* result objects can be subset/filtered by sample using the `[`
method or one of the `filter*` functions (although the `XcmsExperiment` supports
at present only few selected filter functions). In some cases filtering can
remove preprocessing results, but most filter functions support parameters
`keepFeatures` and `keepAdjustedRtime` that can be set to `TRUE` to avoid their
removal.
## Parallel processing
Most functions in *xcms* support parallel processing, which is enabled by
default and is performed, for most operations, on a per-file basis. Parallel
processing is handled and configured by the `BiocParallel` Bioconductor package
and can be globally defined for an R session. Note that, while all data objects
are designed to have a low memory footprint by loading only peak data into
memory when needed, parallel processing will increase this demand: in general,
the full data for as many files as there are parallel processes are loaded into
memory. This needs also to be considered, when the number of parallel processes
is defined.
Unix-based systems (Linux, macOS) support `multicore`-based parallel
processing. To configure it globally we `register()` the parameter class. Note
also that `bpstart()` is used below to initialize the parallel processes.
```{r multicore, message = FALSE, eval = FALSE }
register(bpstart(MulticoreParam(2)))
```
Windows supports only socket-based parallel processing:
```{r snow, message = FALSE, eval = FALSE }
register(bpstart(SnowParam(2)))
```
## Main differences to the `MSnbase`-based xcms version 3
- `Spectra` is used as the main container for the (raw) MS data. Thus changing
backends is supported at any stage of the analysis (e.g. to load all data into
memory or to load data from remote databases).
# Additional documentation resources
Some of the documentations listed here are still based on xcms version 3 but
will be subsequently updated.
- [Exploring and analyzing LC-MS data with Spectra and
xcms](https://jorainer.github.io/xcmsTutorials/): tutorial explaining general
data handling using the *Spectra* package and LC-MS data preprocessing with
*xcms*.
- [MetaboAnnotationTutorials](https://jorainer.github.io/MetaboAnnotationTutorials):
examples for annotation of metabolomics data from [@rainer_modular_2022].
- [@gattoMSnbaseEfficientElegant2020a]: describes the concept of the *on-disk*
data modes used also in *xcms*.
- [SpectraTutorials](): tutorials describing the `r Biocpkg("Spectra")` package
and its functionality.
# Session information
R packages used for this document are listed below.
```{r si}
sessionInfo()
```
# References