MSPrep 1.16.0
MSPrep
provides a convenient set of functionalities used in the pre-analytic
processing pipeline for mass spectrometry based metabolomics data. Functions are
included for the following processes commonly performed prior to analysis of
such data:
The sections which follow provide an explanation of each function contained in
MSPrep
, those functions’ respective options, and examples of the pre-analysis
pipeline using two data sets provided in the MSPrep
package, MSQuant
and
COPD_131
.
For further information, please see MSPrep—Summarization, normalization and diagnostics for processing of mass spectrometry–based metabolomic data (Hughes et al., 2014) and Pre-analytic Considerations for Mass Spectrometry-Bas ed Untargeted Metabolomics Data (Reinhold et al., 2019).
Data my be input as a Data Frame or SummarizedExperiment
.
When using the functions provided by MSPrep
on a data frame, the following
format is expected throughout the pipeline.
Most often, two or more columns of the data frame will identify unique
compounds. This may include columns which specify the mass-to-charge ratio, the
retention time, or the name of each compound. Using the parameter compVars
,
the names of these columns should be provided to each function as a vector of
character strings.
The remainder of the columns in the data frame should specify the respective
abundances of each compound for each sample. It is expected that one or more
identifying variables for each sample will be specified by the column name (e.g.
Sample ID, batch number, or replicate). Each piece of information contained in
the column names must be separated by a consistent character not present
anywhere else in the column name. Using the parameter sampleVars
, the sample
variables present in the column names should be provided to each function as a
vector of character strings specifying the order the variables appear, and the
parameter separator
should identify the character which separates each sample
variable. Each column name may also include consistent non-identifying text at
the beginning of each column name. This text should be provided to each function
using the colExtraText
parameter.
As an example see the provided data set msquant
and its use in the pipeline
below.
When using the functions provided by MSPrep
on a SummarizedExperiment
, it is
expected that the data will include a single assay
of abundances, rowData
identifying characteristics of each metabolite, and colData
specifying
characteristics of each sample. The parameters discussed in the previous section
may be ignored.
## [1] "Neutral_Operator_Dif_Pos_1x_O1_A_01"
Above is the third column name in msquant
. The first part
“Neutral_Operator_Dif_Pos_” will not be used in this analysis, so we will assign
it to colExtraText
parameter. The next value, “1x”, is the spike-in value.
The following value, “O1”, specifies the sample’s batch. The remaining values,
“A” and “01”, specify the replicate and subject IDs. We will pass these
sample variables to each function with the sampleVars
parameter. Finally, note
that msquant
contains two columns, mz
and rt
which identify each
compounds’ mass-to-charge ratio and retention time, respectively. We will pass
these column names to each function using the compVars
parameter.
With our data in this format, we can start the pipeline.
This step summarizes the technical replicates using the following procedure for each compound in each batch.
If the name of variable specifying replicate in sampleVars
for a data frame
or the column data of a SummarizedExperiment
, specify the name of the variable
using the replicate
parameter.
The technical replicates in MSQuant
are summarized below using a CV maximum of
.50 and a minimum proportion present of 1 out of 3 replicates. Note that in the
MSQuant
dataset, missing values are represented as ‘1’, which is specified in
the missingValue
argument below. msSummarize()
will replace these missing
values with ‘0’ in all instances where the summarization algorithm determines
the values to be truly missing.
summarizedDF <- msSummarize(msquant,
cvMax = 0.50,
minPropPresent = 1/3,
compVars = c("mz", "rt"),
sampleVars = c("spike", "batch", "replicate",
"subject_id"),
colExtraText = "Neutral_Operator_Dif_Pos_",
separator = "_",
missingValue = 1)
## # A tibble: 10 × 6
## mz rt `1x_O1_01` `1x_O1_02` `1x_O2_01` `1x_O2_02`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 78.0 21.362432 120937. 121018. 118425. 118527.
## 2 80.0 9.772964 63334. 63415. 69530 69631.
## 3 80.1 0.6517916 78601 78668. 154636 154737.
## 4 83.1 1.3226668 58473. 58554. 298703. 298804.
## 5 84.1 7.864 0 0 0 0
## 6 85.1 22.307388 348686 348753. 342413 342420.
## 7 85.1 0.7104762 0 0 0 0
## 8 85.1 1.3228333 335092. 335172. 1753681 1753782.
## 9 86.0 22.587963 226792. 226872. 240137 240238.
## 10 87.0 1.702 0 0 674771. 674872.
Following the summarization of technical replicates, the data can be filtered to
only contain compounds present in a specified proportion of samples. To do so,
the msFilter()
function is provided. By default, msFilter()
uses the 80%
rule and filters the compounds in the data set leaving only those which are
present in 80% of the samples.
filteredDF <- msFilter(summarizedDF,
filterPercent = 0.8,
compVars = c("mz", "rt"),
sampleVars = c("spike", "batch", "subject_id"),
separator = "_")
## # A tibble: 10 × 6
## mz rt `1x_O1_01` `1x_O1_02` `1x_O2_01` `1x_O2_02`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 78.0 21.362432 120937. 121018. 118425. 118527.
## 2 80.0 9.772964 63334. 63415. 69530 69631.
## 3 80.1 0.6517916 78601 78668. 154636 154737.
## 4 86.0 22.587963 226792. 226872. 240137 240238.
## 5 90.0 3.0758798 0 0 148358. 148460.
## 6 99.1 22.379221 6216101 6216182. 6137392 6137493.
## 7 99.9 21.431955 117396 117477. 110735 110836.
## 8 102. 3.076125 92308 92389. 0 0
## 9 104. 9.770778 78100 78181. 61308 61409.
## 10 107. 22.5569 134960 135039. 83090. 83281.
Next, depending on the downstream analysis, you may need to impute missing data. Three imputation methods are provided:
Half-min imputes each missing value as one half of the minimum observed value
for that compound. Half-min imputation performs faster than other methods,
but may introduce bias. The BPCA algorithm, provided by the pcaMethods
package, estimates the missing value by a linear combination
of principle axis vectors, with the number of principle components specified by
the user with the nPcs
argument. KNN uses a K-Nearest Neighbors algorithm
provided by the VIM
package. Users may provide
their preferred value of k using the kKnn
argument. By default, KNN
uses samples as neighbors, but by specifying compoundsAsNeighbors = TRUE
,
compounds will be used as neighbors instead. Note that this is significantly
slower than using samples as neighbors and may take several minutes or more to
run depending on the size of your data set.
imputedDF <- msImpute(filteredDF,
imputeMethod = "knn",
compVars = c("mz", "rt"),
sampleVars = c("spike", "batch", "subject_id"),
separator = "_",
returnToSE = FALSE,
missingValue = 0)
## # A tibble: 10 × 6
## mz rt `1x_O1_01` `1x_O1_02` `1x_O2_01` `1x_O2_02`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 78.0 21.362432 120937. 121018. 118425. 118527.
## 2 80.0 9.772964 63334. 63415. 69530 69631.
## 3 80.1 0.6517916 78601 78668. 154636 154737.
## 4 86.0 22.587963 226792. 226872. 240137 240238.
## 5 90.0 3.0758798 121246 116549. 148358. 148460.
## 6 99.1 22.379221 6216101 6216182. 6137392 6137493.
## 7 99.9 21.431955 117396 117477. 110735 110836.
## 8 102. 3.076125 92308 92389. 1215434. 908742.
## 9 104. 9.770778 78100 78181. 61308 61409.
## 10 107. 22.5569 134960 135039. 83090. 83281.
In order to make comparisons between samples, the data may need to be transformed and normalized This step transforms the data and performs one of eight normalization strategies:
msNormalize()
also provides options to transform the data using a log base 10
(default), log base 2, or natural log transformation. To select either option,
or to forego transformation, use the transform
argument to specify "log10"
,
"log2"
, "ln"
, or "none"
respectively.
Quantile normalization, provided by the preprocessCore
package, ensures that
the provided samples have the same quantiles. Median normalization subtracts
the median abundance of each sample from every compound in that sample, thereby
aligning the median abundance of each sample at 0.
ComBat, provided by the sva
package, is an empirical Bayes batch effect
correction algorithm to remove unwanted batch effects and may be used separately
or in conjunction with quantile or median normalization. When using ComBat, a
sampleVar
called “batch” must be present for data frames, or for
SummarizedExperiment
“batch” must be present in the columns names of
colData
. Or, if the sample variable corresponding to batch differs from
“batch”, you may specify the batch variables name using the batch
parameter.
RUV and SVA normalization each estimate a matrix of unobserved factors of
importance using different methods of supervised factor analysis. For both
methods, known covariates (e.g. sex, age) should be provided using the
covariatesOfInterest
parameter, and must correspond to the sample variables
specified by sampleVars
in the case of a data frame or colData
in the case
of a SummarizedExperiment
. For RUV normalization, the kRUV
argument
specifies the number of factors on which the data is normalized.
Cross-Contribution Compensating Multiple Standard Normalization (CRMN), provided
by the crmn
package, normalizes based on internal standards. The sample
variable identifying internal standards must be provided using the
covariatesOfInterest
parameter. For experiments which have control compounds,
a vector of the row numbers containing them should be provided in the
controls variable. If a vector of control compounds is not provided, data driven
controls will be generated.
Below, we apply a log base 10 transformation, quantile normalization, and ComBat batch correction.
normalizedDF <- msNormalize(imputedDF,
normalizeMethod = "quantile + ComBat",
transform = "log10",
compVars = c("mz", "rt"),
sampleVars = c("spike", "batch", "subject_id"),
covariatesOfInterest = c("spike"),
separator = "_")
## # A tibble: 10 × 6
## mz rt `1x_O1_01` `1x_O1_02` `1x_O2_01` `1x_O2_02`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 78.0 21.362432 5.07 5.08 5.03 5.04
## 2 80.0 9.772964 4.80 4.80 4.82 4.83
## 3 80.1 0.6517916 4.97 4.97 5.08 5.09
## 4 86.0 22.587963 5.39 5.39 5.29 5.31
## 5 90.0 3.0758798 5.12 5.10 5.11 5.12
## 6 99.1 22.379221 6.82 6.82 6.82 6.82
## 7 99.9 21.431955 5.06 5.06 5.01 5.03
## 8 102. 3.076125 5.05 5.05 5.68 5.63
## 9 104. 9.770778 4.90 4.90 4.82 4.83
## 10 107. 22.5569 5.09 5.09 4.95 4.96
Often, all the above steps will need to be conducted. This can be done in a
single statement using the msPrepare()
function. Simply provide the
function the same arguments that you would provide to the individual functions.
preparedDF <- msPrepare(msquant,
minPropPresent = 1/3,
missingValue = 1,
filterPercent = 0.8,
imputeMethod = "knn",
normalizeMethod = "quantile + ComBat",
transform = "log10",
covariatesOfInterest = c("spike"),
compVars = c("mz", "rt"),
sampleVars = c("spike", "batch", "replicate",
"subject_id"),
colExtraText = "Neutral_Operator_Dif_Pos_",
separator = "_")
## # A tibble: 10 × 6
## mz rt `1x_O1_01` `1x_O1_02` `1x_O2_01` `1x_O2_02`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 78.0 21.362432 5.07 5.08 5.03 5.04
## 2 80.0 9.772964 4.80 4.80 4.82 4.83
## 3 80.1 0.6517916 4.97 4.97 5.08 5.09
## 4 86.0 22.587963 5.39 5.39 5.29 5.31
## 5 90.0 3.0758798 5.12 5.10 5.11 5.12
## 6 99.1 22.379221 6.82 6.82 6.82 6.82
## 7 99.9 21.431955 5.06 5.06 5.01 5.03
## 8 102. 3.076125 5.05 5.05 5.68 5.63
## 9 104. 9.770778 4.90 4.90 4.82 4.83
## 10 107. 22.5569 5.09 5.09 4.95 4.96
Next, the functionality of MSPrep
will be demonstrated using the included data
COPD_131
. The raw data set can be found here, at Metabolomics Workbench.
Note that only a portion of the compounds in the original COPD_131
data set
are included in this package in order to limit file size and example run time.
Generally, the number of compounds in a data set will greatly
exceed the number of samples, and the functions included in this package will
take more time to process the data.
This data set differs from msquant
in several ways. First, it has a column
Compound.Name
which specifies compound names, and the mass-to-charge ratio and
retention-time columns are named Mass
and Retention.Time
respectively.
Second, this data set does not have spike-ins or batches (but it does have
technical replicates). Finally, the data has already been transformed, so that
step of the pipeline will be excluded.
Next, the technical replicates in COPD_131
need to be summarized. This process
is largely the same as before, but with different column names passed to
compVars
, so msSummarize()
is called as follows:
Again, this process is largely the same as before, choosing a filter percentage
of 0.8. So, we call msFilter()
as follows:
For this example, msImpute()
will be called using Bayesian PCA using three
principle components to impute the missing values for the data.
For this example, msNormalize()
will be called using median normalization with
no transformation applied.
As with the previous example, the above steps can be performed in a pipeline
using the msPrepare()
function. To skip the transformation step of the
pipeline, set the transform
parameter to “none” (note that the same can
be done for imputationMethod
and normalizationMethod
).
preparedSE <- msPrepare(COPD_131,
cvMax = 0.5,
minPropPresent = 1/3,
compVars = c("Mass", "Retention.Time",
"Compound.Name"),
sampleVars = c("subject_id", "replicate"),
colExtraText = "X",
separator = "_",
filterPercent = 0.8,
imputeMethod = "bpca",
normalizeMethod = "median",
transform = "none",
nPcs = 3,
missingValue = 0,
returnToSE = TRUE)
## Summarizing
## Filtering
## Imputing
## Normalizing
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MSPrep_1.16.0 BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 rlang_1.1.4
## [3] magrittr_2.0.3 matrixStats_1.4.1
## [5] e1071_1.7-16 compiler_4.4.1
## [7] RSQLite_2.3.7 mgcv_1.9-1
## [9] png_0.1-8 vctrs_0.6.5
## [11] sva_3.54.0 stringr_1.5.1
## [13] pkgconfig_2.0.3 crayon_1.5.3
## [15] fastmap_1.2.0 XVector_0.46.0
## [17] utf8_1.2.4 rmarkdown_2.28
## [19] UCSC.utils_1.2.0 preprocessCore_1.68.0
## [21] itertools_0.1-3 purrr_1.0.2
## [23] bit_4.5.0 xfun_0.48
## [25] randomForest_4.7-1.2 zlibbioc_1.52.0
## [27] cachem_1.1.0 GenomeInfoDb_1.42.0
## [29] jsonlite_1.8.9 blob_1.2.4
## [31] DelayedArray_0.32.0 BiocParallel_1.40.0
## [33] parallel_4.4.1 R6_2.5.1
## [35] bslib_0.8.0 stringi_1.8.4
## [37] vcd_1.4-13 limma_3.62.0
## [39] ranger_0.16.0 genefilter_1.88.0
## [41] car_3.1-3 boot_1.3-31
## [43] lmtest_0.9-40 GenomicRanges_1.58.0
## [45] jquerylib_0.1.4 Rcpp_1.0.13
## [47] bookdown_0.41 SummarizedExperiment_1.36.0
## [49] iterators_1.0.14 knitr_1.48
## [51] zoo_1.8-12 IRanges_2.40.0
## [53] splines_4.4.1 Matrix_1.7-1
## [55] nnet_7.3-19 tidyselect_1.2.1
## [57] abind_1.4-8 yaml_2.3.10
## [59] codetools_0.2-20 doRNG_1.8.6
## [61] lattice_0.22-6 tibble_3.2.1
## [63] withr_3.0.2 KEGGREST_1.46.0
## [65] Biobase_2.66.0 evaluate_1.0.1
## [67] survival_3.7-0 proxy_0.4-27
## [69] Biostrings_2.74.0 pillar_1.9.0
## [71] BiocManager_1.30.25 MatrixGenerics_1.18.0
## [73] carData_3.0-5 rngtools_1.5.2
## [75] VIM_6.2.2 foreach_1.5.2
## [77] stats4_4.4.1 generics_0.1.3
## [79] sp_2.1-4 S4Vectors_0.44.0
## [81] laeken_0.5.3 xtable_1.8-4
## [83] class_7.3-22 glue_1.8.0
## [85] tools_4.4.1 robustbase_0.99-4-1
## [87] data.table_1.16.2 locfit_1.5-9.10
## [89] annotate_1.84.0 XML_3.99-0.17
## [91] grid_4.4.1 tidyr_1.3.1
## [93] missForest_1.5 edgeR_4.4.0
## [95] AnnotationDbi_1.68.0 colorspace_2.1-1
## [97] nlme_3.1-166 GenomeInfoDbData_1.2.13
## [99] crmn_0.0.21 Formula_1.2-5
## [101] cli_3.6.3 fansi_1.0.6
## [103] S4Arrays_1.6.0 dplyr_1.1.4
## [105] pcaMethods_1.98.0 DEoptimR_1.1-3
## [107] sass_0.4.9 digest_0.6.37
## [109] BiocGenerics_0.52.0 SparseArray_1.6.0
## [111] memoise_2.0.1 htmltools_0.5.8.1
## [113] lifecycle_1.0.4 httr_1.4.7
## [115] statmod_1.5.0 bit64_4.5.2
## [117] MASS_7.3-61