\documentclass[a4paper]{article} %\VignetteIndexEntry{runLC} \usepackage{geometry} \usepackage{layout} \geometry{ includeheadfoot, margin=2.54cm } \newcommand{\pkg}[1]{{\normalfont\fontseries{b}\selectfont #1}} \newcommand{\proglang}[1]{{\sffamily #1}} \newcommand{\code}[1]{{\ttfamily #1}} \newcommand{\R}{\proglang{R}} \newcommand{\bC}{\mbox{\boldmath{$C$}}} \newcommand{\bE}{\mbox{\boldmath{$E$}}} \newcommand{\bS}{\mbox{\boldmath{$S$}}} \newcommand{\bX}{\mbox{\boldmath{$X$}}} \newcommand{\compresslist}{% \setlength{\itemsep}{1pt}% \setlength{\parskip}{0pt}% \setlength{\parsep}{0pt}% } \renewcommand{\textfraction}{0} \title{Annotation of LC-MS metabolomics datasets by the ``metaMS'' package} \author{Pietro Franceschi} \begin{document} %%%\SweaveOpts{concordance=TRUE} \maketitle \section{Introduction} \pkg{metaMS} is designed to perform the analysis of LC-MS and GC-MSbased metabolomics assays. For LC-MS data, its major function \code{runLC()} is a wrapper around the functions and classes of {\bf xcms} and {\bf CAMERA} and it is designed to process a series of data files producing a peak table with the intensity of each feature - an mz,rt couple - in each one of the samples. The functions and classes of \pkg{xcms} and \pkg{CAMERA} are used to perform the peak picking, grouping and retention correction, peak filling, and CAMERA annotation. The parameters required at each step are collected into a settings object which is passed as an argument to \code{runLC}. Three examples, fine-tuned for instruments in our own laboratory, are included: <<>>= library(metaMS) data(FEMsettings) Synapt.RP @ Consult the manual pages for more details. Package \pkg{metaMS} implements a strategy to generate an annotation database by processing the injections of a list chemical standards, which have been analyzed under the same chromatographic conditions as the samples. The injections of the standards are processed by \pkg{xcms} and \pkg{CAMERA} and the peak lists are matched against a manually validated reference table to identify the set of features associated to each compound. These features are then organized in the database used for the annotation which is included in the output of \code{runLC()}. In the following, database generation and feature annotation are described in details, while for the discussion of \pkg{xcms} and \pkg{CAMERA} the reader should refer to their specific documentation. \section{Example Data} A part of the data used to illustrate database creation and feature annotation is included in \code{metaMS}, while the more heavy experimental data have been included in the \pkg{metaMSdata} package. For LC-MS \pkg{metaMSdata} contains four CDF files referring to the analysis of a mixture of chemical standards. These files have been converted to the open source CDF format by using proprietary vendor software (Waters, Databridge). The system specific position of the example files can be determined as follows: <<>>= library(metaMSdata) cdfpath <- system.file("extdata", package = "metaMSdata") files <- list.files(cdfpath, "_RP_", full.names=TRUE) files @ The construction of the database is performed on the basis of a manually validated reference table which contains the key analytical information for each standard. The example table is part of \pkg{metaMS}: <<>>= library(metaMS) data(exptable) @ The content of the table can be visualized as follows: <<>>= head(exptable) @ \begin{itemize} \item \code{ChemSpiderID}: an unique numeric identifier for a chemical standard from the freeware ChemSpider database (http://www.chemspider.com). \item \code{compound}: a string with the human readable name of the standard. This name is used to produce the output of the annotation. \item \code{formula}: the chemical formula of the compound. This field is included in view of future developments. \item \code{M.ref}: the theoretical mass for the observed ion. This field is included in view of future developments. \item \code{mz.observed}: the manually validated {\it m/z} value which identifies the ``main" ion for this specific compound. In the majority of cases, one would choose the most most intense ion of each compound. To avoid wrong assignments the best practice should be to ask to the analyst to ``identify" the nature of the ion (protonated/deprotonated, adduct, dimer, common fragment, ...). \item \code{RTman}: the manually validated retention time for the standard (in minutes). \item \code{stdFile}: the complete path which points to the raw file of the injection of the standard. \end{itemize} \section{Database Construction} In order to create the database, the \code{stdFile} column in \code{exptable} has to be updated with the complete path pointing to the correct CDF file: <<>>= exptable$stdFile <- sapply(exptable$stdFile, grep, files, value = TRUE) exptable$stdFile @ The annotation database is constructed on the basis of \code{exptable} with the following workflow: \begin{enumerate} \item {\sc Peak picking}. The injections of the standards are processed to produce a feature list by using the parameters specified in the \code{PeakPicking} slot of the \code{settings} list. For each feature, the maximum value of the signal on the chromatographic peak (\code{maxo}) is extracted. For a detailed description of the single settings refer to the package documentation. <<>>= metaSetting(Synapt.RP, "PeakPicking") @ \item {\sc Feature grouping}. \code{CAMERA} with its default settings is used to group the features in pseudospectra and annotate them looking for isotopes and common adducts. These pseudospectra can be therefore considered as mass spectrometric fingerprints of compounds eluting at a specific retention time. It is important to remember that co-eluting compounds are likely to be grouped together, in particular where chromatographic separation is not optimal (e.g. at the extreme ends of a chromatographic run). \item {\sc Ref. Table Matching}. The full list of feature is matched with \code{exptable} looking for features compatible with \code{M.ref} and \code{RTman}. The {\it m/z} and retention time tolerances are fixed and specified in the \code{DBconstruction} element of the \code{settings} list. <<>>= metaSetting(Synapt.RP, "DBconstruction") @ The retention time tolerance is specified in minutes, the {\it m/z} one in dalton. The \code{minfeat} parameter is used to prevent the inclusion in the database of standards with a very low number of associated features. This unfortunate situation can happen, for example, when the signal is too low either because a chemical is not efficiently ionized or because it has been injected with low concentration. The absence of a good matching for a specific compound is notified on the output on the screen. \item {\sc Database Creation} The features assigned to each standard are collected into a dataframe which can be used for annotation. At this stage an additional filter on the feature intensities is applied: only the ones with an intensity bigger than \code{Ithr} are kept. This is done to avoid inserting in the DB low intensity features coming from the noise. \end{enumerate} Consider as an example the construction of the database included in the \code{metaMS} package. It contains four chemical standards: <<>>= exptable$compound @ The reference table has been already described. The data can be loaded as follows: <<>>= library(metaMSdata) cdfpath <- system.file("extdata", package = "metaMSdata") files <- list.files(cdfpath, "_RP_", full.names=TRUE) exptable$stdFile <- sapply(exptable$stdFile, function(x) files[grep(x,files)]) @ For this example the \code{minfeat} parameter is set to 2: <<>>= metaSetting(Synapt.RP, "DBconstruction.minfeat") <- 2 @ The database is constructed by using the\code{createSTDdbLC} function: <>= LCDBtest <- createSTDdbLC(stdInfo=exptable, settings = Synapt.RP, polarity = "positive", Ithr = 20) @ The messages on the screen can be used to monitor the progress of the analysis. In practice, the \code{createSTDdbLC} function is all that users need to use. The example DB is also available as data object \code{LCDBtest} and can be loaded with <<>>= data(LCDBtest) @ The example database is a list of three elements: <<>>= names(LCDBtest) @ The first contains the reference table: <<>>= head(LCDBtest$Reftable) @ The second contains the settings and the date of creation of the DB: <<>>= names(LCDBtest$Info) @ The third is the true database: <<>>= head(LCDBtest$DB) @ Each line of this \code{data.frame} is a feature detected at \code{mz} and \code{rt} with an intensity \code{maxo}. The output of \pkg{CAMERA} annotation are presented in the \code{adduct} and \code{isotopes} fields. \code{ChemSpiderID} and \code{compound} identify the compound which a feature is associated to. The \code{validate} column is set to \code{automatic} to indicate that the feature has been assigned to the neutral by using an automatic algorithm, without performing any manual validation of the results. It is interesting to see how many features are included in the DB and their association to the four chemical standards included in the reference table: <<>>= table(LCDBtest$DB$compound) @ \section{Annotation} In the previous section, we have illustrated how to create a database from a series of injections of standards. This DB can then be used to annotate the results of the analysis of a complete metabolomic experiment by passing it to the main \code{runLC} function. It is important to remember that this type of annotation relies very much on the retention time, so it gives its best results when the standards and the samples have been analyzed under the same chromatographic and mass-spectrometric conditions. Considering that in LC-MS experiments co-elution of different compounds is the rule rather than the exception, the annotation is performed feature-wise (each feature is independently matched with the DB) and a subsequent validation step is performed. The idea is to retain annotations only if more than one feature associated to a specific compound is found in the peak list. How many ``validation" features are requested is an adjustable parameter included in the (\code{minfeat} slot of the settings object). For matching and validation it is necessary to specify mass and {\it m/z} tolerances, accounting for mass and retention time shifts. For retention time the tolerance is fixed and it is specified in the settings. For {\it m/z}, the package implements either a {\it fixed tolerance} or an {\it adaptive tolerance}. The use of an {\it adaptive tolerance} to optimize the results of the annotation for Q-TOF spectrometers has been proposed by the authors in \cite{Shahaf2013}. With this approach, the optimal {\it m/z} tolerance is calculated taking into account the {\it m/z} value for a specific ion and its intensity: these two parameters are indeed affecting the accuracy of this specific class of analyzers. A more detailed description of the approach can be found in the manual pages of the package and in the specific reference. The implemented annotation strategy can be broken down in the following steps: \begin{enumerate} \item {\sc Feature wise Annotation} Each feature detected by \code{runLC} is matched against the database. If the mass error function is provided, the appropriate {\it m/z} tolerance is calculated, otherwise a fixed tolerance is used (\code{mzdiff}). The retention time tolerance is fixed and should be selected on the bases of the characteristics of each chromatographic method (\code{rtdiff}). Multiple annotations - i.e. features which are associated to more than one compound - are possible. This outcome does not indicate a problem {\it per se}, but is an inherent drawback of co-elution. \item {\sc Annotation Validation} The annotated features are organized in ``pseudospectra" collecting all the experimental features which are assigned to a specific compound. A specific annotation is confirmed only if more than \code{minfeat} features which differ in retention time less than \code{rtval} are present in a pseudospectrum. As a general rule \code{rtval} should be narrower than \code{rtdiff}. The latter, indeed, accounts for shifts in retention time between the injection of the standards and the metabolomics experiment under investigation. This time can be rather long, considering that the standards are not commonly re-analyzed each time. On the other hand,\code{rtval} represents the shift between the ions of the same compound within the same batch of injections and therefore it has only to account for the smaller shifts occurring during peak picking and alignment. \end{enumerate} To illustrate the procedure consider the results of the annotation of the example data included in \code{metaMSdata} with the \code{LCDBtest} db. Here we set a fixed mass tolerance and we use the settings for a Reverse Phase chromatography. <>= LC <- runLC(files, settings = Synapt.RP, DB = LCDBtest$DB) @ The progress of the analysis can be followed from the messages on the screen. As before, the results from this vignette are available in data object \code{LCresults} -- this is used here to demonstrate the structure of the output without having to create it on the fly, which simply takes too much time. <<>>= data(LCresults) @ A summary of the results of the annotation can be found in the \code{Annotation} element of \code{LCresults}: <<>>= head(LCresults$Annotation$annotation.table) @ This dataframe contains the complete results of the annotation. \begin{itemize} \item \code{feature}: the index of the annotated feature in the peak table. \item \code{ChemSpiderID}: the Chem Spider ID of the neutral the feature is associated to. \item \code{dbposition}: the position inside the DB of the matching entry. \item \code{mz, rt, I}: mass, retention time and intensity of the feature. \item \code{compound}: the (human) readable name of the standard. \item \code{db\_mz, db\_rt, db\_I, db\_ann}: information relative to the corresponding DB entry. \item \code{mz.err}: the {\it m/z} error used in the matching. \item \code{clid}: the results of a hierarchical clustering of the retention times of the annotated features. This can be used to identify the presence of sub groupings of the features which are assigned to the same standard, thus suggesting the presence of co-eluting isomers/compounds. The HC tree is cut at a hight of \code{rtval}. \end{itemize} <<>>= LCresults$Annotation$compounds LCresults$Annotation$ChemSpiderIDs @ These are the list of standards found in the peaklist. <<>>= LCresults$Annotation$multiple.annotations @ This is the list of features which show multiple annotations <<>>= LCresults$Annotation$ann.features @ The list of the features with annotation. Inside the results,the outputs of the annotation are also included in a more compact form as part of the peak table in the \code{ChemSpiderID} and \code{compound} columns: <<>>= head(LCresults$PeakTable) @ \clearpage \bibliographystyle{unsrt} \bibliography{LC} \end{document}