% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{LC/MS Preprocessing and Analysis with xcms} %\VignetteDepends{faahKO} %\VignetteKeywords{preprocess, analysis, alignment} %\VignettePackage{xcms} \documentclass[12pt]{article} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \begin{document} \title{LC/MS Preprocessing and Analysis with xcms} \author{Colin A. Smith} \maketitle \section*{Introduction} This document describes how to use \Rpackage{xcms} to preprocess LC/MS data for relative quantitation and statistical analysis. It gives examples of how visualization can be used throughout the process and to display final results. An overview of the preprocessing/analysis methodology, along with the function names associated with each step, is shown in Figure~\ref{flowchart}. \begin{figure} \begin{center} \includegraphics{FlowChart} \end{center} \caption{\label{flowchart} Flow chart showing a high-level overview of the preprocessing/analysis methodology employed by \Rpackage{xcms}. Function/method names corresponding to each step are also given.} \end{figure} <>= library(multtest) library(xcms) library(faahKO) @ \section{Raw Data File Preparation} The \Rpackage{xcms} package reads full-scan LC/MS data from AIA/ANDI format NetCDF, mzXML, and mzData files. All data to be analyzed by \Rpackage{xcms} must first be converted to one of those file formats. Software packages for many instruments are be able to export to NetCDF. For information about how to export to NetCDF, please consult the documentation that came with your instrument's software. The online help of most packages frequently use the terms ``CDF'' or ``AIA'' when referring to NetCDF format. In addition to NetCDF, mzXML exporters for a number of instruments are also available.\footnote{\url{http://sashimi.sourceforge.net/software_glossolalia.html}} After exporting all files to NetCDF/mzXML/mzData format, they should be put in a location that will remain the same throughout the analysis. That is because \Rpackage{xcms} records the location of the raw data files and refers back to them a number of times during preprocessing and analysis. In most cases, LC/MS files that were acquired under different conditions should not be compared. For instance, positive and negative ionization mode files will have no ions in common and should thus be preprocessed separately. Similarly, data files acquired with different elution gradients should not be processed together. Another important consideration is the directory structure in which the files are organized. \Rpackage{xcms} uses sample class information during preprocessing to help decide which groups of peaks are significant. If organized into subdirectories, samples will automatically be assigned to separate classes based on their location. Samples may be separated into class based on tissue type, mutation, gender, disease, or time. For example, if you are analyzing the longitudinal effect of a drug in two patient groups, you may wish to put the groups into two directories ``GroupA'' and ``GroupB''. Within each of those directories, you could further separate the samples by the time they were taken, such as ``Day1'', ``Day2'', etc. In \Rpackage{xcms}, they will be automatically assigned class names ``GroupA/Day1'', ``GroupA/Day2'', etc. For the purposes of demonstration, we will use a subset of the data from \cite{Saghatelian04} examining the metabolic consequences of knocking out the fatty acid amide hydrolase (FAAH) gene in mice. The raw data files are contained in the \texttt{cdf} directory of the \Rpackage{faahKO} data package. There are samples from the spinal cords of 6 knockout mice and 6 wild-type mice placed in two subdirectories. Each file contains centroided data acquired in positive ion mode from 200-600 m/z and 2500-4500 seconds. To access the NetCDF files, we first locate the \textit{cdf} directory in the \Rpackage{faahKO} package. <>= cdfpath <- system.file("cdf", package = "faahKO") list.files(cdfpath, recursive = TRUE) @ \section{Filtration and Peak Identification} The class of objects used for preprocessing analyte data from multiple LC/MS files is \Rclass{xcmsSet}. It stores peak lists and provides methods for grouping and aligning those peaks. To create an \Rclass{xcmsSet} object from a set of NetCDF files, use the \Rfunction{xcmsSet()} constructor function. It handles batch peak picking and generation of the \Rclass{xcmsSet} object. There are a number of ways you can specify the files it should read. By default, it will recursively search through the current directory for NetCDF/mzXML/mzData files. Alternatively, you can manually specify the files you are interested in, as shown below. During peak identification, \Rpackage{xcms} uses a sepearate line for each sample to report on the status of processing. It outputs out pairs of numbers separated by a colon. The first number is the m/z it is currently processing. The second number is the number of peaks that have been identified so far. It is imporant to note that the number may be significantly larger than the final number of peaks as a vicinity elimination posprocessing step removes duplicate peaks corresponding to the same ion. <>= library(xcms) cdffiles <- list.files(cdfpath, recursive = TRUE, full.names = TRUE) xset <- xcmsSet(cdffiles) #xset <- faahko xset @ The default arguments for \Rfunction{xcmsSet} should work acceptably in most cases. However, there are a number of parameters that may need to be optimized for a particular instrument or group of samples. The full set of parameters can be seen by viewing the documentation for the \Rfunction{xcmsSet} function and \Rmethod{findPeaks} method. The method \Rmethod{findPeaks} can make use of different algorithms for peak detection. Currently \Rmethod{findPeaks.matchedFilter} and \Rmethod{findPeaks.centWave} are available, \Rmethod{findPeaks.matchedFilter} is used by default. First, several of the most important parameters of \Rmethod{findPeaks.matchedFilter} will be discussed. \\ \textbf{findPeaks.matchedFilter} \\ One parameter to consider is the Gaussian model peak width used for matched filtration, an integral part of the peak detection algorithm. For a discussion of how model peak width affects the signal to noise ratio, see \cite{Danielsson02}. It can be specified as either the standard deviation (\Rfunarg{sigma}) or full width at half maximum (\Rfunarg{fwhm}). By default, a FWHM of 30 seconds is used. Depending on the type of chromatography, the correct model peak width can be quite different. One means of determining the peak width is to fit the Gaussian function to one or more peaks in representative samples produced with your experimental protocol. Functionality for doing so is provided in the \Rmethod{plotChrom} method with the \Rfunarg{fitgauss} argument set to TRUE. Several parameters depend on the resolution your mass spectrometer. Prior to matched filtration, the peak detection algorithm creates extracted ion base peak chromatograms (EIBPC) on a fixed step size defined by the \Rfunarg{step} argument (default 0.1 m/z). To take into account uncertainties in scan to scan mass accuracy, the peak identification algorithm combines a given number of EIBPCs prior to filtration and peak detection, as defined by the \Rfunarg{steps} argument. The default value, 2, combines EIBPCs 1-2, 2-3, 3-4, etc. If the peak width is significantly greater than the step size, you may wish to turn off combination using a value of 1. If your scan to scan accuracy is worse, you may wish to increase the number of scans combined. For example, a value of 3 would combine EIBPCs 1-3, 2-4, 3-5, etc. Another factor to consider is the algorithm by which EIBPCs are produced. One way of thinking about that process is as a transformation of the data from being separate lists of mass/intensity pairs (one list for each scan) to a matrix with rows representing equally spaced masses and a column for each scan. Data transformed into such a matrix is usually referred to as being in profile mode. To do so, each scan of unequally spaced masses must be mapped onto a column of the final matrix. The algorithm used to do so is selected using the \Rfunarg{profmethod} argument and can be either ``bin'', ``binlin'', ``binlinbase'', or ``intlin''. The simplest algorithm, ``bin'', simply bins the intensity into the matrix cell closest to it in mass. If more than one intensity value is assigned to the same cell, then the greatest intensity is used. All other matrix cells are left at zero. It is the default and is especially useful for centroided data. ``binlin'' does the same thing except that it uses linear interpolation to fill in cells that otherwise would have been left at zero. It works well for sparsely populated continuum data. Some mass spectrometry software allows the user to set an intensity threshold below which no mass/intensity values are recorded in continuum mode. When the mass spectral signal falls below that threshold, simple linear interpolation will not provide a good approximation of the original signal, instead creating artificially high background. To address that, the ``binlinbase'' method uses linear interpolation between data points within 0.15 m/z of each other, and otherwise inserts a basal intensity value set to half of the minimum intensity. Those specific parameters can be changed using the \Rfunarg{profparam} argument. See documentation for the function \Rfunction{profBinLinBase} for more details. The last method, ``intlin'', uses integration and linear interpolation between mass/intensity pairs to determine the equally spaced intensity values. This has the advantage of being numerically stable regardless of the mass step size. However, it is more useful for visualization than peak identification and is generally not recommended as such. \\ \textbf{findPeaks.centWave} \\ The method \Rmethod{findPeaks.centWave} follows a different approach. This algorithm is most suitable for high resolution LC/\{TOF,OrbiTrap,FTICR\}-MS data in centroid mode. Due to the fact that peak centroids are used, a binning step is not necessary. In the first phase of the method mass traces (characterised as regions with less than \Rfunarg{ppm} m/z deviation in consecutive scans) in the LC/MS map are located. In the second phase these mass traces are further analysed. Continuous wavelet transform (CWT) is used to locate chromatographic peaks on different scales. Accordingly, two parameters have to be adjusted. The \Rfunarg{ppm} parameter has to be set according to the machine accuracy, e.g. \Rfunarg{ppm=25}. The peak width range (\Rfunarg{peakwidth=c(min,max)}) has to be set according to the chromatographic peak width range, e.g. \Rfunarg{peakwidth=c(20,50)} seconds for HPLC and \Rfunarg{peakwidth=c(5,12)} seconds for UPLC chromatography. The method is capable of detecting close-by-peaks and also overlapping peaks. Some efforts are made to detect the exact peak boundaries to get precise peak integrals. The peak attributes \Robject{sn} (Signal/Noise Ratio) and \Robject{egauss} (root-mean-square-error of the gaussian fit) can be used to assess the peak quality. \section{Matching Peaks Across Samples} After peak identification, peaks representing the same analyte across samples must be placed into groups. That is accomplished with the \Rmethod{group} method, which returns a new \Rclass{xcmsSet} object with the additional group information. The grouping process is non-destructive and does not affect the other data stored in the \Rclass{xcmsSet} object. Therefore, we can safely replace the \Robject{xset} object with the grouped version. The grouping algorithm processes the peak lists in order of increasing mass and will regularly output the mass it is currently working on. <>= xset <- group(xset) @ There are several grouping parameters to consider optimizing for your chromatography and mass spectrometer. Please consult the \Rmethod{group} documentation for more details. To see what the algorithm is doing while running, use the \Rfunarg{sleep} argument to specify a time (in seconds) to pause and plot each iteration. That can be quite useful for visualizing parameter effects. \section{Retention Time Correction} After matching peaks into groups, \Rpackage{xcms} can use those groups to identify and correct correlated drifts in retention time from run to run. The aligned peaks can then be used for a second pass of peak grouping which will be more accurate than the first. The whole process can be repeated in an iterative fashion, although we will only demonstrate a single pass of retention time alignment here. Not all peak groups will be helpful for identifying retention time drifts. Some groups may be missing peaks from a large fraction of samples and thus provide an incomplete picture of the drift at that time point. Still others may contain multiple peaks from the same sample, which is a sign of impropper grouping. \Rpackage{xcms} ignores those groups by only considering ``well-behaved'' peak groups which are missing at most one sample and have at most one extra peak. (Those values can be changed with the \Rfunarg{missing} and \Rfunarg{extra} arguments.) For each of those well-behaved groups, the algorithm calculates a median retention time and, for every sample, a deviation from that median. Within a sample, the observed deviation generally changes over time in a nonlinear fashion. Those changes are approximated using a local polynomial regression technique implemented in the \Rfunction{loess} function. By default, the curve fitting is done using least-squares on all data points. However, it is possible to enable outlier detection and removal by setting the \Rfunarg{family} argument to \texttt{"symmetric"}, as shown here. Retention time correction is performed by the \Rmethod{retcor} method, which returns an \Rclass{xcmsSet} object with corrected retention times. Because it changes the retention times of all peaks, it is important to store the new object under a new variable name. That will allow you to backtrack and repeat retention time correction if necessary. <>= xset2 <- retcor(xset, family = "symmetric", plottype = "mdevden") @ The above command uses the \Rfunarg{plottype} argument to produce a plot, shown in Figure~\ref{rtcorrection}, which is useful for supervising the algorithm. It includes the data points used for loess regression and the resulting deviation profiles. It additionally shows the distribution of peak groups across retention time. \begin{figure} \begin{center} \includegraphics{xcmsPreprocess-RTCorrection} \end{center} \caption{\label{rtcorrection} Retention time deviation profiles used for aligning the samples. The data points used for generating each profile are also shown. All times are in seconds. A negative number indicates a sample was eluting before most of the others, and vice versa. Samples that were acquired on the same day are colored similarly and have correlated deviation profiles, as expected. Below, kernel density estimation is used to show the distribution of all peaks and those peaks used as standards for retention time correction. Examples of two peaks before and after alignment are shown in Figure~\ref{eicalign}.} \end{figure} After retention time correction, the initial peak grouping becomes invalid and is discarded. Therefore, the resulting object needs to be regrouped. Here, we decrease the inclusiveness of the grouping using the \Rfunarg{bw} argument (default 30 seconds). <>= xset2 <- group(xset2, bw = 10) @ \section{Filling in Missing Peak Data} After the second pass of peak grouping, there will still be peak groups which are missing peaks from some of the samples. That can occur because peaks were missed during peak identification or because an analyte was not present in a sample. In any case, those missing data points can be filled in by rereading the raw data files and integrating them in the regions of the missing peaks. That is performed using the \Rmethod{fillPeaks} method, which returns a \Rclass{xcmsSet} object with the filled in peak data. While running, it outputs the name of the sample it is currently processing. <>= xset3 <- fillPeaks(xset2) xset3 @ \section{Analyzing and Visualizing Results} A report showing the most statistically significant differences in analyte intensities can be generated with the \Rmethod{diffreport} method. It will automatically generate extracted ion chromatograms for a given number of them, in this case 10. Several of those chromatograms are shown in Figure~\ref{eic}. \begin{figure} \begin{center} \begin{tabular}{cc} \includegraphics[width=0.49\textwidth]{example_eic/001}& \includegraphics[width=0.49\textwidth]{example_eic/002}\\ \includegraphics[width=0.49\textwidth]{example_eic/003}& \includegraphics[width=0.49\textwidth]{example_eic/004}\\ \end{tabular} \end{center} \caption{\label{eic} Auto-generated extracted ion chromatograms for the top four differentially regulated ions. Darkened lines indicate where the peaks were integrated for quantitation. The top two plots show the primary and secondary isotopic peaks of an N-acyl ethanolamine (NAE) with a 16 carbon acyl chain. The lower left plot shows the primary isotopic peak of an NAE with a 16 carbon, monounsaturated acyl chain. The lower right plot shows another potential FAAH substrate of unknown identity. Its peaks are not aligned because it is showing a different elution profile than the majority of the other metabolites. Compare it with peaks in the top two plots, which are also eluting at the same time but are correctly aligned.} \end{figure} <>= reporttab <- diffreport(xset3, "WT", "KO", "example", 10, metlin = 0.15, h=480, w=640) reporttab[1:4,] @ If the \Rfunarg{metlin} argument is set to a numeric value, the report will include links to the Metlin Metabolite Database (\url{http://metlin.scripps.edu/}) showing potential metabolite identities. A positive value indicates the data was acquired in positive ion mode and the neutral mass is calculated assuming all ions are M+H. A negative value does the opposite. The value itself indicates the uncertainty in mass accuracy. For instance, the first and third metabolites in the report produce the following URLs: \begin{itemize} \item \small{ <>= cat("\\url{", as.character(reporttab[1,"metlin"]), "}", sep = "") @ } \item \small{ <>= cat("\\url{", as.character(reporttab[3,"metlin"]), "}", sep = "") @ } \end{itemize} \section{Selecting and Visualizing Peaks} It is also possible to generate extracted ion chromatograms for arbitrary peak groups selected using various criteria. Here we generate EICs for two analytes eluting at different times. They are shown using both unaligned and aligned retention times. The resulting plots are shown in Figure~\ref{eicalign}. <>= gt <- groups(xset3) colnames(gt) groupidx1 <- which(gt[,"rtmed"] > 2600 & gt[,"rtmed"] < 2700 & gt[,"npeaks"] == 12)[1] groupidx2 <- which(gt[,"rtmed"] > 3600 & gt[,"rtmed"] < 3700 & gt[,"npeaks"] == 12)[1] eiccor <- getEIC(xset3, groupidx = c(groupidx1, groupidx2)) eicraw <- getEIC(xset3, groupidx = c(groupidx1, groupidx2), rt = "raw") @ <>= plot(eicraw, xset3, groupidx = 1) @ <>= plot(eicraw, xset3, groupidx = 2) @ <>= plot(eiccor, xset3, groupidx = 1) @ <>= plot(eiccor, xset3, groupidx = 2) @ <>= cat("These are the warning") warnings() @ \begin{figure} \begin{center} \begin{tabular}{cc} \includegraphics[width=0.49\textwidth]{xcmsPreprocess-EICRaw1}& \includegraphics[width=0.49\textwidth]{xcmsPreprocess-EICRaw2}\\ \includegraphics[width=0.49\textwidth]{xcmsPreprocess-EICCor1}& \includegraphics[width=0.49\textwidth]{xcmsPreprocess-EICCor2}\\ \end{tabular} \end{center} \caption{\label{eicalign} Unaligned (top) and aligned (bottom) extracted ion chromatograms from two analytes eluting at 2624 and 3678 seconds. Darkened lines indicate where the peaks were integrated for quantitation. A plot illustrating the retention time correction is shown in Figure~\ref{rtcorrection}.} \end{figure} \bibliographystyle{plainnat} \bibliography{xcms} \end{document}