\documentclass{article} \usepackage{amsmath} \usepackage{amscd} \usepackage[tableposition=top]{caption} \usepackage{ifthen} \usepackage[utf8]{inputenc} \topmargin 0in \headheight 0in \headsep 0in \oddsidemargin 0in \evensidemargin 0in \textwidth 176mm \textheight 215mm \begin{document} %\VignetteIndexEntry{\texttt{flagme}: Fragment-level analysis of \\ GC-MS-based metabolomics data} \title{\texttt{flagme}: Fragment-level analysis of \\ GC-MS-based metabolomics data} \author{Mark Robinson \\ \texttt{mrobinson@wehi.edu.au} \\ Riccardo Romoli \\ \texttt{riccardo.romoli@unifi.it}} \maketitle \section{Introduction} \noindent This document gives a brief introduction to the \texttt{flagme} package, which is designed to process, visualise and statistically analyze sets of GC-MS samples. The ideas discussed here were originally designed with GC-MS-based metabolomics in mind, but indeed some of the methods and visualizations could be useful for LC-MS data sets. The {\em fragment-level analysis} though, takes advantage of the rich fragmentation patterns observed from electron interaction (EI) ionization. There are many aspects of data processing for GC-MS data. Generally, algorithms are run separately on each sample to detect features, or {\em peaks} (e.g. AMDIS). Due to retention time shifts from run-to-run, an alignment algorithm is employed to allow the matching of the same feature across multiple samples. Alternatively, if known standards are introduced to the samples, retention {\em indices} can be computed for each peak and used for alignment. After peaks are matched across all samples, further processing steps are employed to create a matrix of abundances, leading into detecting differences in abundance. Many of these data processing steps are prone to errors and they often tend to be black boxes. But, with effective exploratory data analysis, many of the pitfalls can be avoided and any problems can be fixed before proceeding to the downstream statistical analysis. The package provides various visualizations to ensure the methods applied are not black boxes. The \texttt{flagme} package gives a complete suite of methods to go through all common stages of data processing. In addition, R is especially well suited to the downstream data analysis tasks since it is very rich in analysis tools and has excellent visualization capabilities. In addition, it is freely available (\texttt{www.r-project.org}), extensible and there is a growing community of users and developers. For routine analyses, graphical user interfaces could be designed. \section{Reading and visualizing GC-MS data} To run these examples, you must have the \texttt{gcspikelite} package installed. This data package contains several GC-MS samples from a spike-in experiment we designed to interrogate data processing methods. So, first, we load the packages: <>= require(gcspikelite) library(flagme) @ To load the data and corresponding peak detection results, we simply create vectors of the file-names and create a \texttt{peakDataset} object. Note that we can speed up the import time by setting the retention time range to a subset of the elution, as below: <>= gcmsPath <- paste(find.package("gcspikelite"), "data", sep="/") data(targets) cdfFiles <- paste(gcmsPath, targets$FileName, sep="/") eluFiles <- gsub("CDF", "ELU", cdfFiles) pd <- peaksDataset(cdfFiles, mz=seq(50,550), rtrange=c(7.5,8.5)) pd <- addAMDISPeaks(pd, eluFiles) pd @ Here, we have added peaks from AMDIS, a well known and mature algorithm for deconvolution of GC-MS data. For GC-TOF-MS data, we have implemented a parser for the \texttt{ChromaTOF} output (see the analogous \texttt{addChromaTOFPeaks} function). The \texttt{addXCMSPeaks} allows to use all the XCMS peak-picking algorithms; using this approach it is also possible to elaborate the raw data file from within R instead of using an external software. %% Support for XMCS or MzMine may be added in the future. Ask the author %% if another detection result format is desired as the parsers are %% generally easy to design. In particular the function reads the raw data using XCMS, group each extracted ion according to their retention time using CAMERA and attaches them to an already created \texttt{peaksDataset} object: <>= pd.2 <- peaksDataset(cdfFiles[1:3], mz = seq(50, 550), rtrange = c(7.5, 8.5)) mfp <- xcms::MatchedFilterParam(fwhm = 10, snthresh = 5) pd.2 <- addXCMSPeaks(cdfFiles[1:3], pd.2, settings = mfp) pd.2 @ The possibility to work using computer cluster will be added in the future. Regardless of platform and peak detection algorithm, a useful visualization of a set of samples is the set of total ion currents (TIC), or extracted ion currents (XICs). To view TICs, you can call: <>= plotChrom(pd, rtrange=c(7.5,8.5), plotPeaks=TRUE, plotPeakLabels=TRUE, max.near=8, how.near=0.5, col=rep(c("blue","red","black"), each=3)) @ Note here the little {\em hashes} represent the detected peaks and are labelled with an integer index. One of the main challenges is to match these peak detections across several samples, given that the appear at slightly different times in different runs. For XICs, you need to give the indices (of \texttt{pd@mz}, the grid of mass-to-charge values) that you want to plot through the \texttt{mzind} argument. This could be a single ion (e.g. \texttt{mzind=24}) or could be a range of indices if multiple ions are of interest (e.g. \texttt{mzind=c(24,25,98,99)}). There are several other features within the \texttt{plot} command on \texttt{peaksDataset} objects that can be useful. See \texttt{?plot} (and select the flagme version) for full details. Another useful visualization, at least for individual samples, is a 2D heatmap of intensity. Such plots can be enlightening, especially when peak detection results are overlaid. For example (with detected fragment peaks from AMDIS shown in white): <>= r <- 1 plotImage(pd, run=r, rtrange=c(7.5,8.5), main="") v <- which(pd@peaksdata[[r]] > 0, arr.ind=TRUE) # find detected peaks abline(v=pd@peaksrt[[r]]) points(pd@peaksrt[[r]][v[,2]], pd@mz[v[,1]], pch=19, cex=.6, col="white") @ \section{Pairwise alignment with dynamic programming algorithm} One of the first challenges of GC-MS data is the matching of detected peaks (i.e. metabolites) across several samples. Although gas chromatography is quite robust, there can be some drift in the elution time of the same analyte from run to run. We have devised a strategy, based on dynamic programming, that takes into account both the similarity in spectrum (at the apex of the called peak) and the similarity in retention time, without requiring the identity of each peak; this matching uses the data alone. If each sample gives a `peak list' of the detected peaks (such as that from AMDIS that we have attached to our \texttt{peaksDataset} object), the challenge is to introduce gaps into these lists such that they are best aligned. From this a matrix of retention times or a matrix of peak abundances can be extracted for further statistical analysis, visualization and interpretation. For this matching, we created a procedure analogous to a multiple {\em sequence} alignment. To highlight the dynamic programming-based alignment strategy, we first illustrate a pairwise alignment of two peak lists. This example also illustrates the selection of parameters necessary for the alignment. From the data read in previously, let us consider the alignment of two samples, denoted \texttt{0709\_468} and \texttt{0709\_474}. First, a similarity matrix for two samples is calculated. This is calculated based on a scoring function and takes into account the similarity in retention time and in the similarity of the apex spectra, according to: \[ S_{ij}(D) = \frac{\sum_{k=1}^K x_{ik} y_{jk}}{\sqrt{ \sum_{k=1}^K x_{ik}^2 \cdot \sum_{k=1}^K y_{jk}^2 } } \cdot \exp \left( - \frac{1}{2} \frac{(t_i-t_j)^2}{D^2} \right) \] \noindent where $i$ is the index of the peak in the first sample and $j$ is the index of the peak in the second sample, $\mathbf{x}_i$ and $\mathbf{y}_j$ are the spectra vectors and $t_i$ and $t_j$ are their respective retention times. As you can see, there are two components to the similarity: spectra similarity (left term) and similarity in retention time (right term). Of course, other metrics for spectra similarity are feasible. Ask the author if you want to see other metrices implemented. We have some non-optimized code for a few alternative metrics. The peak alignment algorithm, much like sequence alignments, requires a \texttt{gap} parameter to be set, here a number between 0 and 1. A high gap penalty discourages gaps when matching the two lists of peaks and a low gap penalty allows gaps at a very low {\em cost}. We find that a gap penalty in the middle range (0.4-0.6) works well for GC-MS peak matching. Another parameter, \texttt{D}, modulates the impact of the difference in retention time penalty. A large value for \texttt{D} essentially eliminates the effect. Generally, we set this parameter to be a bit larger than the average width of a peak, allowing a little flexibility for retention time shifts between samples. Keep in mind the \texttt{D} parameter should be set on the scale (i.e. seconds or minutes) of the \texttt{peaksrt} slot of the \texttt{peaksDataset} object. The next example shows the effect of the \texttt{gap} and \texttt{D} penalty on the matching of a small ranges of peaks. <>= Ds <- c(0.1, 10, 0.1, 0.1) gaps <- c(0.5, 0.5, 0.1, 0.9) par(mfrow=c(2,2), mai=c(0.8466,0.4806,0.4806,0.1486)) for(i in 1:4){ pa <- peaksAlignment(pd@peaksdata[[1]], pd@peaksdata[[2]], pd@peaksrt[[1]], pd@peaksrt[[2]], D=Ds[i], gap=gaps[i], metric=1, type=1, compress = FALSE) plotAlignment(pa, xlim=c(0, 17), ylim=c(0, 16), matchCol="yellow", main=paste("D=", Ds[i], " gap=", gaps[i], sep="")) } @ You might ask: is the flagme package useful without peak detection results? Possibly. There have been some developments in alignment (generally on LC-MS proteomics experiments) without peak/feature detection, such as Prince et al. 2006, where a very similar dynamic programming is used for a pairwise alignment. We have experimented with alignments without using the peaks, but do not have any convincing results. It does introduce a new set of challenges in terms of highlighting differentially abundant metabolites. However, in the \texttt{peaksAlignment} routine above (and those mentioned below), you can set \texttt{usePeaks=FALSE} in order to do {\em scan}-based alignments instead of {\em peak}-based alignments. In addition, the \texttt{flagme} package may be useful simply for its bare-bones dynamic programming algorithm. \subsection{Normalizing retention time score to drift estimates} In what is mentioned above for pairwise alignments, we are penalizing for differences in retention times that are non-zero. But, as you can see from the TICs, some differences in retention time are consistent. For example, all of the peaks from sample \texttt{0709\_485} elute at later times than peaks from sample \texttt{0709\_496}. We should be able to estimate this drift and normalize the time penalty to that estimate, instead of penalizing to zero. That is, we should replace $t_i-t_j$ with $t_i-t_j-\hat{d}_{ij}$ where $\hat{d}_{ij}$ is the expected drift between peak $i$ of the first sample and peak $j$ of the second sample. More details coming soon. \subsection{Imputing location of undetected peaks} One goal of the alignment leading into downstream data analyses is the generation of a table of abundances for each metabolite across all samples. As you can see from the TICs above, there are some low intensity peaks that fall below detection in some but not all samples. Our view is that instead of inserting arbitrary low constants (such as 0 or half the detection limit) or imputing the intensities post-hoc or having missing data in the data matrices, it is best to return to the area of the where the peak should be and give some kind of abundance. The alignments themselves are rich in information with respect to the location of undetected peaks. We feel this is a more conservative and statistically valid approach than introducing arbitrary values. More details coming soon. \section{Multiple alignment of several experimental groups} Next, we discuss the multiple alignment of peaks across many samples. With replicates, we typically do an alignment within replicates, then combine these together into a summarized form. This cuts down on the computational cost. For example, consider 2 sets of samples, each with 5 replicates. Aligning first within replicates requires 10+10+1 total alignments whereas an all-pairwise alignment requires 45 pairwise alignments. In addition, this allows some flexibility in setting different gap and distance penalty parameters for the {\em within} alignment and {\em between} alignment. An example follows. <>= print(targets) ma <- multipleAlignment(pd, group=targets$Group, wn.gap=0.5, wn.D=.05, bw.gap=.6, bw.D=0.05, usePeaks=TRUE, filterMin=1, df=50, verbose=FALSE, metric = 1, type = 1) # bug ma @ If you set \texttt{verbose=TRUE}, many nitty-gritty details of the alignment procedure are given. Next, we can take the alignment results and overlay it onto the TICs, allowing a visual inspection. <>= plotChrom(pd, rtrange=c(7.5,8.5), runs=ma@betweenAlignment@runs, mind=ma@betweenAlignment@ind, plotPeaks=TRUE, plotPeakLabels=TRUE, max.near=8, how.near=.5, col=rep(c("blue","red","black"), each=3)) @ % \section{Correlation Alignment algorithm} % Another approach, represented by the \texttt{correlationAlignment} % function, is to use a modified form of the Pearson correlation % algorithm. After the correlation between two samples is calculated, a % penalization coefficient, based on the retention time differences, is % applied to the result. It is also possible to set a retention time % range in which the penalization is 0, this because in gas % chromatography we can have a little deviation in the retention time of % the metabolite so, based on the experimental data, we can choose the % retention time window for the penalization coefficient being applied. <>= mp <- correlationAlignment(object=pd.2, thr=0.85, D=20, penality=0.2, normalize=TRUE, minFilter=1) mp @ % \noindent where \texttt{thr} represent correlation threshold from 0 % (min) to 1 (max); \texttt{D} represent the retention time window in % seconds; \texttt{penality} represent the penality inflicted to a match % between two peaks when the retention time difference exceed the % parameter \texttt{D}; \texttt{normalize} is about the peak % normalization-to-100 before the correlation is calculated; % \texttt{minFilter} give the opportunity to exclude from the resulting % correlation matrix each feature that in represented in our samples % less time than this value. The value of minFilter must be smaller than % the number of samples. % The correlation-based peak alignment for multiple GC-MS % peak lists uses a center-star technique to the alignment of the % peaks. The combination of the \texttt{D} and \texttt{penality} parameters % allow the users to force the algorithm to match the peaks close to the % reference. The \texttt{thr} parameter control the matching factor. \subsection{Gathering results} The alignment results can be extracted from the \texttt{multipleAlignment} object as: <>= ma@betweenAlignment@runs ma@betweenAlignment@ind @ \noindent This table would suggest that matched peak \texttt{8} (see numbers below the TICs in the figure above) corresponds to detected peaks \texttt{9, 12, 11} in runs \texttt{4, 5, 6} and so on, same as shown in the above plot. In addition, you can gather a list of all the merged peaks with the \texttt{gatherInfo} function, giving elements for the retention times, the detected fragment ions and their intensities. The example below also shows the how to construct a table of retention times of the matched peaks (No attempt is made here to adjust retention times onto a common scale. Instead, the peaks are matched to each other on their original scale). For example: <>= outList <- gatherInfo(pd,ma) outList[[8]] rtmat <- matrix(unlist(lapply(outList,.subset,"rt"), use.names=FALSE), nr=length(outList), byrow=TRUE) colnames(rtmat) <- names(outList[[1]]$rt); rownames(rtmat) <- 1:nrow(rtmat) round(rtmat, 3) @ \section{Future improvements and extension} There are many procedures that we have implemented in our investigation of GC-MS data, but have not made part of the package just yet. Some of the most useful procedures will be released, such as: \begin{enumerate} \item Parsers for other peak detection algorithms (e.g. % XCMS, MzMine) and parsers for other alignment procedures (e.g. SpectConnect) and perhaps retention indices procedures. \item More convenient access to the alignment information and abundance table. \item Statistical analysis of differential metabolite abundance. \item Fragment-level analysis, an alternative method to summarize abundance across all detected fragments of a metabolite peak. \end{enumerate} \section{References} See the following for further details: \begin{enumerate} \item Robinson MD. {\em Methods for the analysis of gas chromatography - mass spectrometry data.} {\bf Ph.D. Thesis}. October 2008. Department of Medical Biology (Walter and Eliza Hall Institute of Medical Research), University of Melbourne. \item Robinson MD, De Souza DP, Keen WW, Saunders EC, McConville MJ, Speed TP, Liki\'{c} VA. (2007) {\em A dynamic programming approach for the alignment of signal peaks in multiple gas chromatography-mass spectrometry experiments.} {\bf BMC Bioinformatics}. 8:419. \item Prince JT, Marcotte EM (2006) {\em Chromatographic alignment of ESI-LC-MS proteomics data sets by ordered bijective interpolated warping}. {\bf Anal Chem}. 78(17):6140-52. \end{enumerate} \section{This vignette was built with/at ...} <>= sessionInfo() date() @ \end{document}