%\VignetteIndexEntry{IPPD Manual} %\VignetteKeywords{Proteomics} %\VignetteDepends{IPPD} %\VignettePackage{IPPD} \input{preamble_paper.tex} \title{\texttt{IPPD} package vignette} \author{ \begin{tabular}{ll} Martin Slawski $^{1}$ & \url{ms@cs.uni-saarland.de} \\ Rene Hussong $^2$ & \url{rene.hussong@uni.lu} \\ Andreas Hildebrandt $^{3}$ & \url{andreas.hildebrandt@uni-mainz.de} \\ Matthias Hein $^1$ & \url{hein@cs.uni-saarland.de} \end{tabular} } \SweaveOpts{keep.source=TRUE} \begin{document} \date{\normalsize{\begin{tabular}{l} $^1$ Saarland University, Department of Computer Science, Machine Learning Group,\\ $\;$ Saarbr\"ucken, Germany \\ $^2$ Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Luxemburg \\ $^3$ Johannes Gutenberg-University Mainz, Institute for Computer Science, Mainz, Germany\\ \end{tabular}\\ \vspace{0.2cm} \large , }} \maketitle %\vspace{11pt} \normalsize \begin{abstract} This is the vignette of the Bioconductor add-on package \texttt{IPPD} which implements automatic isotopic pattern extraction from a raw protein mass spectrum. Basically, the user only has to provide mass/charge channels and corresponding intensities, which are automatically decomposed into a list of monoisotopic peaks. \texttt{IPPD} can handle several charge states as well as overlaps of peak patterns. \end{abstract} %<>= %library(packagename) %@ \section{Aims and scope of \texttt{IPPD}} A crucial challenge in the analysis of protein mass spectrometry data is to automatically process the raw spectrum to a list of peptide masses. \texttt{IPPD} is tailored to spectra where peptides emerge in the form of isotope patterns, i.e. one observes several peaks for each peptide mass at a given charge state due to the natural abundance of heavy isotopes. Datasets with a size of up to 100,000 mass/charge channels and the presence of isotope patterns at multiple charge states frequently exhibiting overlap make the manual annotation of a raw spectrum a tedious task. \texttt{IPPD} provides functionality to perform this task in a fully automatic, transparent and user-customizable way. Basically, one feeds the raw spectrum into one single function to obtain a list of monoisotopic peaks described by a mass/charge channel, a charge and an intensity. What makes our approach particularly user-friendly is its dependence on only a small set of easily interpretable parameters. We also offer a method to display the decomposition of the spectrum graphically, thereby facilitating a manual validation of the output. \section{Methodology} \subsection{Template model} In the context of this package, a protein mass spectrum is understood as a sequence of pairs $\{x_i, y_i \}_{i=1}^n$, where $x_i = m_i/z_i$ is a mass ($m_i$) per charge ($z_i$) value (measured in Thomson) and $y_i$ is the intensity, i.e. the abundance of a particular mass (modulo charge state), observed at $x_i$, $i=1,\ldots,n$, which are assumed to be in an increasing order. The $y_i$ are modeled as a linear combination of template functions representing prior knowledge about peak shapes and the composition of isotopic patterns. If our model were exact, we could write \begin{equation}\label{eq:model.basic} \bm{y} = \bm{\Phi} \bm{\beta}^{\ast}, \quad \bm{y} = (y_1,\ldots,y_n)^{\T}, \end{equation} where $\bm{\Phi}$ is a matrix template functions and $\bm{\beta}^{\ast}$ a vector of weights for each template. Only a small fraction of all templates are needed to fit the signal, i.e. $\bm{\beta}^{\ast}$ is highly sparse. Since $\bm{y} \geq 0$, where '$\geq$' is understood componentwise, all template functions are nonnegative and accordingly $\bm{\beta}^{\ast} \geq 0$. \begin{figure} \begin{tabular}{cc} \includegraphics[height = 7cm, width = 7cm]{templates} & \includegraphics[height = 7cm, width = 7cm]{templatedetail} \end{tabular} \caption{Illustration of the template construction as described in the text. The left panel depicts different templates of different charge states (1 to 4). The right panel zooms at the charge two template $\varphi_2$.}\label{fig:templates} \end{figure} Model \eqref{eq:model.basic} can equivalently be written as \begin{equation}\label{eq:model.basic.charge} \bm{y} = \left[ \begin{array}{ccc} \bm{\Phi}_1 & \ldots & \bm{\Phi}_C \end{array} \right] \left[ \begin{array}{c} \bm{\beta}^{\ast}_1 \\ \vdots \\ \bm{\beta}^{\ast}_C \end{array} \right] = \sum_{c=1}^C \bm{\Phi}_c \bm{\beta}^{\ast}_c, \end{equation} where $\bm{\Phi}_c, \bm{\beta}_c^{\ast}$ denote the matrix of template functions and weight vector to fit isotopic patterns of a particular charge state $c$, $c = 1,\ldots,C$. Each submatrix $\bm{\Phi}_c$ can in turn be divided into columns $\bm{\varphi}_{c,1}, \ldots, \bm{\varphi}_{c,p_c}$, where the entries of each column vector store the evaluations of a template $\varphi_{c_j}$, $j=1,\ldots,p_c,$ at the $x_i$, $i=1,\ldots,n$. Each template $\varphi_{c,j}$ depends on parameter $m_{c,j}$ describing the $m/z$ position at which $\varphi_{c,j}$ is placed. A template $\varphi_{c,j}$ is used to fit an isotopic pattern of peaks composed of several single peaks, which is modeled as \begin{equation}\label{eq:model.template} \varphi_{c,j} = \sum_{k \in Z_{c,j}} a_{c,j,k} \; \psi_{c,j,k, \bm{\theta}_{c,j}}, \quad Z_{c,j} \subset \mathbb{Z} \end{equation} where the $\psi_{c,j,k}$ are functions representing a peak of a single isotope within an isotopic pattern. They depend on $m_{c,j}$ and a parameter vector $\bm{\theta}_{c,j}$. The nonnegative weights $a_{c,j,k}$ reflect the relative abundance of the isotope indexed by $k$. The $a_{c,j,k}$ are computed according to the averagine model (\citet{Senko1995}) and hence are fixed in advance. Each $\psi_{c,j,k}$ is linked to a location $m_{c,j,k}$ at which it attains its maximum. The $m_{c,j,k}$ are calculated from $m_{c,j}$ as $m_{c,j,k} = m_{c,j} + \kappa \frac{k}{c}$, where $\kappa$ equals 1 Dalton ($\approx 1.003$). The rationale behind Eq. \eqref{eq:model.template} and the definitions that follow is the fact that the location of the most intense isotope is taken as characteristic location of the template, i.e. we set $m_{c,j,0} = m_{c,j}$ so that the remaining $m_{c,j,k}, \; k \neq 0,$ are computed by shifting $m_{c,j}$ in both directions on the $m/z$ axis. By 'most intense isotope', we mean that $a_{c,j,0} = \max_k a_{c,j,k} = 1$. The set $Z_{c,j}$ is a subset of the integers which depends on the averagine model and a pre-specified tolerance, i.e. we truncate summation in Eq. \eqref{eq:model.template} if the weights drop below that tolerance. Figure \ref{fig:templates} illustrates the construction scheme and visualizes our notation. \subsection{Peak shape}\label{sec:peakshape} In an idealized setting, the $\psi_{c,j,k}$ are delta functions at specific locations. In practice, however, the shape of a peak equals that of a bump which may exhibit some skewness. In the case of no to moderate skewness, we model peaks by Gaussian functions: \begin{equation}\label{eq:gaussian} \psi_{c,j,k}(x) = \exp \left( - \frac{(x - m_{c,j,k})^2}{\sigma_{c,j}}\right). \end{equation} The parameter to be determined is $\bm{\theta}_{c,j} = \sigma_{c,j} > 0$. In the case of considerable skewness, peaks are modeled by exponentially modified Gaussian (EMG) functions, see for instance \citet{Grushka1972}, \citet{MarcoBombi2001}, and \citet{SchulzTrieglaff2007} in the context of protein mass spectrometry: \begin{align}\label{eq:emg} \begin{split} &\psi_{c,j,k}(x) = \frac{1}{\alpha_{c,j}} \exp \left(\frac{\sigma_{c,j}^2}{2 \alpha_{c,j}^2} + \frac{\mu_{c,j} - (x - m_{c,j,k})}{\alpha_{c,j}} \right) \left(1 - F\left(\frac{\sigma_{c,j}}{\alpha_{c,j}} + \frac{\mu_{c,j} - (x - m_{c,j,k})}{\sigma_{c,j}} \right) \right), \\ &F(t) = \int_{-\infty}^t \frac{1}{\sqrt{2 \pi}} \exp \left(-\frac{u^2}{2} \right) \; du. \end{split} \end{align} The EMG function involves a vector of three parameters $\bm{\theta}_{c,j} = (\alpha_{c,j}, \sigma_{c,j}, \mu_{c,j})^{\T} \in \R^+ \times \R^+ \times \R$. The parameter $\alpha_{c,j}$ controls the additional length of the right tail as compared to a Gaussian. For $\alpha_{c,j} \downarrow 0$, the EMG function becomes a Gaussian. For our fitting approach as outlined in Section \ref{sec:templatefitting}, it is crucial to estimate the $\bm{\theta}_{c,j}$, which are usually unknown, from the data as good as possible. To this end, we model each component $\theta_l$ of $\bm{\theta}$ as a linear combination of known functions $g_{l,m}$ of $x = m/z$ and an error component $\varepsilon_l$, i.e. \begin{equation}\label{eq:parametermodel} \theta_l(x) = \sum_{m = 1}^{M_l} \nu_{l,m} g_{l,m}(x) + \varepsilon_l(x). \end{equation} In the case of no prior knowledge about the $g_{l,m}$, we model $\theta_l$ as a constant independent of $x$. In most cases, it is sensible to assume a linear trend, i.e. $\theta_{l}(x) = \nu_{l,1} + \nu_{l,2} x$. In order to fit a model of the form \eqref{eq:parametermodel}, we have to collect information from the data $\{x_i, y_i \}_{i=1}^n$. To be precise, we proceed according to the following steps. \begin{enumerate} \item We apply a simple peak detection algorithm to the spectrum to identify disjoint regions $\mc{R}_r \subset \{1,\ldots, n \}, \; \; r=1,\ldots,R$, of well-resolved peaks. \item For each region $r$, we fit the chosen peak shape to the data $\{x_i, y_i \}_{i \in \mc{R}_r}$ using nonlinear least squares: \begin{equation} \min_{\bm{\theta}} \sum_{i \in \mc{R}_r} (y_i - \psi_{\bm{\theta}}(x_i))^2, \end{equation} yielding an estimate $\wh{\bm{\theta}}_r(\wh{x}_r)$, where $\wh{x}_r$ denotes an estimation for the mode of the peak in region $\mc{R}_r$. \item The sequence $\{ \wh{x}_r, \wh{\bm{\theta}}_r \}_{r=1}^R$ is then used as input for the estimation of the parameters $\nu_{l,m}$ in model \eqref{eq:parametermodel}. \end{enumerate} Step 2. is easily solved by the general purpose nonnegative least squares routine \texttt{nls} in \texttt{R:::stats} for a Gaussian peak shape. For the EMG, we have to perform a grid search over all three parameters to find a suitable starting value, which is then passed to the general purpose optimization routine \texttt{optim} in \texttt{R:::stats} with the option \texttt{method = "BFGS"} and a specification of a closed form expression of the gradient via the argument \texttt{gr}. For step 3., we use least absolute deviation regression because of the presence of outliers arising from less well-resolved, wiggly or overlapping peaks. The whole procedure is performed by the function \texttt{fitModelParameters} as demonstrated below. After loading the package, we access the real world dataset \texttt{myo500} and extract $m/z$ channels (\texttt{x}) and the corresponding intensities (\texttt{y}). For computational convenience and since they contain very few relevant information, we discard all channels above $2500$. <>= options(prompt = "R> ", continue = " ", warn = -1, width = 90) @ <>= library(IPPD) data(myo500) x <- myo500[,"mz"] y <- myo500[,"intensities"] y <- y[x <= 2500] x <- x[x <= 2500] @ To have a look at the data, we plot the first $1000$ (\texttt{x},\texttt{y}) pairs: <>= layout(matrix(c(1,2), 1, 2)) plot(x[1:1000], y[1:1000], xlab = expression(x[1]~~ldots~~x[1000]), cex.lab = 1.5, cex.axis = 1.25, ylab = expression(y)) plot(x[x >= 804 & x <= 807], y[x >= 804 & x <= 807], xlab = "x: 804 <= x <= 807", cex.lab = 1.5, cex.axis = 1.25, ylab = expression(y), type = "b") layout(matrix(1)) @ In the plot, one identifies a prominent peak pattern beginning at about $804$, which is zoomed at in the right panel.\\ We now apply \texttt{fitModelParameters} to fit model \eqref{eq:parametermodel} for the width parameter $\sigma$ of a Gaussian function \eqref{eq:gaussian}. For simplicity, we take $g_1(x) = 1$, $g_2(x) = x$. The model is specified by using an \texttt{R} formula interface. <>= fitGauss <- fitModelParameters(mz = x, intensities = y, model = "Gaussian", fitting = "model", formula.sigma = formula(~mz), control = list(window = 6, threshold = 200)) @ An analogous command for the EMG \eqref{eq:emg} with the model formulae $\alpha(x) = \nu_{1,1} + \nu_{1,2} x$, $\sigma(x) = \nu_{2,1} + \nu_{2,2} x$, $\mu(x) = \nu_{3,1}$ is given by <>= fitEMG <- fitModelParameters(mz = x, intensities = y, model = "EMG", fitting = "model", formula.alpha = formula(~mz), formula.sigma = formula(~mz), formula.mu = formula(~1), control = list(window = 6, threshold = 200)) @ Inspecting the results, we find that $R = 55$ peak regions are used to fit an EMG parameter model. Moreover, it turns out that the EMG model is a more appropriate peak model for the data when visually comparing the list of mean residual sums of squares of the EMG fits and the Gauss fits extracted from \texttt{slot(fitEMG, "peakfitresults")} and \texttt{slot(fitGauss, "peakfitresults")}, respectively. The figure shows an example where the EMG shape comes relatively close to the observed data. A long right tail indicates that a Gaussian would yield a rather poor fit here. <>= show(fitEMG) mse.EMG <- data.frame(mse = slot(fitEMG,"peakfitresults")[,"rss"] / slot(fitEMG,"peakfitresults")[,"datapoints"], peakshape = rep("EMG", nrow( slot(fitEMG,"peakfitresults")))) mse.Gauss <- data.frame(mse = slot(fitGauss,"peakfitresults")[,"rss"] / slot(fitGauss,"peakfitresults")[,"datapoints"], peakshape = rep("Gaussian", nrow( slot(fitGauss,"peakfitresults")))) mses <- rbind(mse.EMG, mse.Gauss) with(mses, boxplot(mse ~ peakshape, cex.axis = 1.5, cex.lab = 1.5, ylab = "MSE")) @ <>= visualize(fitEMG, type = "peak", cex.lab = 1.5, cex.axis = 1.25) @ To assess the fit of the two linear models for the EMG parameters $\alpha$ and $\sigma$, we use again the function \texttt{visualize} as follows: <>= visualize(fitEMG, type = "model", modelfit = TRUE, parameters = c("sigma", "alpha"), cex.lab = 1.5, cex.axis = 1.25) @ While the fit for $\sigma$ seems to be reasonable except for some extreme outliers, the fit for $\alpha$ is not fully convincing. Nevertheless, in the absence of further knowledge, the fit produces good results in the template matching step detailed in the next section. \subsection{Template fitting}\label{sec:templatefitting} Once all necessary parameters have been determined, the positions at which the templates are placed have to be fixed. In general, one has to choose positions from the interval $[x_1, x_n]$. We instead restrict us to a suitable subset of the finite set $\{ x_i \}_{i=1}^n$. The deviations from the true positions is then at least in the order of the sampling rate, but this can be improved by means of a postprocessing step described in \ref{sec:postprocessing}. Using the whole set $\{ x_i \}_{i=1}^n$ may be computationally infeasible if $n$ is large. Such an approach would be at least computationally wasteful, since 'genuine' peaks patterns occur very sparsely in the spectrum. Therefore, we apply a pre-selection step on the basis of what we term 'local noise level' (LNL). The LNL is defined as a quantile (typically the median) of the intensities $y_i$ falling into a sliding window of fixed width around a specific position. Given the LNL, we place templates on an $x_i$ (one for each charge state) if and only if the corresponding $y_i$ exceeds the LNL at $x_i$ by a factor \texttt{factor.place}, which typically equals three or four and has to be specified by the user. Given the positions of the templates, we compute the matrix $\bm{\Phi}$ according to Eqs. \eqref{eq:model.basic} and \eqref{eq:model.template}. It then remains to estimate the coefficient vector $\bm{\beta}^{\ast}$ on the basis of two structural assumptions, sparsity and nonnegativity of all quantities involved. Related approaches in the literature (\citet{Du2006}, \citet{Renard2008}) account for sparsity of $\bm{\beta}^{\ast}$ by using $\ell_1$-regularized regression (\citet{Tib1996}). We here argue empirically that $\ell_1$ regularization is not the best to do, since it entails the selection of a tuning parameter which is difficult to choose in our setting, and secondly the structural constraints concerning nonnegativity turn out to be so strong that sparsity is more conveniently achieved by fitting followed by hard thresholding. We first determine \begin{align}\label{eq:fitting} \begin{split} &\wh{\bm{\beta}} \in \argmin_{\bm{\beta}} \norm{\bm{y} - \bm{\Phi} \bm{\beta}}_{q}^q, \; \; q=1 \; \text{or} \; q = 2, \\ & \text{subject to} \; \; \bm{\beta} \geq 0. \end{split} \end{align} The optimization problem \eqref{eq:fitting} is a quadratic ($q=2$) or linear ($q = 1$) program and is solved using standard techniques (\citet{BoydVandenberghe2004}); we omit further details here. We remark that in the presence of high noise, it is helpful to subtract the LNL from $\bm{y}$. Concerning the choice of $q$, we point out that $q = 1$ can cope better with deviations from model assumptions, i.e. deviations from the averagine model or from the peak model and thus may lead to a reduction of the number of false positives. \subsection{Postprocessing}\label{sec:postprocessing} Given an estimate $\wh{\bm{\beta}}$, we define $\mc{M}_c = \{m_{c,j}: \; \wh{\beta}_{c,j} > 0 \} \subset \{x_i \}_{i=1}^n, \; \; c = 1,\ldots,C,$ as the set of all template locations where the corresponding coefficient exceeds $0$, separately for each charge. Due to a limited sampling rate, different sources of noise and model misfit, the locations in the sets $\{ \mc{M}_c \}_{c=1}^C$ may still deviate considerably from the set of true peak pattern locations. Specifically, the sets $\{\mc{M}_c \}_{c=1}^C$ tend to be too large, mainly caused by what we term 'peak splitting': for the reasons just mentioned, it frequently occurs that several templates are used to fit the same peak. This can at least partially be corrected by means of the following merging procedure. \begin{enumerate} \item Separately for each $c$, divide the sets $\mc{M}_c$ into groups $\mc{G}_{c,1},\ldots,\mc{G}_{c,G_c}$ of 'adjacent' positions. Positions are said to be adjacent if their distance on the $m/z$ scale is below a certain tolerance as specified via a parts per million (ppm) value. \item For each $c = 1,\ldots,C$ and each group $g_c = 1,\ldots,G_c$, we solve the following optimization problem. \begin{equation}\label{eq:merging} (\wt{m}_{c, g}, \wt{\beta}_{c, g}) = \min_{m_{c, g}, \beta_{c, g}} \norm{\sum_{m_{c, j} \in \mc{G}_{c, g}} \wh{\beta}_{c, j} \psi_{m_{c, j}} - \beta_{c, g} \psi_{m_{c, g}} }_{L^2}^2 \end{equation} In plain words, we take the fitted function resulting from the functions $\{ \psi_{m_{c, j}} \}$ representing the most intense peak of each peak pattern in the same group and then determine a function $\psi_{\wt{m}_{c, g}}$ placed at location $\wt{m}_{c, g}$ and weighted by $\wt{\beta}_{c, g}$ such that $\wt{\beta}_{c, g} \psi_{\wt{m}_{c, g}}$ approximates the fit of multiple functions $\{ \psi_{m_{c, j}} \}$ best (in a least squares sense). \item One ends up with sets $\wt{\mc{M}}_{c} = \{ \wt{m}_{c, g} \}_{g=1}^{G_c}$ and coefficients $\{ \wt{\beta}_{c, g} \}_{g = 1}^{G_c}, \; \; c=1,\ldots,C$. \end{enumerate} The additional benefit of step 2. as compared to the selection of the function with the largest coefficient as proposed in \citet{Renard2008} is that, in the optimal case, we are able to determine the peak pattern location even more accurate as predetermined by a limited sampling rate. The integral in \eqref{eq:merging} can be solved analytically for a Gaussian function, and we resort to numeric approximations for the EMG function.\\ The sets $\{ \mc{M}_{c} \}$ tend to be too large in the sense that they still contain noise peak patterns. Therefore, we apply hard thresholding to the $ \{ \wt{\beta}_{c, g} \}_{g=1}^{G_c}, \; c=1,\ldots,C$, discarding all positions where the corresponding coefficients is less than a significance level times the LNL, where the signficance level has to be specified by the user. \section{Case study}\label{sec:casestudy} We continue the data analysis starting in Section \ref{sec:peakshape}. The methodology of the Sections \ref{sec:templatefitting} and \ref{sec:postprocessing} is implemented in the function \texttt{getPeaklist}. For the computation of the template functions, we recycle the object \texttt{fitEMG} obtained in Section \ref{sec:peakshape}. <>= EMGlist <- getPeaklist(mz = x, intensities = y, model = "EMG", model.parameters = fitEMG, loss = "L2", trace = FALSE, control.localnoise = list(factor.place = 2), control.basis = list(charges = c(1, 2)), control.postprocessing = list(ppm = 200)) show(EMGlist) @ The argument list can be summarized as follows: we compute EMG templates for charges $1$ and $2$; templates are placed on all $m/z$-positions in the spectrum where the intensity is at least two times the LNL; the fit is least squares (\texttt{loss = L2}); postprocessing is performed by merging peaks within a tolerance of 200 ppm. Subsequently, only the patterns with signal-to-noise ratio bigger than three are maintained. The result is of the following form. %\scriptsize{ <>= threshold(EMGlist, threshold = 3, refit = TRUE, trace = FALSE) @ %} The results can be examined in detail graphically. We finally present some selected regions to demonstrate that our method performs well. The pre-defined method \texttt{visualize} can be used display the template fitting at several stages for regions within selected $m/z$ intervals as specified by the arguments \texttt{lower} and \texttt{upper}. <>= visualize(EMGlist, x, y, lower= 963, upper = 973, fit = FALSE, fittedfunction = TRUE, fittedfunction.cut = TRUE, localnoise = TRUE, quantile = 0.5, cutoff.functions = 3) @ <>= visualize(EMGlist, x, y, lower= 1502, upper = 1510, fit = FALSE, fittedfunction = TRUE, fittedfunction.cut = TRUE, localnoise = TRUE, quantile = 0.5, cutoff.functions = 2) @ In the $m/z$ range $[963,973]$ a charge-1 peak overlaps with a more intense charge two peak. A further overlap occurs in the interval $[1502,1510]$, and it is correctly resolved.\\ An even more challenging problem, in which it is already difficult to unravel the overlap by visual inspection, is displayed in the following plot. <>= visualize(EMGlist, x, y, lower= 1360, upper = 1364, fit = FALSE, fittedfunction = TRUE, fittedfunction.cut = TRUE, localnoise = TRUE, quantile = 0.5, cutoff.functions = 2) @ \section{Extension to process LC-MS runs}\label{sec:lcms} In the preceding sections, it has been demonstrated how \texttt{IPPD} can be used to process single spectrums. For LC-MS, multiple spectra, one for a sequence of retention times $\{ t_l \}_{l=1}^L$, have to be processed. In this context, a single spectrum is referred to as scan. The resulting data can be displayed as in Figure \ref{fig:lcmsdata} by plotting intensities over the plane defined by retention times and $m/z$-values. \texttt{IPPD} offers basic functionality to process this kind of data. Support for \texttt{mzXML} format as well as an implementation of the sweep line scheme as suggested in \citet{SchulzTrieglaff2008} is provided, which is briefly demonstrated in the sequel. % clear all data that are currently in the workspace <>= rm(list = ls()) @ \begin{center} \begin{figure} \includegraphics[height = 0.35\textheight]{lcms} \caption{Graphical display of the sample \texttt{mzXML} file used in the code.}\label{fig:lcmsdata} \end{figure} \end{center} % load mzXML file % directory <- system.file("data", package = "IPPD") %filename <- file.path(directory, "CytoC_1860-2200_500-600.mzXML") <>= directory <- system.file("data", package = "IPPD") download.file("http://www.ml.uni-saarland.de/code/IPPD/CytoC_1860-2200_500-600.mzXML", destfile = paste(directory, "/samplefile", sep = ""), quiet = TRUE) data <- read.mzXML(paste(directory, "/samplefile", sep = "")) @ The sweep line scheme aggregates the peaklists of multiple scans by looking for blocks of consecutive retention times at which there is signal at nearby $m/z$-positions. The output is a quadruple consisting of a retention time interval, a $m/z$-position, a charge state and a quantification of the intensity. The intervals are found by sequentially processing the results of \texttt{getPeaklist}, where the results of each peaklist will lead to extensions of existing interval of preceding lists or to the creation of new intervals; intervals are closed once they have not been extended after processing more than \texttt{gap} additional peaklists, where \texttt{gap} is a parameter to be specified by the user. For more details, we refer to \citet{SchulzTrieglaff2008}. The function \texttt{analyzeLCMS} runs \texttt{getPeaklist} for each scan and then calls the function \texttt{sweepline}, which can as well be run independently from \texttt{analyzeLCMS} to aggregate the results. While there is a default setting, parameters can be changed by passing appropriate arguments. <>= processLCMS <- analyzeLCMS(data, arglist.getPeaklist = list(control.basis = list(charges = c(1,2,3))), arglist.threshold = list(threshold = 10), arglist.sweepline = list(minboxlength = 20)) boxes <- processLCMS$boxes @ <>= directory <- system.file("data", package = "IPPD") filename <- file.path(directory, "examples_boxes.txt") boxes <- as.matrix(read.table(filename, header = TRUE)) @ The output can be displayed as follows. The retention time intervals are given by the two columns \verb?rt_begin? and \verb?rt_end?, the corresponding $m/z$-positions are given by the column \verb?loc?. Quantitive information is contained in the column \verb?quant?. The output is visualized by means of a contour plot, where the contour lines depict intensities over the plane defined by $m/z$-positions and retention times. The intervals of the output are drawn as red lines. <>= print(boxes) rtlist <- lapply(data$scan, function(x) as.numeric(sub("([^0-9]*)([0-9|.]+)([^0-9]*)", "\\2", x$scanAttr))) rt <- unlist(rtlist) nscans <- length(rt) npoints <- length(data$scan[[1]]$mass) Y <- matrix(unlist(lapply(data$scan, function(x) x$peaks)), nrow = nscans, ncol = npoints, byrow = TRUE) contour(rt, data$scan[[1]]$mass, Y, xlab = "t", ylab = "mz", levels = 10^(seq(from = 5, to = 6.75, by = 0.25)), drawlabels = FALSE) for(i in 1:nrow(boxes)) lines(c(boxes[i,"rt_begin"], boxes[i,"rt_end"]), rep(boxes[i,"loc"], 2), col = "red") @ %In the absence of any kind of noise, the $\psi_{c,j,k}$ %would simply be delta functions at the locations $m_{c,j,k}$. %%% -> futher remarks here %In the %absence of noise, one would have %\begin{equation*} %\bm{y} = \bm{\Phi} \bm{\beta} = \sum_{j=1}^p \bm{\phi}_j \beta_j %\end{equation*} %only small subset of the \{ \beta_j \} are nonzero, all positive. %\bm{\Phi} = [\bm{\phi}_1 | \ldots | \bm{\phi}_p ] %\bm{\phi}_j = (\phi(x_1;\bm{\theta}_1),\ldots, \phi(x_n; \bm{\theta}_j))^{\T}, %$j = 1,\ldots,p$ %the set of $\phi_j$ dictionary %\phi = template function %\bm{\theta}_j is assigned charge state and location %\bm{\theta}_j = (c_j, m_j, \bm{\vartheta}_j) %where $\bm{\vartheta}_j$, possibly depending on $m_j$ is %a further parameter vector %each $\phi$ template can be written as a finite sum of isotope contributions %\phi_j = \sum_k^K a_{jk} \psi_{jk}(\bm{\theta}_j) %the \psi_{jk} describe the (shape of) a single peak %use initial-peak-parametrization ? %$m_{0k} = $ %$m_{jk} = m_{0k} + \iota_{jk} \kappa/c_j$ %$m_{0k}$ most intense peak location within pattern %$\iota_{jk} \in \{ \}$ %$K$ number of isotopic contributions %$a_{jk}$ amplitudes/weights \vspace{-1cm} \bibliographystyle{plainnat} \bibliography{references} \end{document}