%\VignetteIndexEntry{mzR, Ramp, mzXML, mzData, mzML} %\VignetteKeywords{mzXML, mzData, mzML, Ramp, mass spectrometry, proteomics, metabolomics} %\VignettePackage{mzR} \documentclass[10pt,a4paper]{article} \usepackage[authoryear,round]{natbib} \RequirePackage{amsfonts,amsmath,amstext,amssymb,amscd} \usepackage{graphicx} \usepackage{verbatim} \usepackage{hyperref} \usepackage{color} \definecolor{darkblue}{rgb}{0.2,0.0,0.4} %% \topmargin -1.5cm %% \oddsidemargin -0cm % read Lamport p.163 %% \evensidemargin -0cm % same as oddsidemargin but for left-hand pages %% \textwidth 17cm %% \textheight 24.5cm %% \parindent0em \newcommand{\lib}[1]{{\mbox{\normalfont\textsf{#1}}}} \newcommand{\file}[1]{{\mbox{\normalfont\textsf{'#1'}}}} \newcommand{\R}{{\mbox{\normalfont\textsf{R}}}} \newcommand{\Rfunction}[1]{{\mbox{\normalfont\texttt{#1}}}} \newcommand{\Robject}[1]{{\mbox{\normalfont\texttt{#1}}}} \newcommand{\Rpackage}[1]{{\mbox{\normalfont\textsf{#1}}}} \newcommand{\Rclass}[1]{{\mbox{\normalfont\textit{#1}}}} \newcommand{\code}[1]{{\mbox{\normalfont\texttt{#1}}}} \newcommand{\email}[1]{\mbox{\href{mailto:#1}{\textcolor{darkblue}{\normalfont{#1}}}}} \newcommand{\web}[2]{\mbox{\href{#2}{\textcolor{darkblue}{\normalfont{#1}}}}} \SweaveOpts{keep.source=TRUE,eps=FALSE} \begin{document} \title{A Parser for \texttt{mzXML}, \texttt{mzData} and \texttt{mzML} files} \author{Bernd Fischer\footnote{\email{bernd.fischer@embl.de}} \\ Steffen Neumann\footnote{\email{sneumann@ipb-halle.de}} \\ Laurent Gatto\footnote{\email{lg390@cam.ac.uk}}} \maketitle \tableofcontents \section{Introduction} The \Rpackage{mzR} \citep{Chambers2012} package aims at providing a common interface to several mass spectrometry data formats, namely \texttt{mzData} \citep{Orchard2007}, \texttt{mzXML} \citep{Pedrioli2004} and the latest \texttt{mzML} \citep{Martens2010}, somewhat similar to the Bioconductor package affyio for affymetrix raw data. No processing is done in mzR, which is left to packages such as \Rpackage{XCMS}\footnote{\url{http://www.bioconductor.org/packages/release/bioc/html/xcms.html}} or \Rpackage{MSnbase}\footnote{\url{http://www.bioconductor.org/packages/release/bioc/html/MSnbase.html}}. \bigskip Most importantly, access to the data should be fast and memory efficient. This is made possible by allowing random file access, i.e. retrieving specific data of interest without having to sequentially browser the full content. The actual work of reading and parsing the data files is handled by the included C/C++ libraries or ``backends''. The RAMP parser, written at the Institute for Systems Biology (ISB) is a fast and lightweight parser in pure C. Later, it gained support for the \texttt{mzData} format. The C++ reference implementation for the \texttt{mzML} is the proteowizard library \citep{Kessner08} (pwiz in short), which in turn makes use of the boost C++ (\url{http://www.boost.org/}) library. RAMP is able to access \texttt{mzML} files by calling pwiz methods. \bigskip The \Rpackage{mzR} package is in essence a collection of wrappers to the C++ code, and benefits from the C++ interface provided through the Rcpp package \citep{Rcpp11}. %% Let's not mention this yet, as only one backend is available. %% Where possible \Rpackage{mzR} selects the appropriate backend for the %% requested file type. If several backends provide access, the user can %% specify a backend different than the default. \section{Mass spectrometry raw data} All the mass spectrometry file formats are organized similarly, where a set of metadata nodes about the run is followed by a list of spectra with the actual masses and intensities. In addition, each of these spectra has its own set of metadata, such as the retention time and acquisition parameters. \subsection{Spectral data access} Access to the spectral data is done via the peaks function. The return value is a list of two-column mass-to-charge and intensity matrices or a single matrix if one spectrum is queried. \subsection{Metadata access} \paragraph{Run metadata} is available via several functions such as \texttt{instrumentInfo()} or \texttt{runInfo()}. The individual fields can be accessed via e.g. detector() etc. \paragraph{Spectrum metadata} is available via \texttt{header()}, which will return a list (for single scans) or a dataframe with information such as the \texttt{basePeakMZ}, \texttt{peaksCount}, \ldots or, for higher-order MS the \texttt{msLevel} and precursor information. \bigskip The availability of this metadata can not always be guaranteed, and depends on the MS software which converted the data. \section{Example} A short example sequence to read data from a mass spectrometer. First open the file. <>= library(mzR) library(msdata) mzxml <- system.file("threonine/threonine_i2_e35_pH_tree.mzXML", package = "msdata") aa <- openMSfile(mzxml) @ We can obtain different kind of header information. <>= runInfo(aa) instrumentInfo(aa) header(aa,1) @ Read a single spectrum from the file. <>= pl <- peaks(aa,10) peaksCount(aa,10) head(pl) plot(pl[,1], pl[,2], type="h", lwd=1) @ One should always close the file when not needed any more. This will release the memory of cached content. <>= close(aa) @ \section{Future plans} Right on the heels of the initial RAMP backend release, supporting little metadata beyond e.g. the instrument model, is support to the full pwiz functionality to access the full metadata stored in an \texttt{mzML} files, or the chromatograms (which store e.g. MRM data). Other file formats supported by pwiz, such as \texttt{mzIdentML} for protein identification results are also possible in the future. \section{Session information}\label{sec:sessionInfo} <>= toLatex(sessionInfo()) @ \bibliographystyle{plainnat} \bibliography{mzR} \end{document}