%\VignetteIndexEntry{mzR, Ramp, mzXML, mzData, mzML}
%\VignetteKeywords{mzXML, mzData, mzML, Ramp, mass spectrometry, proteomics, metabolomics}
%\VignettePackage{mzR}

\documentclass[10pt,a4paper]{article}
\usepackage[authoryear,round]{natbib}

\RequirePackage{amsfonts,amsmath,amstext,amssymb,amscd}
\usepackage{graphicx}
\usepackage{verbatim}
\usepackage{hyperref}
\usepackage{color}
\definecolor{darkblue}{rgb}{0.2,0.0,0.4}

%% \topmargin -1.5cm
%% \oddsidemargin -0cm   % read Lamport p.163
%% \evensidemargin -0cm  % same as oddsidemargin but for left-hand pages
%% \textwidth 17cm
%% \textheight 24.5cm
%% \parindent0em

\newcommand{\lib}[1]{{\mbox{\normalfont\textsf{#1}}}}
\newcommand{\file}[1]{{\mbox{\normalfont\textsf{'#1'}}}}
\newcommand{\R}{{\mbox{\normalfont\textsf{R}}}}
\newcommand{\Rfunction}[1]{{\mbox{\normalfont\texttt{#1}}}}
\newcommand{\Robject}[1]{{\mbox{\normalfont\texttt{#1}}}}
\newcommand{\Rpackage}[1]{{\mbox{\normalfont\textsf{#1}}}}
\newcommand{\Rclass}[1]{{\mbox{\normalfont\textit{#1}}}}
\newcommand{\code}[1]{{\mbox{\normalfont\texttt{#1}}}}

\newcommand{\email}[1]{\mbox{\href{mailto:#1}{\textcolor{darkblue}{\normalfont{#1}}}}}
\newcommand{\web}[2]{\mbox{\href{#2}{\textcolor{darkblue}{\normalfont{#1}}}}}

\SweaveOpts{keep.source=TRUE,eps=FALSE}

\begin{document}

\title{A Parser for \texttt{mzXML}, \texttt{mzData} and \texttt{mzML} files}

\author{Bernd Fischer\footnote{\email{bernd.fischer@embl.de}} \\
  Steffen Neumann\footnote{\email{sneumann@ipb-halle.de}} \\
  Laurent Gatto\footnote{\email{lg390@cam.ac.uk}}}

\maketitle

\tableofcontents

\section{Introduction}

The \Rpackage{mzR} \citep{Chambers2012} package aims at providing a common interface to
several mass spectrometry data formats, namely \texttt{mzData}
\citep{Orchard2007}, \texttt{mzXML} \citep{Pedrioli2004} and the
latest \texttt{mzML} \citep{Martens2010}, somewhat similar to the
Bioconductor package affyio for affymetrix raw data. No processing is
done in mzR, which is left to packages such as 
\Rpackage{XCMS}\footnote{\url{http://www.bioconductor.org/packages/release/bioc/html/xcms.html}} 
or \Rpackage{MSnbase}\footnote{\url{http://www.bioconductor.org/packages/release/bioc/html/MSnbase.html}}.

\bigskip

Most importantly, access to the data should be fast and memory efficient. This is made possible by 
allowing random file access, i.e. retrieving specific data of interest without having to sequentially browser 
the full content. 

The actual work of reading and parsing the data files is handled by
the included C/C++ libraries or ``backends''. The RAMP parser, written at the
Institute for Systems Biology (ISB) is a fast and lightweight parser
in pure C. Later, it gained support for the \texttt{mzData} format.

The C++ reference implementation for the \texttt{mzML} is the proteowizard
library \citep{Kessner08} (pwiz in short), which in turn makes use of the boost C++ 
(\url{http://www.boost.org/}) library. 
RAMP is able to access \texttt{mzML} files by calling pwiz methods.

\bigskip

The \Rpackage{mzR} package is in essence a collection of wrappers to the C++
code, and benefits from the C++ interface provided through the Rcpp
package \citep{Rcpp11}. 
%% Let's not mention this yet, as only one backend is available.
%% Where possible \Rpackage{mzR} selects the appropriate backend for the 
%% requested file type. If several backends provide access, the user can 
%% specify a backend different than the default.

\section{Mass spectrometry raw data}

All the mass spectrometry file formats are organized similarly,
where a set of metadata nodes about the run is followed by a list of
spectra with the actual masses and intensities. In addition, each of
these spectra has its own set of metadata, such as the retention time
and acquisition parameters.

\subsection{Spectral data access}

Access to the spectral data is done via the peaks function. 
The return value is a list of two-column mass-to-charge and intensity 
matrices or a single matrix if one spectrum is queried.

\subsection{Metadata access}

\paragraph{Run metadata} is available via several functions such as
\texttt{instrumentInfo()} or \texttt{runInfo()}. The individual fields can be accessed
via e.g. detector() etc.

\paragraph{Spectrum metadata} is available via \texttt{header()}, which will
return a list (for single scans) or a dataframe with information such as the 
\texttt{basePeakMZ}, \texttt{peaksCount}, \ldots or, for higher-order MS 
the \texttt{msLevel} and precursor information.

\bigskip

The availability of this metadata can not always be guaranteed, and
depends on the MS software which converted the data.

\section{Example}

A short example sequence to read data from a mass spectrometer. 
First open the file.

<<open the file>>=
library(mzR)
library(msdata)

mzxml <- system.file("threonine/threonine_i2_e35_pH_tree.mzXML", 
                     package = "msdata")
aa <- openMSfile(mzxml)
@ 

We can obtain different kind of header information.
<<get header information>>=
runInfo(aa)
instrumentInfo(aa)
header(aa,1)
@ 

Read a single spectrum from the file.
<<plotspectrum, fig = TRUE, eps = FALSE>>=
pl <- peaks(aa,10)
peaksCount(aa,10)
head(pl)
plot(pl[,1], pl[,2], type="h", lwd=1)
@ 

One should always close the file when not needed any more. This will release the memory of cached content.
<<close the file>>=
close(aa)
@ 

\section{Future plans}

Right on the heels of the initial RAMP backend release, 
supporting little metadata beyond e.g. the instrument model, 
is support to the full pwiz functionality to access the full metadata 
stored in an \texttt{mzML} files, 
or the chromatograms (which store e.g. MRM data). Other file formats
supported by pwiz, such as \texttt{mzIdentML} for protein identification
results are also possible in the future. 

\section{Session information}\label{sec:sessionInfo} 

<<label=sessioninfo,results=tex,echo=FALSE,cache=FALSE>>=
  toLatex(sessionInfo())
@ 


\bibliographystyle{plainnat}
\bibliography{mzR}

\end{document}