%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Combining HDMSe/MSe data using 'synapter' to optimise identification and quantitation}
%\VignetteKeywords{Mass Spectrometry, Proteomics, Bioinformatics, quantitative, Ion mobility, label-free}
%\VignettePackage{synapter}

\documentclass[12pt, oneside]{article}

<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

\bioctitle[The \Biocpkg{synapter} package]{Optimising identification
  and quantitation by combining data using the \Biocpkg{synapter}
  package}


\author{Laurent Gatto%
  \footnote{\email{lg390@cam.ac.uk}}~,
    Pavel V. Shliaha and 
    Sebastian Gibb
}


\begin{document}


\maketitle

%% Abstract and keywords %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vskip 0.3in minus 0.1in
\hrule
\begin{abstract}
  This vignette describes the functionality implemented in the
  \Biocpkg{synapter} package. It allows to re-analyse label-free
  quantitative proteomics data obtained on a Synapt G2 instrument to
  optimise quantitation and identification. Several combination
  strategies are possible and described. Typically, a user can combine
  identification-optimised data (HDMS$^E$ data using ion mobility
  separation) and quantitation-optimised data (MS$^E$
  data). Additionally, a method to combine several data files into a
  \emph{master} set while controlling the false discovery rate, is
  presented.
\end{abstract}

\textit{Keywords}: Mass Spectrometry (MS), proteomics, bioinformatics,
IMS, ion mobility separation ion mobility, Synapt, label free, data
independent acquisition.

\vskip 0.1in minus 0.05in
\hrule
\vskip 0.2in minus 0.1in
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newpage

\tableofcontents

\newpage

<<knitr, include=FALSE, cache=FALSE>>=
library(knitr)
opts_chunk$set(fig.align = 'center',
               fig.show = 'hold',
               stop_on_error = 1L,
               par = TRUE,
               prompt = TRUE,
               comment = NA)
options(replace.assign = TRUE,
        width = 70)
@

<<setup, echo=FALSE>>=
suppressPackageStartupMessages(library(qvalue))
library(xtable)
suppressPackageStartupMessages(library(synapter))
library(synapterdata)
## preparing some data
data(ups25b) ## from synapterdata
res <- ups25b ## for synergise and plotRt
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{sec:intro}

\subsection{Background}\label{sec:bkg}

The main functionality of \Biocpkg{synapter} is to combine proteomics
data acquired under different mass spectrometry settings or with
different samples to (i) optimise the respective qualities of the two
data sets or (ii) increase the number of identifications, thereby
decreasing missing values. Besides synapter offers other functionality
inaccessible in the default pipeline, like peptide FDR estimation and
filtering on peptide match type and peptide uniqueness.

The example that motivated the development of this package was to
combine data obtained on a Synapt G2 instrument:

\begin{enumerate}
\item HDMS$^E$ data, acquired with additional peptide separation using
  an ion mobility cell, thus leading to better (both in number and in
  quality) identification and
\item standard MS$^E$ data (acquired without ion mobility separation),
  providing better data quantitation.
\end{enumerate}

The former is data is called \emph{identification peptides} and the
latter \emph{quantitation peptides}, irrespective of the acquisition
mode (HDMS$^E$ or MS$^E$). This HDMS$^E$/MS$^E$ design is used in this
document to illustrate the \Biocpkg{synapter} package.

However, although HDMS$^E$ mode possesses superior identification and
MS$^E$ mode superior quantitation capabilities and transferring
identifications from HDMS$^E$ to MS$^E$ is a priori the most efficient
setup, identifications can be transferred between any runs,
independently of the acquisition mode. This allows to reduce the
number of missing values, one of the primary limitation of label-free
proteomics. Thus users will benefit from \Biocpkg{synapter}'s
functionality even if they run their instruments in a single mode
(HDMS$^E$ or MS$^E$ only).

However, as will be shown in section \ref{sec:analysis}, transferring
identifications from multiple runs to each other increases analysis
time and peptide FDR within the analysis. \Biocpkg{synapter} allows
to minimise these effects to acceptable degree by choosing runs to
transfer identifications from and merging them in the \emph{master}
HDMS$^E$ file.

This data processing methodology is described in section
\ref{sec:synapter} and the analysis pipeline is described in section
\ref{sec:pipelines}.

To maximise the benefit of combining better identification and
quantitation data, it is also possible to combine several, previously
merged identification data files into one \emph{master} set. This
functionality is described in section \ref{sec:master}.

Finally, section \ref{sec:msnbase} illustrates a complete pipeline
including \Biocpkg{synapter} and \Biocpkg{MSnbase} \cite{Gatto2012}
packages to perform protein label-free quantitation: how to combine
multiple \Biocpkg{synapter} results to represent the complete
experimental design under study and further explore the data,
normalise it and perform robust statistical data analysis inside the
\R environment.

\bigskip

The rationale underlying \Biocpkg{synapter}'s functionality are
described in \cite{Shliaha2013} and \cite{Bond2013}.  The first
reference describes the benefits of ion mobility separation on
identification and the effects on quantitation, that led to the
development of \Biocpkg{synapter}, which in described and
demonstrated in \cite{Bond2013}.

\bigskip

\Biocpkg{synapter} is written for \R~\cite{R}, an open source, cross
platform, freely available statistical computing environment and
programming
language\footnote{\url{http://www.r-project.org/}}. Functionality
available in the \R environment can be extended though the usage of
packages. Thousands of developers have contributed packages that are
distributed via the Comprehensive R Archive Network (CRAN) or through
specific initiatives like the
Bioconductor\footnote{\url{http://www.bioconductor.org/}} project
\cite{Gentleman2004}, focusing on the analysis and comprehension of
high-throughput biological data.

\Biocpkg{synapter} is such an \R package dedicated to the analysis of
label-free proteomics data. To obtain detailed information about any
function in the package, it is possible to access it's documentation
by preceding it's name with a question mark at the command line
prompt. For example, to obtain information about the
\Biocpkg{synapter} package, one would type \Rfunction{?synapter}.

\subsection{Installation}\label{sec:install}

\Biocpkg{synapter} is available through the Bioconductor
project. Details about the package and the installation procedure can
be found on its
page\footnote{\url{http://bioconductor.org/packages/devel/bioc/html/synapter.html}}. Briefly,
installation of the package and all its dependencies should be done
using the dedicated Bioconductor infrastructure as shown below:

<<install, echo=TRUE, eval=FALSE>>=
source("http://bioconductor.org/biocLite.R")
## or, if you have already used the above before
library("BiocInstaller")
## and to install the package
biocLite("synapter")
@

After installation, \Biocpkg{synapter} will have to be explicitly loaded with

<<library, echo=TRUE, eval=FALSE>>=
library(synapter)
@

so that all the package's functionality is available to the user.


\subsection{Getting help}\label{sec:help}

Question should be asked on the Bioconductor support
forum\footnote{\url{https://support.bioconductor.org}}. Bugs and
feature requests can also be reported on the github tracker
\footnote{\url{https://github.com/lgatto/synapter/issues}}.

% There is a general mailing
% list\footnote{\url{https://stat.ethz.ch/mailman/listinfo/bioconductor}}
% for Bioconductor packages. There is also a on-line
% form\footnote{\url{http://bioconductor.org/help/mailing-list/mailform/}}
% if you do not wish to subscribe to the list. You are also welcome to
% contact Laurent Gatto (\url{lg390@cam.ac.uk}) for general questions,
% bugs, comments and suggestions.

\Biocpkg{synapter} is an open source initiative and contributions,
whether new code, documentation bug fixes and new use cases are much
appreciated. The official source code is available on the Bioconductor
svn
server\footnote{\url{http://bioconductor.org/developers/source-control/}}. A
testing version and easily fork-able source tree is available on
github\footnote{\url{https://github.com/lgatto/synapter/}}.
% which also allows to report issues (bugs, feature requests, \ldots).

\clearpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Data analysis using \Biocpkg{synapter}}\label{sec:analysis}

\subsection{Preparing the input data}\label{sec:input}

Preparation of the data for \Biocpkg{synapter} requires the
\texttt{.raw} data first to be processed with Waters' ProteinLynx
Global Serve (PLGS) software. The PLGS result is then exported as
\texttt{csv} spreadsheet files in user specified folders. These
\texttt{csv} files can then be used as input for \Biocpkg{synapter}.

We also highly recommend users to acquaint themselves with the PLGS
search algorithm for data independent acquisitions \cite{PLGS2009}.

First the user has to specify the output folders for files to be used
in synapter analysis as demonstrated in figure \ref{fig:plgs1}. After
the folders are specified ignore the message that appears requiring
restarting PLGS.

\begin{figure}[!hbt]
  \begin{center}
    \includegraphics[width=0.3\linewidth]{./Figures/PLGS_output_files1.jpg}
    \includegraphics[width=0.3\linewidth]{./Figures/PLGS_output_files2.jpg}
    \includegraphics[width=0.3\linewidth]{./Figures/PLGS_output_files3.jpg}
    \caption{Specifying PLGS output folders. The last message can be ignored.}
    \label{fig:plgs1}
  \end{center}
\end{figure}

At the first stage PLGS performs noise reduction and centroiding,
based on user specified preferences called \textit{processing
  parameters}. These preferences determine thresholds in intensity for
discriminating between noise peaks and peptide and fragment ion peaks
in high and low energy functions of an acquisition. The optimal value
of these parameters is sample dependant and different between MS$^E$
and HDMS$^E$ modes. For \Biocpkg{synapter} to function properly all
acquisitions in the analysis have to be processed with the same
thresholds, optimal for the mode identifications are transferred from
(typically HDMS$^E$ mode). The user is expected to identify optimal
parameters himself for every new sample type by repeatedly analysing a
representative acquisition with different thresholds.

After the ions peaks have been determined and centroided, the ions
representing charge states and isotopes of a peptide are collapsed
into a single entity called EMRT (exact mass retention time pair). The
EMRTs in low energy function represent unidentified peptides and are
assigned peptides sequences during database search. The total list of
EMRTs can be found in the \texttt{pep3DAMRT.csv} file and it is one of
the \Biocpkg{synapter} input files for the runs used for quantitation
(typically MS$^E$ mode)

Prior to the database search, randomised entries are added to the
database to allow PLGS to compute \textit{protein false positive
  rate}. The randomised entries can either be added automatically or
manually, using the \textit{Randomise Databank} function in the
\textit{Databank admin tool}. To properly prepare the files for
\Biocpkg{synapter}, the user has to add randomised entries manually
via \textit{Databank admin tool}, since only then randomised entries
identified in the database search will be displayed in the
\textit{csv} output. Figure \ref{fig:plgs2} demonstrates how to create
a randomised databank manually using one randomised entry per regular
entry.

\begin{figure}[!hbt]
  \begin{center}
    \includegraphics[width=0.3\linewidth]{./Figures/PLGS_create_databank1.jpg}
    \includegraphics[width=0.3\linewidth]{./Figures/PLGS_create_databank2.jpg}
    \caption{Databank creation in PLGS.}
    \label{fig:plgs2}
  \end{center}
\end{figure}

\begin{figure}[!hbt]
  \begin{center}
    \includegraphics[width=0.3\linewidth]{./Figures/PLGS_db_search_options.jpg}
    \caption{Databank search options.}
    \label{fig:plgs3}
  \end{center}
\end{figure}

The user is also expected to use a minimum of 1 fragment per peptide,
3 fragments per protein and 1 peptide per protein identification
thresholds and 100\% \textit{False Positive Rate}\footnote{This is
  erroneously termed false positive rate in the software and
  manuscript and should be considered a false discovery rate.} for
protein identification during database search for all of the
acquisitions in the analysis as demonstrated in figure
\ref{fig:plgs3}. This allows to maximise the number of identified
peptides from the randomised part of the database, needed to estimate
peptide identifications statistics. The total list of identified
peptides is given in \texttt{final\_peptide.csv} files. A single
\texttt{final\_peptide.csv} file has to be supplied to
\Biocpkg{synapter} for every run in the analysis (for both
identification and quantitation runs).

More details and screenshots are available in a separate document
available at \url{http://lgatto.github.com/synapter/}.

\subsection{HDMS$^E$/MS$^E$ data analysis}\label{sec:synapter}

The analysis of pairs of HDMS$^E$ and MS$^E$ data files is based on
the following rationale -- combine strengths or each approach by
matching high quality HDMS$^E$ identifications to quantified MS$^E$
EMRTs applying the following algorithm:

\begin{enumerate}
\item Apply various peptide filters to HDMS$^E$ and MS$^E$ peptides to
  obtain two sets of reliably identified unique proteotypic peptides.
\item Use shared HDMS$^E$ and MS$^E$ peptides to model the deviations
  in retention time between the two mass spectrometer runs.
\item Optimise the parameters that will be used to optimally match all
  HDMS$^E$ peptides and quantified MS$^E$ EMRTs using a grid search.
\item Using the best parameters, match identified HDMS$^E$ peptides to
  quantified MS$^E$ EMRTs.
\end{enumerate}

\subsection{Different pipelines}\label{sec:pipelines}

Three different pipeline are available to the user:

\subsubsection{Graphical user interface}

\textbf{The graphical interface is deprecated and will be removed in
  future versions. We encourage users to utilise the
  \Rfunction{synergise} instead.}

A simple graphical interface (GUI) can be used to preform a complete
data analysis. This pipeline is the most accessible for users that do
not feel comfortable with command line interfaces (see below) and/or
for a limited number of analysis to be run manually.

The GUI is a graphical layer between the user and the
\Rfunction{synergise} function that will be described later. For more
details on the underlying data processing and parameters that can be
customised, read \Rfunction{?synergise}.

\begin{figure}[!hbt]
  \begin{center}
    \includegraphics[width=0.3\linewidth]{./Figures/GUI-Tab1.jpg}
    \includegraphics[width=0.3\linewidth]{./Figures/GUI-Tab2.jpg}
    \includegraphics[width=0.3\linewidth]{./Figures/GUI-Tab3.jpg}
    \caption{Screenshots of the 3 input tabs of the
      \Rfunction{synapterGUI} function. From left to right: (1) the
      data input tab, (2) the data filtering and retention time
      modelling tab and (3) the grid search tab.}
    \label{fig:gui}
  \end{center}
\end{figure}

The graphical interface is shown on figure \ref{fig:gui} and is
started by calling the \Rfunction{synapterGUI()} function. The
interface is composed of three tabs, that allow data input and
analyses parameters customisation. The \Rfunction{synapterGUI}
function itself takes one input parameter, \Robject{n}, that defines
the number of identification/quantitation sets of files that the user
wants to analyse. Each set is composed of one identification final
peptide file (typically HDMS$^E$), one quantitation final peptide file
(typically MS$^E$), one quantitation Pep3D file (also MS$^E$) and one
output directory. In addition, the user also needs to specify one
single fasta file that will be used to filter proteotypic peptide. To
perform 3 analysis, as illustrated on figure \ref{fig:gui}, the
function would be executed like \Rfunction{synapterGUI(n = 3)} or
simply \Rfunction{synapterGUI(3)}.

\begin{description}
\item[Data input] The first tab uses a tree structure to represent the
  input sets to be analysed. The unique fasta file is located at the
  very top of the hierarchy and each subsequent node (file set) can be
  opened and populated. Files are added by selecting the respective
  node, clicking the \texttt{Add} button and selecting the
  corresponding file using the file selection dialogue that opens. The
  file names then appear as new nodes (see for example the
  identification file in set 1) and can be removed with the
  \texttt{Remove} button.

  The \texttt{Master} box needs to be checked if the identification
  inputs are \emph{master} files (see section \ref{sec:master}).

\item[Filtering and modelling] The second tab allows to specify
  peptide filtering and retention time modelling parameters. Modelling
  accounts for systematic deviation in retention time for peptides
  between mass spectrometry runs (figure \ref{fig:rtmodel}) by fitting
  a curve through deviation of retention time vs retention time plot.

\begin{figure}[!hbt]
\centering
<<plotRt, dev='pdf', fig.width=5.5, fig.height=5>>=
plotRt(ups25b, what = "model", nsd = 1)
@
\caption{Figure illustrating retention time modelling between two runs
  in \Biocpkg{synapter}, as generated by the \Rfunction{plotRt}
  function.}
\label{fig:rtmodel}
\end{figure}

For a detailed description of the parameters and the processing
pipeline, see the documentation of the \Rfunction{synergise} function.

\item[Grid search] Matching peptides to quantified EMRTs is done in
  the two dimensional retention time vs. precursor mass space. The
  optimal tolerances in both dimensions are estimated by a grid search
  that uses common identification/quantitation peptides. The size of
  the grid, i.e. the range of the retention time (\texttt{nsd}, number
  of standard deviations in the retention time model) and mass
  tolerance (\texttt{ppm}) to be searched can be defined here. In
  addition, it is possible to select a subset of the data to reduce
  search time.  For a detailed description of the parameters, see the
  documentation of the \Rfunction{synergise} function.
\end{description}

Once all the input has been specified, pressing the \texttt{Run}
button in the lower left corner of the GUI starts the
\Biocpkg{synapter} run: all \Robject{n} analyses are executed one
after each other and a complete report in \texttt{html} as well as
several result files are created in the respective output folders.

\subsubsection{Wrapper function}

The \Rfunction{synergise} function is a high level wrapper that
implements a suggested analysis to combine two files (see next
paragraph for details). A set of parameters can be passed, although
sensible defaults are provided. While the analysis is executed, a html
report is created, including all result files in text spreadsheet
(\texttt{csv} format) and binary \R output. This level allows easy
scripting for automated batch analysis. Using data from the
\Biocexptpkg{synapterdata} package, the following code chunk illustrates
the \Rfunction{synergise} usage. An example report can be found online
at \url{http://lgatto.github.com/synapter/}.

%% <<rndseed,echo=FALSE,eval=TRUE>>=
%%   set.seed(123)
%% @

<<prepare-synergise, echo=TRUE, eval=TRUE, tidy = FALSE>>=
library(synapterdata)
hdmsefile <- getHDMSeFinalPeptide()[2]
basename(hdmsefile)
msefile <- getMSeFinalPeptide()[2]
basename(msefile)
msepep3dfile <- getMSePep3D()[2]
basename(msepep3dfile)
fas <- getFasta()
basename(fas)
## the synergise input is a (named) list of filenames
input <- list(identpeptide = hdmsefile,
              quantpeptide = msefile,
              quantpep3d = msepep3dfile,
              fasta = fas)
## a report and result files will be stored
## in the 'output' directory
output <- tempdir()
output
@

%% Not evaluating this bit because it increases
%% the vignette build/check over the 5 minutes
%% Bioconductor limit (build is 5 min, check is 12 min)

<<synergise, echo=TRUE, eval=FALSE>>=
res <- synergise(filenames = input, outputdir = output)
@

%% res is loaded in the
<<performance, echo=TRUE>>=
performance(res)
@

See \Rfunction{?synergise} for details.

\subsubsection{Detailed step-by-step analysis}

The user can have detailed control on each step of the analysis by
executing each low-level function manually. This pipeline, including
generation of data containers (class instances) and all available
operations are available in \Rfunction{?Synapter}. This strategy
allows the maximum flexibility to develop new unexplored approaches.

\subsection{Using \emph{master} peptide files}\label{sec:master}

While analysing one MS$^E$ file against one single HDMS$^E$ file
increased the total number of reliably identified and quantified
features compared to each single MS$^E$ analysis, a better procedure
can be applied when replicates are available. Consider the following
design with two pairs of files: HDMS$^E_1$, MS$^E_1$, HDMS$^E_2$ and
MS$^E_2$. The classical approach would lead to combining for example,
HDMS$^E_1$ and MS$^E_1$ and HDMS$^E_2$ and MS$^E_2$. However,
HDMS$^E_1$ -- MS$^E_2$ and HDMS$^E_2$ -- MS$^E_1$ would also be
suitable, possibly leading to new identified and quantified
features. Instead of repeating all possible combinations, which could
hardly be applied for more replicates, we allow to merge HDMS$^E_1$
and HDMS$^E_2$ into a new \emph{master} HDMS$^E_{12}$ and then using
it to transfer identification to both MS$^E$ runs. In addition to
leading to a simpler set of analyses, this approach also allows to
control the false positive rate during the HDMS$^E$ merging (see
section \ref{sec:whichhdmse}). Such \emph{master} HDMS$^E$ files can
be readily created with the \Rfunction{makeMaster} function, as
described in section \ref{sec:makeMaster}.

We will use data from the \Biocexptpkg{synapterdata} to illustrate how to
create \emph{master} files.


\subsubsection{Choosing which HDMS$^E$ files to combine}
\label{sec:whichhdmse}

In a more complex design, a greater number of HDMS$^E$ files might
need to be combined. When combining files, one also accumulates false
peptides assignments. The extent to which combining files increases
new reliable identification at the cost of accumulating false
assignments can be estimated with the \Rfunction{estimateMasterFdr}
function.

To illustrate how FDR is estimated for \emph{master} HDMS$^E$ files,
let's consider two extreme cases.

\begin{itemize}

\item In the first one, the two files (each with $1000$ peptides
  filtered at an FDR of $0.01$) to be combined are nearly identical,
  sharing $900$ peptides. The combined data will have $900 (shared) +
  2 \times 100 (unique)$ peptides and each file, taken separately is
  estimated to have $1000 \times 0.01 = 10$ false positive
  identifications. We thus estimate the upper FDR bound after merging
  the two files to be $\frac{20}{1100} = 0.0182$.

\item In the second hypothetical case, the two files (again each with
  $1000$ peptides filtered at a FDR of $0.01$) to be combined are very
  different and share only $100$ peptides. The combined data will have
  $100 (shared) + 2 \times 900 (unique)$ peptides and, as above, each
  file is estimated to have $10$ false discoveries. In this case, we
  obtain an upper FDR bound of $\frac{20}{1900} = 0.0105$.

\end{itemize}

In general, the final false discovery for two files will be
$$FDR_{master} = \frac{nfd_{1} + nfd_{2}}{union(peptides~HDMS^E_{1}, peptides~HDMS^E_{2})}$$

where $nfd_{i}$ is the number of false discoveries in HDMS$^E$ file
$i$. Note that we do not make any assumptions about repeated
identification in multiple files here.

\Rfunction{estimateMasterFdr} generalised this for any number of
HDMS$^E$ files and indicates the best combination at a fixed
user-specified \Robject{masterFdr} level. Mandatory input is a list of
HDMS$^E$ file names and a fasta database file name to filter
non-unique proteotypic peptides.

<<inputfiles, echo=FALSE, eval=TRUE>>=
inputfiles <- getHDMSeFinalPeptide()
@

The result of \Rfunction{estimateMasterFdr} stores the number of
unique proteotypic peptides and FDR for all possible
\Sexpr{sum(choose(length(inputfiles),2:length(inputfiles)))}
combinations of \Sexpr{length(inputfiles)} files. A summary can be
printed on the console or plotted with \Rfunction{plot(cmb)} (see
figure \ref{fig:master}).

<<masterFdr, echo=TRUE, eval=TRUE, cache=TRUE, tidy=FALSE>>=
## using the full set of 6 HDMSe files and a
## fasta database from the synapterdata package
inputfiles <- getHDMSeFinalPeptide()
fasta <- getFasta()
cmb <- estimateMasterFdr(inputfiles, fasta, masterFdr = 0.02,
                         verbose = FALSE)
cmb
@

\begin{figure}[!hbt]
\centering
<<estimateFdrPlot, dev='pdf', fig.width=5, fig.height=5>>=
plot(cmb)
@
\caption{Figure illustrating the relation between the number of unique
  peptides in the combined HDMS$^E$ file and the resulting false
  discovery rate. The symbols on the figure represent the number of
  files for that particular combination. The dotted line is the user
  defined threshold for the combined FDR (\Robject{masterFdr}
  parameter). The best combination, i.e the one that maximises the
  number of unique peptides while keeping the FDR below
  \Robject{masterFdr} is highlighted in red.}
\label{fig:master}
\end{figure}

The best combination can be extracted with the \Rfunction{bestComb}
function.

<<bestComb, echo=TRUE, eval=TRUE>>=
bestComb(cmb)
@

See \Rfunction{?estimateMasterFdr} and references therein for more
details about the function and the returned object.

\subsubsection{Generating a \emph{master} file}
\label{sec:makeMaster}

Now that we have identified which files should be used to create the
master file, we can directly pass the relevant identification files to
the \Rfunction{makeMaster} function to generate the \emph{master}
file.  The function has one mandatory input parameter,
\Robject{pepfiles}, a list oh identification file names to be
merged. The output is an object of class \Robject{MasterPeptides} that
stores the relevant peptides from the original input files. The result
can be saved to disk using \Rfunction{saveRDS} for further analysis,
as described in section \ref{sec:synapter}.


<<master, echo=TRUE, eval=TRUE, cache=TRUE>>=
master <- makeMaster(inputfiles[bestComb(cmb)], verbose = FALSE)
master
@

More details can be found in the function documentation accessible
with \Rfunction{?makeMaster}.

\subsection{Summary}

Two functions are needed to choose a set of IR files and create the
\emph{master} IR. One function enables to perform a complete
identification transfer, eiter through a command line or graphical
interface. Table \ref{tab:summary} summarises all there is to know to
utilise \Biocpkg{synapter}'s functionality.

\begin{table}[!h]
\centering
\begin{tabular}{|l|l|}
  \hline
  Function                      & Description \\ \hline
  \Rfunction{synapterGUI()}     & Opens the graphical user interface \\
  \Rfunction{synergise}         & Runs the complete identification transfer \\
  \Rfunction{estimateMasterFdr} & Chooses which files to be used to create the \emph{master} IR \\
  \Rfunction{makeMaster}        & Creates the \emph{master} IR \\ \hline

\end{tabular}
\caption{The \Biocpkg{synapter} functions.}
\label{tab:summary}
\end{table}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Analysing complete experiments}\label{sec:msnbase}


The functionality described in this section relies on the
\Biocpkg{MSnbase} package \cite{Gatto2012}, which is installed by
default with \Biocpkg{synapter}. Please refer to the
\Biocpkg{MSnbase} Bioconductor web
page\footnote{\url{http://bioconductor.org/packages/release/bioc/html/MSnbase.html}},
the associated vignettes and the respective manual pages for more
details.

The \Biocexptpkg{synapterdata} already provides preprocessed PLGS
data. Six \Robject{Synapter} instances are available: 3 replicates
(labelled \texttt{a}, \texttt{b} and \texttt{c}) of the Universal
Proteomics Standard (UPS1) 48 protein mix at 25 fmol and 3 replicates
at 50 fmol, in a constant \textit{E. coli} background. The 6 files can
be loaded in your working space with

<<loadups, eval=TRUE, echo=TRUE>>=
data(ups25a, ups25b, ups25c, ups50a, ups50b, ups50c)
@

\subsection{Applying the \emph{Top 3} approach}\label{sec:top3}

We will start by describing the analysis of \Robject{ups25a} in
details, and then show how to analyse all the runs using more compact
code. The first step of our analysis is to convert the
\Biocpkg{synapter} object output (a \Robject{Synapter} instance),
into a \Biocpkg{MSnbase}-compatible object, called an
\Robject{MSnSet}, that we will name \Robject{ms25a}. We can obtain a
description of the \Robject{MSnSet} object by typing its name.

<<setas,echo=TRUE,eval=TRUE>>=
ms25a <- as(ups25a, "MSnSet")
class(ups25a)[1]
class(ms25a)[1]
ms25a
@

It contains quantitation information about \Sexpr{nrow(ms25a)}
peptides for \Sexpr{ncol(ms25a)} sample. In the code chunk below, we
update the default sample name \texttt{Synapter1} with a more
meaningful one.

<<updatesamplename>>=
sampleNames(ms25a)
sampleNames(ms25a) <- "ups25a"
sampleNames(ms25a)
@

Quantitative data and meta-data, which has been acquired by
\Biocpkg{synapter}, can be extracted with the \Rfunction{exprs} and
\Rfunction{fData} methods.

<<accessor>>=
tail(exprs(ms25a))
tail(fData(ms25a)[, c(2,9)])
## all fetaure metadata
fvarLabels(ms25a)
@

We will describe a how to process the data using a \textit{Top 3}
approach, where the 3 most intense peptides of each protein are used
to compute the protein intensity, using the \Rfunction{topN} and
\Rfunction{combineFeatures} methods. The former allows to extract the
top most intense peptides (default \texttt{n} is 3) and remove all
other peptides from the \Robject{MSnSet} object. The latter than
aggregates the \texttt{n} most intense peptides per protein using a
user-defined function (\Rfunction{sum}, below). Finally, we also scale
protein intensity values depending on the actual number of peptides
that have summed. This number of quantified peptides can be calculated
(after \Rfunction{topN}, but before \Rfunction{combineFeatures}) with
\Rfunction{nQuants}.

<<dotop3, echo=TRUE, cache=TRUE>>=
ms25a <- topN(ms25a, groupBy = fData(ms25a)$protein.Accession, n = 3)
nPeps <- nQuants(ms25a, groupBy = fData(ms25a)$protein.Accession)
ms25a <- combineFeatures(ms25a, fData(ms25a)$protein.Accession,
                         "sum", na.rm = TRUE, verbose = FALSE)
head(exprs(ms25a))
head(nPeps)
## scale intensities
exprs(ms25a) <- exprs(ms25a) * (3/nPeps)
## NaN result from the division by 0, when
## no peptide was found for that protein
head(exprs(ms25a))
@

\subsection{Batch processing}\label{sec:batchprocessing}

The code chunk below repeats the above processing for the other 5 UPS1/\textit{E. coli} runs.

<<batch, echo=TRUE, tidy = FALSE>>=
nms <- c(paste0("ups", 25, c("b", "c")),
         paste0("ups", 50, c("a", "b", "c")))
tmp <- sapply(nms, function(.ups) {
  cat("Processing", .ups, "... ")
  ## get the object from workspace and convert to MSnSet
  x <- get(.ups, envir = .GlobalEnv)
  x <- as(x, "MSnSet")
  sampleNames(x) <- .ups
  ## extract top 3 peptides
  x <- topN(x, groupBy = fData(x)$protein.Accession, n = 3)
  ## calculate the number of peptides that are available
  nPeps <- nQuants(x, fData(x)$protein.Accession)
  ## sum top3 peptides into protein quantitation
  x <- combineFeatures(x, fData(x)$protein.Accession,
                       "sum", na.rm = TRUE, verbose = FALSE)
  ## adjust protein intensity based on actual number of top peptides
  exprs(x) <- exprs(x) * (3/nPeps)
  ## adjust feature variable names for combine
  x <- updateFvarLabels(x, .ups)
  ## save the new MSnExp instance in the workspace
  varnm <- sub("ups", "ms", .ups)
  assign(varnm, x, envir = .GlobalEnv)
  cat("done\n")
})
@

We now have 6 \Robject{MSnSet} instances, containing protein
quantitation for the 6 UPS/\textit{E. coli} runs.

\subsection{Combining data and filtering}

We now want to filter data out based on missing quantitation data,
retaining proteins that have been quantified in at a least two out of
three replicates. Filtering based on missing data can be done using
the \Rfunction{filterNA} method and a maximum missing data content as
defined by \Robject{pNA}. Multiple \Robject{MSnSet} instances can be
combined with the \Rfunction{combine} method, which is described in
details in the \texttt{MSnbase-demo} vignette\footnote{The vignette is
  accessible from within \R with \Rfunction{vignette("MSnbase-demo",
    package = "MSnbase")}.}. The 6 objects have appropriate distinct
sample names and common feature (protein) names, which will be used to
properly combine the quantitation data.

<<combine25, echo=TRUE>>=
ms25 <- combine(ms25a, ms25b)
ms25 <- combine(ms25, ms25c)
dim(ms25)
ms25 <- filterNA(ms25, pNA = 1/3)
dim(ms25)
@

Once combined and filtered, the 25 fmol group retains
\Sexpr{nrow(ms25)} entries with at least 2 out of 3 quantitation
values, out of the \Sexpr{nrow(ms25a)} total number of proteins.

<<combine50, echo=TRUE>>=
ms50 <- combine(ms50a, ms50b)
ms50 <- combine(ms50, ms50c)
dim(ms50)
ms50 <- filterNA(ms50, pNA = 1/3)
dim(ms50)
@

Similarly, the 50 fmol group retains \Sexpr{nrow(ms50)} entries with
at least 2 out of 3 quantitation values, out of the
\Sexpr{nrow(ms50a)} initial proteins.

We now combine the two subgroups into one \Robject{MSnSet} object that
contains all 6 samples and filter for proteins that are observed in
both groups, i.e retaining proteins with a maximum of 2/6 missing
values. We also compute a summary table with the number of protein
that have 4, 5, or 6 quantitation values across the 6 samples.

<<msUps>>=
msUps <- combine(ms25, ms50)
msUps <- filterNA(msUps, pNA = 2/6)
head(exprs(msUps))
table(apply(exprs(msUps), 1, function(.x) sum(!is.na(.x))))
@

We obtain a final data set containing \Sexpr{nrow(msUps)}
proteins. Finally, we normalise protein intensities in each sample to
correct for experimental loading biases and pipetting errors. To do
so, we compute 6 sample medians using all constant \textit{E. coli}
background proteins and divide each protein by its respective sample
mean.

<<scale>>=
ecoli <- -grep("ups$", featureNames(msUps))
meds <- apply(exprs(msUps)[ecoli, ], 2, median, na.rm=TRUE)
exprs(msUps) <- t(apply(exprs(msUps), 1, "/", meds))
@

This same procedure could be applied with a set of constant spikes to
estimate absolute protein quantities.

%% Maybe a section with a PCA plot...

\subsection{Statistical analysis of differentially expressed proteins}

The UPS1 spiked-in protein mix is composed of 48 proteins,
\Sexpr{length(grep("ups", featureNames(msUps)))} of which have been
observed and quantified in our final data object. In this section, we
will illustrate how to analyse the \Sexpr{nrow(msUps)} proteins to
extract those that show differences between the two groups and show
that these candidates represent the UPS1 spikes.

The \R environment and many of the available packages allow extremely
powerful statistical analysis. In this document, we will apply a
standard t-test on $log_2$ transformed data for convenience, to
calculate p-value for individual proteins (\Robject{pv} variable
below). For best performance with small number of samples and more
complex designs, we recommend the Bioconductor \Biocpkg{limma}
package
\cite{limma}\footnote{\url{http://www.bioconductor.org/packages/release/bioc/html/limma.html}}. We
then perform multiple comparison adjustment using the
\Rfunction{qvalue} from the \Biocpkg{qvalue} package, that implements
the method from \cite{Storey2003} (\Robject{qv} variable below). The
\Biocpkg{multtest} package provides several other p-value adjustment
methods. We will also compute $log_2$ fold-changes and illustrate the
results on a volcano plot (figure \ref{fig:volcano}). Figure
\ref{fig:hmap} illustrates the UPS1 proteins and samples on a
classical heatmap.

%% inspect p-value histogram
%% non-specific filtering

<<stats, tidy = FALSE>>=
## use log2 data for t-test
exprs(msUps) <- log2(exprs(msUps))
## apply a t-test and extract the p-value
pv <- apply(exprs(msUps), 1 ,
            function(x) t.test(x[1:3], x[4:6])$p.value)
## calculate q-values
library(qvalue)
qv <- qvalue(pv)$qvalues
## calculate log2 fold-changes
lfc <- apply(exprs(msUps), 1 ,
             function(x) mean(x[1:3], na.rm=TRUE) - mean(x[4:6], na.rm=TRUE))
## create a summary table
res <- data.frame(cbind(exprs(msUps), pv, qv, lfc))
## reorder based on q-values
res <- res[order(res$qv), ]
head(round(res, 3))
@

In the above example, quantitation values and statistics data are
saved in a summary dataframe (\Robject{res}), that can be exported to
a comma-separated spreadsheet with

<<writecsv, eval=FALSE>>=
write.csv(res, file = "upsResults.csv")
@

\begin{figure}[!hbt]
\centering
<<volcano, dev='pdf', fig.width=5, fig.height=5, tidy=FALSE>>=
plot(res$lfc, -log10(res$qv),
     col = ifelse(grepl("ups$", rownames(res)),
       "#4582B3AA",
       "#A1A1A180"),
     pch = 19,
     xlab = expression(log[2]~fold-change),
     ylab = expression(-log[10]~q-value))
grid()
abline(v = -1, lty = "dotted")
abline(h = -log10(0.1), lty = "dotted")
legend("topright", c("UPS", "E. coli"),
       col = c("#4582B3AA", "#A1A1A1AA"),
       pch = 19, bty = "n")
@
\caption{On the volcano plot, each protein is represented by a dot and
  positioned according to its $log_2$ fold-change along the x axis and
  $-log_{10}$ of its q-value along the y axis. Proteins with large
  fold-changes are positioned on the sides of the plot, while proteins
  with low q-values are at the top of the figure. The most promising
  candidates are this located on the top corners. In our case, the UPS
  proteins (in blue) have $log2$ fold-changes around -1 (vertical
  dotted line), as expected. The horizontal dotted line represents a
  q-value threshold of 0.10.}
\label{fig:volcano}
\end{figure}


\begin{figure}[!hbt]
<<hmap, dev='pdf', fig.width=4, fig.height=4, fig.keep='last'>>=
heatmap(exprs(msUps)[grep("ups",featureNames(msUps)), ])
@
\caption{A heatmap of all UPS1 proteins present in the final data
  set.}
\label{fig:hmap}
\end{figure}


%% <<falseneg, echo=FALSE>>=
%% fn <- which(grepl("ups",rownames(res)) & res$qv > 0.06)
%% @ %%$

<<falsepos, echo=FALSE>>=
sel1 <- (res$qv < 0.1)
signms <- rownames(res)[sel1]
i <- grep("ups", signms)
n1 <- length(i)
n2 <- length(signms) - n1
@


<<upstab, echo=FALSE, results='asis'>>=
k <- grep("ups", rownames(res))
print(xtable(round(res[k, ], 3),
             caption = "UPS1 proteins.",
             table.placement = "thb",
             label = "tab:ups"),
      size = "scriptsize")
@


A total \Sexpr{n1} proteins show a statistically different pattern
between the two groups, at a false discovery rate of 10\%.  Table
\ref{tab:ups} summarises the results for all UPS1 proteins.

<<echo=FALSE>>=
  dev.off()
@

\clearpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Session information}\label{sec:sessionInfo}

All software and respective versions used to produce this document are
listed below.

<<sessioninfo, results='asis', echo=FALSE, cache=FALSE>>=
toLatex(sessionInfo())
@

\bibliography{synapter}

\end{document}