---
title: "Prepare Peptide Spectrum Matches for Use in Targeted Proteomics"
author:
- name: Christian Panse
  affiliation:
    - &fgcz Functional Genomics Center Zurich - Swiss Federal Institute of Technology in Zurich
    - &sib Swiss Institute of Bioinformatics, Quartier Sorge - Batiment Amphipole, CH-1015 Lausanne, Switzerland
  email: cp@fgcz.ethz.ch
- name: Christian Trachsel
  affiliation:
    - *fgcz
- name: Jonas Grossmann
  email: jg@fgcz.ethz.ch
  affiliation:
    - *fgcz
    - *sib
- name: Witold E. Wolski
  affiliation:
    - *fgcz
    - *sib
  email: wew@fgcz.ethz.ch
date: "`r doc_date()`"
package: specL
abstract: |
  Targeted data extraction methods are attractive ways to obtain
  quantitative peptide information from a proteomics experiment.
  Sequential Window Acquisition of all Theoretical Spectra (SWATH) and
  Data Independent Acquisition (DIA) methods increase reproducibility of
  acquired data because the classical precursor selection is omitted and
  all present precursors are fragmented. However, especially for targeted
  data extraction, MS coordinates (retention time information precursor
  and fragment masses) are required for the particular entities (peptide
  ions). These coordinates are usually generated in a so-called discovery
  experiment earlier on in the project if not available in public spectral
  library repositories. The quality of the assay panel is crucial to
  ensure appropriate downstream analysis. For that, a method is needed to
  create spectral libraries and to export customizable assay panels.
vignette: |
  %\VignetteIndexEntry{Introduction to specL}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: specL.bib
output: 
  BiocStyle::html_document:
    toc_float: true
---


# Introduction

Targeted proteomics is a fast evolving field in proteomics science and 
was even elected as the method of the year in 2012 
\footnote{\url{http://www.nature.com/nmeth/journal/v10/n1/pdf/nmeth.2329.pdf}, 
2014-09-22}. Especially targeted methods like SWATH [@SWATH] open 
promising perspectives for for identifying and quantifying of 
peptides and proteins. All targeted
methods  have in common the need of precise MS coordinates composed 
of precursor mass, fragment masses, and retention time. The combination 
of this information is kept in so-called assays or spectra libraries. Here we 
present an R package able to produce such libraries out of peptide 
identification results (Mascot (dat), TPP (pep.xml and mzXMLs), 
ProteinPilot (group), Omssa (omx)).
`r BiocStyle::Biocpkg('specL')` [@specLBioInf] is an easy-to-use, 
versatile, and flexible function, 
which can be integrated into already existing commercial 
or non-commercial analysis pipelines for targeted proteomics 
data analysis. Some examples of today's pipelines are ProteinPilot 
combined with Peakview (AB Sciex), Spectronaut (Biognosys) or 
OpenSwath [@pmid24727770].

In the following vignette it is described how the `r BiocStyle::Biocpkg('specL')` package 
can be used for the included data sets `peptideStd` and 
`peptideStd.redundant`.


# Workflow

## Prologue - How to get the input for the specL package?

Since peptide identification (using, e.g., Mascot, Sequest, xTandem!,
Omssa, ProteinPilot) 
usually creates result files which are 
heavily redundant and therefore unsuited for spectral library building, 
the search results must first be filtered. To create non-redundant 
input files, we use the BiblioSpec [@pmid18428681] algorithm 
implemented in Skyline [@pmid20147306]. A given search result (e.g. 
Mascot result file) is loaded into the software Skyline and is redundancy 
filtered. The 'Skyline workflow step' provides two sqlite readable 
files\footnote{sqlite uses standart SQL as query language.} 
as output named  `*.blib` and `*.redundant.blib`. 
These files are used as ideal input for this packages.
Note here, that Skyline is very flexible when it comes to peptide
identification results. It means with Skyline you can build the spectrum
library files for almost all search engines (even from other spectrum
library files such as spectraST [@pmid18806791]).

The first step which has to be performed on the R shell is loading 
`r BiocStyle::Biocpkg('specL')` library.

```{r echo=FALSE, eval = TRUE, fig = FALSE, echo=FALSE}
options(prompt = "R> ",
        continue = "+  ",
        width = 60,
        useFancyQuotes = FALSE,
        warn = -1)
```

```{r load}
library(specL)
packageVersion('specL')
```

## Read from redundant plus non-redundant blib files

for demonstration, `r BiocStyle::Biocpkg('specL')` contains the two data sets, namely `peptideStd` and 
`peptideStd.redundant`. This data set 
comes from two standard-run experiments routinely
used to check if the liquid chromatographic system is still working
appropriately. The sample consists of a digest of the Fetuin protein 
(Bos taurus, uniprot id: P12763). 40 femtomole are loaded on the column.
Mascot was used to search and identify the respective peptides.

```{r summry}
summary(peptideStd)
```

For both `peptideStd`, `peptideStd.redandant` data sets the 
Skyline software was used to generate the bibliospec files which 
contain the peptide sequences with the respective peptide spectrum 
match (PSM). The `specL::read.bibliospec` function was used 
to read the blib files into R. 

The from `read.bibliospec` generated object has its own plot functions.
The LC-MS map graphs peptide mass versus retention time.

```{r peptideStd}
# plot(peptideStd)
plot(0,0, main='MISSING')
```


The individual peptide spectrum match (psm) is displayed by using the 
`r BiocStyle::CRANpkg('protViz')` `peakplot` function. 

```{r pepttidePlot}
demoIdx <- 40
# str(peptideStd[[demoIdx]])
#res <- plot(peptideStd[[demoIdx]], ion.axes=TRUE)
plot(0,0, main='MISSING')
```

## Read from Mascot result files

Alternatively, Mascot search result files (dat) can be used by applying 
`r BiocStyle::CRANpkg('protViz')` perl script 
`protViz\-\_mascotDat2RData.pl`.

The Perl script can be found in the exec directory of the
`r BiocStyle::CRANpkg('protViz')` package. 
The mascot mod\_file can be found in the configurations of the mascot server.
An example on our Linux shell looks as follows:

```
$ /usr/local/lib/R/site-library/protViz/exec/protViz_mascotDat2RData.pl \
-d=/usr/local/mascot/data/20130116/F178287.dat \
-m=mod_file
```

`mascotDat2RData.pl` requires the Mascot server `mod\_file` keeping 
all the configured modification.

Once the {erl script is finished, the resulting RData file can be read into the R session using `load`.

Next, the variable modifications, and the S3 psmSet object has to be generated. This can be done by using  `specL:::.mascot2psmSet`

```{r mascot2psmSet}
specL:::.mascot2psmSet
```

If you are processing Mascot result files, you can continue reading in the section `genSwathIonLib`.

However, please note due do the high potential redundancy of peptide spectrum matches in a database search approach, it might not result in useful ion library for targeted data extraction unless redundancy filtering is handled. 
However, in a future release, a redundancy filter algorithm might be proposed to resolve this problem.

## Annotate protein IDs using FASTA

The information to which protein a peptide-spectrum-match belongs (PSM) 
is not stored by BiblioSpec. Therefore `r BiocStyle::Biocpkg('specL')`  provides the `annotate.protein\_id` function which uses R's internal `grep`
to 'reassign' the protein information. Therefore a `fasta` object has 
to be loaded into the R system using `read.fasta` of the 
`r BiocStyle::CRANpkg('seqinr')` package. For this, not necessarily, the same `fasta` file needs to be provided as in the original database 
search.

The following lines demonstrate a simple sanity check with a single 
FASTA style formatted protein entry. Also it demonstrates the use case
how to identify entries in the R-object which are from one or a few proteins
of interest.

```{r irtFASTAseq}
irtFASTAseq <- paste(">zz|ZZ_FGCZCont0260|",
"iRT_Protein_with_AAAAK_spacers concatenated Biognosys\n",
"LGGNEQVTRAAAAKGAGSSEPVTGLDAKAAAAKVEATFGVDESNAKAAAAKYILAGVENS",
"KAAAAKTPVISGGPYEYRAAAAKTPVITGAPYEYRAAAAKDGLDAASYYAPVRAAAAKAD",
"VTPADFSEWSKAAAAKGTFIIDPGGVIRAAAAKGTFIIDPAAVIRAAAAKLFLQFGAQGS",
"PFLK\n")

Tfile <- file();  cat(irtFASTAseq, file = Tfile);
fasta.irtFASTAseq <- read.fasta(Tfile, as.string=TRUE, seqtype="AA")
close(Tfile)
```

As expected, the `peptideStd` data, e.g., our demo object, does not contain any protein information yet.

```{r}
peptideStd[[demoIdx]]$proteinInformation
```

The protein information can be added as follow:

```{r}
peptideStd <- annotate.protein_id(peptideStd, 
    fasta=fasta.irtFASTAseq)
```

The following lines now show the object indices of those entries which do
have protein information now.

```{r}
(idx <- which(unlist(lapply(peptideStd, 
    function(x){nchar(x$proteinInformation)>0}))))
```

As expected, there are now a number of peptide sequences
annotated with the protein ID. 

```{r}
peptideStd[[demoIdx]]$proteinInformation
```

Of note, that the default digest pattern is defined as

```{r}
digestPattern = "(([RK])|(^)|(^M))"
```

for tryptic peptides. For other enzymes, the pattern has to 
be adapted. For example, for semi-tryptic identifications, use
`digestPattern = ""`.

## Generate the spectral library (assay)

`genSwathIonLib` is the main contribution of the
`r BiocStyle::Biocpkg('specL')` package. It generates the spectral library used in a targeted data extraction workflow from a mass spectrometric 
measurement. Generating the ion library using iRT peptides is highly recommended as described.  However if you have no iRT peptide, continue reading in section noiRT.

Generation of the spec Library with default (see Table) settings.

```{r}
res.genSwathIonLib <- genSwathIonLib(data = peptideStd, 
   data.fit = peptideStd.redundant)
```

genSwathIonLib default settings

| parameter         |    description                 | value           |
|-------------------|--------------------------------|-----------------|
|max.mZ.Da.error    | max ms2 tolerance              | 0.1             | 
|topN               | the n most intense fragment ion| 10              |
|fragmentIonMzRange |mZ range filter of fragment ion | c(300, 1800)    |
|fragmentIonRange   |min/max number of fragment ions |c(5,100)         |
|fragmentIonFUN}    |desired fragment ion types|b1+,y1+,b2+,y2+,b3+,y3+|


```{r}
summary(res.genSwathIonLib)
```


The determined mass spec coordinates of the selected tandem mass spectrum 
`demoIdx` look like this:

```{r}
res.genSwathIonLib@ionlibrary[[demoIdx]]
```

It can be displayed using the \Rfunction{specL::plot} function.

```{r fig.retina=3}
plot(res.genSwathIonLib@ionlibrary[[demoIdx]])
```

The following code considers only the top five y ions.

```{r fig.retina=3}
# define customized fragment ions
# for demonstration lets consider only the top five singly charged y ions.

r.genSwathIonLib.top5 <- genSwathIonLib(peptideStd,
    peptideStd.redundant, topN=5,
    fragmentIonFUN=function (b, y) {
      return( cbind(y1_=y) )
      }
    )

             
plot(r.genSwathIonLib.top5@ionlibrary[[demoIdx]])
```

## Normalizing the retention time using iRT peptides

Retention time is an essential parameter in targeted data 
extraction. However, retention times are difficult to transfer between 
reverse phase columns or HPLC systems. To make transfer 
applicable and account for the inter-run shift in retention time Biognosys 
[@pmid22577012] invented the iRT normalization based on iRT / HRM 
peptides. For this, a set of well-behaving peptides (good flying 
properties, good fragmentation characteristics, completely artificial) 
which cover the whole rt-gradient and are spiked into each sample. 
For this set of peptides, an idependent retention time (dimension less) 
is suggested by Biognosys. With this at hand, the set of peptides can later 
be used to apply a linear regression model to adapt all measured 
retention times into an independent retention time 
scale.

If the identification results contain iRT peptides, the package 
supports the conversion to the iRT scale. For this (if the identification 
the outcome is based on multiple input files), the redundant BiblioSpec file 
is required where all iRT peptides from all measurements are stored. 
For the most representative spectrum in the non-redundant R-object the 
original filename is identified, and the respective linear model for this
one particular MS experiment is applied 
to normalize the retention time to the iRT scale.
The iRT peptides, as well as their independent retention times, are
stored in the `iRTpeptides` object.

`r BiocStyle::Biocpkg('specL')`  uses by default the iRT peptide table to
normalize
into the independent retention time but could also be extended or 
changed to custom iRT peptides if available.

```{r}
iRTpeptides
```

The method genSwathIonLib uses:

```{r fit, eval=FALSE}
fit <- lm(formula = rt ~ aggregateInputRT * fileName, data = m)
```

to build the linear models for each MS measurement individually. 
For defining `m` both data sets were aggregated over the attributes
`peptide` and `fileName` using the `mean` operator.

```{r eval=FALSE}
data <- aggregate(df$rt, by = list(df$peptide, df$fileName),
  FUN = mean)
data.fit <- aggregate(df.fit$rt, 
  by = list(df.fit$peptide, df.fit$fileName), 
  FUN = mean)
```

Afterwards the following join operator was applied.

```{r merge, eval=FALSE}
m <- merge(iRT, data.fit, by.x='peptide', by.y='peptide')
```


The following graph displays the normalized retention time versus 
the measured retention time after applying the calculated model 
to the two data sets.


```{r fig.retina=3}
# calls the plot method for a specLSet object
op <- par(mfrow=c(2,3)) 
plot(res.genSwathIonLib)
par(op)
```

Shown are the original retention time (in minutes) and iRT (dimensionless)
for two standard run experiments (color black and red). Indicated 
with black {\bf X} are the iRT peptides, which are the base for the 
regression.

## Generate the spectral library having no iRTs

If no iRT peptides are contained in the data, not iRT normalization is applied.
The scatter plot below shows on the y-axis that there was not iRT transformation.

```{r fig.retina=3}
idx.iRT <- which(unlist(lapply(peptideStd,
  function(x){
    if(x$peptideSequence %in% iRTpeptides$peptide){0}
    else{1}
  })) == 0)

# remove all iRTs and compute ion library
res.genSwathIonLib.no_iRT <- genSwathIonLib(peptideStd[-idx.iRT])
summary(res.genSwathIonLib.no_iRT)
op <- par(mfrow = c(2, 3)) 
plot(res.genSwathIonLib.no_iRT)
par(op)
```

## Write output to file

The output can be written as an ASCII text file.

```{r}
write.spectronaut(res.genSwathIonLib,
  file="specL-Spectronaut.txt")
```


# Epilogue

## What can I do with that library now?

The `r BiocStyle::Biocpkg('specL')` output text file can directly be used
as input (assay) for 
the Spectronaut software from Biognosys or with minimal reshaping for 
Peakview. Alternatively, it can be used as a basis for script-based 
construction of SRM/MRM assays.

# Benchmark

The benchmarks were processed on a 
12 core XEON Server (X5650 @ 2.67GHz) running Linux Debian wheezy
having
R version 3.1.1 (2014-07-10) , specL 1.1.2, and BiocParallel 1.0.0
installed. The default setting of BiocParallel uses eight cores.
As FASTA, we used a TAIR10 retrieved from \url{http://www.arabidopsis.org/} and Human Swissprot.

```
\begin{table}[h]
\centering
\resizebox{.99\textwidth}{!}{
\begin{tabular}{rrr|rr|rr}
\hline
fasta=TAIR10     &                    &           & blib  [unpublished]                 &           & runtime  &           \\
\#proteins & \#tryptic peptides & file size & \#specs &  file size & annotate & generate\\\hline \hline
71032      & 3423196            & 39M       & 39648/118268  & 51M       & 79min         &   19sec \\
71032      & 3423196            & 39M       & 65018/136963  & 120M      & 130min         &  30sec \\
\hline
fasta=HUMAN   &                    &           & blib  \cite[Rosenberger]{Rosenberger} &           &   &           \\
88969& 3997085   &   43M     &   256908/3060421 & 4.4G     & $\approx$7h &$\approx$5min \\

%HUMAN\footnote{Rosenberger et al. in Scientific Data (doi:10.1038/sdata.2014.31)} &&    & & & 256908/3060421&         & 4.4G       & & $\approx$5min\\
\hline
\end{tabular}
}
\end{table}

```

The following parameter settings were given to the `genSwathIonLib` function:

```{r eval=FALSE}
res <- genSwathIonLib(data, data.fit,
  topN=10, 
  fragmentIonMzRange=c(200,2000), 
  fragmentIonRange=c(2,100))
```


# Acknowledgement

The authors thank all colleagues of the Functional Genomics Center 
Zuerich (FGCZ), and especial thank goes to our test users
Sira Echevarr\'{i}a~Zome\~{n}o (ETHZ),
Tobias Kockmann (ETHZ),
Lukas von Ziegler (Brain Research Institute, UZH/ETH Zurich),
and Stephan~Michalik (Ernst-Moritz-Arndt-Universität Greifswald, Germany).

# TODO for next releases

- importer for peakview csv format; enable \Rcode{compare.specLSet(object0, object1)}

- new option for \Rfunction{specL::genSwathIonLib}; Exclude fragment 
ions from precursor \Rcode{window = TRUE, FALSE}

- new option for \Rfunction{specL::genSwathIonLib}; Predict 
transitions for heavy labeled peptides using information from light 
peptides \Rcode{predictHeavy = TRUE,FALSE, LabelFile = 
"fileWithHeavyAA"}

- new export function into TraML format for compatibility with OpenSWATH \cite{pmid24727770}

- replace \CRANpkg{seqinr} \Rfunction{read.fasta} by using \Biocpkg{Biostrings} \Rfunction{readAAStringSet} 
to handle fasta files 

- add varMods to specL class

- replace Mascot score by a generic score

- in-silico rt ion map plot (\Rfunction{plot.specLSet}) 
split window into SWATH windows (one plot per, e.g., 25Da window)

- assay refinement - replace contaminated fragment ion in library

# Session information

An overview of the package versions used to produce this document are 
shown below.

```{r sessionInfo, echo=FALSE}
sessionInfo()
```
  
# References