%\VignetteIndexEntry{Introduction to specL} \documentclass[12pt]{article} <>= BiocStyle::latex() @ \title{\Biocpkg{specL} - Prepare Peptide Spectrum Matches for Use in Targeted Proteomics} \author{ Christian Panse\thanks{\email{cp@fgcz.ethz.ch}}\\ Christian Trachsel\\ Jonas Grossmann\thanks{\email{jg@fgcz.ethz.ch}}\\ Witold Wolski\thanks{\email{wew@fgcz.ethz.ch}} } \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \newpage \tableofcontents \newpage \section{Introduction} Targeted proteomics is a fast evolving field in proteomics science and was even elected as method of the year in 2012 \footnote{\url{http://www.nature.com/nmeth/journal/v10/n1/pdf/nmeth.2329.pdf}, 2014-09-22}. Especially targeted methods like SWATH \cite{SWATH} open promising perspectives for the identification and quantification of peptides and proteins. All targeted methods have in common the need of precise MS coordinates composed of precursor mass, fragment masses, and retention time. The combination of this information is kept in so called assays or spectra libraries. Here we present an R package able to produce such libraries out of peptide identification results (Mascot (dat), TPP (pep.xml and mzXMLs), ProteinPilot (group), Omssa (omx)). \Biocpkg{specL} (see also \cite{specLBioInf}) is an easy to use, versatile and flexible function, which can be integrated into already existing commercial or non commercial analysis pipelines for targeted proteomics data analysis. Some examples of today's pipelines are ProteinPilot combined with Peakview (AB Sciex), Spectronaut (Biognosys) or OpenSwath \cite{pmid24727770}. In the following vignette it is described how the \Biocpkg{specL} package can be used for the included data sets \Rcode{peptideStd} and \Rcode{peptideStd.redundant}. \section{Workflow} \subsection{Prologue - How to get the input for the specL package?} Since peptide identification (using, e.g., Mascot, Sequest, xTandem!, Omssa, ProteinPilot) usually creates result files which are heavily redundant and therefore unsuited for spectral library building, the search results must first be filtered. To create non-redundant input files, we use the BiblioSpec \cite{pmid18428681} algorithm implemented in Skyline \cite{pmid20147306}. A given search result (e.g. Mascot result file) is loaded into the software Skyline and is redundancy filtered. The 'Skyline workflow step' provides two sqlite readable files\footnote{sqlite uses standart SQL as query language.} as output named \Rcode{'*.blib'} and \Rcode{'*.redundant.blib'}. These files are used as ideal input for this packages. Note here, that Skyline is very flexible when it comes to peptide identification results. It means with Skyline you can build the spectrum library files for almost all search engines (even from other spectrum library files such as spectraST \cite{pmid18806791}). The first step which has to be performed on the R shell is loading \Biocpkg{specL} library. <>= #rm(list=ls()) #options(width = 80) options(prompt = "R> ", continue = "+ ", width = 60, useFancyQuotes = FALSE, warn = -1) @ <>= library(specL) packageVersion('specL') @ \subsection{Read from redundant plus non-redundant blib files} for demonstration \Biocpkg{specL} contains the two data sets namely \Rcode{peptideStd} and \Rcode{peptideStd.redundant}. This is a data set which comes from two standard run experiments which are routinely used to check if the liquid chromatographic system is still working appropriate. The sample consists of a digest of the Fetuin protein (Bos taurus, uniprot id: P12763). 40 femtomole are loaded on column. Mascot was used to search and identify the respective peptides. <>= summary(peptideStd) @ For both \Rcode{peptideStd}, \Rcode{peptideStd.redandant} data sets the Skyline software was used to generate the bibliospec files which contain the peptide sequences with the respective peptide spectrum match (PSM). The \Rfunction{specL::read.bibliospec} function was used to read the blib files into R. The from \Rcode{read.bibliospec} generated object has its own plot functions. The LC-MS map graphs peptide mass versus retention time. <>= # plot(peptideStd) plot(0,0, main='MISSING') @ The individual peptide spectrum match (psm) is displayed by using the \CRANpkg{protViz} \Rfunction{peakplot} function. <>= demoIdx <- 40 # str(peptideStd[[demoIdx]]) #res <- plot(peptideStd[[demoIdx]], ion.axes=TRUE) plot(0,0, main='MISSING') @ \subsection{Read from Mascot result files} Alternatively Mascot search result files (dat) can be used by applying \CRANpkg{protViz} perl script \texttt{protViz\-\_mascotDat2RData.pl}. The perl script can be found in the exec directory of the \CRANpkg{protViz} package. The mascot mod\_file can be found in the configurations of the mascot server. An example on our Linux shell looks as follow: \begin{verbatim} $ /usr/local/lib/R/site-library/protViz/exec/protViz_mascotDat2RData.pl \ -d=/usr/local/mascot/data/20130116/F178287.dat \ -m=mod_file \end{verbatim} \Rcode{mascotDat2RData.pl} requires the Mascot server \Rcode{mod\_file} keeping all the configured modification. Once the perl script is finished, the resulting RData file can be read into the R session using \Rfunction{load}. Next, the variable modifications, and the S3 psmSet object has to be generated. This can be done by using \Rfunction{specL:::.mascot2psmSet} <>= specL:::.mascot2psmSet @ If you are processing Mascot result files you can continue reading in section \ref{sec:genSwathIonLib}. However, please note, due do the potential high redundancy of peptide spectrum matches in a database search approach, it might not result in useful ion library for targeted data extraction, unless redundancy filtering is handled. However in a future release a redundancy filter algorithm might be proposed to resolve this problem. \subsection{Annotate protein ids using FASTA} The information to which protein a peptide-spectrum-match belongs (PSM) is not stored by BiblioSpec. Therefore \Biocpkg{specL} provides the \Rfunction{annotate.protein\_id} function which uses R's internal \Rfunction{grep} to 'reassigning' the protein information. Therefore a \Robject{fasta} object has to be loaded into the R system using \Rfunction{read.fasta} of the \CRANpkg{seqinr} package. For this, not necessarily, the same \Robject{fasta} file needs to be provided as in the original database search. The following lines demonstrate a simple sanity check with a single FASTA style formated protein entry. Also it demonstrates the use case how to identify entries in the R-object which are from one or a few proteins of interest. <>= irtFASTAseq <- paste(">zz|ZZ_FGCZCont0260|", "iRT_Protein_with_AAAAK_spacers concatenated Biognosys\n", "LGGNEQVTRAAAAKGAGSSEPVTGLDAKAAAAKVEATFGVDESNAKAAAAKYILAGVENS", "KAAAAKTPVISGGPYEYRAAAAKTPVITGAPYEYRAAAAKDGLDAASYYAPVRAAAAKAD", "VTPADFSEWSKAAAAKGTFIIDPGGVIRAAAAKGTFIIDPAAVIRAAAAKLFLQFGAQGS", "PFLK\n") Tfile <- file(); cat(irtFASTAseq, file = Tfile); fasta.irtFASTAseq <- read.fasta(Tfile, as.string=TRUE, seqtype="AA") close(Tfile) @ As expected the \Rcode{peptideStd} data, e.g., our demo Object, does not contain any protein information yet. <>= peptideStd[[demoIdx]]$proteinInformation @ The protein information can be added as follow: <>= peptideStd <- annotate.protein_id(peptideStd, fasta=fasta.irtFASTAseq) @ The following lines show now the object indices of those entries which do hav a protein information now. <>= (idx<-which(unlist(lapply(peptideStd, function(x){nchar(x$proteinInformation)>0})))) @ As expected, there are now a number of peptide sequences annotated with the protein ID. <>= peptideStd[[demoIdx]]$proteinInformation @ Please note, that the default digest pattern is defined as <>= digestPattern = "(([RK])|(^)|(^M))" @ for tryptic peptides. For other enzymes, the pattern has to be adapted. For example, for semi-tryptic identifications use \Rcode{digestPattern = ""}. \subsection{Generate the spectral library (assay)} \label{sec:genSwathIonLib} \Rfunction{genSwathIonLib} is the main contribution of the \Biocpkg{specL} package. It generates the spectral library used in a targeted data extraction workflow from a mass spectrometric measurement. Generating the ion library using iRT peptides is highly recommended as described. However if you have no iRT peptide continue reading in section \ref{sec:noiRT}. Generation of the spec Library with default (see Table \ref{table:default}) settings. <>= res.genSwathIonLib <- genSwathIonLib(data = peptideStd, data.fit = peptideStd.redundant) @ \begin{table}[h] \centering \resizebox{.99\textwidth}{!}{ \begin{tabular}{l|l|r} \hline parameter & description& value \\ \hline \Rcode{max.mZ.Da.error} & max ms2 tolerance& \Rcode{0.1}\\ \Rcode{topN} &the n most intense fragment ion& 10\\ \Rcode{fragmentIonMzRange}&mZ range filter of fragment ion& c(300, 1800)\\ \Rcode{fragmentIonRange}&min/max number of fragment ions& \Rcode{c(5,100)}\\ \Rcode{fragmentIonFUN} &desired fragment ion types& \Rcode{b1+,y1+,b2+,y2+,b3+,y3+}\\ \hline \end{tabular} } \caption{\label{table:default}\Rfunction{genSwathIonLib} default settings} \end{table} <>= summary(res.genSwathIonLib) @ The determined mass spec coordinates of the selected tandem mass spectrum \Rcode{demoIdx} look like this: <>= res.genSwathIonLib@ionlibrary[[demoIdx]] @ It can be displayed using the \Rfunction{specL::plot} function. <>= plot(res.genSwathIonLib@ionlibrary[[demoIdx]]) @ The following code considers only the top five y ions. <>= # define customized fragment ions # for demonstration lets consider only the top five singly charged y ions. r.genSwathIonLib.top5 <- genSwathIonLib(peptideStd, peptideStd.redundant, topN=5, fragmentIonFUN=function (b, y) { return( cbind(y1_=y) ) } ) plot(r.genSwathIonLib.top5@ionlibrary[[demoIdx]]) @ \subsection{Normalizing the retention time using iRT peptides} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Retention time is a very important parameter in targeted data extraction. However retention times are not easy to transfer between different reverse phase columns or HPLC systems. To make transfer applicable and account for inter run shift in retention time Biognosys \cite{pmid22577012} invented the iRT normalization based on iRT / HRM peptides. For this, a set of well behaving peptides (good flying properties, good fragmention characteristics, completely artifical) which cover the whole rt-gradient are spiked into each sample. For this set of peptide an idependent retention time (dimension less) is suggested by Biognosys. With this at hand, the set of peptides can later be used to apply a linear regression model to adapt all measured retention times into an independent retention time scale. \\ If the identification results contain iRT peptides the package supports the conversion to the iRT scale. For this (if the identification result are based on multiple input files) the redundant BiblioSpec file is required where all iRT peptides from all measurements are stored. For the most representative spectrum in the non-redundant R-object the original filename is identified and the respective linear model for this one particular MS experiment is applied to normalize the retention time to the iRT scale. The iRT peptides as well as their independent retention times are stored in the \Rcode{iRTpeptides} object. \Biocpkg{specL} uses by default the iRT peptide table to normalize into the independent retention time but could also be extended or changed to custom iRT peptides if available. <>= iRTpeptides @ The method genSwathIonLib uses: <>= fit <- lm(formula = rt ~ aggregateInputRT * fileName, data=m) @ to build the linear models for each MS measurement individually. For defining \Rcode{m} both data sets were aggregated over the attributes \Rcode{peptide} and \Rcode{fileName} using the \Rcode{mean} operator. <>= data <- aggregate(df$rt, by = list(df$peptide, df$fileName), FUN = mean) data.fit <- aggregate(df.fit$rt, by = list(df.fit$peptide, df.fit$fileName), FUN = mean) @ Afterwards the following join operator was applied. <>= m <- merge(iRT, data.fit, by.x='peptide', by.y='peptide') @ % http://www.ncbi.nlm.nih.gov/pubmed/22577012 The following graph displays the normalized retention time versus the measured retention time after applying the calculated model to the two data sets. <>= # calls the plot method for a specLSet object op <- par(mfrow=c(2,3)) plot(res.genSwathIonLib) par(op) @ Shown are original retention time (in minutes) and iRT (dimensionless) for two standard run experiments (color black and red). Indicated with black {\bf X} are the iRT peptides which are the base for the regression. \subsection{Generate the spectral library having no iRTs\label{sec:noiRT}} If no iRT peptides are contained in the data not iRT normalization is applied. The scatter plot below shows on the y axis that there were not iRT transformation. <>= idx.iRT <- which(unlist(lapply(peptideStd, function(x){ if(x$peptideSequence %in% iRTpeptides$peptide){0} else{1} })) == 0) # remove all iRTs and compute ion library res.genSwathIonLib.no_iRT <- genSwathIonLib(peptideStd[-idx.iRT]) summary(res.genSwathIonLib.no_iRT) op <- par(mfrow = c(2, 3)) plot(res.genSwathIonLib.no_iRT) par(op) @ \subsection{Write output to file} The output can be written as an ASCII text file. <>= write.spectronaut(res.genSwathIonLib, file="specL-Spectronaut.txt") @ %\subsection{Best praxis parameter settings for library generation} %todo(jonas, ct) \subsection{Epilogue - What can I do with that library now?} The specL output text file can directly be used as input (assay) for the Spectronaut software from Biognosys or with minimal reshaping for Peakview. Alternatively it can be used as a basis for script based construction of SRM/MRM assays. \subsection{Benchmark} The benchmarks were processed on a 12 core XEON Server (X5650 @ 2.67GHz) running Linux Debian wheezy having R version 3.1.1 (2014-07-10) , specL 1.1.2, and BiocParallel 1.0.0 installed. The default setting of BiocParallel has used eight cores. As FASTA we used a TAIR10 retrived from \url{http://www.arabidopsis.org/} and Human Swissprot. \begin{table}[h] \centering \resizebox{.99\textwidth}{!}{ \begin{tabular}{rrr|rr|rr} \hline fasta=TAIR10 & & & blib [unpublished] & & runtime & \\ \#proteins & \#tryptic peptides & file size & \#specs & file size & annotate & generate\\\hline \hline 71032 & 3423196 & 39M & 39648/118268 & 51M & 79min & 19sec \\ 71032 & 3423196 & 39M & 65018/136963 & 120M & 130min & 30sec \\ \hline fasta=HUMAN & & & blib \cite[Rosenberger]{Rosenberger} & & & \\ 88969& 3997085 & 43M & 256908/3060421 & 4.4G & $\approx$7h &$\approx$5min \\ %HUMAN\footnote{Rosenberger et al. in Scientific Data (doi:10.1038/sdata.2014.31)} && & & & 256908/3060421& & 4.4G & & $\approx$5min\\ \hline \end{tabular} } \end{table} The following parameter settings were given to the \Rfunction{genSwathIonLib} function: <>= res <- genSwathIonLib(data, data.fit, topN=10, fragmentIonMzRange=c(200,2000), fragmentIonRange=c(2,100)) @ \section{Acknowledgement} The authors thank all colleagues of the Functional Genomics Center Zuerich (FGCZ), and especial thank goes to our test users Sira~Echevarr\'{i}a~Zome\~{n}o (ETHZ), Tobias~Kockmann (ETHZ), Lukas~von~Ziegler (Brain Research Institute, UZH\texttt{|}ETH Zurich), and Stephan~Michalik (Ernst-Moritz-Arndt-Universit{\"a}t Greifswald, Germany). \section{TODO for next releases} \begin{itemize} \item importer for peakview csv format; enable \Rcode{compare.specLSet(object0, object1)} \item new option for \Rfunction{specL::genSwathIonLib}; Exclude fragment ions from precursor \Rcode{window = TRUE, FALSE} %\item new option fo \Rfunction{specL::genSwathIonLib}; Calculate %\Rcode{q1.in-silico} as an alternative to \Rcode{q1} 1.1.8 \item new option for \Rfunction{specL::genSwathIonLib}; Predict transitions for heavy labeled peptides using information from light peptides \Rcode{predictHeavy = TRUE,FALSE, LabelFile = "fileWithHeavyAA"} \item new export function into TraML format for compatibility with OpenSWATH \cite{pmid24727770} %\item streamline modsequence, e.g. \texttt{AAAMASATTM$\left[16.0\right]$LTTK} %for compatibility with peakView V2.0 \item replace \CRANpkg{seqinr} \Rfunction{read.fasta} by using \Biocpkg{Biostrings} \Rfunction{readAAStringSet} to handle fasta files \item add varMods to specL class \item replace Mascot score by a generic score \item in-silico rt ion map plot (\Rfunction{plot.specLSet}) split window into SWATH windows (one plot per, e.g., 25Da window) \item assay refinement - replace contaminated fragment ion in library \end{itemize} \section{Session information} An overview of the package versions used to produce this document are shown below. <>= toLatex(sessionInfo()) @ \bibliography{specL} \end{document}