%\VignetteIndexEntry{The rTANDEM users guide} %\VignetteKeywords{MassSpectrometry, Proteomics} %\VignettePackage{rTANDEM} \documentclass[11pt]{article} \usepackage{hyperref} \usepackage{url} \usepackage{color, pdfcolmk} \usepackage[super,square]{natbib} \bibliographystyle{plainnat} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \author{Frederic Fournier\footnote{frederic.fournier@crchuq.ulaval.ca}, Charles Joly Beauparlant\footnote{charles.joly-beauparlant@crchul.ulaval.ca}, Rene Paradis\footnote{rene.paradis@genome.ulaval.ca}, Arnaud Droit\footnote{arnaud.droit@crchuq.ulaval.ca}} \begin{document} \title{rTANDEM: An R encapsulation of X!Tandem} \maketitle \textnormal {\normalfont} Introduction to rTANDEM \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newpage \section{Licensing} This package and the underlying X!Tandem code are distributed under the Artistic license 1.0. You are free to use and redistribute this software. \section{Introduction} X!Tandem\citep{beavis2003, beavis2004} is an open source software for protein identification by tandem mass spectrometry, and rTANDEM encapsulates this software in R. rTANDEM provides basic encapsulation of X!Tandem: it has a function that takes as an argument the path to an X!Tandem style parameter file and return the path to an X!tandem style output file. The package also presents functions to transform parameters or results files into R objects and vice versa. Some functions are also available to examine the results within R. An associated package, shinyTANDEM, provides a graphical interface to visualize results objects. \section{rTANDEM input/output overview} The input of a standard X!Tandem analysis is composed of several files: three .xml files setting the parameters (usually named input.xml, default-input.xml and taxonomy.xml), one or more database files, and one or more files containing the spectra to be analyzed. Detailed information is available on the \href{www.thegpm.org/TANDEM/index.html}{X!Tandem website}. rTANDEM can use the xml files just like X!Tandem, or R objects can be used in lieu of the .xml files. rTANDEM provides functions to create parameters or taxonomy objects from xml files, or to create the xml files from the R parameters or taxonomy objects. The output of a standard X!Tandem analysis is composed of a standard xml file. Since this output can be quite large, rTANDEM follows X!Tandem behaviour and writes the result of the analysis to an xml file instead of directly creating an R object with the result. (This behaviour allows you to launch a script that will run several analysis without worrying about running out of memory.) The package provides a function to parse result files and create an R object from it. \subsection{The input file} The main input file is used to fix the major parameters of the search, namely \begin{enumerate} \item The path of the output file (the file that will be used to store the results). \item The path of the spectra file (the file that contains the data to be analysed). \item The taxon used for the search. \item The path of the taxonomy file (the file that provides information on the databases to be used). \item The path of a default-input file (a file that sets parameters that are common to many experiments). \end{enumerate} \subsection{The output file} rTANDEM output is an XML file. The format is based on the XML language BIOML (\url{www.bioml.com}) and a full description of the format is available at \url{http://www.thegpm.org/docs/X_series_output_form.pdf}. The function GetResultsFromXML() can be used to create a R object from the xml file. \subsection{The spectra file} The spectra file contains the data generated by the mass spectrometer. The supported formats for the data file are 'DTA', 'PKL' and 'MGF'. \subsection{The taxonomy file} The taxonomy file, or object, is used to link together a taxon and one or multiple fasta files. This file is not usually modified for every experiment, rather it provides the link between all taxa and their relative databases. IT should be modified only when databases are updated or replaced. \subsection{The database files} The database files contain the information that will be used to analyse the spectra files. The most common database files contain the protein sequences of a given taxon in ASCII '.fasta' format. X!Tandem also support the binary 'fasta.pro' format. (See \url{http://www.thegpm.org/TANDEM/api/fastapro.html}.) Lists of SAPs (single amino acid polymorphisms) or PTMs (post translational modifications) can also be used to complement the fasta files. \subsection{The default-parameter file} The default-parameter file is used to set fine-grained parameters. For example it is used to set the mass error tolerated, or to set which types of ions will be used for calculation. This file is not usually modified for every experiment, rather, a default-parameter can be created for every instrument used (the fine-grained parameters are often instrument-specific). The default-parameter file also provides links to formatting files (css and xsl) that are useful for consulting the output in a web browser. \subsection{Parameters information} The official API for x!tandem parameters is described here: http://thegpm.org/TANDEM/api/ Some parameters are accepted even though they are not part of the official API. A list of such parameters can be found at: http://tools.proteomecenter.org/wiki/index.php?title=TPP:X!Tandem\_the\_TPP The description of some parameters for the PTMTreeSearch scoring function can be found at http://net.icgeb.org/ptmtreesearch/doc/ptmtreesearch\_help.htm \section{Dealing with parameters in R objects} The parameters that can be passed to the TANDEM algorithm are numerous: over a hundred! Dealing with this can be cumbersome, but rTANDEM features some functions to make it easier.\\ First, let's introduce a distinction between four levels of parameters: search specific, instrument specific, general parameters, and taxonomies. The search specific parameters are information like the data file to be searched and the taxon to be used: those parameters will likely change with every new analysis. Instrument specific parameters are settings like the mass error of your spectrometer at the ms1 and ms2 levels. They are unlikely to change for any search done on a given spectrometer. Then, there are more general parameters that concern anything from the enzyme used to the fine tuning of the algorithm or the output options. Those parameters change rather rarely. Finally, there are the parameters related to taxonomy: they link the taxon to be searched to one or more fasta file.\\ Given those levels of parameters, rTANDEM proposes a particular syntax and some functions to make it easier to launch analysis and deal with parameters. The launching syntax is:\begin{tt}rtandem( data.file, taxon, taxonomy, default.parameters ) \end{tt}. This syntax put the search-specific parameter directly in the function call, so you are not forced to create a full parameter object (or file) for every new search. Instead, you will need two objects, a taxonomy and a parameter, that won't change very often.\\ To create those objects, rTANDEM features many functions: \begin{enumerate} \item\begin{tt}param <- getParamFromXML(parameter\_file.xml)\end{tt} let's you import parameters from an xml file, from a published method for example. \item\begin{tt}param <- rTParam()\end{tt} let's you initialize a parameter object with no fixed parameter. \item\begin{tt}param <- setParamDefault()\end{tt} let's you quickly set over 30 general parameters to sensible default values. \item\begin{tt}param <- setParamOrbitrap()\end{tt} let's you quickly set instrument-specific parameters with sensible values for orbitrap mass spectrometer (like Thermo Scientific's Orbitrap). \item\begin{tt}param <- setParamQuadTof05Da()\end{tt} let's you quickly set instrument-specific parameters with sensible values for quad-tof mass spectrometer (like AB-Sciex's QStar). \item\begin{tt}param <- setParamQuadTof100ppm(paramObject)\end{tt} let's you quickly set instrument-specific parameters with sensible values for high resolution quad-tof mass spectrometer (like AB-Sciex's QStar Elite). \item\begin{tt}param <- setParamIonTrap()\end{tt} let's you quickly set instrument-specific parameters with sensible values for ion trap spectrometer (like Thermo Scientific's LTQ). \item\begin{tt}param <- setParamValue(param, category, parameter, value)\end{tt} let's you change the value of a specific parameter identified by it's category and name. \item\begin{tt}taxonomy <- getTaxoFromXML(taxonomy\_file.xml)\end{tt} let's you import taxonomy information from an xml file. \item\begin{tt}taxonomy <- rTTaxo(taxon, format, URL)\end{tt} let's you initialize a new taxonomy object with optional data. \item\begin{tt}taxonomy <- addTaxon(taxonomy, taxon, format, URL)\end{tt} let's you add a new taxon to your taxonomy object (or add new URL to a taxon). \end{enumerate} \subsection{Example of parameter usage} Let's say your lab owns two of different mass spectrometers: a 5600-TripleTof and a LTQ. It would be useful to have two different parameter objects with the most appropriate settings for those two instruments. To do this, you should first set the default parameters and check that their value correspond to what you need, then add instrument-specific values. \\ \begin{tt}default.params <- setParamDefault()\end{tt} will give you a parameter object with sensible default values. But you should definitely check it's content: \begin{tt}print.rTParam(default.params)\end{tt} will show the defined parameters with minimal formatting. Looking at the default values, you might realize that your lab function differently. For example, the default cleavage correspond to trypsin ('[RK]|\{P\}'). Let's say that you lab use LysC instead, you will need to change this parameter: \begin{tt}default.params <- setParamValue(default.param, 'protein', 'cleavage site', '[K]|\{P\}')\end{tt}.\\ When the default parameter correspond to your need, you can create two parameter objects corresponding to your instruments by adding instrument-specific parameters to what you already have: \\ \begin{tt}ltq.params <- setParamIonTrap(default.params) 5600.params <- setParamQuadTof100ppm(default.params) \end{tt}\\ You now have default parameters with sensible default values for your two instruments. You still need a taxonomy object. Let's say that these days you are working on human and have downloaded a recent version of Uniprot's complete proteome set for Homo sapiens. Your taxonomy object can be created quickly:\begin{tt}taxonomy <- rTTaxo(taxon='human', format='peptide', URL='/path/to/your/fasta/file')\end{tt}. Other taxon can be added to this object later using the \begin{tt}addTaxon\end{tt} function.\\ Those objects now allow you to use the \begin{tt}rtandem\end{tt} syntax to launch analysis. For example, to analyse data obtained from your 5600, you would use: \begin{tt}result <- rtandem('path/to/your/data/file', 'human', taxonomy, 5600.params)\end{tt} \section{New Scoring functions} In 2005, x!tandem has been modified to allow a 'plug-in' approach to refinement (see MacLean et al, 2005). It is thus possible to design new scoring functions for x!tandem. Many such functions have been developped and the following scoring functions are available in rTANDEM: the k-score function, the hrk-score function and the PTMTreeSearch score function. \subsection{k-score function} The k-score function was implemented by Brendan MacLean\citep{maclean} based on a scoring function described by Keller\citep{keller2005}.\\ To use the function, set the scoring algorithm to 'k-score' in your parameter object:\\ \begin{tt}par <- setParamValue(par, 'scoring', 'algorithm', value='k-score')\end{tt}\\ It is also recommended to disable spectrum conditioning and to set minimum ion count to 1:\\ \begin{tt} par <- setParamValue(par, 'scoring', 'minimum ion count', value=1)\\ par <- setParamValue(par, 'spectrum', 'use conditioning', value='no') \end{tt} \subsection{hrk-score function} The hrk-score function, or high-resolution k-score function, was developped by the Trans-Proteomic Pipeline (TPP) team. It is directly based on the k-score functions, but redefine some computations to fully benefit from high resolution data.\\ To use the function, set the scoring algorithm to 'hrk-score'. It is also recommended to disable spectrum conditioning and to set minimum ion count to 1: \begin{tt} par <- setParamValue(par, 'scoring', 'algorithm', value='hrk-score')\\ par <- setParamValue(par, 'scoring', 'minimum ion count', value=1)\\ par <- setParamValue(par, 'spectrum', 'use conditioning', value='no') \end{tt} \subsection{PTMTreeSearch} The PTMTreeSearch function was developped by Attila Kertesz-Farkas. It is described in Kertesz-Farkas\citep{PTMTS}, and some documentation can be found at \href{http://net.icgeb.org/tandem/ptmtreesearch\_help.htm}{Tandem with PTMTreeSearch website}. Since PTMTreeSearch comprises many parameters, a function is supplied to set them all in a single step:\\ \begin{tt}param <- setParamPTMTreeSearch(param)\end{tt}\\ This will set the scoring algorithm to 'ptmtreesearch-score', with sensible default values and use a small list of modifications installed with rTANDEM. You can also use up-to-date modification lists from unimod or uniprot and link to them using the parameters 'refine, PTMTreeSearch uniprot modif file' or 'refine, PTMTreeSearch unimod modif file'. \section{rTANDEM typical usage} First of all, we must load rTANDEM into the session: <>= library(rTANDEM) @ To use rTANDEM and launch an analysis, we will need a couple of objects and files, including a taxonomy and a parameter set. The taxonomy is used to link taxa to fasta files. The rTANDEM package contains a partial fasta.pro file from the yeast proteome, so we can build a taxonomy object using it. (Note that we could have used a normal fasta file just as well): <>= taxonomy <- rTTaxo( taxon="yeast", format="peptide", URL=system.file("extdata/fasta/scd.fasta.pro", package="rTANDEM") ) taxonomy @ Second we will need to determine the parameters for the analysis. There are usually two different sets of parameters: one set of fine-grained parameters (often related to the mass-spectrometer used to acquire the data); and a set of more basic parameters that specify which data file will be analysed and where should the output be written. A default 'fine-grained' parameter set is included in the package, as well as a small data file, so we can use them to create a complete parameter set. We will start by creating an empty parameter object and we will fill the information in: <>= param <- rTParam() param <- setParamValue(param, 'protein', 'taxon', value="yeast") param <- setParamValue(param, 'list path', 'taxonomy information', taxonomy) param <- setParamValue(param, 'list path', 'default parameters', value=system.file("extdata/default_input.xml", package="rTANDEM")) param <- setParamValue(param, 'spectrum', 'path', value=system.file("extdata/test_spectra.mgf", package="rTANDEM")) param <- setParamValue(param, 'output', 'xsl path', value=system.file("extdata/tandem-input-style.xsl", package="rTANDEM")) param <- setParamValue(param, 'output', 'path', value=paste(getwd(), "output.xml", sep="/")) @ Now that all the relevant parameters are recorded, we can launch the analysis using the param object: <>= result.path <- tandem(param) result.path @ The results are written in xml format to the directory specified. But we can load them in R for further processing. <>= result.R <- GetResultsFromXML(result.path) @ Having the results within R, we can now use all the power of R to explore the identified proteins and the peptides used to identify these proteins. For example, to get the list of the proteins identified from at least two peptides and with a X!tandem score corresponding to an expect value of 0.05 or better, we simply do: <>= proteins <- GetProteins(result.R, log.expect=-1.3, min.peptides=2) proteins[, c(-4,-5), with=FALSE] # columns were removed for better display @ GetProteins() returns a data.table, which is a kind of optimized data.frame. This object can be further manipulated to answer many common questions. For example, the first questions that are often asked about a new analysis are 'how many proteins have been identified with appropriate confidence level?', 'What are the top proteins identified?', and 'Were protein X, Y, and Z identified in the sample?' We can quickly answer those questions: <>= # How many proteins have been identified with appropriate confidence? length(proteins[['uid']]) # What are the top 5 proteins identified? proteins[1:5, c("label", "expect.value"), with=FALSE] # Were proteins YFR053C or P02267 identified in the sample? c("YFR053C", "P02267") %in% proteins[,"label", with=FALSE][[1]] @ At this point, we might want more information about protein YFR053C. For example, we might want to know which peptides from this protein were identified with appropriate confidence level. To get the list of the peptides from this protein identified with an expect value < 0.05, we can simply do: <>= peptides <- GetPeptides( protein.uid=subset(proteins, label=="YFR053C", uid)[[1]], results =result.R, expect =0.05 ) peptides @ Sometimes, a peptide sequence will be found in many different proteins. If we are planning to focus on a particular peptide, we should make sure that is not the case. We can do this using the GetDegeneracy function. This function will return the list of the proteins in which a given peptide is found. Let's look at the first peptide identified for YFR053C: <>= proteins.of.the.peptide <- GetDegeneracy(peptides[[1,"pep.id"]], result.R) proteins.of.the.peptide[,label] # Careful! This peptide belongs to 2 different proteins! It should not be # used for quantification, for MRM or as a biomarker. @ With our results being in R, we can harness the power of other R and bioconductor packages to further analyze our data. For example, we can use biomaRt to get annotation information and cross-references. The reduced database that we used features ensembl peptide-id, but has no real description for the proteins. Let's use biomaRt to find a description of protein YFR053C, and while we are there, let's also get a cross-reference for this protein in uniprot, as well as the GO terms associated with the protein: <>= options("width"=70) @ <>= library(biomaRt) ensembl.mart<- useMart(biomart="ensembl", dataset="scerevisiae_gene_ensembl") str(getBM(mart=ensembl.mart, filters="ensembl_peptide_id", values="YFR053C", attributes="description"), strict.width="wrap", nchar.max=500) getBM(mart=ensembl.mart, filters="ensembl_peptide_id", values="YFR053C", attributes=c("ensembl_peptide_id", "uniprotswissprot")) getBM(mart=ensembl.mart, filters="ensembl_peptide_id", values="YFR053C", attributes=c("go_id", "name_1006")) @ \begin{thebibliography}{9} \bibitem{beavis2003} Craig R, Beavis RC \textbf{[2003]} A method for reducing the time required to match protein sequences with tandem mass spectra. \textit{Rapid Commun Mass Spectrom.} 17(20):2310-6. \bibitem{beavis2004} Craig R, Beavis RC \textbf{[2004]} TANDEM:matching proteins with tandem mass spectra. \textit{Bioinformatics} 20(9):1466-7. \bibitem{PTMTS} Kertesz-Farkas A, Reiz B, Vera R, Myers MP, Pongor S \textbf{[2014]} PTMTreeSearch: a novel two-stage tree-search algorithm with pruning rules for the identification of post-translational modification of proteins in MS/MS spectra. \textit{Bioinformatics}, 30(2):234-41. \bibitem{maclean} MacLean B, Eng JK, McIntosh M \textbf{[2006]} General framework for developing and evaluating database scoring algorithms using the TANDEM search engine. \textit{Bioinformatics} 22(22):2830-2. \bibitem{keller2005} Keller A, Eng J, Zhang N, Li XJ, Aebersold R \textbf{[2005]} A uniform proteomics MS/MS analysis platform utilizing open XML file formats. \textit{Mol Syst Biol} 1:2005.0017. \end{thebibliography} \end{document}