%\VignetteIndexEntry{isobar package for iTRAQ and TMT protein quantification} %\VignetteDepends{} %\VignetteKeywords{Documentation} %\VignettePackage{isobar} \documentclass[11pt]{article} \SweaveOpts{keep.source=TRUE} \SweaveOpts{prefix.string=graphics/plot} <>= require(distr) require(ggplot2) dir.create(file.path("graphics"), showWarnings = FALSE) @ \usepackage{times} \usepackage{hyperref} \usepackage[utf8]{inputenc} \usepackage{xcolor} \usepackage{listings} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\texttt{#1}}} \newcommand{\isobar}{\Rpackage{isobar}} \title{\emph{isobar} package for iTRAQ and TMT protein quantification} \author{Florian P.\ Breitwieser, Jacques Colinge} \begin{document} \maketitle \tableofcontents \section{Introduction} The \Rpackage{isobar} package is an extensible and interactive environment for the data analysis and exploration of iTRAQ and TMT data. \Rpackage{isobar} implements the theory presented in Breitwieser et al., Journal of Proteome Research 2011. Further extensions for the analysis of \small{PTM} data are presented in Breitwieser and Colinge, Journal of Proteomics 2013. \subsection{Installation} \Rpackage{isobar} is part of \href{http://www.bioconductor.org/packages/release/bioc/html/isobar.html}{Bioconductor}. To install the latest stable version available for your version of R, start R and enter: <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("isobar") @ Bioconductor has semi-annual releases, and the releases are tied to specific to R versions. The latest development version of \Rpackage{isobar} are always available at GitHub (\url{https://github.com/fbreitwieser/isobar}). Use the \Rpackage{devtools} package to install it from the source: <>= library(devtools) install_github("fbreitwieser/isobar") @ \subsection{Loading package} The first thing you need to do is load the package. <>= library(isobar) ## load the isobar package @ Check the installed version of isobar: <>= packageVersion("isobar") @ or equivalently for older versions of R: <>= packageDescription("isobar")$Version @ \section{Loading data}\label{sec:loading-data} \Rpackage{isobar} can read identifications and quantifications from tab-separated and MGF files. Perl scripts are supplied to generate a tab-separated version from the vendor formats of Mascot and Phenyx, see appendix \ref{sec:dependencies}. The ``ID.CSV'' format is described in appendix \ref{app:fileformats}. Rockerbox and MSGF+ CSV file formats can be read, too, and are recognized by their file extension (\texttt{peptides.csv} and \texttt{msgfp.csv}). Experimental support for the mzIdentML format is also available - please contact the maintainer in case of problems. \begin{description} \item[\small ID.CSV] tab-separated file containing peptide-spectra matches and spectrum meta-information such as retention time, m/z and charge. Generated by parser scripts. \item[\small MGF] contains peak lists from which quantitative information on reporter tags are extracted. Must be centroided. \item[\small IBSPECTRA.CSV] tab-separated file containing the same columns as {\small ID.CSV} plus \emph{quantitative information} extracted from {\small MGF} file - that means the reporter tag masses and intensities as additional columns. \end{description} \Robject{readIBSpectra} is the primary function to generate a \Robject{IBSpectra} object. The first argument is one of {\small iTRAQ4plexSpectra}, {\small iTRAQ8plexSpectra}, {\small TMT2plexSpectra}, {\small TMT6plexSpectra} and {\small TMT10plexSpectra} denotes the tag type and therefore class. <>= ## generating IBSpectra object from ID.CSV and MGF ib <- readIBSpectra("iTRAQ4plexSpectra",list.files(pattern=".id.csv"), list.files(pattern=".mgf")) ## write in tabular IBSPECTRA.CSV format to file write.table(as.data.frame(ib),sep="\t",row.names=F, file="myexperiment.ibspectra.csv") ## generate from saved IBSPECTRA.CSV - MGF does not have to be supplied ib.2 <- readIBSpectra("iTRAQ4plexSpectra","myexperiment.ibspectra.csv") @ In case the MGF file is very big, it can be advanteguous to generate a smaller version containing only meta- and quantitative information before import in \texttt{R}. On Linux, the tool \texttt{grep} is readily available. \begin{verbatim} egrep '^[A-Z]|^1[12][0-9]\.' BIG.mgf > SMALL.mgf \end{verbatim} \subsection{\Robject{ibspiked} test samples} The examples presented are based on the dataset \Robject{ibspiked\_set1} which has been designed to test \Rpackage{isobar}'s functionality and searched against the Swissprot human database with Mascot and Phenyx. \Robject{ibspiked\_set1} is an iTRAQ 4-plex data set comprised of a complex background (albumin- and IgG-depleted human plasma) and spiked proteins. MS analysis was performed in ThermoFisher Scientific LTQ Orbitrap HCD instrument with 2D shotgun peptide separation (see original paper for more details). The samples used for each iTRAQ channel are as follows: \begin{itemize} \item Depleted human plasma background (>150 protein detected); \item Spiked-in proteins with the following ratios \begin{itemize} \item \texttt{CERU\_HUMAN (P00450)} at concentrations $1:1:1:1$; \item \texttt{CERU\_RAT (P13635)} at concentrations $1:2:5:10$; \item \texttt{CERU\_MOUSE (Q61147)} at concentrations $10:5:2:1$. \end{itemize} \end{itemize} A second data set with ratios 1:10:50:100 is available as \Robject{ibspiked\_set2} from \url{http://bininformatics.cemm.oeaw.ac.at/isobar}. The Ceruplasmins have been selected as the share peptides. Hereafter, we load the data pacakage and the ceru protein IDs are identified via the \Rfunction{protein.g} function, which provides a mean to retrieve data from \Robject{ProteinGroup} objects. \Robject{ProteinGroup} is a slot of \Robject{IBSpectra} objects and contains informations on proteins and their grouping. See \ref{ss:proteingrouping}. <>= data(ibspiked_set1) ceru.human <- protein.g(proteinGroup(ibspiked_set1),"CERU_HUMAN") ceru.rat <- protein.g(proteinGroup(ibspiked_set1),"CERU_RAT") ceru.mouse <- protein.g(proteinGroup(ibspiked_set1),"CERU_MOUSE") ceru.proteins <- c(ceru.human,ceru.rat,ceru.mouse) @ \subsection{Protein information and grouping in \Robject{ProteinGroup}} \label{ss:proteingrouping} When an \texttt{ibspectra.csv} is read, protein are grouped to identify proteins which have unique peptides. By default, only peptides with unique peptides are grouped. The algorithm to infer protein groups works as follows: \begin{enumerate} \item Group proteins together which have been seen with exactly the same peptides (\Robject{indistinguishableProteins}) - these are the \Robject{protein.g} identifiers. \item Create protein groups (\Robject{proteinGroupTable}): \begin{enumerate} \item Define proteins with specific peptides as reporters (\Robject{reporterProteins}) \item Get proteins which are contained \footnote{That means these proteins have a subset of the peptides of the reporter} by reporterProteins and group them below. \end{enumerate} \item Create protein groups for proteins without specific peptides as above. \end{enumerate} \subsection{Loading data from CID/HCD (or CID/MS3, etc) experiments} A combined CID/HCD approach, in which for each precursor two fragmentation spectra are acquired, has proven useful to increase the number of identified and quantified peptide-spectrum matches. Usually, the reporter intensity information is taken from the HCD spectrum, and the peptide is identified based on the fragment ions in the CID spectrum. To import these experiments, a comma-separated \emph{mapping file} is needed, which contains the association from the \texttt{identification} to the \texttt{quantification} spectrum title. \paragraph{Example mapping file (\texttt{mapping.csv}):} \begin{verbatim} "hcd","cid" "spectrum 1","spectrum 2" "spectrum 3","spectrum 4" \end{verbatim} By calling \Rfunction{readIBSpectra} with \texttt{mapping.file="mapping.csv"}, the spectra titles in the \texttt{id.file} are matched to those in the \texttt{peaklist.file}. If the column names for the quantification and identification spectrum are not \texttt{hcd} and \texttt{cid}, resp., they can be set with the argument \texttt{mapping}=\texttt{c(identification.spectrum="column name 1",quantification.spectrum="column name 2")}. \textbf{Example mapping file \texttt{mapping2.csv}:} \begin{verbatim} "quant-spectrum-ms3","id-spectrum-ms2" "spectrum 1","spectrum 2" "spectrum 3","spectrum 4" \end{verbatim} <>= readIBSpectra(..., mapping.file="mapping2.csv", mapping=c(identification.spectrum="id-spectrum-ms2", quantification.spectrum="quant-spectrum-ms3") ) @ The argument \texttt{mapping.file} can take multiple files as argument (which are read and concatenated), or a \texttt{data.frame}. \subsection{Loading data from CID/HCD experiments with full HCD spectrum} If a full HCD spectrum was acquired, and both the HCD and CID spectrum are searched against a protein database, the argument \texttt{id.file.domap} to \texttt{readIBSpectra} can be used to merge both CID and HCD identifications: <>= readIBSpectra("TMT6plexSpectra", id.file="cid.identifications.csv", peaklist.file="hcd.peaklist.mgf", id.file.domap="hcd.identifications.csv", mapping.file="mapping.csv", ... ) @ Here, the CID (\texttt{cid.identifications.csv}) and HCD identifications (\texttt{hcd.identifications.csv}) are combined and mapped according to \texttt{mapping.csv}. Diverging identifications are discarded. \subsection{Integrating and thresholding precursor purity measures} Savitzki et al., 2011 describe a measure for the precursor purity of fragment spectra. An implementation of the algorithm for the \texttt{multiplierz} platform is provided at \url{http://sourceforge.net/apps/trac/multiplierz/wiki/Precursors}. The script creates \texttt{\_precursors.txt} files for each \texttt{RAW} file. To integrate and threshold based on the \texttt{s2i} measure, use the following code: <>= precursors <- read.delim("file1_precursors.txt",header=TRUE,sep="\t",stringsAsFactors=FALSE) idfile <- read.delim("file1.id.csv",header=TRUE,sep="\t",stringsAsFactors=FALSE) idfile <- merge(idfile,precursors, by.x=c("fixed_spectrum","scans.from"), by.y=c("fixed_spectrum","SCAN"),all.x=TRUE) # FIXME see /work/analysis/gwinter/raw-spectra/analysis.R @ \subsection{\Rpackage{MSnbase} integration} NOTE: It has been reported that the conversion functions are not functional for current versions of the package. We will try to resolve this. \Rpackage{MSnbase} by Laurent Gatto provides data manipulation and processing methods for MS-based proteomics data. It provides import, representation and analysis of raw MS data stored in \texttt{mzXML}, \texttt{mzML} and \texttt{mzData} using the \Rpackage{mzR} package and centroided and un-centroided {\small \texttt{MGF}} peak lists. It allows to use and preprocess raw data whereas \Rpackage{isobar} requires centroided peak lists. In the future, the \Rpackage{isobar} class \Robject{IBSpectra} might be based on or replaced by \Robject{MSnbase}'s class \Robject{MSnSet}. For now, methods for coercion are implemented: <>= as(ibspectra,"MSnSet") as(msnset,"IBSpectra") @ \section{Data Analysis} \subsection{Reporter mass precision} The distribution of observed masses from the reporter tags can be used to visualize the precision of the MS setup on the fragment level and used to set the correct window for isolation. The expected masses of the reporter tags are in the slot \Rfunction{reporterTagMasses} of the implementations of the \Robject{IBSpectra} class. The experimental masses are in the matrix \Robject{mass} of \Robject{AssayData}; they can also be accessed by the method \Rfunction{reporterMasses(x)}. \subsection{\Rpackage{MSnbase} integration} NOTE: It has been reported that the conversion functions are not functional for current versions of the package. We will try to resolve this. \Rpackage{MSnbase} by Laurent Gatto provides data manipulation and processing methods for MS-based proteomics data. It provides import, representation and analysis of raw MS data stored in \texttt{mzXML}, \texttt{mzML} and \texttt{mzData} using the \Rpackage{mzR} package and centroided and un-centroided {\small \texttt{MGF}} peak lists. It allows to use and preprocess raw data whereas \Rpackage{isobar} requires centroided peak lists. In the future, the \Rpackage{isobar} class \Robject{IBSpectra} might be based on or replaced by \Robject{MSnbase}'s class \Robject{MSnSet}. For now, methods for coercion are implemented: <>= as(ibspectra,"MSnSet") as(msnset,"IBSpectra") @ \section{Data Analysis} \subsection{Reporter mass precision} The distribution of observed masses from the reporter tags can be used to visualize the precision of the MS setup on the fragment level and used to set the correct window for isolation. The expected masses of the reporter tags are in the slot \Rfunction{reporterTagMasses} of the implementations of the \Robject{IBSpectra} class. The experimental masses are in the matrix \Robject{mass} of \Robject{AssayData}; they can also be accessed by the method \Rfunction{reporterMasses(x)}. <>= sprintf("%.4f",reporterTagMasses(ibspiked_set1)) ## expected masses mass <- assayData(ibspiked_set1)[["mass"]] ## observerd masses apply(mass,2,function(x) sprintf("%.4f",quantile(x,na.rm=TRUE,probs=c(0.025,0.975)))) @ \Rfunction{reporterMassPrecision} provides a plot of the distribution. <>= print(reporterMassPrecision(ibspiked_set1)) @ \begin{center} \begin{figure}[htp] \includegraphics{graphics/plot-print-reporterMasses} \caption{Reporter mass precision plot.} \label{fig:reporterMassPrecision} \end{figure} \end{center} \subsection{Normalization and isotope impurity correction} \label{sec:norm-isot-impur} Isotope impurity correction factors are supplied by labelling reagent manufacturers. Default values that can be modified by the user are available in \Rpackage{isobar} and corrections are obtained by simple linear algebra. Due to differences between samples it is advisable to normalize data before further processing. By default, \Rfunction{normalize} corrects by a factor such that the median intensities in all reporter channels are equal. See figure ~\ref{fig:normalization}. <>= ib.old <- ibspiked_set1 ibspiked_set1 <- correctIsotopeImpurities(ibspiked_set1) ibspiked_set1 <- normalize(ibspiked_set1) @ <>= png("graphics/fig_maplot_normalize.png",width=6.6,height=3,units="in",res=300,pointsize=8) @ <>= par(mfrow=c(1,2)) maplot(ib.old,channel1="114",channel2="117",ylim=c(0.5,2), main="before normalization") abline(h=1,col="red",lwd=2) maplot(ibspiked_set1,channel1="114",channel2="117",ylim=c(0.5,2), main="after normalization") abline(h=1,col="red",lwd=2) @ <>= dev.off() @ \begin{center} \begin{figure}[htp] \includegraphics{graphics/fig_maplot_normalize} \caption{Ratio versus intensity plots ('MA plots') before and after applying normalization.} \label{fig:normalization} \end{figure} \end{center} \subsection{Fitting a noise model} A noise model is a approximation of the expected technical variation based on signal intensity. It is stable for a certain experimental setup and thus can be learned once. Noise is observed directly when comparing identical samples in multiple channels (1:1 iTRAQ/TMT sample) and we can use \Robject{ibspiked\_set1} background proteins as a 1:1 sample. Therefore we exclude the ceruplasmins before fitting a noise model using \Rfunction{NoiseModel}. See figure ~\ref{fig:maplots}. <>= ib.background <- subsetIBSpectra(ibspiked_set1,protein=ceru.proteins,direction="exclude") noise.model <- NoiseModel(ib.background) @ Though only recommended when sufficient data are available, a method exist for the estimation of a noise model without a 1:1 dataset. It takes longer time as it first computes all the protein ratios to shift spectrum ratios to 1:1. To examplify this procedure, we only take rat and mouse CERU proteins from \Robject{ibspiked\_set1}, see figure ~\ref{fig:maplots}. The resultant noise model is a rough approximation only because of the very limitted data, see Breitwieser et al. Supporting Information, submitted, for a real example. <>= ib.ceru <- subsetIBSpectra(ibspiked_set1,protein=ceru.proteins, direction="include", specificity="reporter-specific") nm.ceru <- NoiseModel(ib.ceru,one.to.one=FALSE,pool=TRUE) @ <>= png("graphics/fig_maplot_noisemodel.png",width=3.3,height=3,units="in",res=300,pointsize=8) @ <>= maplot(ib.background,noise.model=c(noise.model,nm.ceru), channel1="114",channel2="115",ylim=c(0.2,5), main="95% CI noise model") @ <>= dev.off() @ \begin{center} \begin{figure}[htp] \includegraphics{graphics/fig_maplot_noisemodel} \caption{Red lines denote the 95 \% confidence interval as estimated by the noise model on background proteins. The blue line is estimated as non 1:1 noise model based on only spectra of CERU proteins.} \label{fig:maplots} \end{figure} \end{center} \subsection{Protein and peptide ratio calculation} \Rfunction{estimateRatio} calculates the relative abundance of a peptide or protein in one tag compared to another. It calculates a weighted average (after outlier removal) of the spectrum ratios. The weights are the inverse of the spectrum ratio variances. It requires a \Robject{IBSpectra} and \Robject{NoiseModel} object and definitions of channel1, channel2, and the protein or peptide. The result is channel2/channel1. <>= ## Calculate ratio based on all spectra of peptides specific ## to CERU_HUMAN, CERU_RAT or CERU_MOUSE. Returns a named ## numeric vector. 10^estimateRatio(ibspiked_set1,noise.model, channel1="114",channel2="115", protein=ceru.proteins)['lratio'] ## If argument 'combine=FALSE', estimateRatio returns a data.frame ## with one row per protein 10^estimateRatio(ibspiked_set1,noise.model, channel1="114",channel2="115", protein=ceru.proteins,combine=FALSE)[,'lratio'] ## spiked material channel 115 vs 114: ## CERU_HUMAN (P00450): 1:1 ## CERU_RAT (P13635): 2:1 = 2 ## CERU_MOUSE (Q61147): 5:10 = 0.5 ## Peptides shared between rat and mouse pep.shared <- peptides(proteinGroup(ibspiked_set1), c(ceru.rat,ceru.mouse),set="intersect", columns=c('peptide','n.shared.groups')) ## remove those which are shared with other proteins pep.shared <- pep.shared$peptide[pep.shared$n.shared.groups==2] ## calculate ratio: it is between the rat and mouse ratios 10^estimateRatio(ibspiked_set1,noise.model, channel1="114",channel2="115", peptide=pep.shared)['lratio'] @ When examining the global differences and differences in between classes, \Rfunction{proteinRatios} can be used. It is also suitable to inspect sample variability. The argument \Robject{cl} can be used to define class labels. If \Robject{combn.method='interclass'} or \Robject{intraclass} and \Robject{summarize=TRUE}, \Rfunction{proteinRatios} return a single summarized ratio across and within classes, resp.. <>= protein.ratios <- proteinRatios(ibspiked_set1,noise.model,cl=c("1","0","0","0")) ## defined class 114 and 115 as class 'T', 116 and 117 as class 'C' classLabels(ibspiked_set1) <- c("T","T","C","C") proteinRatios(ibspiked_set1,noise.model,protein=ceru.proteins, cl=classLabels(ibspiked_set1),combn.method="interclass", summarize=T)[,c("ac","lratio","variance")] @ \subsection{Protein ratio distribution and selection} Protein ratio distributions can be calculated ideally on biological replicated. To examine differentially expressed proteins, both sample variability information (random protein ratios) as a \emph{fold-change} constraint, and ratio \emph{precision} can be used. For a experimental setup with biolgical replicates in the same experiment (but different channels), the distribution of biological variability can be learned by computing the ratios between the replicates. With no replicates available, one has the choice to (a) model the actual protein ratios and just select the most extreme ratios; (b) learn the distribution from a previous experiment; or (c) assume a standard Cauchy distribution with location $0$ and scale $0.1$, $0.05$, and $0.025$, which correspond with $\alpha=0.05$ roughly to fold changes of $4$, $2$, and $1.5$. A Cauchy distribution fits accurately this type of random protein ratio distribution: Cauchy is displayed in red, Gaussian in blue. In the case of \Robject{ibspiked\_set1}, the many 1:1 proteins provide us with adequate data to learn the random protein ratio distribution, however only of the \textit{technical} variation. <>= #protein.ratios <- proteinRatios(ibspiked_set1,noise.model) protein.ratiodistr.wn <- fitWeightedNorm(protein.ratios[,'lratio'], weights=1/protein.ratios[,'variance']) protein.ratiodistr.cauchy <- fitCauchy(protein.ratios[,"lratio"]) @ \begin{center} \begin{figure}[htp] <>= library(distr) # required library limits=seq(from=-0.5,to=0.5,by=0.001) curve.wn <- data.frame(x=limits,y=d(protein.ratiodistr.wn)(limits)) curve.cauchy<-data.frame(x=limits,y=d(protein.ratiodistr.cauchy)(limits)) g <- ggplot(data.frame(protein.ratios),aes(x=lratio)) + geom_histogram(colour = "darkgreen", fill = "white",aes(y=..density..), binwidth=0.02) + geom_rug() + geom_line(data=curve.wn,aes(x=x,y=y),colour="blue") + geom_line(data=curve.cauchy,aes(x=x,y=y),colour="red") print(g) @ \caption{Histogram of all protein ratios in \Robject{ibspiked\_set1}. A fit with a Gaussian and Cauchy probability density function is shown in blue and red, respectively.} \label{fig:ratiodistr} \end{figure} \end{center} Now, when supplying a \Robject{ratiodistr} parameter to \Rfunction{estimateRatio} and \Rfunction{proteinRatios}, sample and signal p-values are calculated, what we illustrate in the code below <>= rat.list <- estimateRatio(ibspiked_set1,noise.model=noise.model,channel1="114",channel2="115", protein=reporterProteins(proteinGroup(ibspiked_set1)),combine=F, ratiodistr=protein.ratiodistr.cauchy) rat.list[rat.list[,"is.significant"]==1,] @ \subsection{Detection of proteins with no specific peptides} It is well known that MS analysis only reveals the presence of so-called protein groups, defined as sets of proteins identified by the same set of peptides. The protein that contains all the peptides is the group reporter (there are possibly several group reporters) and if it has one specific peptide at least then its presence in the sample is certain. The status of the other proteins in the group is in general impossible to determine. When quantitative information is available, there is a potential to elucidate the structure of part of the protein groups. In the example below, a subset \Rfunction{IBSpectra} object is created, containing only peptides shared between CERU\_RAT and CERU\_MOUSE, and those specific to CERU\_RAT. <>= ## peptides shared between CERU_RAT and CERU_MOUSE have been computed before pep.shared ## peptides specific to CERU_RAT pep.rat <- peptides(proteinGroup(ibspiked_set1),protein=ceru.rat, specificity="reporter-specific") ## create an IBSpectra object with only CERU_RAT and shared peptides ib.subset <- subsetIBSpectra(ibspiked_set1, peptide=c(pep.rat,pep.shared),direction="include") ## calculate shared ratios sr <- shared.ratios(ib.subset,noise.model, channel1="114",channel2="117", ratiodistr=protein.ratiodistr.cauchy) sr @ \begin{center} \begin{figure}[htp] <>= ## plot significantly different protein groups where 90% CI does not overlap ## CERU_MOUSE and CERU_RAT is detected, as expected. shared.ratios.sign(sr,z.shared=1.282) @ \caption{Peptides of spiked ceruplasmins have significantly different ratios between groups. Group \textit{reporter} consists of peptides specific to \texttt{CERU\_RAT (P13635)}, group \textit{member} are peptides shared between \texttt{CERU\_RAT} and \texttt{CERU\_MOUSE (Q61147)}.} \label{fig:sharedratios} \end{figure} \end{center} \section{Report generation} \Rpackage{isobar} provides a rich interface for creating Excel and PDF reports for further analysis and quality control. The main entry function is \texttt{create.reports}. Alternatively the Rscript \texttt{create\_reports.R} can be used. It is located in the \texttt{report} folder of the \isobar{} installation, and reads the properties from a file in the working directory. The posible values are defined in the \texttt{report/properties.R} file in the isobar installation. To generate a report with standard properties the following code should do the trick: <>= create.reports(type="iTRAQ4plexSpectra", identifications="my.id.csv",peaklist="my.mgf") @ The properties can also be defined in a \texttt{properties.R} file which is located in the working directory. The properties are set in the following order: \begin{itemize} \item 'global' properties in \texttt{ISOBAR-DIRECTORY/report/properties.R}\footnote{located in \texttt{system.file('report','properties.R',package='isobar')}} \item 'local' properies in \texttt{WORKING-DIRECTORY/properties.R} \item command line arguments to \texttt{create\_reports.R} or \texttt{create.reports} function \end{itemize} Appendix \ref{sec:properties} provides a syntax-highlighted version of the properties file supplied with isobar, which sets the default parameters and provides some help in the comments. The number of paramters which can be set may seem a lot at first, however most times only a few are needed. For successful completion, \LaTeX - for the PDF reports - and Perl - for the Excel reports - need to be installed. \subsection{Files used for report generation} <>= ## execute to find the path and file location in your installation. system.file("report",package="isobar") ## path list.files(system.file("report",package="isobar")) ## files @ \begin{description} \item[create\_reports.R] R script which can be used to create QC and PDF reports It initializes the environment, reads properties and calls \Rfunction{Sweave} on QC and DA Sweave files. Additionally it generates a Excel data analysis report by calling \texttt{tab2xls.pl}. \item[isobar-qc.Rnw] Sweave file with quality control plots. \item[isobar-analysis.Rnw] Sweave file for generating a data analysis report with the list of all protein ratios and list of significantly different proteins. \item[properties.R] Default configuration for \texttt{create\_reports.R}. It is parsed as R code. \item[report-utils.R] Helper R functions used in Sweave documents. \item[report-utils.tex] Helper \LaTeX{} functions used in Sweave documents. \end{description} \begin{appendix} \section{File formats} \label{app:fileformats} \subsection{ID CSV file format} The Perl parsers create ID CSV files - identification information for all matched spectra without quantitative information. You can create your own parser, the resulting file should be tab-delimited and contain the following columns. Only bold columns are obligatory. The information is redundant - that means if a peptide may stem from two different proteins the information of the identification is repeated. \begin{tabular}{r|l} \textbf{accession} & Protein AC \\ \textbf{peptide} & Peptide sequence\\ modif & Peptide modification string\\ charge & Charge state\\ theo.mass & Theoretical peptide mass\\ exp.mass & Experimentally observed mass\\ parent.intens & Parent intensity\\ retention.time & Retention time\\ \textbf{spectrum} & Spectrum identifier\\ search.engine & Protein search engine and score\\ \end{tabular} \subsection{IBSpectra CSV file format} IBSpectra file format has the same columns as the ID CSV format and additionally columns containing the quantitation information, namely \texttt{X\textit{tagname}\_mass} and \texttt{X\textit{tagname}\_ions}, for mass and intensity of each tag \textit{tagname}. Below an example of the further columns for an \texttt{iTRAQ 4plex} IBSpectra. \begin{tabular}{r|l} \textbf{X114\_mass} & reporter ion mass\\ \textbf{X115\_mass} & reporter ion mass\\ \textbf{X116\_mass} & reporter ion mass\\ \textbf{X117\_mass} & reporter ion mass\\ \textbf{X114\_ions} & reporter ion intensity\\ \textbf{X115\_ions} & reporter ion intensity\\ \textbf{X116\_ions} & reporter ion intensity\\ \textbf{X117\_ions} & reporter ion intensity\\ \end{tabular} \section{\texttt{properties.R} for report generation} \label{sec:properties} \definecolor{comment}{HTML}{000065} \definecolor{string}{HTML}{DD0000} \definecolor{special}{HTML}{0A0AF0} { \lstset{ % language=R, % the language of the code basicstyle={\footnotesize \tt}, % the size of the fonts that are used for the code commentstyle=\color{comment}, stringstyle=\color{string}, linewidth=\textwidth } <>= #p <- readLines(system.file("report","properties.R",package="isobar")) #p <- sub("(#.*)","\\\\textcolor{comment}{\\1}",p) #for (i in c("TRUE","FALSE","NULL")) # p <- sub(i,paste("\\\\textcolor{special}{",i,"}",sep=""),p) #p <- sub('(".*?")',"\\\\textcolor{string}{\\1}",p) #p <- gsub("#","\\#",p,fixed=TRUE) #p <- gsub("_","\\_",p,fixed=TRUE) #cat("\\begin{lstlisting}") #cat(p,sep="\n") #cat("\\end{lstlisting}") cat(sprintf("\\lstinputlisting{%s}\n",system.file("report","properties.R",package="isobar"))) @ } \section{Dependencies}\label{sec:dependencies} \subsection{\LaTeX{} and PGF/TikZ} \LaTeX{} is a high-quality typesetting system; it includes features designed for the production of technical and scientific documentation. It is available as free software\footnote{\url{http://www.latex-project.org}}. PGF is a \TeX{} macro package for generating graphics It comes with a user-friedly syntax layer called TikZ\footnote{\url{http://sourceforge.net/projects/pgf}}. \LaTeX{} is used for creating PDF analysis reports, with the PGF package creating the graphics. Go to \url{http://www.latex-project.org} to get information on how to download and install a \LaTeX{} system and packages. \subsection{Perl}\label{app:installing-perl} Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl is required for two tasks: \begin{itemize} \item Conversion of Pidres XML and Mascot DAT files to ID CSV format; \item Creation of Microsoft Excel format data analysis report. \end{itemize} Go to \url{http://www.perl.org} to download and get help on the installation of Perl on your Operating System. For file format conversion, perl module \texttt{Statistics::Lite} is required. For Excel export \texttt{Spreadsheet::WriteExcel}. All Perl scripts are in the subdirectory \texttt{pl} of the isobar package installation. <>= ## execute to find the path and file location in your installation. system.file("pl",package="isobar") ## path list.files(system.file("pl",package="isobar")) ## files @ \texttt{mascotParser2.pl} and \texttt{pidresParser2.pl} convert from respective protein search outputfiles to a XML file format, which can be converted into a CSV file readable by \textit{isobar} by using \texttt{psx2tab2.pl}. \texttt{mascotParser2.pl} coverts from Mascot format, and requires the file \texttt{modifconv.csv} as a definition of modification names. \texttt{pidresParser2.pl} converts from Phenyx output and requires the file \texttt{parsersConfig.xml}. \texttt{tab2xls.pl} converts csv file to different sheets of an Excel spreadsheet. \label{app:itraqmascotparser-help} <>= ## execute on your system system(paste("perl",system.file("pl","mascotParser2.pl",package="isobar"), "--help")) print(paste("perl",system.file("pl","pidresParser2.pl",package="isobar"), "--help")) @ \section{Session Information} The version number of R and packages loaded for generating the vignette were: <<>= toLatex(sessionInfo()) @ \end{appendix} \end{document} \bibliographystyle{plainnat}