%\VignetteIndexEntry{sapFinder Vignette} %\VignetteDepends{sapFinder} %\VignetteEngine{utils::Sweave} %\VignetteKeywords{Mass Spectrometry (MS), SAP, bioinformatics} %\VignettePackage{sapFinder} \documentclass[12pt]{article} <>= BiocStyle::latex(use.unsrturl=FALSE) @ \usepackage{hyperref} \usepackage{url} \usepackage[numbers]{natbib} \usepackage{graphicx} \usepackage{float} \usepackage[section]{placeins} \bibliographystyle{plainnat} \author{Bo Wen, Shaohang Xu} \begin{document} \SweaveOpts{concordance=TRUE} \title{sapFinder User Guide} \maketitle \tableofcontents %------------------------------------------------------------------ \section{Introduction} %------------------------------------------------------------------ This vignette describes the functionality implemented in the \Rpackage{sapFinder} package. \Rpackage{sapFinder} is developed to automate \begin{enumerate} \item variation-associated database construction from public single nucleotide variations (SNVs) database or sample-specific genome-wide association studies (GWAS) and RNA-Seq data; \item database searching; \item post-processing; \item HTML-based report generation. \end{enumerate} %------------------------------------------------------------------ \section{Variation-associated database construction} %------------------------------------------------------------------ Currently, two kinds of variation-associated databases can be constructed by using \Rpackage{sapFinder} package. One is the sample-specific variation-associated database and the other is the aggregate database that is created from the public SNV repositories, such as dbSNP \citep{dbsnp} and COSMIC \citep{cosmic}. %------------------------------------------------------------------ \section{Based on sample-specific SNV data} %------------------------------------------------------------------ \subsection{Input data} \subsubsection{Input data} To construct sample-specific variation-associated database by using \Rpackage{sapFinder}, three files are required as input. One is a Variant Call Format (VCF) file which can be generated from a BAM file using single nucleotide polymorphism calling tools such as SAMtools \citep{samtool} and the Genome Analysis Toolkit (GATK) \citep{gatk}. The other two files are gene annotation file and FASTA format mRNA sequence file which can be downloaded by users from the University of California, Santa Cruz (UCSC) table browser. For non-model organisms, users can manually provide these files in the format of NCBI or ENSEMBL. \subsubsection{Preparing annotation files from UCSC table brower} To map variation information to the protein level, numerous pieces of genome annotation information are needed, such as exon region boundary, CDS region boundary, mRNA sequence et al. It is possible to manually download these data from The Table Browser of UCSC (\url{http://genome.ucsc.edu/cgi-bin/hgTables?command=start}). Currently,before the construction of variation-associated database,it requires users to download a tab-separated positional table annotation file and a corresponding mRNA sequence FASTA file from UCSC table brower. Since Refseq updates from time to time, we suggest generating those files in a same day as running. The bullet list below summarizes the steps to download RefSeq genes annotation file and mRNA Sequence file. \begin{itemize} \item Go to UCSC Table browser \item Choose genome (e.g. "Human") \item Choose assembly (e.g. "2009 GRCh37/hg19") \item Choose group (e.g. "Genes and Gene Predictions") \item Choose track (e.g. "RefSeq Genes") \item Choose table (e.g. "refGene") \item Choose region (e.g. "genome") \item Choose output format "all fields from selected table" (retrieves annotation file) \begin{itemize} \item Enter output filename(e.g. \verb|"hg19_refGene.txt"|) \item Press "get output" button \end{itemize} \item Choose output format "sequence" (retrieves mRNA file) \begin{itemize} \item Enter output filename(e.g. \verb|"hg19_refGeneMrna.fa"|) \item Press "get output" button \item Select "mRNA" sequence type,and press "submit" button \end{itemize} \end{itemize} The bullet list below summarizes the steps to download Ensembl genes annotation file and mRNA Sequence file. \begin{itemize} \item Go to UCSC Table browser \item Choose genome (e.g. "Human") \item Choose assembly (e.g. "2009 GRCh37/hg19") \item Choose group (e.g. "Genes and Gene Predictions") \item Choose track (e.g. "Ensembl Genes") \item Choose table (e.g. "ensGene") \item Choose region (e.g. "genome") \item Choose output format "all fields from selected table" (retrieves annotation file) \begin{itemize} \item Enter output filename(e.g. \verb|"hg19_ensGene.txt"|) \item Press "get output" button \end{itemize} \item Choose output format "sequence" (retrieves mRNA file) \begin{itemize} \item Enter output filename(e.g. \verb|"hg19_ensGeneMrna.fa"|) \item Press "get output" button \item Select "genomic" sequence type,and press "submit" button \item In Sequence Retrieval Region Options.Select three checkboxes("5'UTR Exons","CDS Exons" and "3'UTR Exons") and select one radiobutton("One FASTA record per gene.") \item In Sequence Formatting Options.Select "Exons in upper case, everything else in lower case." button \item Press "get sequence" button \end{itemize} \end{itemize} Users need only to choose one type of annotation files above (Refseq or Ensembl) to download. \subsubsection{The external cross reference file} The xref file is an optional input for sapFinder.You can obtain it from BioMart Central Portal (\url{http://central.biomart.org/martwizard/#!/Search_by_database_name?mart=Ensembl 75 Genes (WTSI, UK)}) or MartView (\url{http://biomart.intogen.org/biomart/martview/}),The following table shows the major features must be selected: \begin{figure}[H] \centering \includegraphics[width=0.35\textwidth]{biomart.png} \caption{Summary of biomart features} \label{fig:xref} \end{figure} \subsubsection{Example code} In additon to download the annotation files and xref file by yourself, you can also obtain them from the \verb|sapfinder_pipeline| repository in bitbucket (\url{https://bitbucket.org/xushaohang/sapfinder_pipeline}) in the \verb|"annotation_files"| directory. The example data is extracted from a recently publised study \citep{gloria2014}. <>= library(sapFinder) vcf <- system.file("extdata/sapFinder_test.vcf", package="sapFinder") annotation <- system.file("extdata/sapFinder_test_ensGene.txt", package="sapFinder") refseq <- system.file("extdata/sapFinder_test_ensGeneMrna.fa", package="sapFinder") xref <- system.file("extdata/sapFinder_test_BioMart.Xref.txt", package="sapFinder") outdir <- "db_dir" prefix <- "sapFinder_test" db.files <- dbCreator(vcf=vcf, annotation=annotation, refseq=refseq, outdir=outdir, prefix=prefix,xref=xref) @ Two files are outputed. One is a variation-associated database file which is written in FASTA format to the directory specified and contains the mutated peptides, the normal protein sequences and their reverse counterparts. The other is a tab-delimited file which contains the variant peptides information. Both files will be used in the following steps. \subsection{Based on public SNV database} The usage of creating variation-associated database from public SNV repositories is same as that based on sample-specific SNV data. Currently, \Rpackage{sapFinder} can be used to create variation-associated database based on the data from dbSNP \citep{dbsnp} and COSMIC \citep{cosmic}. The required VCF files can be downloaded from their ftp sites. %------------------------------------------------------------------ \section{MS/MS data searching} %------------------------------------------------------------------ After the variation-associated database constructed, \Biocpkg{rTANDEM} package \citep{rTANDEM} is adopted to search the database against tandem mass spectra to detect variant peptides. \Biocpkg{rTANDEM} package interfaces with the popular used open source search engine \software{X!Tandem} \citep{tandem} algorithm in R. <>= outdir<-"." mgf.path <- system.file("extdata/sapFinder_test.mgf", package="sapFinder") protein.db <- db.files[1] xml.path <- runTandem(spectra=mgf.path, fasta=protein.db, outdir = outdir,tol=10, tolu="ppm", itol=0.1, itolu="Daltons") @ The results are written in xml format to the directory specified and will be loaded for further processing. %------------------------------------------------------------------ \section{Post-processing} %------------------------------------------------------------------ After the MS/MS data searching, the function \Rfunction{tanparser} can be used to parse the search result. It calculates the q-value for each peptide spectrum matches (PSMs) and then utilizes the Occam's razor approach \citep{Nesvizhskii2003} to deal with degenerated wild peptides by finding a minimum subset of proteins that covered all of the identified wild peptides. <>= parserGear(file=xml.path, db=db.files[1], outdir='parser_outdir', prefix=prefix) @ It exports some tab-delimited files containing the peptide identification result and protein identification result. The annotated spectra for the identified variant peptides which pass the threshold are exported. This function also accepts the "raw" Mascot result file as input(dat format). For instance, <>= dat_file<-"mascot_raw.dat" parserGear(file=dat_file, db=db.files[1], outdir='parser_outdir', prefix=prefix) @ Unfortunately,we don't offer the wrapper function for Mascot search under current conditions. So you have to launch the independent identification by Mascot. %------------------------------------------------------------------ \section{HTML-based report generation} %------------------------------------------------------------------ The results are then summarised and compiled into an interactive HTML report. <>= reportCreator(indir="parser_outdir", db= db.files[1], varInfor=db.files[2],prefix=prefix) @ After the analysis has completed, the file \file{index.html} in the output directory can be opened in a web browser to access report generated. %------------------------------------------------------------------ \section{Integrated function \Rfunction{easyRun}} %------------------------------------------------------------------ The function \Rfunction{easyRun} automates the data analysis process. It will process the dataset in the following way: \begin{enumerate} \item Variation-associated database construction \item MS/MS searching \item Post-processing \item HTML-based report generation \end{enumerate} This function can be called as following: <>= vcf <- system.file("extdata/sapFinder_test.vcf", package="sapFinder") annotation <- system.file("extdata/sapFinder_test_ensGene.txt", package="sapFinder") refseq <- system.file("extdata/sapFinder_test_ensGeneMrna.fa", package="sapFinder") mgf.path <- system.file("extdata/sapFinder_test.mgf", package="sapFinder") xref <- system.file("extdata/sapFinder_test_BioMart.Xref.txt", package="sapFinder") easyRun(vcf=vcf,annotation=annotation,refseq=refseq, outdir="test",prefix="sapFinder_test", spectra=mgf.path,cpu=0,tol=10, tolu="ppm", itol=0.1,itolu="Daltons",xref=xref) @ After the analysis has completed, the file \file{index.html} in the output directory can be opened in a web browser to access report generated. %------------------------------------------------------------------ \section{Session Info} %------------------------------------------------------------------ Here is the output of \Rfunction{sessionInfo}: <>= toLatex(sessionInfo()) @ \bibliography{sapFinder} \end{document}