%\VignetteIndexEntry{Using the PAnnBuilder Package} %\VignetteKeywords{Annotation} %\VignetteDepends{PAnnBuilder} %\VignettePackage{PAnnBuilder} \documentclass[12pt,fullpage]{article} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \usepackage{amsmath,pstricks,fullpage} % With MikTeX 2.9, using pstricks also requires using auto-pst-pdf or running % pdflatex will fail. Note that using auto-pst-pdf requires to set environment % variable MIKTEX_ENABLEWRITE18=t on Windows, and to set shell_escape = t in % texmf.cnf for Tex Live on Unix/Linux/Mac. \usepackage{auto-pst-pdf} \usepackage{hyperref} \usepackage{url} \usepackage[authoryear,round]{natbib} \usepackage{multirow} \author{Hong Li$^\ddagger$\footnote{sysptm@gmail.com}} \begin{document} \title{Using the PAnnBuilder Package} \maketitle \begin{center}$^\ddagger$Key Lab of Systems Biology\\ Shanghai Institutes for Biological Sciences\\ Chinese Academy of Sciences, P. R. China \end{center} \tableofcontents \section{Overview} In genomic era, genome-scale experiments and data analyses require genes to be annotated from different sources, an R \cite{1} package AnnBuilder was developped for this purpose \cite{2}. In post genomic era, advances in proteomics highlight the urgence of understanding protein language \cite{3}. However, relative to genes, AnnBuilder is limited in protein annotation due to the complexity of proteins caused by 3-D strucutre, alternative splicing, modification, dynamic location and other features. The package PAnnBuilder focuses on assembling proteomic annotation data, which should be very useful for proteomic data interpretation. It not only inherits the good features of AnnBuilder such as automatically parsing protein annotation data and building R packages from selected sources, but also emphasizes specific features of proteins. PAnnBuilder currently supports 16 databases involving diverse aspects of proteomics, such as protein primary data, protein domain/family, subcellular location, protein interaction, post-translational modifications, body fluid proteomics, homolog/ortholog groups and so on. Additionally, PAnnBuilder allows annotation to unknown proteins based on sequence similarities to other well-annotated proteins. To extend the use of PAnnBuilder, 54 standard R annotation packages are produced from main protein databases, which are freely available for download via biocLite. \section{Getting Started} \subsection{Requirements} PAnnBuilder requires the support from the following items: \begin{enumerate} \item For PAnnBuilder >= 1.3.0, R >=2.7.0 is needed for building SQLite-based package. Dependent R packages are needed to be installed: \Rpackage{methods}, \Rpackage{utils}, \Rpackage{RSQLite}, \Rpackage{Biobase} (>= 1.17.0), \Rpackage{AnnotationDbi (>= 1.3.12)}. If you use the installation script "PAnnBuilder.R" to install PAnnBuilder (>=1.3.0), it will automatically check these dependent packages and install missing packages from CRAN or Bioconductor. \item Rtools is needed for Windows user. The matched version of Rtools with R can be downloaded via {\url{http://www.murdoch-sutherland.com/Rtools/}}. \item Perl is required to parse the rather large annotation source data files. It can be downloaded from {\url{http://www.perl.com/download.csp}}. \item Program formatdb and BLASTP is needed for function \Rfunction{crossBuilder}, \Rfunction{crossBuilder\_DB}, \Rfunction{pSeqBuilder}, \Rfunction{pSeqBuilder\_DB}. BLAST can be downloaded from http://www.ncbi.nlm.nih.gov/BLAST/download.shtml. \end{enumerate} Note: It is better to have enough space for the temporary directory. The path of the per-session temporary directory can be acquired by: <>= tempdir() @ \subsection{Installation} The biocLite script is used to install PAnnBuilder from within R: <>= source("http://bioconductor.org/biocLite.R") biocLite("PAnnBuilder") @ <>= library(PAnnBuilder) @ Note: Web Connection is needed to install PAnnBuilder and its depended packages. \subsection{Public Data Sources} PAnnBuilder parses proteomics annotation data from public sources and build R annotation packages. It also provides convenient functions to access these sources. For example, you can get all supported databases for "Homo sapiens" by: <>= library(PAnnBuilder) ## Load package getALLUrl("Homo sapiens") ## Get urls getALLBuilt("Homo sapiens") ## Get version/release @ Detail description of all supported public data sources in PAnnBuilder are as follows: \begin{description} \item[UniProt] The data {\url{ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase}} will be used to map protein UniProt identifiers to diverse annotation available in UniProt database. \item[IPI] The data {\url{ftp://ftp.ebi.ac.uk/pub/databases/IPI/current}} will be used to map protein IPI identifiers to diverse annotation available in International Protein Index database. \item[RefSeq] The data {\url{ftp://ftp.ncbi.nih.gov/refseq}} will be used to map protein RefSeq identifiers to diverse annotation available in NCBI RefSeq database. \item[Entrez Gene] The data {\url{ftp://ftp.ncbi.nih.gov/gene/DATA}} will be used to annotate genes after the Entrez Gene identifiers have been obtained. \item[Gene Ontology] The data {\url{ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes}} and {\url{http://archive.geneontology.org/latest-termdb}} will be used to obtain gene ontology information. \item[KEGG] Some data at {\url{ftp://ftp.genome.jp/pub/kegg}} will be used to obtain pathway information. \item[HomoloGene] A data file provided by {\url{ftp://ftp.ncbi.nlm.nih.gov/pub/HomoloGene/current/}} will be used to extract mappings between GeneID/ProteinGI and HomoloGene ids. \item[InParanoid] The data {\url{http://inparanoid.sbc.su.se}} will be used to obtain ortholog protein groups between two organisms. \item[Gene Interaction] The data {\url{ftp://ftp.ncbi.nih.gov/gene/GeneRIF}} will be used to extract protein-protein interactions between Entrez GeneID or Protein RefSeq ids. \item[IntAct] The data {\url{ftp://ftp.ebi.ac.uk/pub/databases/intact/current/psimitab}} will be used to extract protein-protein interactions between UniProt protein accession numbers. \item[MPPI] The data {\url{http://mips.gsf.de/proj/ppi/data/mppi.gz}} will be used to extract protein-protein interactions between UniProt protein accession numbers. \item[3DID] The data {\url{http://gatealoy.pcb.ub.es/3did/download}} will be used to extract domain-domain interactions between Pfam domain identifiers. \item[DOMINE] The data {\url{http://domine.utdallas.edu}} will be used to extract domain-domain interactions between Pfam domain identifiers. \item[DBSubLoc] The data {\url{http://www.bioinfo.tsinghua.edu.cn/~guotao}} will be used to obtain subcellular localization for protein from SWISS-PROT and PIR database. \item[BaCelLo] The data {\url{http://gpcr2.biocomp.unibo.it/bacello}} will be used to map SwissProt eukaryotes protein identifiers to subcellular localization. \item[SCOP] The data {\url{http://scop.mrc-lmb.cam.ac.uk/scop}} will be used to map PDB structure identifiers to SCOP domain identifiers. \item[PeptideAtlas] The data {\url{http://www.peptideatlas.org/builds}} will be used to obtain experimentally identified peptides and their coordinates on chromosomes. \item[SysPTM] The data {\url{http://www.biosino.org/papers/SysPTM}} will be used to obtain protein post-translational modifications information. \item[Sys-BodyFluid] The data {\url{http://www.biosino.org/papers/Sys-BodyFluid}} will be used to map IPI protein identifiers to body fluids. \end{description} \subsection{Annotation Packages Produced by PAnnBuilder} \subsubsection{Packages produced by PAnnBuilder} PAnnBuilder has powerful ability to build R package for assembling proteome annotation. However, the process of building new package may be time-consuming because of the downloading and parsing of large data files. To make PAnnBuilder useful for any users, we have built many frequently used annotation packages in advance. These pre-built package can be downloaded via biocLite. These packages are divided into two classes: enviroment-based packages built by "*Builder" functions (see Table~\ref{table:packages1}); SQLite based packages built by "*Builder\_DB" functions (see Table~\ref{table:packages2}). They are widely used methods for building Bioconductor meta-data packages. Each SQLite-based annotation package (identified by a ".db" suffix in the package name) contains a number of AnnDbBimap objects in place of the environment objects found in the old-style environment-based annotation packages. In future, SQLite-based annotation package will replace environment-based packages. The pre-built packages provide a quick start for R beginners. If one wants to analyze protein set in Human IPI database, the quickest way is to download and use SQLite based package "org.Hs.ipi.db". However, if the package one wants has not been built or a new-version database is released, new package should be built using functions in PAnnBuilder (See Section 4 for methods of building annotation packages). \begin{table} \footnotesize \caption{\label{table:packages2}SQLite-based Annotation Packges Produced by PAnnBuilder.} \begin{center} \begin{tabular}{|p{0.23\textwidth}|p{0.18\textwidth}|p{0.12\textwidth}|p{0.17\textwidth}|p{0.15\textwidth}|p{0.15\textwidth}|} \hline \bf{Description} & \bf{R Function} & \bf{Source} & \bf{Organism} & \bf{Package} \\ \hline \multirow{9}{0.25\textwidth}{complete and canonical annotaion for all proteins of a specific organism, including protein description, Entrez gene identifier, KEGG pathway, gene ontology, domain, and so on.} & \multirow{9}{0.2\textwidth}{pBaseBuilder\_DB} & \multirow{3}{0.15\textwidth}{IPI} & Homo sapiens & org.Hs.ipi.db \\ & & & Mus musculus & org.Mm.ipi.db \\ & & & Rattus norvegicus & org.Rn.ipi.db \\ & & \multirow{3}{0.15\textwidth}{Swiss-Prot} & Homo sapiens & org.Hs.sp.db \\ & & & Mus musculus & org.Mm.sp.db \\ & & & Rattus norvegicus & org.Rn.sp.db \\ & & \multirow{3}{0.15\textwidth}{RefSeq} & Homo sapiens & org.Hs.ref.db \\ & & & Mus musculus & org.Mm.ref.db \\ & & & Rattus norvegicus & org.Rn.ref.db \\ \hline \multirow{3}{0.25\textwidth}{protein indentifier mapping} & \multirow{3}{0.2\textwidth}{crossBuilder\_DB} & \multirow{3}{0.15\textwidth}{Swiss-Prot, IPI, RefSeq} & Homo sapiens & org.Hs.cross.db \\ & & & Mus musculus & org.Mm.cross.db \\ & & & Rattus norvegicus & org.Rn.cross.db \\ \hline \multirow{5}{0.25\textwidth}{protein-protein or domain-domain interaction data} & \multirow{5}{0.2\textwidth}{intBuilder\_DB} & IntAct & & int.intact.db \\ & & NCBI Gene & & int.geneint.db \\ & & MPPI & & int.mppi.db \\ & & 3did & & int.did.db \\ & & Domine & & int.domine.db \\ \hline \multirow{2}{0.25\textwidth}{protein subcell location} & \multirow{2}{0.2\textwidth}{subcellBuilder\_DB} & BaCelLo & & sc.bacello.db \\ & & DBSubLoc & & sc.dbsubloc.db \\ \hline protein structure classification & scopBuilder\_DB & SCOP & & scop.db \\ \hline protein post-translational modification & ptmBuilder\_DB & SysPTM & & sysptm.db \\ \hline body fluid protein & bfBuilder\_DB & Sys-BodyFluid & Homo sapiens & org.Hs.bf.db \\ \hline homolog protein group & HomoloGeneBuilder\_DB & HomoloGene & & homolog.db \\ \hline ortholog protein group & InParanoidBuilder\_DB & InParanoid & Homo sapiens, Mus musculus & org.HsMm.ortholog.db \\ \hline peptides identified by mass spectrometry & PeptideAtlasBuilder\_DB & PeptideAtlas & Homo sapiens & org.Hs.pep.db \\ \hline gene ontology & GOABuilder\_DB & GOA & Homo sapiens & org.Hs.goa.db \\ \hline identifier and name & dNameBuilder\_DB & & & dName.db \\ \hline \end{tabular} \end{center} \end{table} \subsubsection{Using annotation data package} SQLite-based *.db packages are capable of flexible data queries, reverse mapping, and data filtering. Vignette of "AnnotationDbi" detailedly described how to use SQLite based packages ({\url{http://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/AnnotationDbi.pdf}}). Here package "org.Hs.ipi.db" produced by PAnnBuilder is illustrated as an example in the following code chuck. \begin{enumerate} \item Install and load annotation package. Package "org.Hs.ipi" is used as an example, other packages can be derived accordingly. <>= biocLite("org.Hs.ipi.db") @ <>= library(org.Hs.ipi.db) # Load annotation package @ \item Browse data in the package. <>= ?org.Hs.ipiGENEID @ \item Convert the environment object into a "list" object, and get values by index or name. <>= xx <- as.list(org.Hs.ipiGENEID) xx[!is.na(xx)][1:3] xx[["IPI00792103.1"]] @ \item Specific utilities for SQLite-based *.db packages. <>= ## Access the data in table (data.frame) format via function "toTable". toTable(org.Hs.ipiPATH[1:3]) ## reverse the role of protein and pathway via ## function "revmap" or "reverseSplit". tmp1 <- revmap(org.Hs.ipiPATH) ## return a AnnDbBimap Object class(tmp1) as.list(tmp1)[1] tmp2 <- reverseSplit(as.list(org.Hs.ipiPATH)) ## return a list Object class(tmp2) tmp2[1] ## The left and right keys of the Bimap can be extracted ## using "Lkeys" and "Rkeys". Lkeys(org.Hs.ipiPATH)[1:3] Rkeys(org.Hs.ipiPATH)[1:3] ## Get the create table statements. org.Hs.ipi_dbschema() ## Use "SELECT" SQL query. selectSQL<-paste("SELECT ipi_id, de", "FROM basic", "WHERE de like '%histone%'") tmp3 <- dbGetQuery(org.Hs.ipi_dbconn(), selectSQL) tmp3[1:3,] @ \end{enumerate} \section{Function Description} \subsection{Getting URL and Version} To download data file from public database, the first step is getting its URL and release/version information. URLs of supported databases are stored in "data/sourceURLs.txt". Following functions are used to get the url: \begin{itemize} \item \Rfunction{getSrcUrl} - return url according given database. \item \Rfunction{getALLUrl} - return urls for all databases used in PAnnBuilder packages. \item \Rfunction{getSrcBuilt} - return release/version according given database. \item \Rfunction{getALLBuilt} - return release/version information for all databases used in PAnnBuilder packages. \end{itemize} \subsection{Parsing and Writing Data} Parsing is a key step to convert original data file to R object. Sometimes R is directlly used to parse and write data. But for large data file or complicated data format, perl is firstly employed to quickly process data, and then R function reads the result file into R objects. \subsubsection{Employing perl program to parse data} Segment of perl program is written into file in "inst/scripts". Name and function of these parser files are as follows: \begin{itemize} \item spParser - parse protein data from SwissProt or TrEMBL \item ipiParser - parse protein data from IPI \item refseqParser - parse protein data from NCBI RefSeq \item equalParser - find protein ID mapping with equal sequences \item mergeParser - merge different ID mapping files \item mppiParser - parse protein protein interaction data from MIPS \item paParser - parse data from PeptideAtlas \item dbsublocParser - parse data from DBSubLoc \item pfamNameParser - parse domain id and name from Pfam \item blastParser - filter the results of blast \end{itemize} Function \Rfunction{fileMuncher} and \Rfunction{fileMuncher\_DB} perl file based on given parser file and additional input data file, then perform this perl program via R. \subsubsection{Writing data using R} Besides using perl program, R functions also parse data from simple data file and store them as R environment objects or Bimap objects. \begin{itemize} \item \Rfunction{createEmptyDPkg} - create an empty R packge at given directory. \item \Rfunction{writeSQ} - write sequence data into R package. \item \Rfunction{writeName} - parse mulitple data file, and write the mapping of id and name into R package. It employs \Rfunction{writeGOName}, \Rfunction{writeKEGGName}, \Rfunction{writePFAMName}, \Rfunction{writeINTERPROName} and \Rfunction{writeTAXName} to respectively write data from GO, KEGG, Pfam, InterPro and TAX. \item \Rfunction{writeSCOPData} - parse structural classification of proteins from SCOP database. \item \Rfunction{writeSubCellData} - parse data from protein subcellular location databases, and write into R package. It employs \Rfunction{writeBACELLOData} and \Rfunction{writeDBSUBLOCData} to respectively write data from BaCelLo and DBSubLoc. \item \Rfunction{writeIntData} - parse data from protein-protein/domain-domain interaction databases, and write into R package. It employs \Rfunction{writeGENEINTData}, \Rfunction{writeINTACTData}, \Rfunction{writeMPPIData}, \Rfunction{write3DIDData} and \Rfunction{writeDOMINEData} to respectively write data from NCBI gene interaction data file, EBI intact, MIPS interaction data, 3DID database and DOMINE database. \item \Rfunction{writePtmData} - parse database involving protein post-translational modifications, and write into R package. It employs \Rfunction{writeSYSPTMData} to write data from SysPTM database. \item \Rfunction{writeBfData} - write data involving body fulids proteomics into R package. It employs \Rfunction{writeSYSBODYFLUIDData} to write data from Sys-BodyFluid database. \item \Rfunction{writeGOAData} - write gene ontology terms from GOA database into R package. \item \Rfunction{writeHomoloGeneData} - write homolog groups from NCBI HomoloGene into R package. \item \Rfunction{writeInParanoidData} - write paralog groups from InParanoid into R package. \item \Rfunction{writePeptideAtlasData} - write peptides identified by Mass Spectrometry from PeptideAtlas database into R package. \item \Rfunction{writeMeta\_DB} - write meta information about the annotation package into SQLite-based package. \item \Rfunction{writeData\_DB} - parse data from databases, and write data as tables into SQLite-based R package. It employs \Rfunction{writeSPData\_DB}, \Rfunction{writeIPIData\_DB}, \Rfunction{writeREFSEQData\_DB}, \Rfunction{writeGENEINTData\_DB}, \Rfunction{writeINTACTData\_DB}, \Rfunction{writeMPPIData\_DB}, \Rfunction{write3DIDData\_DB}, \Rfunction{writeDOMINEData\_DB}, \Rfunction{writeSYSBODYFLUIDData\_DB}, \Rfunction{writeSYSPTMData\_DB}, \Rfunction{writeSCOPData\_DB}, \Rfunction{writeBACELLOData\_DB}, \Rfunction{writeDBSUBLOCData\_DB}, \Rfunction{writeGOAData\_DB}, \Rfunction{writeHomoloGeneData\_DB}, \Rfunction{writeInParanoidData\_DB}, and \Rfunction{writePeptideAtlasData\_DB} to respectively write data from Swiss-Prot, IPI, NCBI RefSeq database, and so on. \item \Rfunction{writeName\_DB} - parse mulitple data file, and write the mapping of id and name into SQLite-based R package. It employs \Rfunction{writeGOName\_DB}, \Rfunction{writeKEGGName\_DB}, \Rfunction{writePFAMName\_DB}, \Rfunction{writeINTERPROName\_DB} and \Rfunction{writeTAXName\_DB} to respectively write data from GO, KEGG, Pfam, InterPro and TAX. \item \Rfunction{createSeeds} - define a list of AnnDbBimap objects which indicates key and value of . \item \Rfunction{createAnnObjs} - produce AnnDbBimap objects based on the definiation in \Rfunction{createSeeds}. \end{itemize} \subsection{Writing Help Documents} Help documents is an important part for new package. Diverse templates of help documents are stored in the "inst/templates" directory. When building new package, R functions use these templates to create "*.rd" help file in the "man" directory: \begin{itemize} \item \Rfunction{getRepList} - return a list which will replace the symbols in template file. \item \Rfunction{copyTemplates\_DB} - implement similar function with \Rfunction{copyTemplates}, and is specially developped for SQLite-based annotation package. \item \Rfunction{writeDescription\_DB} - implement similar function with \Rfunction{writeDescription}, and is specially developped for SQLite-based annotation package. \end{itemize} \subsection{Building Data Packages} Basic functions described above make it possible to build proteomic annotation data packages. Based on these, PAnnBuilder develops multiple sophisticated functions to assemble proteomic annotaion data. Each function is implemented by the "*Builder\_DB" R function. \begin{itemize} \item \Rfunction{pBaseBuilder\_DB} - build annotation data packages for primary protein database such as SwissProt, TREMBL, IPI or NCBI RefSeq protein data. \item \Rfunction{pSeqBuilder\_DB} - build annotation data packages for query protein sequences based on sequence similarity. \item \Rfunction{crossBuilder\_DB} - build annotation data packages for protein id mapping in SwissProt, Trembl, IPI and NCBI Refseq databases. \item \Rfunction{subcellBuilder\_DB} - build annotation data packages for protein subcellular location from BaCelLo or DBSubLoc database. \item \Rfunction{HomoloGeneBuilder\_DB} - build annotation data packages for homolog protein group from NCBI HomoloGene database. \item \Rfunction{InParanoidBuilder\_DB} - build annotation data packages for ortholog protein group between two given organisms from InParanoid database. \item \Rfunction{GOABuilder\_DB} - build annotation data packages for mapping proteins of UniProt to Gene Ontolgy from GOA database. \item \Rfunction{scopBuilder\_DB} - build annotation data packages for Structural Classification of Proteins. \item \Rfunction{intBuilder\_DB} - build annotation data packages for protein-protein or domain-domain interaction from IntAct, MPPI, 3DID, DOMINE or NCBI Gene interaction database. \item \Rfunction{PeptideAtlasBuilder\_DB} - build annotation data packages for experimentally identified peptides from PeptideAtlas database. \item \Rfunction{ptmBuilder\_DB} - build annotation data packages for post-translational modifications from SysPTM database. \item \Rfunction{bfBuilder\_DB} - build annotation data packages for proteins in body fluids from SysBodyFluid database. \item \Rfunction{dNameBuilder\_DB} - build annotation data packages for mapping between entry ID and name from GO, KEGG, Pfam, InterPro and NCBI Taxonomy databases. \end{itemize} \section{Building Annotation Data Packages} \begin{enumerate} \item The first thing you need to do is setting basic parameters such as "pkgpath", "version", and "author". <>= # Set path, version and author for the package. library(PAnnBuilder) pkgPath <- tempdir() version <- "1.0.0" author <- list() author[["authors"]] <- "Hong Li" author[["maintainer"]] <- "Hong Li " @ \item Then you can run diverse "*Builder\_DB" functions to build packages by yourselves. \Rfunction{pBaseBuilder\_DB}, \Rfunction{subcellBuilder\_DB}, and \Rfunction{pSeqBuilder\_DB} are taken as examples to build annotation packages. \begin{description} \item[pBaseBuilder\_DB] builds annotation data packages for proteins in three primary protein databases (SwissProt, IPI, RefSeq). It is a convenient way to obtain complete and canonical annotaion, including protein description, Entrez gene identifier, KEGG pathway, gene ontology, domain, coordinates on chromosomes and so on. For example, if you want to bulid annotation package for Mouse IPI database, you can use codes as follows: <>= ## Build SQLite based annotation package "org.Mm.ipi.db" ## for Mouse IPI database. ## Note: Perl is needed for parsing data file. ## Rtools is needed for Windows user. pBaseBuilder_DB(baseMapType = "ipi", organism = "Mus musculus", prefix = "org.Mm.ipi", pkgPath = pkgPath, version = version, author = author) @ After running, a subdirectory called "org.Mm.ipi" will be produced in the path given by "pkgPath". This directory contains all data and files, which can be used to build R package by "R CMD build" command. \item[subcellBuilder\_DB] builds annotation data Package which provides protein subcellular location information. <>= ## Build subcellular location annotation package "sc.bacello.db" ## from BaCelLo database. subcellBuilder_DB(src="BaCelLo", prefix="sc.bacello", pkgPath, version, author) ## List all files in created directory "sc.bacello.db". dir(file.path(pkgPath,"sc.bacello.db")) @ \item[pSeqBuilder\_DB] uses blast to calculate sequence similarity between query proteins and subject proteins, then assign annotation for query protein according to existing annotation of its similar proteins. pSeqBuilder\_DB is useful for proteins which have not well annotated. Following code chunk gives an example for annotation query proteins by pSeqBuilder\_DB. Needed R packages \Rpackage{org.Hs.sp.db}, \Rpackage{org.Hs.ipi.db}, can be downloaded via biocLite. <>= ## Read query sequence. tmp = system.file("extdata", "query.example", package="PAnnBuilder") tmp = readLines(tmp) tag = grep("^>",tmp) query <- sapply(1:(length(tag)-1), function(x){ paste(tmp[(tag[x]+1):(tag[x+1]-1)], collapse="") }) query <- c(query, paste(tmp[(tag[length(tag)]+1):length(tmp)], collapse="") ) names(query) = sub(">","",tmp[tag]) ## Set parameters for sequence similarity. blast <- c("blastp", "10.0", "BLOSUM62", "0", "-1", "-1", "T", "F") names(blast) <- c("p","e","M","W","G","E","U","F") match <- c(0.00001, 0.9, 0.9) names(match) <- c("e","c","i") ## Install ackages "org.Hs.sp", "org.Hs.ipi". if( !require("org.Hs.sp.db") ){ biocLite("org.Hs.sp.db") } if( !require("org.Hs.ipi.db") ){ biocLite("org.Hs.ipi.db") } ## Use packages "org.Hs.sp.db", "org.Hs.ipi.db" to produce annotation ## R package for query sequence. annPkgs = c("org.Hs.sp.db","org.Hs.ipi.db") seqName = c("org.Hs.spSEQ","org.Hs.ipiSEQ") pSeqBuilder_DB(query, annPkgs, seqName, blast, match, prefix="test1", pkgPath, version, author) @ \end{description} \item After the running of "*Builder\_DB" function has been finished, a subdirectory named "pkgName" will be produced in given "pkgPath". Then the command "R CMD build" can be used to build source R package, and "R CMD --binary build" can be used to build binary R package for Windows. \item Note: \begin{itemize} \item Web connection is needed to download files from public databases, and Perl is needed to parse data files. Additionally, Rtools is needed for Windows user. \item Users should be aware that downloading, parsing, and saving data may take a long time, in addition to requiring enough disk space to store temporary data files. \item "R CMD build" and "R CMD --binary build" should be used in command line, not in R. Detailed document about how to create your own packages can be found in the book "Writing R Extensions" ({\url{http://cran.r-project.org/doc/manuals/R-exts.pdf}}). For Windows users, "R CMD build" needs you to have installed the files for building source packages (which is the default), as well as the Windows toolset (see the "R Installation and Administration" manual at {\url{http://cran.r-project.org/doc/manuals/R-admin.pdf}}). \end{itemize} \end{enumerate} \section{Session Information} This vignette was generated using the following package versions: <>= sessionInfo() @ \bibliographystyle{plainnat} \bibliography{PAnnBuilder} \end{document}