%\VignetteIndexEntry{AnnBuilder ABPrimer} %\VignetteKeyword{annotation} %\VignettePackage{AnnBuilder} \documentclass[12pt]{article} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \begin{document} \author{Jianhua Zhang} \title{How to use AnnBuilder} \maketitle \section{Overview} AnnBuilder constructs annotation data packages for given sets of genes with known mappings either to GenBank accession numbers, UniGene identifiers, Image identifiers, or Entrez Gene identifiers. This vignette describes the process of building an annotation data package based on a sample file that maps probes to GenBank accession numbers. The process involves: \begin{enumerate} \item Map a given set of probes to Entrez Gene identifiers. There are two main components: \begin{itemize} \item Obtain mappings from different sources. For a given set of genes, especially for genes on an Affymetrix chip, there are quite a few sources of existing mappings available from the web. \item Unify mappings from different sources. Mapping information from different sources may agree or disagree. \Rpackage{AnnBuilder} resolves conflicts using a voting mechanism to obtain unified mappings between probes and Entrez Gene ids. \end{itemize} \item Based on the unified mappings, extract data from Locus Link and other sources such as Golden path, GO, KEGG. \item Combine data into an R data package. \end{enumerate} The capability of \Rpackage{AnnBuider} is well beyond what has be described above. In theory, any set of genes can be annotated using AnnBuilder as long as they can be mapped to an id used by a public data repository. However, that will require some extend of programming using the existing functions. That part will be covered in another vignette (Advanced). In this vignette, the process of annotating a set of genes that are mapped to GenBank accession numbers using a single function will be discussed. \section{Getting Started} \subsection{Requirements} AnnBuilder requires the support from the following items. The system will fail due to the lack of any of the requirements. \begin{itemize} %\item {R package XML is required to support the functions dealing with % XML files. The package is available through % \url{http://cran.r-project.org}.} \item Perl is required to process the potentially rather large annotation source data files. \end{itemize} \subsection{Function description} For a set of genes that are mapped to GenBank accession numbers (or UniGene identifiers, Image clone identifiers,Entrez Gene identifiers) a function named \Rfunction{ABPkgBuilder} can be used for building an annotation package. \Rfunction{ABPkgBuilder} takes the following arguments: \begin{description} \item[baseName] A character string for the name of a file to be used as a base file to parse source data. The file should contain two columns with the first one being the target genes to be annotated and the other being the corresponding mappings to GenBank accession numbers, UniGene identifiers, Image clone identifiers, or Entrez Gene identifiers. The second column should have either a value or "NA". \item[srcUrls] A named vector of character strings for the urls where source data files will be obtained. Valid sources are Entrez Gene, UniGene, Golden Path, Gene Ontology, and KEGG. The names for the character strings should be EG, UG, GP, GO, and KEGG, respectively. LL and UG are required. For windows users, the values should be unzipped files downloaded from the sources. A function call getSrcUrl("all", "Homo sapiens") will return the urls needed for building a package for human. Other valid organism names include the scientific names for mouse and rat. \item[baseMapType] A character string to indicate whether target genes in {\Robject{baseName}} are mapped to GenBank accession numbers (gb), UniGene identifiers (ug), Image clone identifiers (image), or Entrez Gene identifiers (ll). \item[otherSrc] A named vector of character strings for the names of files that contain mappings between target genes in {\Robject{baseName}} and Entrez Gene identifiers that will be unified to get more reliable mappings. \item[pkgName] A character string for the name of the data package to be be built (e. g. hgu95av2, rgu34a). \item[pkgPath] A character string for the full path of an existing directory where the package to be built will be stored. \item[organism] A character string for the name of the organism of concern (now can only be "Homo sapiens", "Mus musculus", or "Rattus norvegicus"). See section \texttt{Extend AnnBuilder} if you have an organism other than the three species. \item[version] A character string for the version number of the data package to be built. \item[author] A list of character strings with an author element for the name of the maintainer of the data package and a maintainer element for the email address of the maintainer. \end{description} What we need to to is to assign correct values to the above arguments and then call \Rfunction{ABPkgBuilder} with these arguments. \subsection{Datasets} We have placed two data sets in the {\texttt{data}} directory of \Rpackage{AnnBuilder} to demonstrate how to use \Rfunction{ABPkgBuilder}. One of them is {\verb+hgu95av2_ID.txt+} that contains Affymetrix probe ids and their mappings to GenBank accession numbers for the HGU95Av2 array. The file looks like: <<>>= library(AnnBuilder) read.table(file.path(.path.package("AnnBuilder"), "data", "hgu95av2_ID"), sep = "\t", header = FALSE, as.is = TRUE)[1:5,] @ Now we set the file as the base file (\Robject{baseName}) and indicate that the mappings for the base file is GenBank accession numbers. <<>>= myBase <- file.path(.path.package("AnnBuilder"), "data", "hgu95av2_ID") myBaseType <- "gb" @ The data sources that can be used for annotation are abundant. We focus on the following public data sources: \begin{description} \item[Entrez Gene] The data {\url{ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz}} will be used to map genes to Entrez Gene identifiers and also to annotate genes after the unified mappings have been obtained. \item[UniGene] The data contained at {\url{ftp://ftp.ncbi.nih.gov/repository/UniGene/}} will be used to obtain mappings between genes and Entrez Gene identifiers. The exact data that will be used depend on the organism. \item[Golden Path] The data (refLink.txt and refGene.txt) at {\url{http://www.genome.ucsc.edu/goldenPath/14nov2002/database}} will be used to obtain the chromosomal location and orientation data for genes. The part 14nov2002 will be something else for a different organism or when there is new built for the data sets. \item[Gene Ontology] The data {\url{http://www.godatabase.org/dev/database/archive/2003-03-01/go_200303-termdb.xml.gz}} will be used to obtain gene ontology information. The last part of the url changes with builds. \item[KEGG] Some data at {\url{ftp://ftp.genome.ad.jp/pub/kegg/pathway/organisms}} will be used to extract the pathway and enzyme information. Quite a few individual files will be used and the system has a way of locating them with information available at the site by the url. \item[HomoloGene] A data file provided by {\url{ftp://ftp.ncbi.nih.gov/pub/HomoloGene/}} will be used to extract mappings between Entrez Gene ids and HomoloGene ids. \end{description} %% removed by Ting: these info changes frequently. It is not a good idea to specify these in case users get confused %We may assign the urls to \Robject{srcUrls}. % %<<>>= %mySrcUrls <- c(EG = "ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz", % UG = "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz", % GP = "http://www.genome.ucsc.edu/goldenPath/14nov2002/database/" , % GO = "http://www.godatabase.org/dev/database/archive/2003-03-01/go_200303-termdb.xml.gz", % KEGG = "ftp://ftp.genome.ad.jp/pub/kegg/pathway/organisms") % % %However, \Rpackage{AnnBuilder} comes with URL information which can be accessed via the global option \verb+AnnBuilderSourceUrls+. <<>>= mySrcUrls <- getSrcUrl("all", "Homo sapiens") mySrcUrls @ %As \Rpackage{AnnBuilder} does not know how to handle %{\texttt{.gz}} files under windows. Therefore, any of the source files %that are of type {\texttt{.gz}} (namely LL, UG, GP, and GO) will have to be %downloaded/unzipped by windows users and then stored locally before %hand. The file names of the downloaded/unzipped files will be used to replace %the urls for the corresponding source data files. %In the vignette, we will use truncated versions of some of the files %to reduce the length of time required to process the source data. The %truncated files are stored at the Bioconductor web site. Note that EG is replaced by LL just to use the example data. %For windows %users, we have downloaded/unzipped the source files and stored them in %the {\texttt{data}} directory of \Rpackage{AnnBuilder}. %% FIXME: the LL and UG urls here are not correct. If there is not other sources of mappings between the target genes and Entrez Gene identifiers available, the mappings provided by Entrez Gene and UniGene will be unified. However, as an example, let us assume that we have the mappings from Affymetrix and another unidentified source that we would like to use as other sources to obtain the unified mappings. The two source files are also stored in the \Robject{data} directory of \Rpackage{AnnBuilder}. <<>>= read.table(file.path(.path.package("AnnBuilder"), "data", "hgu95av2_AFFY"), sep = "\t", header = FALSE, as.is = TRUE)[1:5, ] read.table(file.path(.path.package("AnnBuilder"), "data", "srcb"), sep = "\t", header = FALSE, as.is = TRUE) @ We assign the file to \Robject{otherSrc} <<>>= myOtherSrc <- c(srcone = file.path(.path.package("AnnBuilder"), "data", "hgu95av2_AFFY"), srctwo = file.path(.path.package("AnnBuilder"), "data", "srcb")) @ The other arguments needed are pretty straight forward and will not be elaborated. \subsection{Build annoation} To build an annotation data package, we only have to call \Rfunction{ABPkgBuilder} with correct argument values. However, the code below (and hereafter) is turned off under windows as human intervention is required under the system. Copying the code chunk and pasting into an R session under windows should work. <<>>= myDir <- tempdir() @ \begin{Sinput} > ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType = myBaseType, otherSrc = myOtherSrc, pkgName = "hgu95av2", pkgPath = myDir, organism = "Homo sapiens", version = "1.1.0", author = list(authors = "myname", maintainer = "myname@myemail.com"), fromWeb = TRUE) \end{Sinput} Please note that the build takes quite a while to finish. If you are patient enough to wait until the end, you will have a data package named "hgu95av2" in the directory defined by myDir. The data package has a data, man, and R subdirectory each with some files. The created data package can be installed the same way as a regular R package. \section{Extend AnnBuilder} \Rpackage{AnnBuilder} \section{Further note} Function \Rfunction{ABPkgBuilder} works only if the data files are of the correct format (e.g. delimiter separated two column text files) and the urls for the source data and information on their builds remain unchanged. When changes to the urls occur, the function will fail and users may not have much power of control because \Rfunction{ABPkgBuilder} makes assumptions and then calls different functions based on the assumptions. Another vignette {\texttt{AnnBuilder}} shows the details of using the functions \Rfunction{ABPkgBuilder} based on but are available in \Rpackage{AnnBuilder} to build data packages. More coding is involved there but users will have much greater control over the building process and avoiding system failures as that may occur when using {\texttt{ABPkgBuilder}}. Users are encouraged to read that vignette when become comfortable with \Rfunction{ABPkgBuilder}. \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= sessionInfo() @ \end{document}