% \VignetteIndexEntry{AnnBuilder Basic} % \VignetteDepends{AnnBuilder} % \VignetteKeywords{Expression Analysis} % \VignettePackage{AnnBuilder} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \documentclass[12pt]{article} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \begin{document} \author{Jianhua Zhang} \title{Basic Functions of AnnBuilder} \maketitle \copyright{2003 Bioconductor} \section{Introduction} This vignette is an overview of some of the functions that can be used to build an annotation data package. The purpose of this vignette is to provide guidance for users who are comfortable using the data package building procedures described in {\texttt{ABPrimer}} but would like to have more freedom in building customized data packages. First time users of AnnBuilder are suggested to go through the {\texttt{ABPrimer}} vignette before trying the code here. Functions contained by \Rpackage{AnnBuilder} include: \begin{verbatim} library(AnnBuilder) pkgpath <- .find.package("AnnBuilder") docFiles <- file.path(pkgpath, c("TITLE", "DESCRIPTION", "INDEX")) headers <- c("", "Description:\n\n", "Index:\n\n") footers <- c("\n", "\n", "") for (i in which(file.exists(docFiles))) { writeLines(headers[i], sep = "") writeLines(readLines(docFiles[i]) ) writeLines(footers[i], sep = "") } \end{verbatim} \Rpackage{AnnBuilder} relies on these functions to build annotation data packages by extracting data from the following potential public data repositories. \begin{itemize} \item Entrez Gene (\url{ftp://ftp.ncbi.nlm.nih.gov/gene/DATA}) to obtain mappings to Gene IDs and annotation data. \item UniGene(\url{ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/Hs.data.gz}) to obtain mappings to LocusLink ids from ESTs \item GoldenPath(\url{http://www.genome.ucsc.edu/goldenPath/14nov2002/database}). Two data files (refLink.txt.gz and refGene.txt.gz) are used to obtain chromosomal location and orientation data \item Gene Ontology(\url{ "http://www.godatabase.org/dev/database/archive/latest/go_200506-termdb.rdf-xml.gz"}) to obtain gene ontology terms and relationships among terms. \item KEGG(\url{ftp://ftp.genome.ad.jp/pub/kegg/pathway/organisms}) to obtain pathway and enzyme data for genes. Several data files may be used depending on the organism of interest. \item[HomoloGene] A data file provided by {\url{ftp://ftp.ncbi.nih.gov/pub/old/HomoloGene/}} will be used to extract mappings between LocusLink ids and HomoloGene ids. \end{itemize} The urls with date components may change when the maintainers update the data. However, \Rpackage{AnnBuilder} has the ability to figure out the latest updates and use the corresponding data for annotation as long as the current path structure of the urls remain. Source data will be downloaded from the urls given. Each of the public data repositories is represented as an object of a S4 class. Common methods for an object include \Rfunction{readData} that reads in data line by line and \Rfunction{parseData} that parses data based on the instructions given in a segment of Perl code. In both cases, data are downloaded from the source url and then processed locally. As data from the aforementioned data sources are usually large, truncated versions of the corresponding data will be used to ensure reasonable speed. These files have already been stored in Bioconductor web site. Thus, the source urls will be different for a real annotation project. \subsection*{Getting Source Data} Suppose we are interested in annotating genes on Affymetrix HG\_U95av2 gene chip. A file containing a column for Affymetrix probe ids and another for mappings to GenBank accession numbers can be produced based on the data file provided by Affymetrix and then used as the base to extract annotation data from different data sources. The base file has to be saved as a text file with the two columns separated by a delimiter (e. g. a tab - "\t"). Here we just create a truncated one on the fly and store it in the current working directory. \begin{verbatim} geneNMap <- matrix(c("32468_f_at", "D90278;M16652", "32469_at", "L00693", "32481_at", "AL031663", "33825_at", "X68733", "35730_at", "X03350", "36512_at", "L32179", "38912_at", "D90042", "38936_at", "M16652", "39368_at", "AL031668"), ncol = 2, byrow = TRUE) write.table(geneNMap, file = "geneNMap", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE) \end{verbatim} We can see that the file has two columns for Affymetrix probe ids and the matching GenBank accession numbers: \begin{verbatim} geneNMap \end{verbatim} The first step to annotating these probe ids in the base file is to map them to Entrez Gene ids and then use mapped gene ids as the point of linkage to other annotation data provided by various data sources. As Affymetrix probe ids (probes for other platform as well) may be mapped to gene ids through Entrez Gene and UniGene (and other sources), each of which can be complementary to each other, we may want to get the mappings from all the available sources and then combine the results to ensure completeness. \Rpackage{Annbuilder} has a unifying mechanism that allows users to unify mapping information from different sources to obtained a combined result that is assumed to be more reliable. In this vignette, we first would like to map the probes to Entrez Gene ids using data from both Entrez Gene and UniGene. The following code creates objects for Entrez Gene and UniGene with parsers needed to parse the source data file for mapping Affymetrix probe ids in baseF to gene ids. %For windows users, we have downloaded/unzipped the %sample files and stored them in the {\texttt{data}} directory of %{\texttt{AnnBuilder}}. \begin{verbatim} makeSrcInfo() srcObjs <- list() egUrl <- "http://www.bioconductor.org/datafiles/wwwsources" ugUrl <- "http://www.bioconductor.org/datafiles/wwwsources/Ths.data.gz" eg <- EG(srcUrl = egUrl, parser = file.path(pkgpath, "scripts", "gbLLParser"), baseFile = "geneNMap", accession = "Tll_tmpl.gz", built = "N/A", fromWeb = TRUE) ug <- UG(srcUrl = ugUrl, parser = file.path(pkgpath, "scripts", "gbUGParser"), baseFile = "geneNMap", organism = "Homo sapiens", built = "N/A", fromWeb = TRUE) srcObjs[["eg"]] <- eg srcObjs[["ug"]] <- ug \begin{verbatim} Again, the urls used in the example are for demonstration purpose only. eg and ug objects also take a parser as an argument. A parser is a segment of a Perl script that contains instructions on how the data source will be parsed and how the output will be generated. Please refer to the documents for pubRepo for detailed information on parsers and the objects for various public data repositories. The parser for eg used in the example is only for the text source data only. Replace "gbLLParser" with "egAccParser" when parsing the source data (gene2accession.gz) from Entrez Gene. Each object has a function named parserData that can be invoked to obtain the parsed data. {\texttt{parserData}} has a parameter - {\texttt{fromWeb}} that should be set to FALSE if the source data has been stored locally. The following code needs human intervention under windows and is therefore turned off. Copying the code chunk and then pasting into an R session under windows should work. \begin{verbatim} # This portion only runs interactively under Windows (copy/paste) if(.Platform$OS.type != "windows"){ llMapping <- parseData(eg, eg@accession) colnames(llMapping) <- c("PROBE", "EG") ugMapping <- parseData(ug) colnames(ugMapping) <- c("PROBE", "UG") } \end{verbatim} The parsed data from LocusLink and UniGene are: \begin{verbatim} # This portion only runs after the previous code has been # executed under windows if(.Platform$OS.type != "windows"){ llMapping ugMapping } \end{verbatim} Please note the differences between the mappings from the two sources and some of the Affymetrix probe ids can be mapped to multiple Gene ids and ";" is used to separate multiple mappings in such cases. The mappings obtained from the two sources are then unified to obtain a comprehensive mapping between Affymetrix probe ids and gene ids. The unified mappings are saved in a file show below: \begin{verbatim} # This portion only runs interactively under Windows (copy/paste) base <- matrix(scan("geneNMap", what = "", sep = "\t", quote = "", quiet = TRUE), ncol = 2, byrow = TRUE) colnames(base) <- c("PROBE", "ACC") merged <- merge(base, llMapping, by = "PROBE", all.x = TRUE) merged <- merge(merged, ugMapping, by = "PROBE", all.x = TRUE) unified <- AnnBuilder:::resolveMaps(merged, trusted = c("EG", "UG"), srcs = c("EG", "UG")) unified \end{verbatim} In the above code, "EG" has been identified as the trusted source meaning that when the two sources provide conflicting mappings, the one from Entrez Gene will be used. The unified mapping has four columns with the first one for Affymetrix probe ids, second for GenBank accession numbers, third for mappings to gene ids, and forth for the number of sources that agreed with the mappings. \begin{verbatim} read.table(unified, sep = "\t", header = FALSE) \end{verbatim} The unified mappings can then be used as the base file to parse the data from Entrez Gene to obtain annotation data for each of the Affymetrix probe ids. To do so, we need to assign a new parser that processes the data from Entrez to get annotation data including gene name, chromosomal location, and so on. Again, the parser works for the example only. \begin{verbatim} # This portion only runs interactively under Windows (copy/paste) if(.Platform$OS.type != "windows"){ parser(eg) <- file.path(.path.package("AnnBuilder"), "scripts", "llParser") baseFile(eg) <- unified annotation <- parseData(eg, eg@accession, ncol = 14) colnames(annotation) <- c("PROBE", "ACCNUM", "LOCUSID", "UNIGENE", "GENENAME", "SYMBOL","CHR", "MAP", "PMID", "GRIF", "SUMFUNC", "GO", "OMIM", "REFSEQ") } \end{verbatim} The annotation data obtained has 12 columns for the elements indicated by the column names. Let us view the chromosomal number of the Affymetrix probe ids. \begin{verbatim} annotation[,c("PROBE", "LOCUSID")] \end{verbatim} Other annotation data can be obtained from other sources. In this vignette, we try to get data from GoldenPath for chromosomal location and orientation and Gene Ontology for ontology terms and relations among terms. As usual, we create the objects with truncated data from Bioconductor rather than the actual web site. Two source data files (Tlink.txt.gz and TGene.txt.gz) have to be downloaded/unzipped from GoldenPath (\url{http://www.genome.ucsc.edu/goldenPath/10april2003/database/}) in order to obtain the chromosome location data. We only have to provide the url under unix as the system knows how to get the latest version of the two files. \begin{verbatim} gpUrl <- "http://www.bioconductor.org/datafiles/wwwsources/" goUrl <- "http://www.bioconductor.org/datafiles/wwwsources/Tgo.xml" gp <- GP(srcUrl = gpUrl, organism = "Homo sapiens", fromWeb = TRUE) go <- GO(srcUrl = goUrl, fromWeb = TRUE) \end{verbatim} To get the chromosomal data from GoldenPath with the actual url, one only needs to call a function called getStrand by typing "strand <- getStrand(gp)" where gp is the object for goldenPath with correct url. In this vignette, however, we take a somewhat different approach to get the data as we are using a truncated set of data from a dummy. \begin{verbatim} strand <- getChroLocation(srcUrl(gp), gpLinkNGene(TRUE)) \end{verbatim} The data processed are then merged with the annotation we previously obtained. \begin{verbatim} annotation <- merge(annotation, strand, by = "LOCUSID", all.x = TRUE) \end{verbatim} To generate an R data package containing the annotation data, we first create an empty package and then populate the package with data and F fucntions. \begin{verbatim} pkgName <- "test" pkgPath <- getwd() createEmptyDPkg("test", getwd(), force = TRUE) annotation <- as.matrix(annotation) writeAnnData2Pkg(annotation, pkgName, pkgPath) revNames <- intersect(colnames(annotation), c("PMID", "PATH", "ENZYME")) if(length(revNames) != 0){ writeReverseMap(annotation[, c("PROBE", revNames)], pkgName, pkgPath) } \end{verbatim} The data package is stored in the current working directory under the name {texttt{test}}. \begin{verbatim} list.files(file.path(getwd(), "test")) \end{verbatim} As can be seen, the data package contains all the required elements of a normal R package and can be installed in the same way as an R package. The annotation data are all stored as rda files in the data directory. \begin{verbatim} list.files(file.path(getwd(), "test", "data")) \end{verbatim} Each of the rda files contains key and value pairs with the key being Affymetrix probe ids and value being the annotation element in this case. The last step is to write the needed documentations and statistic data for quality control purpose. The following code can be used to generate the required documentations for the data package. Some part of the code may fail as the urls used may subject to changes by maintainers of the web sits in the future. \begin{verbatim} repList <- getRepList("all", srcObjs) repList[["PKGNAME"]] <- pkgName chrLengths <- getChrLengths("Homo sapiens") writeOrganism(pkgName, pkgPath, "Homo sapiens") writeChrLength(pkgName, pkgPath, chrLengths) writeDocs("geneNMap", pkgName, pkgPath, "1.1.0", list(author = "annonymous", maintainer = "annonymous@net.com"), repList, "PKGNAME") \end{verbatim} Now, we can clean up the mess we have left. \begin{verbatim} unlink(c(unified, XMLOut, "geneNMap", "test.xml", "testByNum.xml")) unlink(file.path(getwd(), "test"), TRUE) \end{verbatim} \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= sessionInfo() @ \end{document}