--- title: "An introduction to biodbNcbi" author: "Pierrick Roger" date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('biodbNcbi')`" abstract: | How to use the NCBI Gene, CCDS, Pubchem Comp and Pubchem Subst connectors and their methods. vignette: | %\VignetteIndexEntry{Introduction to the biodbNcbi package.} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc: yes toc_depth: 4 toc_float: collapsed: false BiocStyle::pdf_document: default bibliography: references.bib --- # Introduction biodbNcbi is a *biodb* extension package that implements a connector to the NCBI databases [@sayers2022_NCBI] Gene, CCDS [@pruitt2009_CCDS; @harte2012_CCDS; @farrell2014_CCDS], Pubchem Comp and Pubchem Subst [@kim2015_PubChem]. # Installation Install using Bioconductor: ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install('biodbNcbi') ``` # Initialization The first step in using *biodbNcbi*, is to create an instance of the biodb class `Biodb` from the main *biodb* package. This is done by calling the constructor of the class: ```{r, results='hide'} mybiodb <- biodb::newInst() ``` During this step the configuration is set up, the cache system is initialized and extension packages are loaded. We will see at the end of this vignette that the *biodb* instance needs to be terminated with a call to the `terminate()` method. # Creating a connector to Gene In *biodb* the connection to a database is handled by a connector instance that you can get from the factory. biodbNcbi implements a connector to a remote database. Here is the code to instantiate a connector: ```{r} gene <- mybiodb$getFactory()$createConn('ncbi.gene') ``` Creating other connectors follow the same process: ```{r} ccds <- mybiodb$getFactory()$createConn('ncbi.ccds') pubchem.comp <- mybiodb$getFactory()$createConn('ncbi.pubchem.comp') pubchem.subst <- mybiodb$getFactory()$createConn('ncbi.pubchem.subst') ``` # Accessing entries To get the number of entries stored inside the database, run: ```{r} gene$getNbEntries() ``` To get some of the first entry IDs (accession numbers) from the database, run: ```{r} ids <- gene$getEntryIds(2) ids ``` To retrieve entries, use: ```{r} entries <- gene$getEntry(ids) entries ``` To convert a list of entries into a dataframe, run: ```{r} x <- mybiodb$entriesToDataframe(entries) x ``` # Accessing efetch web service **efetch** web service is accessible through the `wsEfetch()` method, available on Entrez connectors: `ncbi.gene`, `ncbi.pubchem.comp` and `ncbi.pubchem.subst`. Get the a Gene entry as an XML object and print the `Entrezgene_prot` node: ```{r} entryxml <- gene$wsEfetch('2833', retmode='xml', retfmt='parsed') XML::getNodeSet(entryxml, "//Entrezgene_prot") ``` The object returned is an `XML::XMLInternalDocument`. # Accessing esearch web service **esearch** web service is accessible through the `wsEsearch()` method, available on Entrez connectors: `ncbi.gene`, `ncbi.pubchem.comp` and `ncbi.pubchem.subst`. Search for Gene entries by name and get the IDs of the matching entries (equivalent of running `gene$searchForEntries()`: ```{r} gene$wsEsearch(term='"chemokine"[Gene Name]', retmax=10, retfmt='ids') ``` The same result can be obtained with a call to `searchForEntries()`: ```{r} gene$searchForEntries(fields=list(name='chemokine'), max.results=10) ``` # Accessing einfo web service **einfo** web service is accessible through the `wsEinfo()` method, available on Entrez connectors: `ncbi.gene`, `ncbi.pubchem.comp` and `ncbi.pubchem.subst`. Get PubChem Comp database information as an XML object and print information on first field: ```{r} infoxml <- pubchem.comp$wsEinfo(retfmt='parsed') XML::getNodeSet(infoxml, "//Field[1]") ``` # Closing biodb instance When done with your *biodb* instance you have to terminate it, in order to ensure release of resources (file handles, database connection, etc): ```{r} mybiodb$terminate() ``` # Session information ```{r} sessionInfo() ``` # References