--- title: "Creating a new connector." author: "Pierrick Roger" date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('biodb')`" vignette: | %\VignetteIndexEntry{Creating a new connector class for accessing a database.} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} abstract: | This vignette shows how to create a new connector class and the corresponding new entry class for accessing a remote database. output: BiocStyle::html_document: toc: yes toc_depth: 4 toc_float: collapsed: false BiocStyle::pdf_document: default bibliography: references.bib --- ```{r, echo=FALSE} source(system.file('vignettes_inc.R', package='biodb')) ``` # Introduction *biodb* is a framework designed to help you implement new connectors for databases. To illustrate this, we will show you a practical example where we create a connector for the [ChEBI](https://www.ebi.ac.uk/chebi/) database. In this example, we will present you a small implementation of a *ChEBI* connector, and show you how to declare it to your *biodb* instance. A more complete and functional connector for accessing *ChEBI* database is implemented in [biodbChEBI](https://github.com/pkrog/biodbChebi) library. See \@ref(tab:biodbChebiCapabilities) for a list of the capabilities of this official *biodb* connector. Title / method name | Description ------------------- | ------------------------------------------- Fields parsing | Formula, charge, InChI, InChIKey, molecular mass, monoisotopic mass, KEGG id, entity stars, SMILES. getEntryPageUrl() | Returns the URL of the website page of an entry. getEntryImageUrl() | Returns the URL to the molecule image of an entry. wsWsdl() | Returns the WSDL definition (i.e.: list of available web services and their parameters). wsGetLiteEntity() | Runs the getLiteEntity web service that returns database entries with their contents. convIdsToChebiIds() | Converts a list of IDs (InChI, InChI Keys, CAS, ...) into a list of ChEBI IDs. convInchiToChebi() | Converts a list of InChI or InChI KEYs into a list of ChEBI IDs. convCasToChebi() | Converts a list of CAS IDs into a list of ChEBI IDs. searchForEntries() | Searches for entries by mass and/or by name. : (\#tab:biodbChebiCapabilities) Capabilities of the *biodbChebi* extension package. # Generating a new extension package When creating a new extension package, *biodb* can help you generate all the necessary files. A call to `genNewExtPkg()` will generate the skeletons for the *biodb* connector class and the *biodb* entry class, along with the testthat files, the DESCRIPTION file, etc. A simplified call might look like this: ```{r} biodb::genNewExtPkg(path='biodbChebiEx', dbName='chebi.ex', connType='compound', dbTitle='ChEBI connector example', entryType='xml', remote=TRUE) ``` See \@ref(tab:generatorParameters) for a brief description of the parameters. Other parameters exist for the author's email, the author's name, for generating a `Makefile`, or configuring for writing C++ code with `Rcpp`. Parameter | Description --------- | -------------------------------- path | The path to the package folder to create. dbName | The name of the connector to create. dbTitle | A short description of the database. connType | The type of connector. entryType | The type of the entry. remote | Must be set to \code{TRUE} if a connection to a web server is needed. : (\#tab:generatorParameters) A brief description of some parameters of `biodb::genNewExtPkg()`. The files generated by the `genNewExtPkg()` function are the following ones: ```{r} list.files('biodbChebiEx', all.files=TRUE, recursive=TRUE) ``` Inside the `biodb_ext.yml` file are stored the values of the parameters used with `biodb::genNewExtPkg()`. This is in case you want to upgrade some the generated files (`.gitignore`, `.travis.yml`, `Makefile`, etc) with newer versions from *biodb* package. You would then only need to call `biodb::upgradeExtPkg(path='biodbChebiEx')` and the `biodb_ext.yml` file would be read for parameter values. The `inst/definitions.yml` file defines the new connector, we will fill in some values inside it. Then we need to write implementations for the methods in the connector class `R/ChebiExConn.R`. On the other side, `R/ChebiExEntry.R`, the entry class, needs no modification for our basic usage. The test files in `tests/testthat` will be executed when running `R CMD check`, they need to be edited first though. Generic tests need to enabled inside `tests/testthat/test_100_generic.R`. The files `tests/testthat/test_050_fcts.R` and `tests/testthat/test_200_example.R` contain only examples, thus they need to be modified or removed. The test files in `tests/long` will not be executed when running `R CMD check`. They can be run manually after installing the package locally, by calling `R -e "testthat::test_dir('tests/long')"`. A skeleton vignette has also been generated (`vignettes/intro.Rmd`), and should be completed with specific examples for this package. # Editing the generated skeleton Starting from the skeleton files generated by `genNewExtPkg()`, we need now to fill in the blanks. The first file to take care of is `inst/definitions.yml`, which contains the definition of the new connector. Then we will look quickly at `R/ChebiExEntry.R`, which is rather empty in our case, and `R/ChebiExConn.R`, which requires much more attention, having several methods that need implementation. The naming of the classes inside the R files is important. They must be named `ChebiExEntry` and `ChebiExConn`, in order to match the name defined inside `inst/definitions.yml` (`chebi.ex`). Hopefully the generator has taken care of this, and no special action is required on this aspect, except not modifying the names. ## Editing the YAML definition of the new connector The content of the generated YAML file `inst/definitions.yml` is as follow: ```{r, eval=FALSE, highlight=FALSE, code=readLines('biodbChebiEx/inst/definitions.yml')} ``` It is mainly filled with examples. This YAML file contains two main parts: `databases` and `fields`. The `databases` part is where you list the new connectors you've created, and the `fields` part is where you define the new entry fields your new connectors need. ### Fields definition We just have one new field to define: `chebi.ex.id`. This is the accession field for our new connector. All connector accession fields are in the form `.id`. This accession field is mainly used inside other databases, when they make references to other databases. The field `accession`, which is used in all entries of *biodb* connectors, contains the same value as the connector accession field (`chebi.ex.id` in our case) and is preferable when accessing an entry. The definition of the new field is quite simple, See \@ref(tab:fielddecl) for explanations of the different parameters. Parameter | Description -------------------- | -------------------------------- `description` | A free description of your field. `type` | The type of the field. Here we declare that this is an accession (identifier) field: `id`. `card` | The cardinality of the field: `one` if field accepts only one value, or `many` if multiple values can be stored inside the field. `forbids.duplicates` | If `TRUE` then duplicates are forbidden. This supposes that we allow to store multiple values inside this field (i.e.: cardinality is set to `many`). `case.insensitive` | If `TRUE` then values will be compared in case insensitive mode. This is mostly useful when looking for duplicates. : (\#tab:fielddecl) Field's parameters. Description of the parameters used when declaring a new entry field. ### Database definition The main part is the declaration of the new connector. This is done in the `databases` section, under the key `chebi.id`, which is the database identifier. See \@ref(tab:conndecl) for explanations of the different parameters. Parameter | Description ------------------------ | -------------------------------- `name` | The full name of your new connector. `urls` | A list (key/values) of URLs of the remote database. The common URLs to define are `base.url` to access pages of the database website, and `ws.url` for web service URLs. Those URLs are just "prefix" and are used inside the connector class for building real URLs. You can define as much URLs as the remote database requires, like a second base URL (`base2.url`) or a second web service URL (`ws2.url`), or any other URL with the key name you want. `xml.ns` | This parameter defines namespaces for XML documents returned by the remote database. This is thus only useful for databases that return data in XML format. `scheduler.n` | The maximum number of queries to send to the remote database, each T (stored as `scheduler.t`) seconds. `scheduler.t` | The time (in seconds) during which a maximum of N (stored as `scheduler.n`) queries is allowed. `entry.content.type` | The type of content sent by the database for an entry. Here we have specified `xml`. Allowed values are: `html`, `sdf`, `txt`, `xml`, `csv`, `tsv`, `json`, `list`. This is mainly used to add an extension to the file saved inside *biodb* cache. `entry.content.encoding` | The text encoding used inside the entry's content by the database. `parsing.expr` | This is the most important part of the declaration. It is lists the different expressions to use in order to parse the values of the entry fields. The format is a key/value list, the key being the *biodb* field name, and the value the expression to run. Since the entry content type is XML, we have to use XPath expressions here. See this [XPath Tutorial](https://www.w3schools.com/xml/xpath_intro.asp), for instance, to get an introduction to XPath. Note that we can define multiple expressions, like for `formula` field, in case of XPath expressions. If the first expression fails, then next expressions will be tried. `searchable.fields` | A list of *biodb* entry fields that are searchable when calling a search function like `searchCompound()`. : (\#tab:conndecl) Connector declaration's parameters. Description of the parameters used when declaring a new connector. ### Final version of the YAML file After setting some parsing expressions, the URLs and the searchable fields, we get a complete definition file, that you can find at: ```{r} defFile <- system.file("extdata", "chebi_ex.yml", package='biodb') ``` Its content is as follow: ```{r, eval=FALSE, highlight=FALSE, code=readLines(system.file("extdata", "chebi_ex.yml", package='biodb'))} ``` ## The entry class The entry class represents an entry from the database. Each instance of an entry contains the values parsed from the database downloaded content. The entry class of our example extension package has been generated inside `R/ChebiExEntry.R`. Here is its content: ```{r, eval=FALSE, highlight=TRUE, code=readLines('biodbChebiEx/R/ChebiExEntry.R')} ``` The class inherits from `BiodbXmlEntry` since we have set the `entryType` parameter to `"xml"`. An entry class must inherit from the `BiodbEntry` class and define some methods. To simplify this step, several generic entry classes have been defined in *biodb* (see \@ref(tab:entryClasses)), depending on the type of content downloaded from the database. To use one of these classes for your entry class, you only have to make your class inherit from the desired generic class. Entry class | Content type handled ---------------- | -------------------------------- `BiodbCsvEntry` | CSV file. `BiodbHtmlEntry` | HTML, the parsing will be done using XPath expressions. `BiodbJsonEntry` | JSON. `BiodbListEntry` | R list. `BiodbSdfEntry` | SDF file (chemical data file format). `BiodbTxtEntry` | Text file, the parsing will be done using regular expressions. `BiodbXmlEntry` | XML file, the parsing will be done using XPath expressions. : (\#tab:entryClasses) Provided abstract entry classes. These are the entry classes already defined inside *biodb* package that facilitates the parsing of the corresponding content type. Two methods are defined that can be used to enhance our implementation. The method `doCheckContent()` can be used to further check the parsed content of an entry, for instance for some incoherence between fields. The method `doParseFieldsStep2()` allows to run some custom code for complex parsing of the entry's content. This method is run after `doParseFieldsStep1()`, which is defined inside the mother class (here `BiodbXmlEntry`) and executes the parsing expression defined inside `inst/definitions.yml`. Note: *biodb* uses [R6](https://adv-r.hadley.nz/r6.html) as OOP (Object Oriented Programming) model. Please see vignette ```{r, echo=FALSE, results='asis'} make_vignette_ref('details') ``` , for more explanations. ## The connector class The generator has generated the full class, and thus has taken care of the inheritance part, as well as the declaration of the required methods. See \@ref(tab:chebiExMethods) for a description of these methods. What is left to us is the implementation of those methods. Here is the generated skeleton: ```{r, eval=FALSE, highlight=TRUE, code=readLines('biodbChebiEx/R/ChebiExConn.R')} ``` ### Inheritance The connector class is responsible for the connection to the database. In our case, the database is a compound database. ### Methods to implement Method | Description ----------------------------- | -------------------------------- `doGetEntryPageUrl()` | This method returns the official URL of the entry page on the database website, for each each accession number passed. The return type is thus a list. If no entry pages are available for the database, the method must return a list of `NULL` values, the same length as the input vector. `doGetEntryImageUrl()` | This method returns the official URL of the entry picture on the database website, for each each accession number passed. The picture returned must be visual representation of the entry (a molecule 3D model, a mass spectrum, ...). The return type is thus a list. If no entry pages are available for the database, the method must return a list of `NULL` values, the same length as the input vector. `doGetEntryContentRequest()` | This method is called by `getEntryContentRequest()`, and must return a list of URLs used to retrieve entry contents. If `concatenate` parameter is `FALSE`, the list returned must be the same length as the vector `id` and each URL must point to one entry content only. If `concatenate` parameter is `TRUE`, then it is permitted (but not compulsory) to return URLs that get more than one entry at a time. `doGetEntryIds()` | This method, called by `getEntryIds()`, should return the full list of accession numbers of the entries contained in the database, or a subset if `max.results` is set. This method is used for testing, in order to get a sample of existing entries, but may also be useful for users when developing. `doSearchForEntries()` | This method implements the search of entries by filtering on some field values. For our example, we have kept it simple by implementing only the search by name (field `"name"`), because a full implementation with mass search would require much more code with complex calls to *ChEBI* API. You can however see a real implementation inside [biodbChebi](https://github.com/pkrog/biodbChebi), the package that implements the *ChEBI* connector. : (\#tab:chebiExMethods) Methods to implement inside the chebi.ex connector. See the help inside R about `BiodbConn` for details about the parameters of those functions. ### Remote connection methods The remote methods are used for three different goals. First to build URLs that access the web site, to get the URL of an entry page (`doGetEntryPageUrl()`) or to get the URL of an entry picture (`doGetEntryImageUrl()`) like a molecule representation. Second to get a list of database entry identifiers (`doGetEntryIds()`). Third to Get the content of an entry (`doGetEntryContentRequest()`). In our implementations of `doGetEntryPageUrl()`, `doGetEntryImageUrl()` and `doGetEntryContentRequest()` (see below), you may notice the use of the `getPropValSlot()` method to get some base URLs (`"base.url"`, `"ws.url"`). These values are defined inside the connector YAML definition file that we will detail below. Also, in those methods, we use the `BiobdUrl` class to build the URLs. `BiodbUrl` handles the building of the URL parameters, as well as the encoding of special characters. ### Method for searching for entries The implemented method (`doSearchForEntries()`) is a generic method used to search for entries inside the database by name, mass, or any other field. For our example we have decided to implement only the search by name in order to keep the code as simple and short as possible. To see a full implementation of this method, look at the official *biodb* *ChEBI* connector at [biodbChebi](https://github.com/pkrog/biodbChebi). Inside the method's code you will see that the implementation of the call to the *ChEBI* web service API has been left to the dedicated method `wsGetLiteEntity()`. ### Prototype to respect for web service methods In *biodb* official implementations of remote connectors, the implementations of calls to web services are done in separate dedicated methods having in common some principles. These principles are important, because they assure a uniformity between *biodb* extension packages, allowing users to identify immediately a web service method and recognize the *biodb* generic parameters inside it. Example of a web service method, taken from official *biodb* *ChEBI* extension package: ```{r, eval=FALSE} wsGetLiteEntity=function(search=NULL, search.category='ALL', stars='ALL', max.results=10, retfmt=c('plain', 'parsed', 'request', 'ids')) { } ``` A web service method name must start with the prefix `ws`, which stands for *web service*, and be followed by the database API name of the web service written in Java style (i.e.: an uppercase letter for the start of each word and lowercase letters for the rest). The first parameters of the method are the database web service parameters. The last parameters (`max.results` and `retfmt`) are *biodb* specific. `max.results` controls the maximum number of results wanted, and must have a default value (usually `10`). `retfmt`, which stands for *return format*, controls the format of the method's returned value. The default value of `retfmt` is set to a vector and then processed inside the method with the `match.arg()` method. Thus the "real" default value is the first value of the vector, which must always be `"plain"`. The set of possible values for `retfmt` is variable from one web service method to another. However some of the values are compulsory. See \@ref(tab:retfmtValues) for a full list of `retfmt` possible values officially accepted by *biodb*. Value | Compulsory | Description ------------ | ---------- | -------------------------------- `plain` | yes | Results are returned verbatim, without any change on the data returned by the server. `parsed` | yes | Results are parsed according to the data format expected from the server (JSON, CSV, ...) before being returned. `request` | yes | Instead of returning the results of the query, the query is returned as a `BiodbRequest` object. The query is only built, and is never sent to the server. `ids` | no | Results are returned as a character vector of entry identifiers. `queryid` | no | This value is used when dealing with an asynchronous web service. The value returned is the ID of the asynchronous query extracted from the parsed results returned by the server. This query ID is then used to query the query status and to query the query results, usually with two other web services. `status` | no | When dealing with an asynchronous web service query, this value asks for the current status of the query. `data.frame` | no | Results are formatted into a data frame. : (\#tab:retfmtValues) `retfmt` accepted values. The list of values of `retfmt` officially accepted by *biodb*. You may want to look into some of *biodb* implementations of connectors to official remote databases, and see how the calls to web services have been implemented in dedicated web service methods. See \@ref(tab:biodbOfficialRemoteConns). Package | Official database site ----------------------------------------------------- | -------------------------------- [biodbChebi](https://github.com/pkrog/biodbChebi) | [ChEBI](https://www.ebi.ac.uk/chebi/) [biodbHmdb](https://github.com/pkrog/biodbHmdb) | [HMDB](https://hmdb.ca/) [biodbKegg](https://github.com/pkrog/biodbKegg) | [KEGG](https://www.kegg.jp/) [biodbUniprot](https://github.com/pkrog/biodbUniprot) | [UniProt](https://www.uniprot.org/) : (\#tab:biodbOfficialRemoteConns) *biodb* connectors to remote databases. Some of the *biodb* packages implementing connectors to official remote databases. ### Implementation ```{r, echo=FALSE, results='hide'} connClass <- system.file("extdata", "ChebiExConn.R", package='biodb') entryClass <- system.file("extdata", "ChebiExEntry.R", package='biodb') source(connClass) source(entryClass) ``` Here is our implementation of the connector class: ```{r, code=readLines(connClass)} ``` Here is our implementation of the entry class: ```{r, code=readLines(entryClass)} ``` ## Using the new connector To use the new connector, we first need to load the YAML definition file inside our *biodb* instance. To start we create an instance of the `BiodbMain` class: ```{r} mybiodb <- biodb::newInst() ``` The loading of the definitions is done with a call to `loadDefinitions()`: ```{r} mybiodb$loadDefinitions(defFile) ``` Now our *biodb* instance is aware of our new connector, and is ready to create instances of it. To create an instance of our new connector class, we proceeds as usual in *biodb*, by calling `createConn()` on the factory instance, using our connector identifier: ```{r} conn <- mybiodb$getFactory()$createConn('chebi.ex') ``` Now we can retrieve a *ChEBI* entry from the remote database: ```{r} entry <- conn$getEntry('17001') entry$getFieldsAsDataframe() ``` Do not forget to terminate your biodb instance once you are done with it: ```{r Closing of the biodb instance} mybiodb$terminate() ``` ## Other types of connectors and entries We describe here the other types of connectors and entries that *biodb* provide. The generator that we have used to generate the package skeleton for `chebi.ex` can also be used to generate skeleton for all the types described here. ### Connector for a local database With *biodb* we can also write a connector for a local database. As a matter of fact, all the connectors included in *biodb* base package are local connectors only: `mass.csv.file`, `comp.csv.file` and `mass.sqlite`. See \@ref(tab:connMethods) for a list of methods to implement when writing a local connector. Method | Description ---------------------------- | -------------------------------- `doGetNbEntries()` | Must return the number of entries contained in the database. `doGetEntryContentFromDb()` | Return the content(s), as strings, of one or more entries from the database. `doDefineParsingExpressions()` | May be overriden in order to define parsing expressions dynamically (see `CsvFileConn` class for an example). `doGetEntryIds()` | This method, called by `getEntryIds()`, should return the full list of accession numbers of the entries contained in the database, or a subset if `max.results` is set. This method is used for testing, in order to get a sample of existing entries, but may also be useful for users when developing. : (\#tab:connMethods) `BiodbConn` methods to implement. The list of methods to implement when inheriting from the `BiodbConn` class. ### Connector for a mass spectra database In the example above, we have implemented a compound database. Another type of database is a mass spectra database. The following connectors included in *biodb* package are mass spectra database connectors: `mass.csv.file` and `mass.sqlite`. See \@ref(tab:massdbConnMethods) for a list of methods to implement when writing a mass spectra database connector. Method | Description ---------------------------- | -------------------------------- `doGetChromCol()` | Returns a data frame containing the description of the chromatographic columns. `doGetNbPeaks()` | Returns the total number of MS peaks contained in the database. `doGetMzValues()` | Returns a list of M/Z values contained inside the database, with the possibility of filtering on MS mode, MS level, and some other variables. `doSearchMzRange()` | Searches for spectra using an M/Z range and optional filtering on some other variables. : (\#tab:massdbConnMethods) Methods to implement when defining a connector to a mass spectra database. ### Connector for a downloadable database Some database servers do not propose web services, or other connection to the database, but propose to download the whole database for local processing. *biodb* offers the possibility to handle the connection to such database servers, by setting `downloadable` to `TRUE` inside the definition of the database connector. See \@ref(tab:downloadableMethods) for a list of methods to implement inside your connector when writing a downloadable database connector. Method | Description ---------------------------- | -------------------------------- `doesRequireDownload()` | This method must return TRUE if the connector requires to download files locally with the `BiodbDownloadable` interface. `doDownload()` | This method must implement the download of the database file. `doExtractDownload()` | This method must implement the extraction of the database files (e.g.: from a zip). : (\#tab:downloadableMethods) Methods to implement when defining a downloadable connector class. ### How to implement other types of entry classes We have seen in the example how to parse XML entries by writing an entry class that inherits from the `BiodbXmlEntry` class. As stated before, *biodb* provides other types of abstract entry classes, that facilitate the parsing of diverse entry content formats. Here is a review of those formats. #### HTML content To parse HTML content, your entry class should inherit from `BiodbHtmlEntry`. The parsing expressions must be written in *XPath* language, as for XML content, but it uses a special parsing algorithm since HTML is less strict than XML and allows some "illegal" constructs. Example of a parsing expression: ``` path: //input[@id='DATA'] ``` #### JSON content To parse JSON content, your entry class should inherit from `BiodbJsonEntry`. The parsing expressions are written in the form of lists of keys to follow as a path inside the JSON tree. Here is an example: ``` chrom.col.id: - liquidChromatography - columnCode ``` #### List content If your connector gets entry contents directly as an R list object, like in the case of `MassSqliteConn`, you have interest in making your entry class inherit from `BiodbListEntry` abstract class. With this class, the entry content is provided as a flat named R list object, although it is also possible to pass a JSON string containing flat key/value pairs instead. The parsing expressions are the names used inside the list object. Here is an example: ``` accession: id compound.id: comp_id formula: chem_form ``` #### CSV content The `BiodbCsvEntry` class helps you handle entry content in CSV (using comma separator or any other character) format. When declaring the constructor for your own entry class, do not forget to call the mother class constructor to pass it your separator and/or the string values that have to be converted to `NA`: ```{r} MyEntryClass <- R6::R6Class("MyEntryClass", inherit=biodb::BiodbCsvEntry, public=list( initialize=function() { super$initialize(sep=';', na.strings=c('', 'NA')) } )) ``` The parsing expressions are the column names of the CSV file: ``` accession: id name: fullname ``` #### SDF content If your entry content is in SDF (Structure Data File) chemical file format, make you entry class inherit from `BiodbSdfEntry` abstract class. Since the SDF format is an official standard format, the parsing expressions are useless in this case, your class only has to inherit from `BiodbSdfEntry`. #### Text content The `BiodbTxtEntry` abstract class allows you to handle any text file content for entries. Parsing expressions are defined as regular expressions, using the [stringr](https://stringr.tidyverse.org/) package, hence in [ICU Regular Expressions](https://unicode-org.github.io/icu/userguide/strings/regexp.html) format. Here is an example: ``` accession: ^ENTRY\s+(\S+)\s+Compound exact.mass: ^EXACT_MASS\s+(\S+)$ formula: ^FORMULA\s+(\S+)$ ``` #### Implementing your own parsing If none of the predefined formats fits your needs, your class have to inherit directly from `BiodbEntry`. Two methods have to be implemented in this case. The first is `doParseContent()`, which parses a string into the acceptable format for the second function, `doParseFieldsStep1()`. Look for instance at the code of `BiodbTxtEntry` class for a good example. Here is an excerpt: ```{r, eval=FALSE} doParseContent=function(content) { # Get lines of content lines <- strsplit(content, "\r?\n")[[1]] return(lines) }, doParseFieldsStep1=function(parsed.content) { # Get parsing expressions parsing.expr <- .self$getParent()$getPropertyValue('parsing.expr') .self$.assertNotNull(parsed.content) .self$.assertNotNa(parsed.content) .self$.assertNotNull(parsing.expr) .self$.assertNotNa(parsing.expr) .self$.assertNotNull(names(parsing.expr)) # Loop on all parsing expressions for (field in names(parsing.expr)) { # Match whole content g <- stringr::str_match(parsed.content, parsing.expr[[field]]) # Get positive results results <- g[ ! is.na(g[, 1]), , drop=FALSE] # Any match ? if (nrow(results) > 0) .self$setFieldValue(field, results[, 2]) } } ``` #### Extending the parsing of an existing class When inheriting from one of the abstract class listed above (`BiodbTxtEntry`, `BiodbJsonEntry`, `BiodbXmlEntry`, ...), you also have the opportunity to write some custom parsing code by implementing `doParseFieldsStep2()`. This method will be called just after `doParseFieldsStep1()`, which is implemented by the abstract class. See `HmdbMetabolitesEntry` class inside [biodbHmdb](https://github.com/pkrog/biodbHmdb) extension package for an example. Here is an extract: ```{r, eval=FALSE} doParseFieldsStep2=function(parsed.content) { # Remove fields with empty string for (f in .self$getFieldNames()) { v <- .self$getFieldValue(f) if (is.character(v) && ! is.na(v) && v == '') .self$removeField(f) } # Correct InChIKey if (.self$hasField('INCHIKEY')) { v <- sub('^InChIKey=', '', .self$getFieldValue('INCHIKEY'), perl=TRUE) .self$setFieldValue('INCHIKEY', v) } # Synonyms synonyms <- XML::xpathSApply(parsed.content, "//synonym", XML::xmlValue) if (length(synonyms) > 0) .self$appendFieldValue('name', synonyms) } ``` # Session information ```{r} sessionInfo() ```