biodb 1.2.2
biodb is a framework designed to help you implement new connectors for databases. To illustrate this, we will show you a practical example where we create a connector for the ChEBI database. In this example, we will present you a small implementation of a ChEBI connector, and show you how to declare it to your biodb instance.
A more complete and functional connector for accessing ChEBI database is implemented in biodbChEBI library. See 1 for a list of the capabilities of this official biodb connector.
Title / method name | Description |
---|---|
Fields parsing | Formula, charge, InChI, InChIKey, molecular mass, monoisotopic mass, KEGG id, entity stars, SMILES. |
getEntryPageUrl() | Returns the URL of the website page of an entry. |
getEntryImageUrl() | Returns the URL to the molecule image of an entry. |
wsWsdl() | Returns the WSDL definition (i.e.: list of available web services and their parameters). |
wsGetLiteEntity() | Runs the getLiteEntity web service that returns database entries with their contents. |
convIdsToChebiIds() | Converts a list of IDs (InChI, InChI Keys, CAS, …) into a list of ChEBI IDs. |
convInchiToChebi() | Converts a list of InChI or InChI KEYs into a list of ChEBI IDs. |
convCasToChebi() | Converts a list of CAS IDs into a list of ChEBI IDs. |
searchForEntries() | Searches for entries by mass and/or by name. |
When creating a new extension package, biodb can help you generate all the necessary files.
A call to genNewExtPkg()
will generate the skeletons for the biodb
connector class and the biodb entry class, along with the testthat files, the
DESCRIPTION file, etc.
A simplified call might look like this:
biodb::genNewExtPkg(path='biodbChebiEx', dbName='chebi.ex', connType='compound',
dbTitle='ChEBI connector example', entryType='xml', remote=TRUE)
See 2 for a brief description of the parameters.
Other parameters exist for the author’s email, the author’s name, for
generating a Makefile
, or configuring for writing C++ code with Rcpp
.
Parameter | Description |
---|---|
path | The path to the package folder to create. |
dbName | The name of the connector to create. |
dbTitle | A short description of the database. |
connType | The class name of the connector. |
entryType | The class name of the entry. |
remote | Must be set to if a connection to a web server is needed. |
The files generated by the genNewExtPkg()
function are the following ones:
list.files('biodbChebiEx', all.files=TRUE, recursive=TRUE)
## [1] ".BBSoptions"
## [2] ".Rbuildignore"
## [3] ".gitignore"
## [4] ".travis.yml"
## [5] "DESCRIPTION"
## [6] "LICENSE"
## [7] "R/ChebiExConn.R"
## [8] "R/ChebiExEntry.R"
## [9] "R/package.R"
## [10] "README.md"
## [11] "biodb_ext.yml"
## [12] "inst/definitions.yml"
## [13] "inst/testref/entry-chebi.ex-0001.json"
## [14] "longtests/testthat.R"
## [15] "longtests/testthat/test_long_001_init_logging.R"
## [16] "longtests/testthat/test_long_100_generic.R"
## [17] "longtests/testthat/test_long_200_example.R"
## [18] "tests/testthat.R"
## [19] "tests/testthat/test_001_init_logging.R"
## [20] "tests/testthat/test_050_fcts.R"
## [21] "tests/testthat/test_100_generic.R"
## [22] "tests/testthat/test_200_example.R"
## [23] "vignettes/intro.Rmd"
Inside the biodb_ext.yml
file are stored the values of the parameters used
with biodb::genNewExtPkg()
.
This is in case you want to upgrade some the generated files (.gitignore
,
.travis.yml
, Makefile
, etc) with newer versions from biodb package.
You would then only need to call biodb::upgradeExtPkg(path='biodbChebiEx')
and the biodb_ext.yml
file would be read for parameter values.
The inst/definitions.yml
file defines the new connector, we will fill in some
values inside it.
Then we need to write implementations for the methods in the connector class
R/ChebiExConn.R
.
On the other side, R/ChebiExEntry.R
, the entry class, needs no modification
for our basic usage.
The test files in tests/testthat
will be executed when running R CMD check
,
they need to be edited first though.
Generic tests need to enabled inside tests/testthat/test_100_generic.R
.
The files tests/testthat/test_050_fcts.R
and
tests/testthat/test_200_example.R
contain only examples, thus they need to
be modified or removed.
The test files in tests/long
will not be executed when running R CMD check
.
They can be run manually after installing the package locally, by calling
R -e "testthat::test_dir('tests/long')"
.
A skeleton vignette has also been generated (vignettes/intro.Rmd
), and should
be completed with specific examples for this package.
Starting from the skeleton files generated by genNewExtPkg()
, we need now to
fill in the blanks.
The first file to take care of is inst/definitions.yml
, which contains the
definition of the new connector.
Then we will look quickly at R/ChebiExEntry.R
, which is rather empty in our
case, and R/ChebiExConn.R
, which requires much more attention, having several
methods that need implementation.
The naming of the classes inside the R files is important.
They must be named ChebiExEntry
and ChebiExConn
, in order to match the name
defined inside inst/definitions.yml
(chebi.ex
).
Hopefully the generator has taken care of this, and no special action is
required on this aspect, except not modifying the names.
The content of the generated YAML file inst/definitions.yml
is as follow:
# biodb example definitions file for extensions packages, version 1.0.0
databases:
chebi.ex:
name: ChEBI connector example
description: Write here the description of this database.
compound.db: true
entry.content.type: xml
parsing.expr:
accession: substring-after(//dbns:return/dbns:accessionId,'ACCESSION:')
name:
- //dbns:name
- //dbns:synonyms/dbns:data
mass: //dbns:mass
monoisotopic.mass: //dbns:monoisotopicMass
smiles: //dbns:return/dbns:smiles
inchi: //dbns:return/dbns:inchi
inchikey: //dbns:return/dbns:inchiKey
formula:
- //dbns:Formulae/dbns:source[text()='MyDatabase']/../dbns:data
- (//dbns:Formulae/dbns:data)[1]
xml.ns:
dbns: https://my.database.org/webservices/v1
xsd: http://www.w3.org/2001/XMLSchema
searchable.fields:
- name
- monoisotopic.mass
- molecular.mass
- average.mass
- nominal.mass
remote: true
# Length in seconds of the connection sliding window
scheduler.t: 1
# Number of connections allowed inside the connection sliding window
scheduler.n: 3
urls:
# Base URL of the database server, where to find entry pages
base.url: https://my.database.org/mydb/
# Webservice URL to use to contact web services
ws.url: https://my.database.org/webservices/mydb/3.2/
# Add any other URL you need for the development of your connector
# Inside your code, you can get each of these URLs with a call like the following one:
# .self$getPropValSlot('urls', 'ws.url')
fields:
chebi.ex.id:
description: ChEBI connector example ID
case.insensitive: true
forbids.duplicates: true
type: id
card: many
It is mainly filled with examples.
This YAML file contains two main parts: databases
and fields
.
The databases
part is where you list the new connectors you’ve created, and
the fields
part is where you define the new entry fields your new connectors
need.
We just have one new field to define: chebi.ex.id
.
This is the accession field for our new connector.
All connector accession fields are in the form <connector_class_id>.id
.
This accession field is mainly used inside other databases, when they make
references to other databases.
The field accession
, which is used in all entries of biodb connectors,
contains the same value as the connector accession field (chebi.ex.id
in our
case) and is preferable when accessing an entry.
The definition of the new field is quite simple, See 3 for
explanations of the different parameters.
Parameter | Description |
---|---|
description |
A free description of your field. |
type |
The type of the field. Here we declare that this is an accession (identifier) field: id . |
card |
The cardinality of the field: one if field accepts only one value, or many if multiple values can be stored inside the field. |
forbids.duplicates |
If TRUE then duplicates are forbidden. This supposes that we allow to store multiple values inside this field (i.e.: cardinality is set to many ). |
case.insensitive |
If TRUE then values will be compared in case insensitive mode. This is mostly useful when looking for duplicates. |
The main part is the declaration of the new connector.
This is done in the databases
section, under the key chebi.id
, which is the
database identifier.
See 4 for explanations of the different parameters.
Parameter | Description |
---|---|
name |
The full name of your new connector. |
urls |
A list (key/values) of URLs of the remote database. The common URLs to define are base.url to access pages of the database website, and ws.url for web service URLs. Those URLs are just “prefix” and are used inside the connector class for building real URLs. You can define as much URLs as the remote database requires, like a second base URL (base2.url ) or a second web service URL (ws2.url ), or any other URL with the key name you want. |
xml.ns |
This parameter defines namespaces for XML documents returned by the remote database. This is thus only useful for databases that return data in XML format. |
scheduler.n |
The maximum number of queries to send to the remote database, each T (stored as scheduler.t ) seconds. |
scheduler.t |
The time (in seconds) during which a maximum of N (stored as scheduler.n ) queries is allowed. |
entry.content.type |
The type of content sent by the database for an entry. Here we have specified xml . Allowed values are: html , sdf , txt , xml , csv , tsv , json , list . This is mainly used to add an extension to the file saved inside biodb cache. |
entry.content.encoding |
The text encoding used inside the entry’s content by the database. |
parsing.expr |
This is the most important part of the declaration. It is lists the different expressions to use in order to parse the values of the entry fields. The format is a key/value list, the key being the biodb field name, and the value the expression to run. Since the entry content type is XML, we have to use XPath expressions here. See this XPath Tutorial, for instance, to get an introduction to XPath. Note that we can define multiple expressions, like for formula field, in case of XPath expressions. If the first expression fails, then next expressions will be tried. |
searchable.fields |
A list of biodb entry fields that are searchable when calling a search function like searchCompound() . |
After setting some parsing expressions, the URLs and the searchable fields, we get a complete definition file, that you can find at:
defFile <- system.file("extdata", "chebi_ex.yml", package='biodb')
Its content is as follow:
databases:
chebi.ex:
name: ChEBI example connector
description: An example connector for ChEBI.
compound.db: true
entry.content.encoding: UTF-8
entry.content.type: xml
parsing.expr:
accession: substring-after(//chebi:return/chebi:chebiId,'CHEBI:')
formula:
- //chebi:Formulae/chebi:source[text()='ChEBI']/../chebi:data
- (//chebi:Formulae/chebi:data)[1]
inchi: //chebi:return/chebi:inchi
inchikey: //chebi:return/chebi:inchiKey
mass: //chebi:mass
monoisotopic.mass: //chebi:monoisotopicMass
name:
- //chebi:chebiAsciiName
smiles: //chebi:return/chebi:smiles
searchable.fields:
- name
- monoisotopic.mass
- molecular.mass
remote: true
scheduler.t: 1
scheduler.n: 3
urls:
base.url: https://www.ebi.ac.uk/chebi/
ws.url: https://www.ebi.ac.uk/webservices/chebi/2.0/
xml.ns:
chebi: https://www.ebi.ac.uk/webservices/chebi
xsd: http://www.w3.org/2001/XMLSchema
fields:
chebi.ex.id:
description: ChEBI ID
type: id
card: many
forbids.duplicates: true
case.insensitive: true
The entry class represents an entry from the database. Each instance of an entry contains the values parsed from the database downloaded content.
The entry class of our example extension package has been generated inside
R/ChebiExEntry.R
.
Here is its content:
#' ChEBI connector example entry class.
#'
#' Entry class for ChEBI connector example.
#'
#' @seealso
#' \code{\link{BiodbXmlEntry}}.
#'
#' @examples
#' # Create an instance with default settings:
#' mybiodb <- biodb::Biodb()
#'
#' # Get a connector that inherits from ChebiExConn:
#' conn <- mybiodb$getFactory()$createConn('chebi.ex')
#'
#' # Get the first entry
#' e <- conn$getEntry(conn$getEntryIds(1L))
#'
#' # Terminate instance.
#' mybiodb$terminate()
#'
#' @import biodb
#' @import R6
#' @export
ChebiExEntry <- R6::R6Class("ChebiExEntry",
inherit=
biodb::BiodbXmlEntry
,
public=list(
initialize=function(...) {
super$initialize(...)
}
,doCheckContent=function(content) {
# You can do some more checks of the content here.
return(TRUE)
}
,doParseFieldsStep2=function(parsed.content) {
# TODO Implement your custom parsing processing here.
}
))
The class inherits from BiodbXmlEntry
since we have set the entryType
parameter to "xml"
.
An entry class must inherit from the BiodbEntry
class and define some
methods.
To simplify this step, several generic entry classes have been defined in
biodb (see 5), depending on the type of content
downloaded from the database.
To use one of these classes for your entry class, you only have to make your
class inherit from the desired generic class.
Entry class | Content type handled |
---|---|
BiodbCsvEntry |
CSV file. |
BiodbHtmlEntry |
HTML, the parsing will be done using XPath expressions. |
BiodbJsonEntry |
JSON. |
BiodbListEntry |
R list. |
BiodbSdfEntry |
SDF file (chemical data file format). |
BiodbTxtEntry |
Text file, the parsing will be done using regular expressions. |
BiodbXmlEntry |
XML file, the parsing will be done using XPath expressions. |
Two methods are defined that can be used to enhance our implementation.
The method doCheckContent()
can be used to further check the parsed
content of an entry, for instance for some incoherence between fields.
The method doParseFieldsStep2()
allows to run some custom code for complex
parsing of the entry’s content.
This method is run after doParseFieldsStep1()
, which is defined inside the
mother class (here BiodbXmlEntry
) and executes the parsing expression defined
inside inst/definitions.yml
.
Note: biodb uses R6 as OOP (Object Oriented Programming) model. Please see vignette Details on biodb , for more explanations.
The generator has generated the full class, and thus has taken care of the inheritance part, as well as the declaration of the required methods. See 6 for a description of these methods. What is left to us is the implementation of those methods.
Here is the generated skeleton:
#' ChEBI connector example connector class.
#'
#' Connector class for ChEBI connector example.
#'
#' @seealso \code{\link{BiodbConn}}.
#'
#' @examples
#' # Create an instance with default settings:
#' mybiodb <- biodb::Biodb()
#'
#' # Get a connector:
#' conn <- mybiodb$getFactory()$createConn('chebi.ex')
#'
#' # Get the first entry
#' e <- conn$getEntry(conn$getEntryIds(1L))
#'
#' # Terminate instance.
#' mybiodb$terminate()
#'
#' @import biodb
#' @import R6
#' @export
ChebiExConn <- R6::R6Class("ChebiExConn",
inherit=biodb::BiodbConn,
public=list(
initialize=function(...) {
super$initialize(...)
}
,wsFind=function(name="", retfmt=c('plain', 'parsed', 'ids', 'request')) {
# This is the implementation of a fictive web service called "find" that
# search for entries by name.
# Use it as an example for implementing your own web services.
retfmt <- match.arg(retfmt)
# Build request
params <- list(name=name)
url <- BiodbUrl(url=c(.self$getPropValSlot('urls', 'ws.url'), 'find'),
params=params)
request <- .self$makeRequest(method='get', url=url)
# Return request
if (retfmt == 'request')
return(request)
# Send request
# This the line that should be run for sending the request and getting the
# results:
#results <- .self$getBiodb()$getRequestScheduler()$sendRequest(request)
# Instead, for this example, we just generate the results of this fictive
# web service:
results <- paste('{"0001": {"name": "name1"},',
' "0198": {"name": "name2"},',
' "9834": {"name": "name3"}}')
# Parse
if (retfmt != 'plain') {
# Parse JSON
results <- jsonlite::fromJSON(results, simplifyDataFrame=FALSE)
# Get IDs
if (retfmt == 'ids')
results <- names(results)
}
return(results)
}
),
private=list(
doGetNbEntries=function(count=FALSE) {
# Replace the call below if you have a direct way (specific web service for
# a remote database, provided method or information for a local database)
# to count entries for your database.
return(callSuper(count=count))
}
,doGetEntryContentFromDb=function(id) {
# Initialize return values
content <- rep(NA_character_, length(id))
# TODO Implement retrieval of entry contents.
# Some debug message
if (length(content) > 0)
.self$message('debug', paste("Content of first entry:", content[[1]]))
return(content)
}
,doGetEntryIds=function(max.results=NA_integer_) {
# Overrides super class' method.
ids <- NA_character_
# TODO Implement retrieval of accession numbers.
return(ids)
}
,doSearchForEntries=function(fields=NULL, max.results=NA_integer_) {
# Overrides super class' method.
ids <- character()
# TODO Implement search of entries by filtering on values of fields.
return(ids)
}
,doGetEntryContentRequest=function(id, concatenate=TRUE) {
# TODO Modify the code below to build the URLs to get the contents of the
# entries.
# Depending on the database, you may have to build one URL for each
# individual entry or may be able to write just one or a few URL for all
# entries to retrieve.
u <- c(.self$getPropValSlot('urls', 'base.url'), 'entries',
paste(id, 'xml', sep='.'))
url <- BiodbUrl(url=u)$toString()
return(url)
}
,doGetEntryPageUrl=function(id) {
# TODO Modify this code to build the individual URLs to the entry web pages
fct <- function(x) {
u <- c(.self$getPropValSlot('urls', 'base.url'), 'entries', x)
BiodbUrl(url=u)$toString()
}
return(vapply(id, fct, FUN.VALUE=''))
}
,doGetEntryImageUrl=function(id) {
# TODO Modify this code to build the individual URLs to the entry images
fct <- function(x) {
u <- c(.self$getPropValSlot('urls', 'base.url'), 'images', x,
'image.png')
BiodbUrl(url=u)$toString()
}
return(vapply(id, fct, FUN.VALUE=''))
}
))
The connector class is responsible for the connection to the database. In our case, the database is a compound database.
Method | Description |
---|---|
doGetEntryPageUrl() |
This method returns the official URL of the entry page on the database website, for each each accession number passed. The return type is thus a list. If no entry pages are available for the database, the method must return a list of NULL values, the same length as the input vector. |
doGetEntryImageUrl() |
This method returns the official URL of the entry picture on the database website, for each each accession number passed. The picture returned must be visual representation of the entry (a molecule 3D model, a mass spectrum, …). The return type is thus a list. If no entry pages are available for the database, the method must return a list of NULL values, the same length as the input vector. |
doGetEntryContentRequest() |
This method is called by getEntryContentRequest() , and must return a list of URLs used to retrieve entry contents. If concatenate parameter is FALSE , the list returned must be the same length as the vector id and each URL must point to one entry content only. If concatenate parameter is TRUE , then it is permitted (but not compulsory) to return URLs that get more than one entry at a time. |
doGetEntryIds() |
This method, called by getEntryIds() , should return the full list of accession numbers of the entries contained in the database, or a subset if max.results is set. This method is used for testing, in order to get a sample of existing entries, but may also be useful for users when developing. |
doSearchForEntries() |
This method implements the search of entries by filtering on some field values. For our example, we have kept it simple by implementing only the search by name (field "name" ), because a full implementation with mass search would require much more code with complex calls to ChEBI API. You can however see a real implementation inside biodbChebi, the package that implements the ChEBI connector. |
See the help inside R about BiodbConn
for details
about the parameters of those functions.
The remote methods are used for three different goals.
First to build URLs that access the web site, to get the URL of an entry page
(doGetEntryPageUrl()
) or to get the URL of an entry picture
(doGetEntryImageUrl()
) like a molecule representation.
Second to get a list of database entry identifiers (doGetEntryIds()
).
Third to Get the content of an entry (doGetEntryContentRequest()
).
In our implementations of doGetEntryPageUrl()
, doGetEntryImageUrl()
and
doGetEntryContentRequest()
(see below), you may notice the use of the
getPropValSlot()
method to get some base URLs ("base.url"
, "ws.url"
).
These values are defined inside the connector YAML definition file that we will
detail below.
Also, in those methods, we use the BiobdUrl
class to build the URLs.
BiodbUrl
handles the building of the URL parameters, as well as the encoding
of special characters.
The implemented method (doSearchForEntries()
) is a generic method used to
search for entries inside the database by name, mass, or any other field.
For our example we have decided to implement only the search by name in order
to keep the code as simple and short as possible.
To see a full implementation of this method, look at the official biodb
ChEBI connector at biodbChebi.
Inside the method’s code you will see that the implementation of the call to
the ChEBI web service API has been left to the dedicated method
wsGetLiteEntity()
.
In biodb official implementations of remote connectors, the implementations of calls to web services are done in separate dedicated methods having in common some principles.
These principles are important, because they assure a uniformity between biodb extension packages, allowing users to identify immediately a web service method and recognize the biodb generic parameters inside it.
Example of a web service method, taken from official biodb ChEBI extension package:
wsGetLiteEntity=function(search=NULL, search.category='ALL', stars='ALL',
max.results=10,
retfmt=c('plain', 'parsed', 'request', 'ids')) {
}
A web service method name must start with the prefix ws
, which stands for
web service, and be followed by the database API name of the web service
written in Java style (i.e.: an uppercase letter for the start of each word and
lowercase letters for the rest).
The first parameters of the method are the database web service parameters.
The last parameters (max.results
and retfmt
) are biodb specific.
max.results
controls the maximum number of results wanted, and must have a
default value (usually 10
).
retfmt
, which stands for return format, controls the format of the method’s
returned value.
The default value of retfmt
is set to a vector and then processed inside the
method with the match.arg()
method.
Thus the “real” default value is the first value of the vector, which must
always be "plain"
.
The set of possible values for retfmt
is variable from one web service method
to another.
However some of the values are compulsory.
See 7 for a full list of retfmt
possible values
officially accepted by biodb.
Value | Compulsory | Description |
---|---|---|
plain |
yes | Results are returned verbatim, without any change on the data returned by the server. |
parsed |
yes | Results are parsed according to the data format expected from the server (JSON, CSV, …) before being returned. |
request |
yes | Instead of returning the results of the query, the query is returned as a BiodbRequest object. The query is only built, and is never sent to the server. |
ids |
no | Results are returned as a character vector of entry identifiers. |
queryid |
no | This value is used when dealing with an asynchronous web service. The value returned is the ID of the asynchronous query extracted from the parsed results returned by the server. This query ID is then used to query the query status and to query the query results, usually with two other web services. |
status |
no | When dealing with an asynchronous web service query, this value asks for the current status of the query. |
data.frame |
no | Results are formatted into a data frame. |
You may want to look into some of biodb implementations of connectors to official remote databases, and see how the calls to web services have been implemented in dedicated web service methods. See 8.
Package | Official database site |
---|---|
biodbChebi | ChEBI |
biodbHmdb | HMDB |
biodbKegg | KEGG |
biodbUniprot | UniProt |
Here is our implementation of the connector class:
ChebiExConn <- R6::R6Class("ChebiExConn",
inherit=biodb::BiodbConn,
public=list(
initialize=function(...) {
super$initialize(...)
},
wsGetLiteEntity=function(search=NULL, search.category='ALL', stars='ALL',
max.results=10,
retfmt=c('plain', 'parsed', 'request', 'ids')) {
# Check parameters
chk::chk_string(search)
chk::chk_in(search.category, self$getSearchCategories())
chk::chk_number(max.results)
chk::chk_gte(max.results, 0)
chk::chk_in(stars, self$getStarsCategories())
retfmt <- match.arg(retfmt)
# Build request
params <- c(search=search,
searchCategory=search.category,
maximumResults=max.results,
starsCategory=stars)
url <- c(self$getPropValSlot('urls', 'ws.url'), 'test/getLiteEntity')
request <- self$makeRequest(method='get', url=BiodbUrl$new(url=url,
params=params),
encoding='UTF-8')
if (retfmt == 'request')
return(request)
# Send request
results <- self$getBiodb()$getRequestScheduler()$sendRequest(request)
# Parse
if (retfmt != 'plain') {
# Parse XML
results <- XML::xmlInternalTreeParse(results, asText=TRUE)
if (retfmt == 'ids') {
ns <- self$getPropertyValue('xml.ns')
results <- XML::xpathSApply(results, "//chebi:chebiId",
XML::xmlValue, namespaces=ns)
results <- sub('CHEBI:', '', results)
if (length(grep("^[0-9]+$", results)) != length(results))
self$error("Impossible to parse XML to get entry IDs.")
}
}
return(results)
}
),
private=list(
doSearchForEntries=function(fields=NULL, max.results=0) {
ids <- character()
if ( ! is.null(fields)) {
# Search by name
if ('name' %in% names(fields))
ids <- self$wsGetLiteEntity(search=fields$name,
search.category="ALL NAMES",
max.results=0, retfmt='ids')
}
# Cut
if (max.results > 0 && max.results < length(ids))
ids <- ids[seq_len(max.results)]
return(ids)
},
doGetEntryContentRequest=function(id, concatenate=TRUE) {
url <- c(self$getPropValSlot('urls', 'ws.url'), 'test',
'getCompleteEntity')
urls <- vapply(id, function(x) BiodbUrl$new(url=url,
params=list(chebiId=x))$toString(),
FUN.VALUE='')
return(urls)
},
doGetEntryIds=function(max.results=NA_integer_) {
return(NULL)
},
doGetEntryPageUrl=function(id) {
# Overrides super class' method
url <- c(self$getPropValSlot('urls', 'base.url'), 'searchId.do')
fct <- function(x) {
BiodbUrl$new(url=url, params=list(chebiId=x))$toString()
}
urls <- vapply(id, fct, FUN.VALUE='')
return(urls)
},
doGetEntryImageUrl=function(id) {
# Overrides super class' method
url <- c(self$getPropValSlot('urls', 'base.url'), 'displayImage.do')
fct <- function(x) {
BiodbUrl$new(url=url, params=list(defaultImage='true', imageIndex=0,
chebiId=x, dimensions=400))$toString()
}
urls <- vapply(id, fct, FUN.VALUE='')
return(urls)
}
))
Here is our implementation of the entry class:
ChebiExEntry <- R6::R6Class("ChebiExEntry",
inherit=BiodbXmlEntry,
public=list(
initialize=function(...) {
super$initialize(...)
}
),
private=list(
doCheck=function(content) {
# You can do some more checks of the content here.
return(TRUE)
}
,doParseFieldsStep2=function(parsed.content) {
# TODO Implement your custom parsing processing here.
}
))
To use the new connector, we first need to load the YAML definition file inside our biodb instance.
To start we create an instance of the BiodbMain
class:
mybiodb <- biodb::newInst()
## INFO [04:00:36.016] Loading definitions from package biodb version 1.2.2.
The loading of the definitions is done with a call to loadDefinitions()
:
mybiodb$loadDefinitions(defFile)
Now our biodb instance is aware of our new connector, and is ready to create instances of it.
To create an instance of our new connector class, we proceeds as usual in
biodb, by calling createConn()
on the factory instance, using our connector
identifier:
conn <- mybiodb$getFactory()$createConn('chebi.ex')
Now we can retrieve a ChEBI entry from the remote database:
entry <- conn$getEntry('17001')
## INFO [04:00:36.145] Create cache folder "/home/biocbuild/.cache/R/biodb/chebi.ex-0c5076ac2a43d16dbce503a44b09f649" for "chebi.ex-0c5076ac2a43d16dbce503a44b09f649".
entry$getFieldsAsDataframe()
## accession formula
## 1 17001 C9H13N5O4
## inchi
## 1 InChI=1S/C9H13N5O4/c10-9-13-7-5(8(18)14-9)12-3(1-11-7)6(17)4(16)2-15/h4,6,15-17H,1-2H2,(H4,10,11,13,14,18)/t4-,6+/m1/s1
## inchikey molecular.mass monoisotopic.mass
## 1 YQIFAMYNGGOTFB-XINAWCOVSA-N 255.2308 255.0967
## name smiles chebi.ex.id
## 1 7,8-dihydroneopterin Nc1nc2NCC(=Nc2c(=O)[nH]1)[C@H](O)[C@H](O)CO 17001
Do not forget to terminate your biodb instance once you are done with it:
mybiodb$terminate()
## INFO [04:00:37.313] Closing BiodbMain instance...
## INFO [04:00:37.316] Connector "chebi.ex" deleted.
We describe here the other types of connectors and entries that biodb
provide.
The generator that we have used to generate the package skeleton for chebi.ex
can also be used to generate skeleton for all the types described here.
With biodb we can also write a connector for a local database.
As a matter of fact, all the connectors included in biodb base package are
local connectors only: mass.csv.file
, comp.csv.file
and mass.sqlite
.
See 9 for a list of methods to implement when writing a local connector.
Method | Description |
---|---|
doGetNbEntries() |
Must return the number of entries contained in the database. |
doGetEntryContentFromDb() |
Return the content(s), as strings, of one or more entries from the database. |
doDefineParsingExpressions() |
May be overriden in order to define parsing expressions dynamically (see CsvFileConn class for an example). |
doGetEntryIds() |
This method, called by getEntryIds() , should return the full list of accession numbers of the entries contained in the database, or a subset if max.results is set. This method is used for testing, in order to get a sample of existing entries, but may also be useful for users when developing. |
In the example above, we have implemented a compound database.
Another type of database is a mass spectra database.
The following connectors included in biodb package are mass spectra database
connectors: mass.csv.file
and mass.sqlite
.
See 10 for a list of methods to implement when
writing a mass spectra database connector.
Method | Description |
---|---|
doGetChromCol() |
Returns a data frame containing the description of the chromatographic columns. |
doGetNbPeaks() |
Returns the total number of MS peaks contained in the database. |
doGetMzValues() |
Returns a list of M/Z values contained inside the database, with the possibility of filtering on MS mode, MS level, and some other variables. |
doSearchMzRange() |
Searches for spectra using an M/Z range and optional filtering on some other variables. |
Some database servers do not propose web services, or other connection to the database, but propose to download the whole database for local processing.
biodb offers the possibility to handle the connection to such database
servers, by setting downloadable
to TRUE
inside the definition of the
database connector.
See 11 for a list of methods to implement inside your connector when writing a downloadable database connector.
Method | Description |
---|---|
doesRequireDownload() |
This method must return TRUE if the connector requires to download files locally with the BiodbDownloadable interface. |
doDownload() |
This method must implement the download of the database file. |
doExtractDownload() |
This method must implement the extraction of the database files (e.g.: from a zip). |
We have seen in the example how to parse XML entries by writing an entry class
that inherits from the BiodbXmlEntry
class.
As stated before, biodb provides other types of abstract entry classes, that
facilitate the parsing of diverse entry content formats.
Here is a review of those formats.
To parse HTML content, your entry class should inherit from BiodbHtmlEntry
.
The parsing expressions must be written in XPath language, as for XML
content, but it uses a special parsing algorithm since HTML is less strict than
XML and allows some “illegal” constructs.
Example of a parsing expression:
path: //input[@id='DATA']
To parse JSON content, your entry class should inherit from BiodbJsonEntry
.
The parsing expressions are written in the form of lists of keys to follow as a
path inside the JSON tree.
Here is an example:
chrom.col.id:
- liquidChromatography
- columnCode
If your connector gets entry contents directly as an R list object, like in the
case of MassSqliteConn
, you have interest in making your entry class inherit
from BiodbListEntry
abstract class.
With this class, the entry content is provided as a flat named R list object,
although it is also possible to pass a JSON string containing flat key/value
pairs instead.
The parsing expressions are the names used inside the list object.
Here is an example:
accession: id
compound.id: comp_id
formula: chem_form
The BiodbCsvEntry
class helps you handle entry content in CSV (using comma
separator or any other character) format.
When declaring the constructor for your own entry class, do not forget to call
the mother class constructor to pass it your separator and/or the string values
that have to be converted to NA
:
MyEntryClass <- R6::R6Class("MyEntryClass", inherit=biodb::BiodbCsvEntry,
public=list(
initialize=function() {
super$initialize(sep=';', na.strings=c('', 'NA'))
}
))
The parsing expressions are the column names of the CSV file:
accession: id
name: fullname
If your entry content is in SDF (Structure Data File) chemical file format,
make you entry class inherit from BiodbSdfEntry
abstract class.
Since the SDF format is an official standard format, the parsing expressions
are useless in this case, your class only has to inherit from BiodbSdfEntry
.
The BiodbTxtEntry
abstract class allows you to handle any text file content
for entries.
Parsing expressions are defined as regular expressions, using the
stringr package, hence in ICU Regular
Expressions
format.
Here is an example:
accession: ^ENTRY\s+(\S+)\s+Compound
exact.mass: ^EXACT_MASS\s+(\S+)$
formula: ^FORMULA\s+(\S+)$
If none of the predefined formats fits your needs, your class have to inherit directly from BiodbEntry
.
Two methods have to be implemented in this case.
The first is doParseContent()
, which parses a string into the acceptable
format for the second function, doParseFieldsStep1()
.
Look for instance at the code of BiodbTxtEntry
class for a good example.
Here is an excerpt:
doParseContent=function(content) {
# Get lines of content
lines <- strsplit(content, "\r?\n")[[1]]
return(lines)
},
doParseFieldsStep1=function(parsed.content) {
# Get parsing expressions
parsing.expr <- .self$getParent()$getPropertyValue('parsing.expr')
.self$.assertNotNull(parsed.content)
.self$.assertNotNa(parsed.content)
.self$.assertNotNull(parsing.expr)
.self$.assertNotNa(parsing.expr)
.self$.assertNotNull(names(parsing.expr))
# Loop on all parsing expressions
for (field in names(parsing.expr)) {
# Match whole content
g <- stringr::str_match(parsed.content, parsing.expr[[field]])
# Get positive results
results <- g[ ! is.na(g[, 1]), , drop=FALSE]
# Any match ?
if (nrow(results) > 0)
.self$setFieldValue(field, results[, 2])
}
}
When inheriting from one of the abstract class listed above (BiodbTxtEntry
,
BiodbJsonEntry
, BiodbXmlEntry
, …), you also have the opportunity to write
some custom parsing code by implementing doParseFieldsStep2()
.
This method will be called just after doParseFieldsStep1()
, which is
implemented by the abstract class.
See HmdbMetabolitesEntry
class inside biodbHmdb extension package for an example.
Here is an extract:
doParseFieldsStep2=function(parsed.content) {
# Remove fields with empty string
for (f in .self$getFieldNames()) {
v <- .self$getFieldValue(f)
if (is.character(v) && ! is.na(v) && v == '')
.self$removeField(f)
}
# Correct InChIKey
if (.self$hasField('INCHIKEY')) {
v <- sub('^InChIKey=', '', .self$getFieldValue('INCHIKEY'), perl=TRUE)
.self$setFieldValue('INCHIKEY', v)
}
# Synonyms
synonyms <- XML::xpathSApply(parsed.content, "//synonym", XML::xmlValue)
if (length(synonyms) > 0)
.self$appendFieldValue('name', synonyms)
}
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biodb_1.2.2 BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] progress_1.2.2 tidyselect_1.1.1 xfun_0.28
## [4] bslib_0.3.1 purrr_0.3.4 vctrs_0.3.8
## [7] generics_0.1.1 htmltools_0.5.2 BiocFileCache_2.2.0
## [10] yaml_2.2.1 utf8_1.2.2 blob_1.2.2
## [13] XML_3.99-0.8 rlang_0.4.12 jquerylib_0.1.4
## [16] pillar_1.6.4 withr_2.4.3 glue_1.5.1
## [19] DBI_1.1.1 rappdirs_0.3.3 bit64_4.0.5
## [22] dbplyr_2.1.1 lifecycle_1.0.1 plyr_1.8.6
## [25] stringr_1.4.0 memoise_2.0.1 evaluate_0.14
## [28] knitr_1.36 fastmap_1.1.0 curl_4.3.2
## [31] fansi_0.5.0 highr_0.9 Rcpp_1.0.7
## [34] openssl_1.4.5 filelock_1.0.2 BiocManager_1.30.16
## [37] cachem_1.0.6 jsonlite_1.7.2 bit_4.0.4
## [40] hms_1.1.1 chk_0.7.0 askpass_1.1
## [43] digest_0.6.29 stringi_1.7.6 bookdown_0.24
## [46] dplyr_1.0.7 bitops_1.0-7 tools_4.1.2
## [49] magrittr_2.0.1 sass_0.4.0 RCurl_1.98-1.5
## [52] RSQLite_2.2.9 tibble_3.1.6 crayon_1.4.2
## [55] pkgconfig_2.0.3 ellipsis_0.3.2 prettyunits_1.1.1
## [58] assertthat_0.2.1 rmarkdown_2.11 httr_1.4.2
## [61] lgr_0.4.3 R6_2.5.1 compiler_4.1.2