---
title: "Details on biodb"
author: "Pierrick Roger"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('biodb')`"
vignette: |
  %\VignetteIndexEntry{Details on general *biodb* usage and principles}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
abstract: |
  Provides some more advanced knowledge on *biodb*: OOP model, configuration,
  request scheduler, persistent cache, logging...
output:
  BiocStyle::html_document:
    toc: yes
    toc_depth: 4
    toc_float:
      collapsed: false
  BiocStyle::pdf_document: default
bibliography: references.bib 
---

```{r, echo=FALSE}
source(system.file('vignettes_inc.R', package='biodb'))
```

# Introduction

In this document are presented different aspects of *biodb* in more details:
the object oriented programming model adopted,
its architecture,
how to configure the package,
how to create connectors and delete them,
understanding the request scheduler,
how to tune the logging system.

# Object oriented programming (OOP) model

In R, you may already know the two classical OOP models *S3* and *S4*.
*S4* is certainly the most used model, and is based on an original system of
generic methods that can be specialized for each class.

Two more recent OOP models exist in R: *Reference Classes* (aka *RC* or *R5*
from package *methods*) and *R6* (from package *R6*).
These two models are very similar.
The *biodb* package uses a mix of *RC* and *R6*, but will switch completely to
*R6* in a near future.

They both implement an object model and based on references.
This means that each object is unique, never copied and accessed through
references.
In this system, the created objects are not copied, but their reference are
copied.
Any modification of an object in one part of the code, will be visible from all
other parts of the code.
This means that when you pass an instance to a function, that function is able
to modify the instance.

Also, in *RC* and *R6*, the functions are attached to the object.
In other words, the mechanism to call a function on a object is different from
*S4*.
The calling mechanism is thus slightly different, in *RC* or *R6* we write
`myObject$myFunction()` instead of `myFunction(myObject)` in S4.

See
[Reference Classes](https://www.rdocumentation.org/packages/methods/versions/3.6.0/topics/ReferenceClasses) chapter from package *methods*
and this [introduction to R6](https://r6.r-lib.org/articles/Introduction.html)
for more details.

# Initialization & termination

*biodb* uses an initialization/termination scheme. You must first initialize
the library by creating an instance of the main class `BiodbMain`:
```{r}
mybiodb <- biodb::newInst()
```
And when you are done with the library, you have to terminate the instance
explicitly:
```{r}
mybiodb$terminate()
```

We will need a *biodb* instance for the rest of this vignette. Let us call the
constructor again:
```{r}
mybiodb <- biodb::newInst()
```

# Management classes

Several class instances are attached to the *biodb* instance for managing
different aspects of *biodb*: creating connectors, configuring *biodb*,
accessing the cache system, etc.

See table \@ref(tab:mngtClasses) for a list of these instances and their
purpose.

Class                 | Method to get the instance      | Description
--------------------- | ------------------------------- | --------------------------------
BiodbConfig           | `mybiodb$getConfig()`           | Access to configuration values, and modification.
BiodbDbsInfo          | `mybiodb$getDbsInfo()`          | Databases information (name, description, request frequency, etc).
BiodbEntryFields      | `mybiodb$getEntryFields()`      | Entry fields information (description, type, cardinality, etc).
BiodbFactory          | `mybiodb$getFactory()`          | Creation of connectors and entries.
BiodbPersistentCache  | `mybiodb$getPersistentCache()`  | Cache system on disk.
BiodbRequestScheduler | `mybiodb$getRequestScheduler()` | Send requests to web servers, respecting the frequency limit for each database server.
: (\#tab:mngtClasses) The management classes.
Only one instance is created for each of theses classes, and attached to the `BiodbMain` instance,

# Configuration

Several configuration values are defined inside the `definitions.yml` file of
*biodb*.
New configuration values can also be defined in extension packages.

o get a list of the existing configuration keys with their current value, run:
```{r}
mybiodb$getConfig()
```

To get a data frame of all keys with their title (short description), type and
default value, call (result in table \@ref(tab:keysDf)):
```{r}
x <- mybiodb$getConfig()$listKeys()
```
```{r keysDf, echo=FALSE, results='asis'}
knitr::kable(x, "pipe", caption="List of keys with some of their parameters")
```

To get the description of a key, run:
```{r}
mybiodb$getConfig()$getDescription('useragent')
```

To get a value, run:
```{r}
mybiodb$getConfig()$get('useragent')
```

To set a field value, run:
```{r}
mybiodb$getConfig()$set('useragent', 'My application ; wizard@of.oz')
mybiodb$getConfig()$get('useragent')
```

If the field is boolean, you can use the following methods instead:
```{r}
mybiodb$getConfig()$enable('offline')
mybiodb$getConfig()$disable('offline')
```

Configuration keys have default values. 
You can get a key's default value with this call:
```{r}
mybiodb$getConfig()$getDefaultValue('useragent')
```

Environment variables can be used to overwrite default values.
To get the name of the environment variable associated with a particular key,
call the following method:
```{r}
mybiodb$getConfig()$getAssocEnvVar('useragent')
```

# Databases information

Before creating any connector, you can information on the available databases
and their connector classes.

Getting the databases info instance will print you a list of all available
database connector classes:
```{r}
mybiodb$getDbsInfo()
```

If you want more information on one particular connector, run:
```{r}
mybiodb$getDbsInfo()$get('mass.csv.file')
```

# Available database connectors

This package is delivered with two connectors for local databasses:
MassCsvFile annd MassSqlite. However it is extendable, and in fact other
packages already exist or will soon be made available on Bioconductor or
GitHub for accessing other databases like ChEBI, Uniprot, HMDB, KEGG,
Massbank or Lipidmaps.  You may also write your own connector by extending
*biodb*. If you are interested, a vignette explains what you need to do in
details.

When creating the instance of the `BiodbMain` class you should have received a
message like "Loading definitions from package ..." if any extending package
has also been installed on your system. Connector definitions found in
extending packages are automatically loaded when instantiating `BiodbMain`, thus
you do not need to call `library()` to individually load each extending
package.

To get a list of available connectors, simply print information about your
`BiodbMain` instance:
```
mybiodb
```

# Creating connectors

Connectors are created through the factory instance.

To get the factory instance, run:
```{r}
mybiodb$getFactory()
```

Here is the creation of a Compound CSV File connector, using TSV file:
```{r}
chebi.tsv <- system.file("extdata", "chebi_extract.tsv", package='biodb')
conn <- mybiodb$getFactory()$createConn('comp.csv.file', url=chebi.tsv)
conn
```

The connector instance allows you to send requests to the database
to retrieve entries directly with `getEntry()`:
```{r}
conn$getEntry('1018')
```
or run more complex queries like a search:
```{r}
conn$searchForEntries(list(monoisotopic.mass=list(value=136.05, delta=0.1)))
```
See vignette
```{r, echo=FALSE, results='asis'}
make_vignette_ref('entries')
```
to learn how to access database entries and manipulate them.

All the connectors inherit from super class `BiodbConn`, hence they share a set
of common methods like `getEntryIds()`, `getEntry()` and `searchForEntries()`.
Moreover a connector may be specialized as a connector to a compound database
or to a mass spectra database, in which case they will inherit specific
methods.
In the case of a mass spectra database, it will be methods targeted toward mass
spectra like `getNbPeaks()`, `searchForMassSpectra()`, `msmsSearch()`, etc.
In the case of a compound database, it will be `annotateMzValues()`.
Those methods are generic and thus can be used with any connector inheriting
the super class.

When creating a connector, the factory keeps a reference to it in a list.
If you try to create again the same connector, the method `createConn()` will
throw an error.
However to can use the `getConn()` method to get back the connector instance
from its identifier or database name (if there is only one instance for the
database):
```{r}
conn <- mybiodb$getFactory()$getConn('comp.csv.file')
```

The factory is also responsible for creating the `BiodbEntry` objects, and as
for the connectors, it stores them into a list (called the "volatile cache").
When asking again for the same entry, the factory will return the reference is
has kept in the list.

To delete a connector, which is a good thing to do if you are not done with
*biodb* but you have finished using a particular connector, run:
```{r}
mybiodb$getFactory()$deleteConn(conn)
```
This will free all memory used for this connector, including the created
entries.
Be careful to do not keep those entry objects in some variable on your side,
otherwise the memory will not be released by R.

# Accessing a custom CSV file with a biodb connector

When creating a connector with `CompCsvFileConn` or `MassCsvFileConn`, if your
CSV file uses standard biodb field names as column names in its header line,
then everything will be fine and all values will read, recognized and set into
entry objects.

However if your CSV file uses custom column names, those values will be ignore
by biodb.
To tell biodb to use those columns, you must define a mapping between each
custom column with a valid biodb entry field, by using the `setField()` method.

Here we create a connector to a CSV file database (see table
\@ref(tab:compTable) for content) of chemical compounds that uses the semi-colon
as a separator:
```{r}
compUrl <- system.file("extdata", "chebi_extract_custom.csv", package='biodb')
compdb <- mybiodb$getFactory()$createConn('comp.csv.file', url=compUrl)
compdb$setCsvSep(';')
```
We use the `getUnassociatedColumns()` method to get a list of custom column
names:
```{r}
compdb$getUnassociatedColumns()
```
The method returns 3 column names that have not been automatically mapped.
However there is a little trick here, since `mass` field has been automatically
mapped but with the wrong biodb field `molecular.mass`, as you can see when
calling the method `getFieldsAndColumnsAssociation()`:
```{r}
compdb$getFieldsAndColumnsAssociation()
```
The `mass` column of the CSV file stores in fact the monoisotopic masses.
So we need to remap this column, and before that to reset the connector:
```{r}
mybiodb$getFactory()$deleteConn(compdb)
compdb <- mybiodb$getFactory()$createConn('comp.csv.file', url=compUrl)
compdb$setCsvSep(';')
compdb$setField('accession', 'ID')
compdb$setField('kegg.compound.id', 'kegg')
compdb$setField('monoisotopic.mass', 'mass')
compdb$setField('molecular.mass', 'molmass')
```

Now the connector works fine, and we can for instance get a list of all
accession numbers:
```{r}
compdb$getEntryIds()
```

And get whichever entry we want:
```{r}
compdb$getEntry('1018')$getFieldsAsDataframe()
```

```{r compTable, echo=FALSE, results='asis'}
compDf <- read.table(compUrl, sep=";", header=TRUE, quote="")
# Prevent RMarkdown from interpreting @ character as a reference:
compDf$smiles <- vapply(compDf$smiles, function(s) paste0('`', s, '`'), FUN.VALUE='')
knitr::kable(head(compDf), "pipe", caption="Excerpt from compound database CSV file.")
```

# Request scheduler

The `BiodbRequestScheduler` instance is responsible for sending requests to web
server, taking care of respecting the frequency specified by the `scheduler.n`
and `scheduler.t` parameters, and for receiving results and saving them to cache.

The cache is used to give back a result immediately without contacting the
server, in case the exact same request has already been run previously.

You do not have to interact with the request scheduler, it runs as a back end
component.

# Entry fields information

The `BiodbEntryFields` instance stores information about all entry fields
declared inside definitions YAML files.

Get the entry fields instance:
```{r}
mybiodb$getEntryFields()
```

Get a list of all defined fields:
```{r}
mybiodb$getEntryFields()$getFieldNames()
```

Get information about a field:
```{r}
mybiodb$getEntryFields()$get('monoisotopic.mass')
```
The object returned is a `BiodbEntryField` instance.
See the help page of this class to get a list of all methods you can call on
such an instance.

# Persistent cache system

The persistent cache system is responsible for saving entry contents and
results of web server requests onto disk, and reuse them later to avoid
recontacting the web server.

Run the following method to get the instance of the `BiodbPersistentCache`
class:
```{r}
mybiodb$getPersistentCache()
```

It is possible to delete files from the cache directly from the persistent
cache instance.
However it is a lot preferable to do it from the connector instance.
If we open an instance of the ChEBI example connector from the vignette
```{r, echo=FALSE, results='asis'}
make_vignette_ref('new_connector')
```
:
```{r}
source(system.file("extdata", "ChebiExConn.R", package='biodb'))
source(system.file("extdata", "ChebiExEntry.R", package='biodb'))
mybiodb$loadDefinitions(system.file("extdata", "chebi_ex.yml", package='biodb'))
conn <- mybiodb$getFactory()$createConn('chebi.ex')
```
And load some entry:
```{r}
entry <- conn$getEntry('17001')
```
The entry is now downloaded into the cache system.
We can check that with the following call:
```{r}
mybiodb$getPersistentCache()$fileExist(conn$getCacheId(), name='17001', ext=conn$getEntryFileExt())
```
And get the path to the cache file:
```{r}
mybiodb$getPersistentCache()$getFilePath(conn$getCacheId(), name='17001', ext=conn$getEntryFileExt())
```
If we delete the entry content from the cache:
```{r}
conn$deleteAllEntriesFromPersistentCache()
```
The file does not exist anymore:
```{r}
mybiodb$getPersistentCache()$fileExist(conn$getCacheId(), name='17001', ext=conn$getEntryFileExt())
```
But the entry object is still in memory.
We need to delete entry instances with the following call:
```{r}
conn$deleteAllEntriesFromVolatileCache()
```
Note that the results of web server requests are still inside the cache folder.
In order to force a new downloading of data, we need to erase those files too.
The following call will erase all cache files associated with a connector,
including the files deleted by `deleteAllEntriesFromPersistentCache()`:
```{r}
conn$deleteWholePersistentCache()
```

# Logging messages

*biodb* uses the *lgr* package for logging messages.
The *lgr* instance used by *biodb* can be gotten by calling:
```{r}
biodb::getLogger()
```
See the [lgr](https://s-fleck.github.io/lgr/) home page for demonstration on
how to use it.

You can use the following *biodb* short cuts to send messages of different
levels.
To send an information message:
```{r}
biodb::logInfo("%d entries have been processed.", 12)
```
To send a debug message:
```{r}
biodb::logDebug("The file %s has been written.", 'myfile.txt')
```
To send a trace message:
```{r}
biodb::logTrace("%d bytes written.", 1284902)
```

In addition *biodb* defines two methods to throw an error or a warning and log
this error or warning at the same time.
These are `biodb::error()` and `biodb::warn()`.

By default the *lgr* package displays information messages.
If you want to silence all messages, just run `lgr::lgr$remove_appender(1)`.
This is will remove the default appender and silence all messages from all
packages using *lgr*, including *biodb*.
However if you just want to silence *biodb* messages, run:
```{r}
biodb::getLogger()$set_threshold(300)
```
Information messages are now silenced:
```{r}
biodb::logInfo("hello")
```
For enabling again:
```{r}
biodb::getLogger()$set_threshold(400)
```
And messages are echoed again:
```{r}
biodb::logInfo("hello")
```

# Closing biodb instance

Do not forget to terminate your biodb instance once you are done with it:
```{r}
mybiodb$terminate()
```

# Session information

```{r}
sessionInfo()
```

# References