--- title: "In-house compound database" author: "Pierrick Roger" date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('biodb')`" vignette: | %\VignetteIndexEntry{In-house compound database} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} abstract: | In this vignette you will learn how to load your in-house compounds CSV file into *biodb*, how to **search** for compounds by **name** and/or **mass**, and how to **annotate** your **mass spectrum** using that in-house database. You will also learn how to create a connector for a *biodb* SQLite database. output: BiocStyle::html_document: toc: yes toc_depth: 4 toc_float: collapsed: false BiocStyle::pdf_document: default bibliography: references.bib --- ```{r, code=readLines(system.file('vignettes_inc.R', package='biodb')), echo=FALSE} ``` # Introduction *biodb* is able to handle in-house compound databases either stored inside a CSV file, using the `comp.csv.file`, or inside an SQLite file, using `comp.sqlite` connectors. Both connectors accept the creation of new entries through *biodb* methods. The CSV file connector is able to read any CSV file in input, while the SQLite connector is only able to interact with a database file created by *biodb*. Inside vignette ```{r, echo=FALSE, results='asis'} make_vignette_ref('entries') ``` you can learn how to create a new empty SQLite of CSV file connector in order to copy entries into it entries from another connector, and thus create your own database. To start we create an instance of the `BiodbMain` class: ```{r} mybiodb <- biodb::newInst() ``` # CSV File connector In order to facilitate the loading of the file, you should use the **tabulation** character as columns separator and name the columns of your file with *biodb* **standard field names**. However, if your CSV file does not respect the *biodb* standard, you have also the possibility to declare a **custom column separator character** and the mapping between **column names of your file** and *biodb* field names just before loading the file. Once the connection to the file is defined, you can use the connector to your in-house file as any other compound database connector. ## Creating a connector In order to create a connector to a CSV database, you have to provide the path to your CSV file. This is done with the `url` parameter of the `createConn()` method. If your CSV file respects the *biodb* defaults (see below), then no further information is required. If your CSV file does not respect the *biodb* **standard**, then you will have to modify the defaults on the connector instance, before the CSV file is loaded. Thus it has to be done immediately after the connector creation. In the following sub-sections we are going to see how to load a *biodb* standard CSV file and a custom CSV file. ### Loading a biodb standard CSV file Here is a *biodb* standard CSV file containing an extract of the ChEBI database: ```{r} csvUrl <- system.file("extdata", "chebi_extract.tsv", package='biodb') ``` See table \@ref(tab:csvTable) for the content of this file. ```{r csvTable, echo=FALSE, results='asis'} csvDf <- read.table(csvUrl, sep="\t", header=TRUE, quote="") ## Prevent RMarkdown from interpreting @ character as a reference: csvDf$smiles <- vapply(csvDf$smiles, function(s) paste0('`', s, '`'), FUN.VALUE='') knitr::kable(head(csvDf), "pipe", caption="Excerpt from compound database TSV file.") ``` This CSV file respects the *biodb* defaults. The columns separator is the tabulation character. Column names use *biodb* standard entry field names. String values may be quoted with double quotes (`"`). We instantiate the connector by passing the URL to the factory: ```{r} conn <- mybiodb$getFactory()$createConn('comp.csv.file', url=csvUrl) ``` We will later use this connector to run the examples of this vignette. ### Loading a custom CSV file Here is a custom CSV file containing the same extract of the ChEBI database than with the *biodb* standard CSV file: ```{r} csvUrl2 <- system.file("extdata", "chebi_extract_custom.csv", package='biodb') ``` Only the columns separator character and some column names have been changed. We create now a connector for this custom CSV file: ```{r} conn2 <- mybiodb$getFactory()$createConn('comp.csv.file', url=csvUrl2) ``` At this step the file has not yet been loaded. We can thus customize the connector in order for the CSV file parsing to proceed correctly. The effective loading of the CSV file will happen when you run a method of the connector that requires the data. The first step to customize your connector is to set the separator character: ```{r} conn2$setCsvSep(';') ``` Then you may change the quote characters: ```{r} conn2$setCsvQuote('') ``` Here we specify with an empty string that this CSV file does not use quotes for character values. Finally you have to map each custom column name with the name of a *biodb* entry field. For this you call the `setField()` method for each column name, giving as first argument the *biodb* field name and as second argument the column name. In our case this gives: ```{r} conn2$setField('accession', 'ID') conn2$setField('kegg.compound.id', 'kegg') conn2$setField('monoisotopic.mass', 'mass') conn2$setField('molecular.mass', 'molmass') ``` You will notice that with the first call to `setField()` an information message tells you that the CSV file has been loaded. It is possible to associate **several column names** to a **single** *biodb* field, in which case you have to provide a character vector containing your column names. The values of the resulting *biodb* field will be the **concatenation** of the values of your selected columns, in the order specified. Because of the concatenation of your values, the type of the targeted *biodb* field must be **character**. This is particularly useful for the **accession** field, which must correspond to a unique entry inside your CSV file. Depending on your CSV file, you may need to associate several columns to create a valid accession value that identifies a unique entry. ## Retrieving entries Retrieving entries is done as with any other connector in *biodb*, using their accession numbers. The returned value is a list of `BiodbEntry` objects: ```{r} entries <- conn$getEntry(c('1018', '1456', '16750', '64679')) entries ``` From a list of entries, you can obtain a data frame with their values: ```{r} entriesDf <- mybiodb$entriesToDataframe(entries) ``` See table \@ref(tab:entriesTable) for the content of this data frame. ```{r entriesTable, echo=FALSE, results='asis'} ## Prevent RMarkdown from interpreting @ character as a reference: entriesDf$smiles <- vapply(entriesDf$smiles, function(s) paste0('`', s, '`'), FUN.VALUE='') knitr::kable(entriesDf, "pipe", caption="Some entries from the compound database.") ``` See vignette ```{r, echo=FALSE, results='asis'} make_vignette_ref('entries') ``` to know everything you can do with *biodb* entry objects and also the help page of the class `?biodb::BiodbEntry`. ## Searching for entries It is possible to search for entries by mass inside a compounds database: ```{r} conn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1))) ``` The function returns a list of accession numbers that you can use with the `getEntry()` method to retrieve full entry objects. The tolerance can also be expressed in PPM: ```{r} conn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, ppm=10))) ``` or with a range: ```{r} conn$searchForEntries(list(monoisotopic.mass=list(min=283.091, max=283.093))) ``` You can set a maximum to the number of entries returned with the `max.results` parameter: ```{r} conn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1)), max.results=2) ``` To get a list of all possible mass fields in *biodb*, run: ```{r} mybiodb$getEntryFields()$getFieldNames(type='mass') ``` To get information on these fields run: ```{r} mybiodb$getEntryFields()$get(c('monoisotopic.mass', 'nominal.mass')) ``` To check if a connector is searchable by a field, use the following method: ```{r} conn$isSearchableByField('monoisotopic.mass') ``` To get the list of searchable fields for a connector, run: ```{r} conn$getSearchableFields() ``` Entries are also searchable by name: ```{r} conn$searchForEntries(list(name='deoxyguanosine')) ``` And it is possible to combine a search by mass with a search by name: ```{r} conn$searchForEntries(list(name='guanosine', monoisotopic.mass=list(value=283.0917, delta=0.1))) ``` ## Annotation of an MS file Your in-house chemical database can be used to annotate a mass spectrum, using a data frame or a vector as input. Annotation is done using the `annotateMzValues()` method, which is a generic method. It is thus available for all compound databases that allow search on masses. You will obtain a new data frame with appended columns taken from the chemical database. Here is an input data frame example with M/Z values in a column: ```{r} msTsv <- system.file("extdata", "ms.tsv", package='biodb') mzDf <- read.table(msTsv, header=TRUE, sep="\t") ``` See table \@ref(tab:mzTable) for the content of the input. ```{r mzTable, echo=FALSE, results='asis'} knitr::kable(mzDf, "pipe", caption="Input M/Z values.") ``` We run the annotation with the `annotateMzValues()` method: ```{r} annotDf <- conn$annotateMzValues(mzDf, mz.tol=1e-3, ms.mode='neg', mz.tol.unit='plain', fields=c('accession', 'name', 'formula', 'molecular.mass', 'monoisotopic.mass'), prefix='mydb.', fieldsLimit=1) ``` See table \@ref(tab:annotTable) for the results. Inside this table, the values coming from the database entry fields have been prefixed with the value provided inside the `prefix` parameter. The default value of this `parameter` would be the name of the database but you can set it to any value you like. The first parameter is the input, as a data frame or a numeric vector. In case of a data frame the column containing the M/Z values must be named `mz` or you have to specify its name using the `mz.col` parameter. The `mz.tol` and `mz.tol.unit` parameters are used to set the tolerance, see the manual of the class `BiodbCompounddbConn`. You can set the mass field to use in the database with the `mass.field` parameter (default is `monoisotopic.mass`). By default all entry fields from the database will be copied inside the output data frame, but you can restrict to a custom set of fields using the `fields` parameter. The `fieldsLimit` parameter is used to limit the number of values output for fields that may contain more than one value. Here it is used for the `'name'` field, which may content more than one name for each entry. By setting the parameter to `1` we select only the first name for each entry. You will find a complete description of this method and other compound methods by running `?biodb::BiodbCompounddbConn`. ```{r annotTable, echo=FALSE, results='asis'} knitr::kable(annotDf, "pipe", caption="The annotated mass spectrum. Columns prefixed with \"mydb.\" come from the compound database.") ``` See also vignette ```{r, echo=FALSE, results='asis'} make_vignette_ref('in_house_mass_db') ``` for annotation using a mass spectra database. # SQLite connector The SQLite connector operates in the same way as the CSV file connector, except for the instantiation step for which it needs an SQLite file as input instead of a CSV file. Moreover the SQLite file needs to be in *biodb* format (the relations need to be created by *biodb*). Here is a *biodb* SQLite compounds file, already filled with entries: ```{r} sqliteFile <- system.file("extdata", "generated", "chebi_extract.sqlite", package='biodb') ``` We create a connector from this file: ```{r} sqliteConn <- mybiodb$getFactory()$createConn('comp.sqlite', url=sqliteFile) ``` We can search inside this database the same way we have been searching inside the CSV file database: ```{r} sqliteConn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1))) ``` # Closing biodb instance Do not forget to terminate your biodb instance once you are done with it: ```{r} mybiodb$terminate() ```