---
title: "In-house mass spectra database"
author: "Pierrick Roger"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('biodb')`"
vignette: |
  %\VignetteIndexEntry{In-house LCMS database}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
abstract: |
  In this vignette you will learn how to load your in-house mass spectra CSV file database into *biodb*, how to **search** for LCMS spectra and MSMS spectra, how find matching peaks, and how to annotate your **mass spectrum**. You will also learn how to create a connector for a *biodb* SQLite database
output:
  BiocStyle::html_document:
    toc: yes
    toc_depth: 4
    toc_float:
      collapsed: false
  BiocStyle::pdf_document: default
bibliography: references.bib
---

```{r, code=readLines(system.file('vignettes_inc.R', package='biodb')), echo=FALSE}
```

# Introduction

*biodb* is able to handle in-house mass spectra databases either stored inside
a CSV file, using the `mass.csv.file`, or inside an SQLite file, using
`mass.sqlite` connectors.

Both connectors accept the creation of new entries through *biodb* methods.
The CSV file connector is able to read any CSV file in input, while the SQLite
connector is only able to interact with a database file created by *biodb*.

Inside vignette
```{r, echo=FALSE, results='asis'}
make_vignette_ref('entries')
```
you can learn how to create a new empty SQLite of CSV file connector in order
to copy entries into it entries from another connector, and thus create your
own database.

To start we create an instance of the `BiodbMain` class:
```{r}
mybiodb <- biodb::newInst()
```

# CSV File connector

We are going to use the `mass.csv.file` database connector to load an in-house
LCMS database.  This connector is able to access any in-house database CSV
file, containing LCMS entries and/or MSMS entries.

We will use an extract of [Massbank](http://www.massbank.jp/) [@horai2010_massbank] database for our example:
```{r}
fileUrl <- system.file("extdata", "massbank_extract_lcms_2.tsv", package='biodb')
```

We create the connector from the data frame, and will use this connector for all subsequent examples.
```{r}
conn <- mybiodb$getFactory()$createConn('mass.csv.file', url=fileUrl)
```

Two fields are missing inside this database: the MS level and unit used for chromatography retention times.
We define them using the `addField()` method:
```{r}
conn$addField('ms.level', 1)
conn$addField('chrom.rt.unit', 's')
```

# Searching for entries

Both `peak.mzexp` and `peak.mztheo` are defined inside the database.
The following call to get the field that will be used for matching will emit a warning because of this:
```{r}
conn$getMatchingMzField()
```

Setting the M/Z field to use:
```{r}
conn$setMatchingMzField('peak.mztheo')
```

Searching by M/Z:
```{r}
conn$searchForMassSpectra(mz.min=145.96, mz.max=147.13)
```

Searching with multiple M/Z:
```{r}
conn$searchForMassSpectra(mz=c(86.000002, 146.999997), mz.tol=10, mz.tol.unit='ppm')
```

Searching by M/Z and RT:
```{r}
conn$searchForMassSpectra(mz.min=72.78, mz.max=73.12, rt=45.5, rt.tol=1, rt.unit='s')
```

Other parameters can be used in `searchForMassSpectra()`:

 * MS level.
 * MS mode.
 * Minimum of relative intensity.
 * Matching of precursor peak.
 * Maximum of results.

# Peak annotation

Peak annotation can be done with `searchMsPeaks()` method.

We must first define the data frame of inputs:
```{r}
input <- data.frame(mz=c(73.01, 116.04, 174.2),
                    rt=c(79, 173, 79))
```

Then we run the annotation:
```{r}
conn$searchMsPeaks(input, mz.tol=0.1, rt.unit='s', rt.tol=10, match.rt=TRUE, prefix='match.')
```

# MS Annotation

## Annotation using an LCMS database

Here is the input data frame we want to annotate with both M/Z values and RT values in seconds:
```{r Set a list of M/Z values}
ms.tsv <- system.file("extdata", "ms.tsv", package='biodb')
mzdf <- read.table(ms.tsv, header=TRUE, sep="\t")
```

Annotation is also possible using an LCMS database.

For this example we use an in-house database saved as a TSV file:
```{r Instantiate an in-house LCMS database}
lcmsdb <- system.file("extdata", "massbank_extract_lcms_1.tsv", package="biodb")
massbank <- mybiodb$getFactory()$createConn('mass.csv.file', url=lcmsdb)
```
The database file was built using Massbank data. The accession numbers correspond to real Massbank entries.

Now that the database is loaded, we define some missing fields (MS level and RT unit):
```{r Set missing fields on database}
massbank$addField('ms.level', 1)
massbank$addField('chrom.rt.unit', 's')
```

We will now annotate our M/Z peaks, selecting a set of fields from the database to write in the output data frame:
```{r Annotate M/Z values with an in-house LCMS database}
massbank$searchMsPeaks(mzdf, mz.tol=1e-3, fields=c('accession', 'name', 'formula', 'chebi.id'), prefix='mydb.')
```

Annotation is also possible using retention times.
We then choose the chromatographic columns on which we want to match, and the set of fields we want in output:
```{r Set chromatographic columns and output fields}
chromColIds <- c('TOSOH TSKgel ODS-100V  5um Part no. 21456')
fields <- c('accession', 'name', 'formula', 'chebi.id', 'chrom.rt', 'chrom.col.id')
```

Finally we run the annotation on M/Z and RT values:
```{r Annotate using RT values}
massbank$searchMsPeaks(mzdf, mz.tol=1e-3, fields=fields, prefix='mydb.', chrom.col.ids=chromColIds, rt.unit='s', rt.tol=10, match.rt=TRUE)
```

For more details on the `searchMsPeaks()` method, please this the help page of `BiodbMassdbConn`.

# MS/MS

## Defining the database

We create the connector to a CSV file database built with MS2 spectra extracted from Massbank:
```{r}
db.tsv <- system.file("extdata", "massbank_extract_msms.tsv", package='biodb')
conn <- mybiodb$getFactory()$createConn('mass.csv.file', url=db.tsv)
```

## Searching for entries

We can search for entries containing certain M/Z values with the following method:
```{r}
conn$searchForMassSpectra(mz.min=115, mz.max=115.1, max.results=5)
```

It is also possible to call it with a tolerance:
```{r}
conn$searchForMassSpectra(mz=115, mz.tol=0.1, mz.tol.unit='plain', max.results=5)
```

## Spectrum matching

The spectrum to search for must be defined inside a data frame:
```{r}
spectrum <- data.frame(mz=c(286.1456, 287.1488, 288.1514), rel.int=c(100, 45, 18))
```

Then we call the `msmsSearch()` method:
```{r}
conn$msmsSearch(spectrum, precursor.mz=286.1438, mz.tol=0.1, mz.tol.unit='plain', ms.mode='pos')
```

# SQLite connector

The SQLite connector operates in the same way as the CSV file connector, except
for the instantiation step for which it needs an SQLite file as input instead
of a CSV file.
Moreover the SQLite file needs to be in *biodb* format (the relations need to be created by *biodb*).

Here is a *biodb* SQLite mass spectra file, already filled with entries:
```{r}
sqliteFile <- system.file("extdata", "generated", "massbank_extract_full.sqlite", package='biodb')
```

We create a connector from this file:
```{r}
sqliteConn <-  mybiodb$getFactory()$createConn('mass.sqlite', url=sqliteFile)
```

We can search inside this database the same way we have been searching inside the CSV file database:
```{r}
sqliteConn$searchMsPeaks(mzdf, mz.tol=1e-3, fields=c('accession', 'name', 'formula', 'chebi.id'), prefix='mydb.')
```

# Closing biodb instance

Do not forget to terminate your biodb instance once you are done with it:
```{r}
mybiodb$terminate()
```

# References