---
title: "R-side access to published microbial signatures from BugSigDB"
author: "Ludwig Geistlinger, Jennifer Wokaty, and Levi Waldron"
output:
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('bugsigdbr')`"
vignette: >
  %\VignetteIndexEntry{R-side access to BugSigDB}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, include = FALSE}
## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL
)
```

# BugSigDB: a comprehensive database of published microbial signatures

[BugSigDB](https://bugsigdb.org) is a manually curated database of microbial
signatures from the published literature of differential abundance studies of
human and other host microbiomes.

BugSigDB provides:

* standardized data on geography, health outcomes, host body sites, and
  experimental, epidemiological, and statistical methods using controlled
  vocabulary,
* results on microbial diversity,
* microbial signatures standardized to the NCBI taxonomy, and
* identification of published signatures where a microbe has been reported.

The [bugsigdbr](https://github.com/waldronlab/bugsigdbr) package implements
convenient access to BugSigDB from within R/Bioconductor. 
The goal of the package is to facilitate import of BugSigDB data into
R/Bioconductor, provide utilities for extracting microbe signatures, and enable
export of the extracted signatures to plain text files in standard file formats
such as [GMT][1].

[1]: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29

The [bugsigdbr](https://github.com/waldronlab/bugsigdbr) package is primarily a
data package. For descriptive statistics and comprehensive analysis of BugSigDB
contents, please see the 
[BugSigDBStats package](https://github.com/waldronlab/BugSigDBStats)
and [analysis vignette][2].

[2]: http://waldronlab.io/BugSigDBStats/articles/BugSigDBStats.html
 
We start by loading the package.

```{r "start", message=FALSE}
library(bugsigdbr)
```

## Obtaining published microbial signatures from BugSigDB

The function `importBugSigDB` can be used to import the complete collection of
curated signatures from BugSigDB. The dataset is downloaded once and
subsequently cached. Use `cache = FALSE` to force a fresh download of BugSigDB
and overwrite the local copy in your cache.  

```{r getBugSigDB, message = FALSE}
bsdb <- importBugSigDB()
dim(bsdb)
colnames(bsdb)
```

Each row of the resulting `data.frame` corresponds to a microbe signature from
differential abundance analysis, i.e. a set of microbes that has been found
with increased or decreased abundance in one sample group when compared to
another sample group (eg. in a case-vs.-control setup).
The curated signatures are richly annotated with additional metadata columns
providing information on study design, antibiotics exclusion criteria,
sample size, and experimental and statistical procedures, among others.

Subsetting the full dataset to certain conditions, body sites, or other
metadata columns of interest can be done along the usual lines for
subsetting `data.frame`s.

For example, the following `subset` command restricts the dataset to signatures
obtained from microbiome studies on obesity, based on fecal samples from
participants in the US.

```{r}
us.obesity.feces <- subset(bsdb,
                           `Location of subjects` == "United States of America" &
                           Condition == "obesity" &
                           `Body site` == "feces")
```

## Extracting microbe signatures

Given the full BugSigDB collection (or a subset of interest), the function
`getSignatures` can be used to obtain the microbes annotated to each signature.

Microbes annotated to a signature are returned following the
[NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) nomenclature per
default.

```{r getSignatures}
sigs <- getSignatures(bsdb)
length(sigs)
sigs[1:3]
```

It is also possible obtain signatures based on the full taxonomic
classification in [MetaPhlAn](https://github.com/biobakery/MetaPhlAn)
format ...

```{r getSignaturesMP}
mp.sigs <- getSignatures(bsdb, tax.id.type = "metaphlan")
mp.sigs[1:3]
```

... or using the taxonomic name only:

```{r getSignaturesTN}
tn.sigs <- getSignatures(bsdb, tax.id.type = "taxname")
tn.sigs[1:3]
```

As metagenomic profiling with 16S RNA sequencing or whole-metagenome shotgun
sequencing is typically conducted on a certain taxonomic level, it is also
possible to obtain signatures restricted to eg. the genus level ...

```{r getSignaturesGN}
gn.sigs <- getSignatures(bsdb, 
                         tax.id.type = "taxname",
                         tax.level = "genus")
gn.sigs[1:3]
```

... or the species level:

```{r getSignaturesSP}
gn.sigs <- getSignatures(bsdb, 
                         tax.id.type = "taxname",
                         tax.level = "species")
gn.sigs[1:3]
```

Note that restricting signatures to microbes given at the genus level, will per
default exclude microbes given at a more specific taxonomic rank such as
species or strain.

For certain applications, it might be desirable to not exclude microbes given
at a more specific taxonomic rank, but rather extract the more general
`tax.level` for microbes given at a more specific taxonomic level.

This can be achieved by setting the argument `exact.tax.level` to `FALSE`,
which will here extract genus level taxon names, for taxa given at the species
or strain level.

```{r getSignaturesExact}
gn.sigs <- getSignatures(bsdb, 
                         tax.id.type = "taxname",
                         tax.level = "genus",
                         exact.tax.level = FALSE)
gn.sigs[1:3]
```

## Writing microbe signatures to file in GMT format

Once signatures have been extracted using a taxonomic identifier type of
choice, the function `writeGMT` allows to write the signatures to plain text
files in [GMT format][3].

[3]: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29


```{r writeGMT}
writeGMT(sigs, gmt.file = "bugsigdb_signatures.gmt")
```

This is the standard file format for gene sets used by
[MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/) and
[GeneSigDB](https://pubmed.ncbi.nlm.nih.gov/22110038)
and is compatible with most enrichment analysis software.

```{r, echo = FALSE, results = "hide"}
file.remove("bugsigdb_signatures.gmt")
```

## Displaying BugSigDB signature and taxon pages

Leveraging BugSigDB's semantic MediaWiki web interface, we can also
programmatically access annotations for individual microbes and microbe
signatures.

The `browseSignature` function can be used to display BugSigDB signature pages
in an interactive session. For programmatic access in a non-interactive
setting, the URL of the signature page is returned.

```{r browseSig}
browseSignature(names(sigs)[1])
```

Analogously, the `browseTaxon` function displays BugSigDB taxon pages in an
interactive session, or the URL of the corresponding taxon page otherwise.

```{r browseTaxon}
browseTaxon(sigs[[1]][1])
```

# Ontology-based queries for experimental factors and body sites

The Semantic MediaWiki curation interface at [bugsigdb.org](https://bugsigdb.org)
enforces metadata annotation of signatures to follow established ontologies such
as the [Experimental Factor Ontology (EFO)](https://www.ebi.ac.uk/efo/) for condition,
and the [Uber-Anatomy Ontology (UBERON)](https://www.ebi.ac.uk/ols/ontologies/uberon)
for body site.

The `getOntology` function can be used to import both ontologies into R.
The result is an object of class `ontology_index` from the
[ontologyIndex](https://cran.r-project.org/web/packages/ontologyIndex/index.html)
package. 

```{r getEfo}
efo <- getOntology("efo")
efo
```

```{r getUberon}
uberon <- getOntology("uberon")
uberon
```

As demonstrated above, subsets of BugSigDB signatures can be obtained for signatures
associated with certain experimental factors or specific body sites of interest.
Higher-level queries can be performed with the `subsetByOntology` function, which
implements subsetting by more general ontology terms. This facilitates grouping
of signatures that semantically belong together.

More specifically, subsetting BugSigDB signatures by an EFO term then involves
subsetting the `Condition` column to the term itself and all descendants of that
term in the EFO ontology and that are present in the `Condition` column. Here,
we demonstrate the usage by subsetting to signatures associated with cancer. 

```{r subsetByEfo}
sdf <- subsetByOntology(bsdb,
                        column = "Condition",
                        term = "cancer",
                        ontology = efo)
dim(sdf)
table(sdf[,"Condition"])
```

And analogously, subsetting by an UBERON term involves subsetting the
`Body site` column to the term itself and all descendants of that term in the
UBERON ontology and that are present in the `Body site` column. For example,
we can use `subsetByOntology` to subset to signatures for which microbiome
samples have been obtained from parts of the digestive system.

```{r subsetByUberon}
sdf <- subsetByOntology(bsdb,
                        column = "Body site",
                        term = "digestive system",
                        ontology = uberon)
dim(sdf)
table(sdf[,"Body site"])
```

# Session info

```{r sessionInfo}
sessionInfo()
```