---
title: "CBEA: Competitive Balances for Taxonomic Enrichment Analysis"
author: 
  - name: Quang P.Nguyen
    affiliation:
    - Department of Epidemiology at Dartmouth College
    - Department of Biomedical Data Science at Dartmouth College  
    email: quangpmnguyen@gmail.com
  - name: Anne G. Hoen 
    affiliation: 
    - Department of Epidemiology at Dartmouth College
    - Department of Biomedical Data Science at Dartmouth College 
    - Department of Microbiology and Immunology at Dartmouth College 
    email: Anne.G.Hoen@Dartmouth.edu
  - name: H. Robert Frost
    affiliation:
    - Department of Biomedical Data Science at Dartmouth College 
    email: Hildreth.R.Frost@Dartmouth.edu  
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('CBEA')`"
vignette: >
  %\VignetteIndexEntry{Basic Usage}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```


```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE}
## Track time spent on making the vignette
startTime <- Sys.time()

## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    phyloseq = citation("phyloseq")[1],
    treesummarizedexperiment = citation("TreeSummarizedExperiment")[1],
    tidyverse = citation("tidyverse")[1],
    mixtools = citation("mixtools")[1],
    fitdistrplus = citation("fitdistrplus")[1],
    CBEA = citation("CBEA")[1],
    broom = citation("broom")[1], 
    BiocParallel = citation("BiocParallel")[1]
)
```

# Introduction  

This package implements the CBEA approach for performing set-based enrichment analysis for microbiome relative abundance data. A preprint of the package can be found [on bioXriv](https://www.biorxiv.org/content/10.1101/2021.09.07.459294v1.full). In summary, CBEA (Competitive Balances for taxonomic Enrichment Analysis) provides an estimate of the activity of a set by transforming an input taxa-by-sample data matrix into a corresponding set-by-sample data matrix. The resulting output can be used for additional downstream analyses such as differential abundance, classification, clustering, etc. using set-based features instead of the original units.  

The transformation that CBEA applies is based on the isometric log ratio transformation:  
$$
CBEA_{i,\mathbb{S}} = \sqrt{\frac{|\mathbb{S}||\mathbb{S_c}|}{|\mathbb{S}| + |\mathbb{S_c}|}} \ln \frac{g(X_{i,j | j\in \mathbb{S}})}{g(X_{i,j | j \notin \mathbb{S}})}
$$
Where $\mathbb{S}$ is the set of interest, $\mathbb{S}_C$ is it's complement, $g()$ is the geometric mean operation, and $X$ is the original data matrix where $i$ is the index representing samples and $j$ is the index representing variables (or taxa).   

The inference procedure is performed through estimating the null distribution of the test statistic. This can be done either via permutations or a parametric fit of a distributional form on the permuted scores. Users can also adjust for variance inflation due to inter-taxa correlation. Please refer to the main manuscript for any additional details.  

# Usage guide    

## Install `CBEA`  

`r Biocpkg("CBEA")` is an `R` package available via the [Bioconductor](http://bioconductor.org) repository for packages. It requires installing the `R` open source statistical programming language, which can be accessed on any operating system from [CRAN](https://cran.r-project.org/). After which you can install `r Biocpkg("CBEA")` by using the following commands in your `R` session:

```{r "install", eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("CBEA")

## Check that you have a valid Bioconductor installation
BiocManager::valid()
```

If there are any issues with the installation procedure or package features, the best place would be to file an issue at the GitHub [repository](https://github.com/qpmnguyen/CBEA). For any additional support you can use the [Bioconductor support site](https://support.bioconductor.org/) and use the `CBEA` tag and check [the older posts](https://support.bioconductor.org/t/CBEA/). Please note that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.  

```{r "start", message=FALSE}
library("CBEA")
library(BiocSet)
library(tidyverse)
set.seed(1020)
```

## Loading sample data  
First, we can load some pre-packaged data sets contained in `r Biocpkg("CBEA")`. Here we're loading the data from the Human Microbiome Project (HMP) in `TreeSummarizedExperiment` data container from the `r Biocpkg("TreeSummarizedExperiment")`. This package does not support `phyloseq` from the `r Biocpkg("phyloseq")` package but users can leverage the `r Biocpkg("mia")` package to convert between the data types.  

In addition, users can also input raw matrices or data frames, however those require additional arguments. The `taxa_are_rows` argument requires users specify whether the data frame/matrix has taxa abundances as rows (or as columns). The `id_col` argument requires users to specify (for data frames only) a vector of names of row metadata that will be excluded from the analysis.  

```{r load_data}
data("hmp_gingival")
abun <- hmp_gingival$data
metab_sets <- hmp_gingival$set
abun # this is a TreeSummarizedExperiment object 
```

## Input sets 

CBEA accepts any type of sets, as long as it is in the `BiocSet` format where the elements in the sets can be matched to taxa names in the data set. The main function will check if these names match.   

```{r load_sets}
metab_sets
```

For more information on `BiocSet`, please refer to the documentation from `r Biocpkg("BiocSet")`. However, simply speaking, `BiocSet` acts similar to a list of three data frames and can be used in conjunction with `r CRANpkg("dplyr")`/`r CRANpkg("tidyr")`.  

## Applying CBEA   

After specifying the inputs, `cbea` is the main function to apply the method. If there are zeros in the abundance data, the `cbea` will add a pseudocount to avoid issues with the log-ratio transformation (but will throw a warning). If a different zero-handling approach is desired, users should pre-process the abundance data with the appropriate method. For parametric fits, `cbea` relies on the `r CRANpkg("fitdistrplus")` and `r CRANpkg("mixtools")` packages to estimate the parameters of the null. Specific arguments to control this fitting procedure can be provided as a named list in the `control` argument.     
Applying `cbea` is one command:  
```{r run_cbea, eval = TRUE}
results <- cbea(abun, set = metab_sets, abund_values = "16SrRNA",
              output = "cdf", distr = "mnorm", adj = TRUE, thresh = 0.05, n_perm = 10)
results
```

Some important arguments to control the behaviour of CBEA.  

* `output`: This controls what type of output is being returned. CBEA usually estimates a parametric null and users can specify what they want in return. If users want to perform downstream analysis with set-level features, they can return CDF values or z-scores of each raw score computed against that distribution (options `cdf` or `zscore`). Alternatively, users can just return the raw scores themselves (no distribution fitting will be performed) using `raw` as the option. Users can also use this distribution to estimate unadjusted p-values (option `pval`) to see whether a set is enriched at each sample. These unadjusted p-values can be converted based on a threshold (based on `thresh` which is default to be set at 0.05) into a dummy variable indicating enrichment (option `sig`). **Note**: CDF values and Z-scores are not available for non-parametric null estimations.  
* `parametric`: This is a logical argument to specify whether a the null distribution will be specified via parametric fit or via non-parametric permutation testing. If `parametric` is `TRUE`, users need to specify `distr` and `adj`. If `parametric` is `FALSE`, users need to increase `n_perm`.  
* `distr`: The form of the distribution if parametric fit is desired. As of now only supports `norm`, `mnorm`.  
* `adj`: Whether the distribution should be adjusted for variance inflation. This procedure is done by combining the mean estimate from scores computed from permuted data set and the variance estimate from raw scores (computed on the unpermuted data set).  
 
## Model output  

The output object is of class `CBEAout`, which is an S3 object. The underlying data structure is a list of lists, where the outer lists represent different aspects of the output. For example `R` represent the final scores while `diagnostic` represent certain goodness-of-fit statistics. 

```{r results_obj, eval = TRUE}
names(results)
```
Within each aspect, there is a list of size equivalent to the total number of sets evaluated. For example, the `results` object is of size 3 representing the evaluated sets.   

```{r results_obj_2, eval = TRUE}
str(results$R)
```

Users can use `tidy` and `glance` following the `r CRANpkg("broom")` to process `CBEAout` into nice objects. The `tidy` function returns a `tibble` of scores (samples by set). The `glance` function returns some diagnostics. There are two options for the `glance` function: `fit_comparison` allows users to compare the l-moments of the data, the permuted data, and the final fitted distribution; `fit_diagnostic` shows goodness-of-fit statistics of the distribution fitting procedure itself, with log-likelihoods and Anderson-Darling (column "ad") statistics.   

```{r tidy_functions, eval = TRUE}
tidy(results)
glance(results, "fit_comparison")
glance(results, "fit_diagnostic")
```

## Parallel computing  

`CBEA` has in-built capacity to perform calculations paralelled across the total number of sets. The engine for parallelization is `r Biocpkg("BiocParallel")`. If `NULL`, `SerialParam` backend will be used.   

```{r check_backend, eval= TRUE}
BiocParallel::registered()
```
```{r parallel_computing, eval=FALSE}
cbea(abun, set = metab_sets, abund_values = "16SrRNA",
     output = "cdf", distr = "mnorm", adj = TRUE, thresh = 0.05, n_perm = 10, 
     parallel_backend = MulticoreParam(workers = 2))
```


# Citing `CBEA`

We hope that `r Biocpkg("CBEA")` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

```{r "citation"}
## Citation info
citation("CBEA")
```

# Reproducibility

The `r Biocpkg("CBEA")` package `r Citep(bib[["CBEA"]])` was made possible thanks to:

* R `r Citep(bib[["R"]])`
* `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])`
* `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])`
* `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`
* `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])`
* `r CRANpkg("broom")` `r Citep(bib[["broom"]])`
* `r CRANpkg("sessioninfo")` `r Citep(bib[["sessioninfo"]])`
* `r CRANpkg("testthat")` `r Citep(bib[["testthat"]])`
* `r CRANpkg("mixtools")` `r Citep(bib[["mixtools"]])`
* `r CRANpkg("fitdistrplus")` `r Citep(bib[["fitdistrplus"]])`
* `r CRANpkg("tidyverse")` `r Citep(bib[["tidyverse"]])`
* `r Biocpkg("BiocSet")` `r Citep(bib[["BiocSet"]])`
* `r Biocpkg("phyloseq")` `r Citep(bib[["phyloseq"]])`
* `r Biocpkg("BiocParallel")` `r Citep(bib[["BiocParallel"]])`

This package was developed using `r BiocStyle::Biocpkg("biocthis")`.

Code for creating the vignette

```{r createVignette, eval=FALSE}
## Create the vignette
library("rmarkdown")
system.time(render("basic_usage.Rmd", "BiocStyle::html_document"))

## Extract the R code
library("knitr")
knit("basic_usage.Rmd", tangle = TRUE)
```

Date the vignette was generated.

```{r reproduce1, echo=FALSE}
## Date the vignette was generated
Sys.time()
```

Wallclock time spent generating the vignette.

```{r reproduce2, echo=FALSE}
## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits = 3)
```

`R` session information.

```{r reproduce3, echo=FALSE}
## Session info
library("sessioninfo")
options(width = 120)
session_info()
```


# Bibliography

This vignette was generated using `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])`
with `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` running behind the scenes.

Citations made with `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`.

```{r vignetteBiblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE}
## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))
```