---
title: "lefser: a Megatanomic biomarker discovery tool"
author: Asya Khleborodova, Sehyun Oh, Ludwig Geistlinger, and Levi Waldron
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Quickstart}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(comment = "#>", 
                      fig.align = "center",
                      collapse = FALSE, 
                      message = FALSE, 
                      warning = FALSE, 
                      eval = TRUE)
```

# Overview
## Background
*lefser* is the R implementation of the Linear discriminant analysis (LDA) 
Effect Size ([LEfSe][]), a Python package for metagenomic biomarker discovery 
and explanation. ([Huttenhower et al. 2011][]).

The original software utilizes standard statistical significance tests along 
with supplementary tests that incorporate biological consistency and the 
relevance of effects to identity the features (e.g., organisms, clades, OTU,
genes, or functions) that are most likely to account for differences between
the two sample classes of interest. While *LEfSe* is widely used and available
in different platform such as Galaxy UI and Conda, there is no convenient way
to incorporate it in R-based workflows. Thus, we re-implement *LEfSe* as
an R/Bioconductor package, *lefser*. Following the *LEfSe*‘s algorithm including
Kruskal-Wallis test, Wilcoxon-Rank Sum test, and Linear Discriminant Analysis,
with some modifications, *lefser* successfully reproduces and improves the
original statistical method and the associated plotting functionality.

[LEfSe]: https://huttenhower.sph.harvard.edu/galaxy/
[Huttenhower et al. 2011]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218848/


## Install and load pacakge
```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("lefser")
```

```{r}
library(lefser)
```

## Citing *lefser*

Your citations are crucial in keeping our software free and open source. To
cite our package, please use this publication at the link [here][1].

[1]: https://doi.org/10.1093/bioinformatics/btae707

# Analysis example
## Prepare input

*lefser* package include the demo dataset, `zeller14`, which is the 
microbiome data from colorectal cancer (CRC) patients and controls 
([Zeller et al. 2014][]). 

In this vignette, we excluded the 'adenoma' condition and used control/CRC 
as the main classes and age category as sub-classes (adult vs. senior) with 
different numbers of samples: control-adult (n = 46), control-senior (n = 20), 
CRC-adult (n = 45), and CRC-senior (n = 46). 

[Zeller et al. 2014]: https://www.embopress.org/doi/full/10.15252/msb.20145645

```{r}
data(zeller14)
zeller14 <- zeller14[, zeller14$study_condition != "adenoma"]
```

The class and subclass information is stored in the `colData` slot under the
`study_condition` and `age_category` columns, respectively. 

```{r}
## Contingency table
table(zeller14$age_category, zeller14$study_condition)
```

If you try to run `lefser` directly on the 'zeller14' data, you will get the
following warning messages
```{r eval=FALSE}
lefser(zeller14, classCol = "study_condition", subclassCol = "age_category")
```

```
Warning messages:
1: In lefser(zeller14, classCol = "study_condition", subclassCol = "age_category") :
  Convert counts to relative abundances with 'relativeAb()'
2: In lda.default(x, classing, ...) : variables are collinear
```

### Terminal node
When working with taxonomic data, including both terminal and non-terminal 
nodes in the analysis can lead to collinearity problems. Non-terminal nodes 
(e.g., genus) are often linearly dependent on their corresponding terminal 
nodes (e.g., species) since the species-level information is essentially a 
subset or more specific representation of the genus-level information. This
collinearity can violate the assumptions of certain statistical methods, such
as linear discriminant analysis (LDA), and can lead to unstable or unreliable 
results. By using only terminal nodes, you can effectively eliminate this 
collinearity issue, ensuring that your analysis is not affected by linearly 
dependent or highly correlated variables. Additionally, you can benefit of 
avoiding redundancy, increasing specificity, simplifying data, and reducing
ambiguity, using only terminal nodes.

You can select only the terminal node using `get_terminal_nodes` function. 

```{r}
tn <- get_terminal_nodes(rownames(zeller14))
zeller14tn <- zeller14[tn,]
```


### Relative abundance
First warning message informs you that `lefser` requires relative abundance of
features. You can use `relativeAb` function to reformat your input. 

```{r}
zeller14tn_ra <- relativeAb(zeller14tn)
```


## Run `lefser`
The `lefser` function returns a `data.frame` with two columns - the names of 
selected features (the `features` column) and their effect size (the `scores` 
column).

There is a random number generation step in the `lefser` algorithm to ensure 
that more than half of the values for each features are unique. In most cases, 
inputs are sparse, so in practice, this step is handling 0s. So to reproduce
the identical result, you should set the seed before running `lefser`.

```{r}
set.seed(1234)
res <- lefser(zeller14tn_ra, # relative abundance only with terminal nodes
              classCol = "study_condition",
              subclassCol = "age_category")
head(res)
```


## Visualization using `lefserPlot`
```{r}
lefserPlot(res)
```


# Benchmarking againt other tools
The codes for benchmarking *lefser* against *LEfSe* and the other R
implementation of *LEfSe* is available [here][].

[here]: https://github.com/shbrief/lefserBenchmarking


# Interoperating with phyloseq
When using `phyloseq` objects, we recommend to extract the data and create a
`SummarizedExperiment` object as follows:

```{r warning=FALSE}
library(phyloseq)
library(SummarizedExperiment)

## Load phyloseq object
fp <- system.file("extdata", 
                  "study_1457_split_library_seqs_and_mapping.zip",
                  package = "phyloseq")
kostic <- microbio_me_qiime(fp)

## Split data tables
counts <- unclass(otu_table(kostic))
coldata <- as(sample_data(kostic), "data.frame")

## Create a SummarizedExperiment object
SummarizedExperiment(assays = list(counts = counts), colData = coldata)
```

You may also consider using `makeTreeSummarizedExperimentFromPhyloseq` from 
the `mia` package.
```{r}
mia::makeTreeSummarizedExperimentFromPhyloseq(kostic)
```


# Session Info

<details>
```{r}
sessionInfo()
```
</details>