---
title: 'An introduction to benchdamic'
author: 
    - Matteo Calgaro
    - Chiara Romualdi
    - Davide Risso
    - Nicola Vitulo
output: 
  BiocStyle::html_document:
    toc: yes
    number_section: yes
    fig_caption: yes
vignette: >
  %\VignetteEcoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Intro}
  \usepackage[utf8]{inputenc}
bibliography: bib_intro.json
csl: bioinformatics.csl
---

```{r options, include=FALSE, echo=FALSE}
knitr::opts_chunk$set(
    warning = FALSE, 
    error = FALSE, 
    message = FALSE)
```

# Introduction

This vignette provides an introductory example on how to work with the analysis framework firstly proposed in [@calgaro].

The package is named benchdamic, acronym for "BENCHmarking of Differential Abundance detection methods for MICrobial data". Not only does the package structure allow the users to test a variety of commonly used methods for differential abundance analysis, but it also enables them to set benchmarks including custom methods on their datasets. Performances of each method are evaluated with respect to i) suitability of distributional assumptions, ii) ability to control false discoveries, iii) concordance of the findings, and iv) enrichment of differentially abundant microbial species in specific conditions. Each step of the assessment is flexible when it comes to the choice of differential abundance methods, their parameters, and input data types. Various graphic outputs lead the users to an informed decision when evaluating the most suitable method to use for their data.

## Installation

To install this package, start R (version "4.2") and enter:

```{r, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("benchdamic")
```

or use:

```{r, eval=FALSE}
if (!require("devtools", quietly = TRUE))
    install.packages("devtools")

devtools::install_github("mcalgaro93/benchdamic")
```

Then, load some packages for basic functions and data:

```{r load_packs}
library(benchdamic)
# Parallel computation
library(BiocParallel)
# Generate simulated data
library(SPsimSeq)
# Data management
library(phyloseq)
library(SummarizedExperiment)
library(plyr)
# Graphics and tables
library(ggplot2)
library(cowplot)
library(kableExtra)
```

## Data loading

All datasets used in `benchdamic` are downloaded using the `HMP16SData` Bioconductor package [@HMP16SData].

For GOF and TIEC analyses a homogeneous group of samples (*e.g.*, only samples from a specific experimental condition, phenotype, treatment, body site, etc.) *ps_stool_16S* is used (use `help("ps_stool_16S")` for details). It contains 16S data from:

-   32 stool samples from participants of the Human Microbiome Project;

-   71 taxa, all features having the same genus-level taxonomic classification are collapsed together (a total of 71 taxa corresponding to 71 genera).

```{r dataloading_stool}
data("ps_stool_16S")
ps_stool_16S
```

For Concordance and Enrichment analyses the *ps_plaque_16S* dataset is used (use `help("ps_plaque_16S")` for details). It contains 16S data from:

-   30 participants of the Human Microbiome Project;

-   samples collected from subgingival plaque and supragingival plaque for each subject (a total of 60 samples);

-   88 taxa, all features having the same genus-level taxonomic classification are collapsed together (a total of 88 taxa corresponding to 88 genera).

```{r dataloading_plaque}
data("ps_plaque_16S")
ps_plaque_16S
```

# Goodness of Fit

**Assumption**: Many DA detection methods are based on parametric distributions.

**Research question**: Which are the parametric distributions that can fit both the proportion of zeros and the counts in your data?

## GOF structure

As different methods rely on different statistical distributions to perform DA analysis, the goodness of fit (GOF) of the statistical models underlying some of the DA methods on a 16S dataset is assessed. For each model, its ability to correctly estimate the average counts and the proportion of zeroes by taxon is evaluated.

Five distributions are considered: (1) the negative binomial (**NB**) used in `edgeR` and `DeSeq2` [@edger; @deseq2], (2) the zero-inflated negative binomial (**ZINB**) used in ZINB-WaVE [@zinbwave], (3) the truncated Gaussian Hurdle model of `MAST` [@mast], (4) the zero-inflated Gaussian (**ZIG**) mixture model of `metagenomeSeq` [@zig], and (5) the Dirichlet-Multinomial (**DM**) distribution underlying `ALDEx2` Monte-Carlo sampling [@aldex2] and multivariate extension of the beta-binomial distribution used by `corncob` [@corncob].

The relationships between the functions used in this section are explained by the diagram in Figure \@ref(fig:figGOF). To help with the reading: green boxes represent the inputs or the outputs, red boxes are the methods and blue boxes are the main parameters of those method.

```{r figGOF, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, fig.cap="Goodness of Fit diagram."}
knitr::include_graphics("./GOF_structure.svg")
```

## Parametric distributions

### Negative Binomial and Zero-Inflated Negative Binomial Models

For any $\mu \ge 0$ and $\theta > 0$, let $f_{NB}(\cdot;\mu,\theta)$ denote the probability mass function (PMF) of the negative binomial (NB) distribution with mean $\mu$ and inverse dispersion parameter $\theta$, namely:$$
f_{NB} = \frac{\Gamma(y+\theta)}{\Gamma(y+1)\Gamma(\theta)}\left(\frac{\theta}{\theta+1} \right)^\theta\left(\frac{\mu}{\mu+\theta} \right)^y, \forall y \in \mathbb{N}
$$Note that another parametrization of the NB PMF is in terms of the dispersion parameter $\psi = \theta^{-1}$ (although $\theta$ is also sometimes called dispersion parameter in the literature). In both cases, the mean of the NB distribution is $\mu$ and its variance is:$$
\sigma^2 = \mu + \frac{\mu^2}{\theta} = \mu+\psi\mu^2
$$In particular, the NB distribution boils down to a Poisson distribution when $\psi=0 \iff \theta=+ \infty$.

For any $\pi\in[0,1]$, let $f_{ZINB}(\cdot;\mu,\theta,\pi)$ be the PMF of the ZINB distribution given by:

$$
f_{ZINB}(\cdot;\mu,\theta,\pi) = \pi\delta_0(y)+(1-\pi)f_{NB}(y;\mu,\theta), \forall y\in\mathbb{N}
$$

where $\delta_0(\cdot)$ is the Dirac function. Here, $\pi$ can be interpreted as the probability that a 0 is observed instead of the actual count, resulting in an inflation of zeros compared to the NB distribution, hence the name ZINB.

To fit these distributions on real count data the `edgeR` [@edger] and `zinbwave` [@zinbwave] packages are used. In `benchdamic` they are implemented in the `fitNB()` and `fitZINB()` functions.

### Zero-Inflated Gaussian Model

The raw count for sample j and feature i is denoted by $c_{ij}$. The zero-inflated model is defined for the continuity-corrected logarithm of the raw count data: $y_{ij} = log_2(c_{ij}+1)$ as a mixture of a point mass at zero $I_{0}(y)$ and a count distribution $f_{count}(y;\mu,\sigma^2) \sim N(\mu,\sigma^2)$. Given mixture parameters $\pi_j$, we have that the density of the ZIG distribution for feature i, in sample j with $s_j$ total counts is: $$f_{ZIG}(y_{ij};s_j,\beta,\mu_i,\sigma^2_i) = \pi_j(s_j)\cdot I_{0}(y_{ij})+(1-\pi_j(s_j))\cdot f_{count}(y_{ij};\mu,\sigma^2)$$

The mean model is specified as:$$E(y_{ij})=\pi_{j} + (1-\pi_j)\cdot\left(b_{i0}+\eta_ilog_2\left( \frac{s_j^{\hat{l}}}{N}+1 \right) \right)$$

In this case, parameter $b_{i0}$ is the intercept of the model while the term including the logged normalization factor $log_2\left(\frac{s_j^{\hat{l}}}{N}+1 \right)$ captures feature-specific normalization factors through parameter $\eta_i$. In details, $s_j^{\hat{l}}$ is the median scaling factor resulted from the Cumulative Sum Scaling (CSS) normalization procedure. $N$ is a constant fixed by default at 1000 but it should be a number close to the scaling factors to be used as a reference, for this reason a good choice could be the median of the scaling factors (which is used instead of 1000). The mixture parameters $\pi_j(s_j)$ are modeled as a binomial process:

$$log\frac{\pi_j}{1-\pi_j} = \beta_0+\beta_1\cdot log(s_j)$$

To fit this distribution on real count data the `metagenomeSeq` package [@zig] is used. In `benchdamic` it is implemented in the `fitZIG()` function.

### Truncated Gaussian Hurdle Model

The original field of application of this method was the single-cell RNAseq data, where $y = log_2(TPM+1)$ expression matrix was modeled as a two-part generalized regression model [@mast]. In microbiome data that starting point translates to a $y_{ij} = log_2\left(counts_{ij}\cdot\frac{10^6}{libSize_{j}}+1 \right)$ or a $log_2\left(counts_{ij}\cdot\frac{ median(libSize)}{libSize_{j}}+1\right)$.

The taxon presence rate is modeled using logistic regression and, conditioning on a sample with the taxon, the transformed abundance level is modeled as Gaussian.

Given normalized, possibly thresholded, abundance $y_{ij}$, the rate of presence and the level of abundance for the samples were the taxon is present, are modeled conditionally independent for each gene $i$. Define the indicator $z_{ij}$, indicating whether taxon $i$ is expressed in sample $j$ (*i.e.*, $z_{ij} = 0$ if $y_{ij} = 0$ and $z_{ij} = 1$ if $y_{ij} > 0$). We fit logistic regression models for the discrete variable $Z$ and a Gaussian linear model for the continuous variable $(Y|Z=1)$ independently, as follows:

$$ logit(Pr(Z_{ij}=1))=X_j\beta_i^D $$

$$ P(Y_{ij}=y|Z_{ij}=1) \sim N(X_j\beta^C_i,\sigma^2_i)$$

To estimate this distribution on real count data the `MAST` package [@mast] is used. In `benchdamic` it is implemented in the `fitHURDLE()` function.

### Dirichlet-Multinomial Mixture Model

The probability mass function of a $n$ dimensional multinomial sample $y = (y_1,...,y_n)^T$ with library size $libSize = \sum_{i=1}^ny_i$ and parameter $p=(p_1,...,p_n)$ is:

$$
f(y;p)= {libSize\choose y}\prod_{i=1}^np_i^{y_i}
$$

The mean-variance structure of the MN model doesn't allow over-dispersion, which is common in real data. DM distribution models the probability parameter $p$ in the MN model by a Dirichlet distribution. The probability mass of a n-category count vector $y$ over $libSize$ trials under DM with parameter $\alpha=(\alpha_1,...,\alpha_n)$, $a_i>0$ and proportion vector $p \in \Delta_n=\{(p_1,...,p_n):p_i\ge0,\sum_ip_i=1 \}$ is:

$$
f(y|\alpha)={libSize\choose y}\frac{\prod_{i=1}^n(a_i)y_i}{(\sum_i\alpha_i)\cdot libSize}
$$

The mean value for the $i^{th}$ taxon and $j^{th}$ sample of the count matrix is given by $libSize_j\cdot \frac{\alpha_{ij}}{\sum_i a_{ij}}$.

To estimate this distribution on real count data the `MGLM` package [@MGLM; @MGLMpackage] is used. In `benchdamic` it is implemented in the `fitDM()` function.

## Comparing estimated and observed values

The goodness of fit for the previously described distributions is assessed comparing estimated and observed values. For each taxon the following measures are compared:

-   the Mean Difference (MD) *i.e.* the difference between the estimated mean and the observed mean abundance (log scale);

-   the Zero Probability Difference (ZPD) *i.e.* the difference between the probability to observe a zero and the observed proportion of samples which have zero counts.

To easily compare estimated and observed mean values the natural logarithm transformation, with the continuity correction ($log(counts+1)$), is well suited, indeed it reduces the count range making the differences more stable.

Except for the `fitHURDLE()` function, which performs a CPM transformation on the counts (or the one with the median library size), and the `fitZIG()` function which models the $log_2(counts+1)$, the other methods, `fitNB()`, `fitZINB()`, and `fitDM()`, model the $counts$ directly. For these reasons, `fitHURDLE()`'s output should not be compared directly to the observed $log(counts+1)$ mean values as for the other methods. Instead, the logarithm of the observed CPM (or the one with the median library size) should be used.

Here an example on how to fit a Truncated Gaussian hurdle model:

```{r fitHURDLE_alone}
example_HURDLE <- fitHURDLE(
    object = ps_stool_16S,
    scale = "median"
)
head(example_HURDLE)
```

The values above are those estimated by the `fitHURDLE()` function. Some *NA* values could be present due to taxa sparsity. The internally used function to prepare for the comparisons the observed counts is `prepareObserved()`, specifying the *scale* parameter if the HURDLE model is considered (if *scale = "median"*, the median library size is used to scale counts instead of $10^6$):

```{r observe_hurdle}
observed_hurdle <- prepareObserved(
    object = ps_stool_16S, 
    scale = "median")
head(observed_hurdle)
```

Which are different from the non-scaled observed values:

```{r prepareObserved_normal}
head(prepareObserved(object = ps_stool_16S))
```

The function to compute MD and ZPD values, is `meanDifferences()`:

```{r meanDifferences_hurdle}
head(meanDifferences(
    estimated = example_HURDLE,
    observed = observed_hurdle
))
```

A wrapper function to simultaneously perform the estimates and the mean differences is `fitModels()`:

```{r fitting}
GOF_stool_16S <- fitModels(
    object = ps_stool_16S,
    models = c("NB", "ZINB", "DM", "ZIG", "HURDLE"),
    scale_HURDLE = c("median", "default"),
    verbose = FALSE # TRUE is always suggested
)
```

Exploiting the internal structure of the `fitModels()`'s output the Root Mean Squared Error (RMSE) values for MD values can be extracted (the lower, the better):

```{r RMSE_MD}
plotRMSE(GOF_stool_16S, difference = "MD", plotIt = FALSE)
```

Similarly, they are extracted for ZPD values:

```{r RMSE_ZPD}
plotRMSE(GOF_stool_16S, difference = "ZPD", plotIt = FALSE)
```

## Visualization

### Mean Differences

To plot estimated and observed values the `plotMD()` function can be used (Figure \@ref(fig:plotGOFMD)). No systematic trend are expected, moreover, the closer the values to the dotted line are (representing equality between observed and estimated values), the better the goodness of fit relative to the model.

```{r plotGOFMD, fig.width=15, fig.height=4, fig.cap="MD plot. Mean-difference (MD) between the estimated and observed count values for each distribution."}
plotMD(
    data = GOF_stool_16S,
    difference = "MD",
    split = TRUE
)
```

If some warning messages are shown with this graph, they are likely due to sparse taxa. To address this, the number of NA values generated by each model can be investigated (which are 24 for each HURDLE model):

```{r NA_values, eval=FALSE}
plyr::ldply(GOF_stool_16S, function(model) 
    c("Number of NAs" = sum(is.na(model))), 
    .id = "Distribution")
```

To summarize the goodness of fit, the Root Mean Squared Error (RMSE) metric is also displayed for each model. For the *HURDLE_default* model, a quite different range of values of mean differences is displayed because of the excessive *default* scaling proposed (1 million). It is also possible to plot only a subset of the estimated models (Figure \@ref(fig:plotGOFMDnoHurdleDefault)).

```{r plotGOFMDnoHurdleDefault, warning=FALSE, fig.width=15, fig.height=4, fig.cap="MD plot reduced. Mean-difference (MD) between the estimated and observed count values for the first 5 distributions."}
plotMD(
    data = GOF_stool_16S[1:5],
    difference = "MD",
    split = TRUE
)
```

From the Figure \@ref(fig:plotGOFMDnoHurdleDefault), *DM* distribution slightly overestimates the logarithm of the average counts for low values, while the *HURDLE_median* distribution presents an overestimation that increases as the observed values increase. *ZIG*, but especially *NB* and *ZINB* distributions produce very similar estimated and observed values. Similarly, to plot the mean differences for Zero Probability/Proportion the `plotMD()` function is used (Figure \@ref(fig:plotGOFZPD)).

```{r plotGOFZPD, fig.width=15, fig.height=4, fig.cap="ZPD plot. Mean-difference between the estimated probability to observe a zero and the observed proportion of zero values (ZPD) for the first 5 distributions."}
plotMD(
    data = GOF_stool_16S[1:5],
    difference = "ZPD",
    split = TRUE
)
```

From the figure \@ref(fig:plotGOFZPD), *ZIG* and *NB* models underestimate the probability to observe a zero for sparse features, while the *HURDLE_median* model presents a perfect fit as the probability to observe a zero is the zero rate itself by construction. *DM* and *ZINB* models produce estimated values very similar to the observed ones. MDs and ZPDs are also available in the Figure \@ref(fig:plotGOFMDcollapsed) with a different output layout:

```{r plotGOFMDcollapsed, warning=FALSE, fig.width=10, fig.height=5, fig.cap="MD and ZPD plots. MD and ZPD plotted together for the first 5 distributions."}
plot_grid(plotMD(data = GOF_stool_16S[1:5], difference = "MD", split = FALSE),
    plotMD(data = GOF_stool_16S[1:5], difference = "ZPD", split = FALSE),
    ncol = 2
)
```

### RMSE

As already mentioned, to summarize the goodness of fit, the Root Mean Squared Error (RMSE) metric is used. The summary statistics for the overall performance are visible in Figure \@ref(fig:plotGOFRMSE):

```{r plotGOFRMSE, fig.width=10,fig.height=7, fig.cap="RMSE plot. Root Mean Squared Errors (RMSE) for both the MD and ZPD values for all the distributions."}
plot_grid(plotRMSE(GOF_stool_16S, difference = "MD"),
    plotRMSE(GOF_stool_16S, difference = "ZPD"),
    ncol = 2
)
```

The lower the RMSE value, the better the goodness of fit of the model.

## Discussion about GOF

The Goodness of Fit chapter is focused on some existing parametric models: NB, ZINB, HURDLE, ZIG, DM. The assumption of this analysis is that if a model estimates the data well, then a method based on that model may be a possibly good choice for studying the differential abundance. Other distributions could also be investigated (Poisson, Zero-Inflated Poisson...) but what about DA methods which are based on non-parametric models such as ANCOM? We can't use the GOF framework to compare the parametric models to non-parametric models. However, non-parametric methods may work well in real scenarios due to their added robustness and other evaluations are necessary in order not to favor one group of methods over another.

# DA methods

Differential abundance analysis is the core of `benchdamic`. DA analysis steps can be performed both directly, using the `DA_<name_of_the_method>()` methods, or indirectly, using the `set_<name_of_the_method>()` functions.

`set_<name_of_the_method>()` functions allow to create lists of instructions for DA methods which can be used by the `runDA()`, `runMocks()`, and `runSplits()` functions (more details in each chapter).

This framework grants a higher flexibility allowing users to set up the instructions for many DA methods only at the beginning of the analysis. If some modifications are needed, the users can re-set the methods or modify the list of instructions directly.

A list of the available methods is presented below (Table \@ref(tab:availableMethodsTable)). They are native to different application fields such as RNA-Seq, single-cell RNA-Seq, or Microbiome data analysis. Some basic information are reported for each DA method, for more details please refer to functions' manual.

```{r availableMethodsTable, echo=FALSE}
available_methods <- read.csv(file = "./benchdamic_methods.csv", sep = ";")
kable(x = available_methods, 
    caption = "DA methods available in benchdamic.",
    col.names = c("Method (package)", "Short description", "Test",
        "Normalization / Transformation", "Suggested input", 
        "Output", "Application"), booktabs = TRUE) %>%
    kable_styling(latex_options = "scale_down") %>%
    row_spec(0, bold = TRUE, color = "black") %>%
    column_spec(c(1,5,6), width = "3cm", color = "black") %>% 
    column_spec(2:4, width = "6cm", color = "black") %>%
    landscape()
```

Please remember that the data pre-processing, including QC analysis, filtering steps, and normalization, are not topics treated in `benchdamic`. In real life situations those steps precede the DA analysis and they are of extreme importance to obtain reliable results.

Some exceptions are present for the normalization step. In `benchdamic`, `norm_edgeR()`, `norm_DESeq2()`, `norm_CSS()`, and `norm_TSS()` are implemented functions to add the normalization/scaling factors to the `phyloseq` or `TreeSummarizedExperiment` objects, needed by DA methods. As for DA methods, normalization instructions list, including the previous functions, can be set using `set_<normalization_name>()` or `setNormalizations()` too. To run the normalization instructions the function `runNormalizations()` can be used (more examples will follow).

Many DA methods already contain options to normalize or transform counts. If more complex normalizations/transformations are needed, all the DA methods support the use of `TreeSummarizedExperiment` objects. In practice, users can put the modified count matrix in a named assay (the *counts* assay is the default one which contains the raw counts) and run the DA method on that assay using the parameter *assay_name = "assay_to_use"*.

## Add a custom DA method

To add a custom method to the benchmark, it must:

-   include a *verbose = TRUE* (or *FALSE*) parameter to let the user know what the method is doing;

-   return a *pValMat* matrix which contains the raw p-values and adjusted p-values in *rawP* and *adjP* columns respectively;

-   return a *statInfo* matrix which contains the summary statistics for each feature, such as the *logFC*, standard errors, test statistics and so on;

-   return a *name* which contains the complete name of the used method.

An example is proposed:

```{r customExample, eval=FALSE}
DA_yourMethod <- function(
        object, 
        assay_name = "counts",
        param1,
        param2,
        verbose = TRUE)
{
    if(verbose)
        message("Reading data")
    # Extract the data from phyloseq or TreeSummarizedExperiment
    counts_metadata <- get_counts_metadata(
        object = object, 
        assay_name = assay_name)
    counts <- counts_metadata[[1]] # First position = counts 
    metadata <- counts_metadata[[2]] # Second position = metadata
    
    ### your method's code 
    # Many things here
    if(verbose)
        message("I'm doing this step.")
    # Many other things here
    ### end of your method's code
    
    if(verbose)
        message("Extracting important statistics")
    # contains the p-values
    vector_of_pval <- NA 
    # contains the adjusted p-values
    vector_of_adjusted_pval <- NA 
    # contains the OTU, or ASV, or other feature names. 
    # Usually extracted from the rownames of the count data
    name_of_your_features <- NA 
    # contains the logFCs
    vector_of_logFC <- NA 
    # contains other statistics
    vector_of_statistics <- NA 
    
    if(verbose)
        message("Preparing the output")
    pValMat <- data.frame("rawP" = vector_of_pval,
                          "adjP" = vector_of_adjusted_pval)
    statInfo <- data.frame("logFC" = vector_of_logFC,
                           "statistics" = vector_of_statistics)
    name <- "write.here.the.name"
    # Be sure that the algorithm hasn't changed the order of the features. If it
    # happens, re-establish the original order.
    rownames(pValMat) <- rownames(statInfo) <- name_of_your_features

    # Return the output as a list
    return(list("pValMat" = pValMat, "statInfo" = statInfo, "name" = name))
} # END - function: DA_yourMethod
```

Once the custom method is set, it can be run by using the `DA_yourMethod()` function or manually, by setting a list of instructions of the custom method with the desired combination of parameters:

```{r customExampleInstances, eval=FALSE}
my_custom_method <- list(
    customMethod.1 = list( # First instance
        method = "DA_yourMethod", # The name of the function to call
        assay_name = "counts",
        param1 = "A", # Its combination of parameters
        param2 = "B"), # No need of verbose and object parameters
    customMethod.2 = list( # Second instance
        method = "DA_yourMethod", 
        assay_name = "counts",
        param1 = "C",
        param2 = "D")
    # Other istances
)
```

The *method* field, containing the name of the method to call, is mandatory, while the *verbose* parameter and the *object* are not needed.

# Type I Error Control

**Assumption**: Many DA methods do not control the number of false discoveries.

**Research question**: Which are the DA methods which can control the number of false positives in your data?

## TIEC structure

The Type I Error is the probability of a statistical test to call a feature DA when it is not, under the null hypothesis. To evaluate the Type I Error rate Control (TIEC) for each differential abundance detection method:

1.  using the `createMocks()` function, homogeneous samples (*e.g.*, only the samples from one experimental group) are randomly assigned to a group ('grp1' or 'grp2');

2.  DA methods are run to find differences between the two mock groups using `runMocks()`;

3.  the number of DA feature for each method is counted, these are False Positives (FP) by construction;

4.  points 1-3 are repeated many times (*N = 3*, but at least 1000 is suggested) and the results are averaged using the `createTIEC()` function.

In this setting, the p-values of a perfect test should be uniformly distributed between 0 and 1 and the false positive rate (FPR or observed $\alpha$), which is the observed proportion of significant tests, should match the nominal value (*e.g.*, $\alpha=0.05$).

The relationships between the functions used in this section are explained by the diagram in Figure \@ref(fig:figTIEC).

```{r figTIEC, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, fig.cap="Type I Error Control diagram."}
knitr::include_graphics("./TIEC_structure.svg")
```

## Create mock comparisons

Using `createMocks()` function, samples are randomly grouped, `N = 3` times. A higher `N` is suggested (at least 1000) but in that case a longer running time is required.

```{r createMocks}
set.seed(123)
my_mocks <- createMocks(
    nsamples = phyloseq::nsamples(ps_stool_16S),
    N = 3
) # At least N = 1000 is suggested
```

## Set up normalizations and DA methods

Once the mocks have been generated, DA analysis is performed. Firstly, some normalization factors, such as *TMM* from `edgeR` and *CSS* from `metagenomeSeq`, and some size factors such as *poscounts* from `DESeq2` are added to the `phyloseq` object (or `TreeSummarizedExperiment` object). This can be done, manually, using the `norm_edgeR()`, `norm_DESeq2()`, and `norm_CSS()` methods:

```{r normalizationManual, eval=FALSE}
ps_stool_16S <- norm_edgeR(
    object = ps_stool_16S,
    method = "TMM"
)
ps_stool_16S <- norm_DESeq2(
    object = ps_stool_16S,
    method = "poscounts"
)
ps_stool_16S <- norm_CSS(
    object = ps_stool_16S,
    method = "CSS"
)
```

Or automatically, using the `setNormalizations()` and `runNormalizations()` methods:

```{r setNormalization}
my_normalizations <- setNormalizations(
    fun = c("norm_edgeR", "norm_DESeq2", "norm_CSS"), 
    method = c("TMM", "poscounts", "CSS"))
ps_stool_16S <- runNormalizations(normalization_list = my_normalizations, 
    object = ps_stool_16S, verbose = TRUE)
```

Some messages "Found more than one"phylo" class in cache..." could be shown after running the previous functions. They are caused by duplicated class names between `phyloseq` and `tidytree` packages and can be ignored.

After the normalization/size factors have been added to the `phyloseq` or `TreeSummarizedExperiment` object, the user could decide to filter rare taxa which do not carry much information. In this example vignette a simple filter is applied to keep only features with a count in at least 3 samples:

```{r simple_filterTIEC, message=FALSE}
ps_stool_16S <- phyloseq::filter_taxa(
    physeq = ps_stool_16S, 
    flist = function(x) sum(x > 0) >= 3, prune = TRUE)
ps_stool_16S
```

Some zero-inflated negative binomial weights using the `weights_ZINB()` function are computed. They can be used as observational weights in the generalized linear model frameworks of `DA_edgeR()`, `DA_DESeq2()`, and `DA_limma()`, as described in [@zinbweights].

```{r weights}
zinbweights <- weights_ZINB(
    object = ps_stool_16S,
    K = 0,
    design = "~ 1",
)
```

For each row of the `mock_df` data frame a bunch of DA methods is run. In this demonstrative example the following DA methods are used:

-   basic *t* and *wilcox* tests;

-   `edgeR` with *TMM* scaling factors [@edger] with and without *ZINB* weights [@zinbwave; @zinbweights];

-   `DESeq2` with *poscounts* normalization factors [@deseq2] with and without *ZINB* weights [@zinbwave; @zinbweights];

-   `limma-voom` with *TMM* scaling factors [@limma; @voom; @limmarnaseq] with and without *ZINB* weights [@zinbwave; @zinbweights];

-   `ALDEx2` with *all* and *iqlr* data transformation (*denom* parameter) performing the *wilcox* test [@aldex2];

-   `metagenomeSeq` with *CSS* normalization factors using both the *fitFeatureModel* (for a zero-inflated log-normal distribution, mixture model, as suggested in the package vignette) and the *fitZig* (for a zero-inflated gaussian distribution, mixture model) algorithms [@zig];

-   `corncob` (temporarily removed from CRAN) with a focus on average differences (not dispersion, regulated by *phi.formula* and *phi.formula_null* parameters) using both *Wald* and *LRT* tests [@corncob];

-   `MAST` with both rescalings, *default* (*i.e.* $10^6$, for CPMs) and *median* [@mast];

-   `Seurat` with *LogNormalize* and *CLR* normalization/transformations, *t* and *wilcox* tests, and $10^5$ as scaling factor [@seurat];

-   `ANCOM` based on ANCOM-II algorithm with sampling fraction bias correction (*BC* parameter) [@ancom-bc; @ancom-ii];

-   `dearseq` with *asymptotic* test [@dearseq];

-   `linda` with winsorization for the outliers and pseudo-count addition to handle zeros [@linda];

-   `Maaslin2` with *TSS* normalization, *LOG* transformation, and *LM* analysis method [@maaslin2];

-   `ZicoSeq` with default parameters [@ZicoSeq].

Among the available methods, `NOISeq` [@noiseq] has not been used since it does not return p-values but only adjusted ones. Similarly, `mixMC` [@mixMC] has not been used since it does not return p-values. Many combination of parameters are still possible for all the methods.

```{r set_Methods}
my_basic <- set_basic(pseudo_count = FALSE,
    contrast = c("group", "grp2", "grp1"), 
    test = c("t", "wilcox"), 
    paired = FALSE, 
    expand = TRUE)

my_edgeR <- set_edgeR(
    pseudo_count = FALSE,
    group_name = "group",
    design = ~ group,
    robust = FALSE,
    coef = 2,
    norm = "TMM",
    weights_logical = c(TRUE, FALSE),
    expand = TRUE)

my_DESeq2 <- set_DESeq2(
    pseudo_count = FALSE,
    design = ~ group,
    contrast = c("group", "grp2", "grp1"),
    norm = "poscounts",
    weights_logical = c(TRUE, FALSE),
    alpha = 0.05,
    expand = TRUE)

my_limma <- set_limma(
    pseudo_count = FALSE,
    design = ~ group,
    coef = 2,
    norm = "TMM",
    weights_logical = c(FALSE, TRUE),
    expand = TRUE)

my_ALDEx2 <- set_ALDEx2(
    pseudo_count = FALSE,
    design = "group",
    mc.samples = 128,
    test = "wilcox",
    paired.test = FALSE, 
    denom = c("all", "iqlr"), 
    contrast = c("group", "grp2", "grp1"),
    expand = TRUE)

my_metagenomeSeq <- set_metagenomeSeq(
    pseudo_count = FALSE, 
    design = "~ group", 
    coef = "groupgrp2", 
    norm = "CSS", 
    model = c("fitFeatureModel", "fitZig"),
    expand = TRUE)

# Temporarily removed from CRAN
# my_corncob <- set_corncob(
#     pseudo_count = FALSE,
#     formula = ~ group,
#     formula_null = ~ 1,
#     phi.formula = ~ group,
#     phi.formula_null = ~ group,
#     test = c("Wald", "LRT"),
#     boot = FALSE,
#     coefficient = "groupgrp2")

my_MAST <- set_MAST(
    pseudo_count = FALSE,
    rescale = c("default", "median"),
    design = "~ 1 + group",
    coefficient = "groupgrp2",
    expand = TRUE)

my_Seurat <- set_Seurat(
    pseudo_count = FALSE,
    test = c("t", "wilcox"),
    contrast = c("group", "grp2", "grp1"),
    norm = c("LogNormalize", "CLR"),
    scale.factor = 10^5,
    expand = TRUE
)

my_ANCOM <- set_ANCOM(
    pseudo_count = FALSE,
    fix_formula = "group", 
    contrast = c("group", "grp2", "grp1"),
    BC = TRUE, 
    expand = TRUE   
)

my_dearseq <- set_dearseq(
    pseudo_count = FALSE,covariates = NULL,
    variables2test = "group", 
    preprocessed = FALSE, 
    test = "asymptotic",
    expand = TRUE)

my_linda <- set_linda(
    formula = "~ group", 
    contrast = c("group", "grp2", "grp1"),
    is.winsor = TRUE,
    zero.handling = "pseudo-count",
    alpha = 0.05,
    expand = TRUE)

my_Maaslin2 <- set_Maaslin2(
    normalization = "TSS",
    transform = "LOG",
    analysis_method = "LM",
    fixed_effects = "group",
    contrast = c("group", "grp2", "grp1"),
    expand = TRUE)

my_ZicoSeq <- set_ZicoSeq(contrast = c("group", "grp2", "grp1"),
    feature.dat.type = "count", 
    is.winsor = TRUE, outlier.pct = 0.03, winsor.end = "top",
    is.post.sample = TRUE, post.sample.no = 25, perm.no = 99,
    ref.pct = 0.5, stage.no = 6, excl.pct = 0.2, 
    link.func = list(function(x) sign(x) * (abs(x))^0.5))

my_methods <- c(my_basic, my_edgeR, my_DESeq2, 
    my_limma, my_metagenomeSeq, # my_corncob, 
    my_ALDEx2, my_MAST, my_Seurat, my_ANCOM, my_dearseq,
    my_linda, my_Maaslin2, my_ZicoSeq)
```

After concatenating all the DA instructions, they are run on the mock comparisons using the `runMocks()` function:

```{r runMocks}
bpparam <- BiocParallel::SerialParam()
# Random grouping each time
Stool_16S_mockDA <- runMocks(
    mocks = my_mocks, 
    method_list = my_methods, 
    object = ps_stool_16S, 
    weights = zinbweights, 
    verbose = FALSE, 
    BPPARAM = bpparam)
```

If some warnings are reported, *verbose = TRUE* can be used to obtain the method name and the mock comparison where the warnings occured.

The structure of the output in this example is the following:

-   *Comparison1* to *Comparison3* on the first level, which contains:

    -   *Methods*' output lists on the second level:

        -   *pValMat* which contains the matrix of raw p-values and adjusted p-values in *rawP* and *adjP* columns respectively;

        -   *statInfo* which contains the matrix of summary statistics for each feature, such as the *logFC*, standard errors, test statistics and so on;

        -   *dispEsts* which contains the dispersion estimates for methods like `edgeR` and `DESeq2`;

        -   *name* which contains the complete name of the used method.

The list of methods can be run in parallel leveraging the `BiocParallel` package. In details, parallelization is supported in Linux/Mac OS through the `MulticoreParam()` function as long as `ANCOM` functions are not included in the list of methods due to a different parallelization management of those functions.

```{r, eval=FALSE}
ancom_index <- which(grepl(pattern = "ANCOM", 
    names(my_methods)))
bpparam = BiocParallel::MulticoreParam()
Stool_16S_mockDA <- runMocks(
    mocks = my_mocks, 
    method_list = my_methods[-ancom_index], 
    object = ps_stool_16S, 
    weights = zinbweights, 
    verbose = FALSE, 
    BPPARAM = bpparam)
```

`ANCOM` based methods are usually the most time consuming. Parallel computing is still possible as long as it is directly managed by those methods (`n_cl` parameter). In the following example, each mock dataset is analyzed in serial mode but `ANCOM` is run in more than one core.

```{r, eval=FALSE}
# Modify the n_cl parameter
my_ANCOM_parallel <- set_ANCOM(
    pseudo_count = FALSE,
    fix_formula = "group", 
    contrast = c("group", "grp2", "grp1"),
    BC = TRUE, 
    n_cl = 2, # Set this number according to your machine
    expand = TRUE   
)

bpparam = BiocParallel::SerialParam()
Stool_16S_mockDA_ANCOM <- runMocks(
    mocks = my_mocks,
    method_list = my_ANCOM_parallel, # Only ANCOM
    object = ps_stool_16S,
    weights = zinbweights,
    verbose = FALSE,
    BPPARAM = bpparam)
```

### Add a new DA method later in the analysis

It may happen that at a later time the user wants to add to the results already obtained, the results of another group of methods. For example a new version of `limma`:

```{r new_methods}
my_new_limma <- set_limma(
    pseudo_count = FALSE,
    design = ~ group,
    coef = 2,
    norm = "CSS",
    weights_logical = FALSE)
```

Which returns a new set of `limma` instructions and a warning for using *CSS* normalization factors instead of those native to `edgeR`.

First of all, the same mocks and the same object must be used to obtain the new results. To run the new instructions the `runMocks()` function is used:

```{r runMocks_new_limma, message=FALSE}
Stool_16S_mockDA_new_limma <- runMocks(
    mocks = my_mocks, 
    method_list = my_new_limma, 
    object = ps_stool_16S, 
    verbose = FALSE, BPPARAM = bpparam)
```

To put everything together a `mapply()` function is used to exploit the output structures:

```{r merging_old_and_new}
Stool_16S_mockDA_merged <- mapply(
    Stool_16S_mockDA, # List of old results
    Stool_16S_mockDA_new_limma, # List of new results
    FUN = function(old, new){
        c(old, new) # Concatenate the elements
    }, SIMPLIFY = FALSE)
```

## Counting the False Positives

The `createTIEC()` function counts the FPs and evaluates the p-values distributions:

```{r createTIEC, warning=FALSE}
TIEC_summary <- createTIEC(Stool_16S_mockDA)
```

A list of 5 `data.frames` is produced:

1.  `df_pval` is a 5 columns and *number_of_features* x *methods* x *comparisons* rows data frame. The five columns are called *Comparison*, *Method*, *variable* (which contains the feature names), *pval* and *padj*;

2.  `df_FPR` is a 5 columns and *methods* x *comparisons* rows data frame. For each set of method and comparison, the proportion of FPs, considering 3 threshold (0.01, 0.05, 0.1) is reported;

3.  `df_FDR` is a 4 columns and number of *methods* rows data frame. For each method, the average False Discovery Rate is computed averaging the results across all comparisons (considering 3 threshold, 0.01, 0.05, and 0.1);

4.  `df_QQ` contains the average p-value for each theoretical quantile, *i.e.* the QQ-plot coordinates to compare the mean observed p-values distribution across comparisons, with the theoretical uniform distribution. Indeed, the observed p-values should follow a uniform distribution under the null hypothesis of no differential abundant features presence;

5.  `df_KS` is a 5 columns and *methods* x *comparisons* rows data frame. For each set of method and comparison, the Kolmogorov-Smirnov test statistics and p-values are reported in *KS* and *KS_pval* columns respectively.

## Visualization

### False Positive Rate

The false positive rate (FPR or observed $\alpha$), which is the observed proportion of significant tests, should match the nominal value because all the findings are FPs by construction. In this example `edgeR.TMM`, `edgeR.TMM.weighted`, `limma.TMM.weighted`, and `metagenomeSeq.CSS.fitZig` appear to be quite over all the thresholds (liberal behavior), differently `ALDEx2.all.wilcox.unpaired` and `basic_t` methods are below (conservative behavior) or in line with the thresholds (Figure \@ref(fig:FPRplot)).

```{r FPRplot, fig.cap="FPR plot. Boxplots of the proportion of raw p-values lower than the commonly used thresholds for the nominal $\\alpha$ (i.e. the False Positive Rate) for each DA method.", fig.width=8, fig.height=5}
cols <- createColors(variable = levels(TIEC_summary$df_pval$Method))
plotFPR(df_FPR = TIEC_summary$df_FPR, cols = cols)
```

### False Discovery Rate

The false discovery rate $FDR = E\left[\frac{FP}{FP+TP}\right]$ is the expected value of the ratio between the false positives and all the positives. By construction, mock comparisons should not contain any TPs and when all the hypotheses are null, FDR and FWER (Family Wise Error Rate) coincide. For each set of method and comparison, the FDR is set equal to 1 (if at least 1 DA feature is found) or 0 (if no DA features are found). Hence, the estimated FDR is computed by averaging the values across all the mock comparisons. As the number of mock comparisons increases, the more precise the estimated FDR will be. Just as alpha is set as a threshold for the p-value to control the FPR, a threshold for the adjusted p-value, which is the FDR analog of the p-value, can be set. FDR values should match the nominal values represented by the red dashed lines. In this example, the number of mock comparisons is set to 3, so the estimates are unprecise (Figure \@ref(fig:FDRplot)).

```{r FDRplot, fig.cap="FDR plot. Average nominal False Discovery Rate values for several commonly used thresholds for each DA method.", fig.width=8, fig.height=5}
plotFDR(df_FDR = TIEC_summary$df_FDR, cols = cols)
```

### QQ-Plot

The p-values distribution under the null hypothesis should be uniform. This is qualitatively summarized in the QQ-plot in Figure \@ref(fig:QQplot) where the bisector represents a perfect correspondence between observed and theoretical quantiles of p-values. For each theoretical quantile, the corresponding observed quantile is obtained averaging the observed p-values' quantiles from all mock datasets. The plotting area is zoomed-in to show clearly the area between 0 and 0.1.

Methods over the bisector show a conservative behavior, while methods below the bisector a liberal one.

The starting point is determined by the total number of features. In our example the starting point for the theoretical p-values is computed as 1 divided by the number of taxa, rounded to the second digit. In real experiments, where the number of taxa is higher, the starting point is closer to zero.

```{r QQplot, fig.cap="QQ plot. Quantile-quantile plot from 0 to 0.1 for each DA methods. Average curves are reported", fig.width=8, fig.height=7}
plotQQ(df_QQ = TIEC_summary$df_QQ, zoom = c(0, 0.1), cols = cols) + 
    guides(colour = guide_legend(ncol = 1))
```

As the number of methods increases, distinguishing their curves becomes more difficult. For this reason it is also possible to plot each method singularly (Figure \@ref(fig:QQplotsplit)).

```{r QQplotsplit, fig.cap="QQ plot. Quantile-quantile plots from 0 to 1 for each DA method are displayed separately. Average curves are reported", fig.height=12, fig.width=12}
plotQQ(df_QQ = TIEC_summary$df_QQ, zoom = c(0, 1), cols = cols, split = TRUE)
```

### Kolmogorov-Smirnov test

Departure from uniformity is quantitatively evaluated through the Kolmogorov-Smirnov test which is reported for each method across all mock datasets using the the `plotKS` function in Figure \@ref(fig:KSplot).

```{r KSplot, fig.cap="KS plot. Kolmogorov-Smirnov (KS) statistic boxplots for each DA methods where the raw p-value distribution is compared with a uniform distribution."}
plotKS(df_KS = TIEC_summary$df_KS, cols = cols)
```

High KS values indicates departure from the uniformity while low values indicates closeness. All the clues we had seen in the previous figures \@ref(fig:QQplot) and \@ref(fig:QQplotsplit) are confirmed by the KS statistics: `metagenomeSeq.CSS.fitZig`, which was very liberal and its distribution of p-values is the farthest from uniformity among the tested methods. Also ALDEx2 based methods show high KS values, indeed they showed a very conservative behaviour.

### Log distribution of p-values

Looking at the p-values' log-scale can also be informative. This is because behavior in the tail may be poor even when the overall p-value distribution is uniform, with a few unusually small p-values in an otherwise uniform distribution. Figure \@ref(fig:LogPplot) displays the distributions of all the p-values (in negative log scale) generated by each DA method across all the mock comparisons.

```{r LogPplot, fig.cap="-log10(p-value) plot. Negative logarithm distribution of p-values. Red-shaded vertical bars represent the 90, 95, and 99 percentiles for the negative log distribution of p-values for each method. They should align with the dotted lines which represent the percentiles of the IDEAL distribution.", fig.width=8, fig.height=10}
plotLogP(df_pval = TIEC_summary$df_pval, cols = cols)
```

Similarly, figure \@ref(fig:AveLogPplot) exploits the structure of the `df_QQ` data.frame generated by the `createTIEC()` function to display the distribution of the p-values (in negative log scale) generated by each DA method, averaged among mock comparisons (only two in this vignette). As this second graphical representation is only based on 1 averaged p-value for each quantile, it is also less influenced by anomalously large values.

```{r AveLogPplot, fig.cap="-log10(average p-value) plot ('average' refers to the average p-value computed for each quantile across mocks comparisons). Negative logarithm distribution of average p-values. Red-shaded vertical bars represent the 90, 95, and 99 percentiles for the negative log distribution of average p-values for each method. They should align with the dotted lines which represent the percentiles of the IDEAL distribution.", fig.width=8, fig.height=10}
plotLogP(df_QQ = TIEC_summary$df_QQ, cols = cols)
```

In the figure \@ref(fig:LogPplot) and \@ref(fig:AveLogPplot), the $-\log_{10}(p-value)$ IDEAL distribution is reported in red color as the first method. To highlight tail's behaviors, 3 percentiles (0.9, 0.95, 0.99) are reported using red-shaded vertical segments for each method. If the method's distribution of negative log-transformed p-values or average p-values is still uniform in the 3 selected quantiles of the tail, the 3 red vertical segments will align to the respective dotted line. Methods are ordered using the distances between the observed quantiles and the ideal ones. Usually, when a method has its red segments to the left of the IDEAL's ones is conservative (*e.g.*, `ALDEx2.iqlr.wilcox.unpaired` and `MAST.default`). Indeed, for those methods, little p-values are fewer than expected. On the contrary, methods with red segments to the right of the IDEAL's ones are liberal (*e.g.*, `edgeR.TMM`). Mixed results could be present: a method that has a lower quantile for one threshold and higher quantiles for the others (*e.g.*, `limma.TMM`).

## Discussion about TIEC

Putting all the previous graphical representations together gives a general overview of methods' ability to control FPs and p-values distribution under the null hypothesis (*i.e.* no differential abundance). It is clear that only methods that produce p-values can be included in this analysis. While figures \@ref(fig:QQplot) and \@ref(fig:QQplotsplit) have a main exploratory scope regarding the p-values distribution based on quantile-quantile comparison, figures \@ref(fig:FPRplot), \@ref(fig:KSplot), and \@ref(fig:AveLogPplot) are able to rank methods according to False Positive Rate, uniformity of p-values distribution, and departure from uniformity in the tail. The latter graphical representations could be used as a first tool to establish which DA method to consider for further analyses and which DA methods to exclude. Finally, the figure \@ref(fig:FDRplot) can be used to assess FDR control under the null scenario. This exposes the problem of a few extremely small p-values among a collection that looks roughly uniform. If that is the case, the Type I error would be under control but the FDR would be inflated.

# Concordance

**Assumption**: Applying different methods to the same data may produce different results.

**Questions**: How much do the methods agree with each other? How much does a method agree with itself?

## Concordance structure

To measure the ability of each method to produce replicable results from a dataset with two or more groups:

1.  samples are divided to obtain the Subset1 and Subset2 datasets using the `createSplits()` function;

2.  DA methods are run on both subsets using the `runSplits()` function;

3.  the Concordance At the Top metric (CAT) between the lists of p-values is computed to obtain the Between Methods Concordance (BMC) and the Within Method Concordance (WMC);

4.  steps 1-3 are repeated many times (*N = 2*, but at least 100 are suggested) and the results are averaged using the `createConcordance()` function.

The relationships between the functions used in this section are explained by the diagram in Figure \@ref(fig:figconcordance).

```{r figconcordance, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, fig.cap="Concordance diagram."}
knitr::include_graphics("./concordance_structure.svg")
```

## Split datasets

Using the `createSplits()` function, the *ps_plaque_16S* dataset is randomly divided by half. In this dataset, samples are paired: 1 sample for supragingival plaque and 1 sample for subgingival plaque are considered for each subject. The *paired* parameter is passed to the method (it contains the name of the variable which describes the subject IDs) so the paired samples are inside the same split. In this specific case, the two groups of samples are balanced between conditions, reflecting the starting dataset. However, if the starting dataset had been unbalanced, the *balanced* option would have allowed to keep the two splits unbalanced or not.

```{r createSplits}
set.seed(123)

# Make sure that groups and subject IDs are factors
sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE <- 
    factor(sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE)
sample_data(ps_plaque_16S)$RSID <- 
    factor(sample_data(ps_plaque_16S)$RSID)

my_splits <- createSplits(
  object = ps_plaque_16S,
  varName = "HMP_BODY_SUBSITE",
  paired = "RSID",
  balanced = TRUE,
  N = 2
) # At least 100 is suggested
```

The structure produced by `createSplits()` function consists in a list of two matrices: *Subset1* and *Subset2*. Each matrix contains the randomly chosen sample IDs. The number of rows of both matrices is equal to the number of comparisons/splits (2 in this example, but at least 100 are suggested).

## Set up normalizations and DA methods

For some of the methods implemented in this package it is possible to perform differential abundance testings for the repeated measurements experimental designs (*e.g.*, by adding the subject ID in the model formula of `DESeq2`).

Once again, to set the differential abundance methods to use, the `set_<name_of_the_method>()` methods can be exploited. For a faster demonstration, differential abundance methods without weighting are used:

```{r set_Methods_noweights}
my_edgeR_noWeights <- set_edgeR(
    group_name = "HMP_BODY_SUBSITE", 
    design = ~ 1 + RSID + HMP_BODY_SUBSITE, 
    coef = "HMP_BODY_SUBSITESupragingival Plaque", 
    norm = "TMM")

my_DESeq2_noWeights <- set_DESeq2(
    contrast = c("HMP_BODY_SUBSITE",
    "Supragingival Plaque", "Subgingival Plaque"), 
    design = ~ 1 + RSID + HMP_BODY_SUBSITE, 
    norm = "poscounts")

my_limma_noWeights <- set_limma(
    design = ~ 1 + RSID + HMP_BODY_SUBSITE, 
    coef = "HMP_BODY_SUBSITESupragingival Plaque", 
    norm = "TMM")

my_ALDEx2 <- set_ALDEx2(
    pseudo_count = FALSE,
    design = "HMP_BODY_SUBSITE",
    mc.samples = 128,
    test = "wilcox",
    paired.test = TRUE, 
    denom = "all", 
    contrast = c("HMP_BODY_SUBSITE", "Supragingival Plaque", "Subgingival Plaque"))

my_MAST <- set_MAST(
    pseudo_count = FALSE,
    rescale = "median",
    design = "~ 1 + RSID + HMP_BODY_SUBSITE",
    coefficient = "HMP_BODY_SUBSITESupragingival Plaque")

my_dearseq <- set_dearseq(
    pseudo_count = FALSE, 
    covariates = NULL,
    variables2test = "HMP_BODY_SUBSITE",
    sample_group = "RSID", 
    test = "asymptotic", 
    preprocessed = FALSE)

# Temporarily removed (v2.3.2 crashes R)
# # Very time consuming
# my_ANCOM <- set_ANCOM(
#     pseudo_count = FALSE,
#     adj_formula = NULL,
#     rand_formula = "(1|RSID)",
#     lme_control = lme4::lmerControl(),
#     contrast = c("HMP_BODY_SUBSITE",
#     "Supragingival Plaque", "Subgingival Plaque"),
#     BC = FALSE)

my_linda <- set_linda(
    formula = "~ HMP_BODY_SUBSITE + (1|RSID)",
    contrast = c("HMP_BODY_SUBSITE",
    "Supragingival Plaque", "Subgingival Plaque"),
    is.winsor = TRUE, 
    zero.handling = "pseudo-count",
    alpha = 0.05)

my_Maaslin2 <- set_Maaslin2(
    normalization = "TSS",
    transform = "LOG",
    analysis_method = "LM",
    fixed_effects = "HMP_BODY_SUBSITE",
    random_effects = "RSID",
    contrast = c("HMP_BODY_SUBSITE",
    "Supragingival Plaque", "Subgingival Plaque"))

# Temporarily removed due to issues with ID_variables
# my_mixMC <- set_mixMC(
#     pseudo_count = 1,
#     ID_variable = "RSID",
#     contrast = c("HMP_BODY_SUBSITE",
#     "Supragingival Plaque", "Subgingival Plaque"))

my_ZicoSeq <- set_ZicoSeq(
    contrast = c("HMP_BODY_SUBSITE",
        "Supragingival Plaque", "Subgingival Plaque"),
    strata = "RSID",
    feature.dat.type = "count", 
    is.winsor = TRUE, outlier.pct = 0.03, winsor.end = "top",
    is.post.sample = TRUE, post.sample.no = 25, perm.no = 99,
    ref.pct = 0.5, stage.no = 6, excl.pct = 0.2, 
    link.func = list(function(x) sign(x) * (abs(x))^0.5))

my_methods_noWeights <- c(
    my_edgeR_noWeights, 
    my_DESeq2_noWeights, 
    my_limma_noWeights,  
    my_ALDEx2,
    my_MAST,
    my_dearseq,
    # my_ANCOM, # Temporary bug with random effects
    my_linda,
    my_Maaslin2,
    # my_mixMC, # Temporary bug with multilevel
    my_ZicoSeq)
```

Similarly, to set the normalization methods, the `setNormalizations()` function can be used. In this case it has already been set up for the TIEC analysis:

```{r info_normalizations}
str(my_normalizations)
```

The `runSplits()` function generates the subsets and performs DA analysis on the features with at least 1 (*min_counts \> 0*) count in more than 2 samples (*min_samples \> 2*):

```{r runSplits}
# Set the parallel framework
# Remember that ANCOMBC based methods are compatible only with SerialParam()
bpparam <- BiocParallel::SerialParam()

# Make sure the subject ID variable is a factor
phyloseq::sample_data(ps_plaque_16S)[, "RSID"] <- as.factor(
    phyloseq::sample_data(ps_plaque_16S)[["RSID"]])

Plaque_16S_splitsDA <- runSplits(
    split_list = my_splits, 
    method_list = my_methods_noWeights, 
    normalization_list = my_normalizations, 
    object = ps_plaque_16S, 
    min_counts = 0, min_samples = 2, 
    verbose = FALSE,
    BPPARAM = bpparam)
```

Many warning messages could be shown after running the previous function. As suggested before, *verbose = TRUE* can be used to obtain the method name and the comparison where the warnings occured. In this case they are probably due to the low sample size of the example dataset but they should be carefully addressed in real data analysis.

The structure of the output in this example is the following:

-   *Subset1* and *Subset2* on the first level, which contains:

    -   *Comparison1* to *Comparison2* output lists on the second level:

        -   *Methods*' results on the third level: `edgeR` with *TMM* scaling factors, `DESeq2` with *poscounts* normalization factors, `limma-voom` with *TMM* scaling factors (all the 3 previous methods have the Subject identifier in the design formula), `ALDEx2` with paired *wilcox* test and denom equals to *all*, `MAST` with *median* scaling and the subject identifier in the design formula, `dearseq` for repeated measures with *asymptotic* test, `ANCOM` without bias correction (temporarily removed), `linDA` with winsorization and pseudo-count addition, `Maaslin2` with *TSS* normalization, *LOG* transformation and *LM* analysis method, `mixMC` (temporarily removed), and `ZicoSeq` with default parameters. Their outputs are organized as always:

            -   *pValMat* which contains the matrix of raw p-values and adjusted p-values in *rawP* and *adjP* columns respectively;

            -   `statInfo` which contains the matrix of summary statistics for each feature, such as the *logFC*, standard errors, test statistics and so on;

            -   *dispEsts* which contains the dispersion estimates for methods like *edgeR* and *DESeq2*;

            -   *name* which contains the complete name of the used method.

### Add a new DA method later in the analysis

Again, it may happen that at a later time the user wants to add to the results already obtained, the results of another group of methods. First of all, the same splits and the same object must be used to obtain the new results:

```{r second_round}
my_basic <- set_basic(
    pseudo_count = FALSE, 
    contrast = c("HMP_BODY_SUBSITE",
    "Supragingival Plaque", "Subgingival Plaque"), 
    test = "wilcox", 
    paired = TRUE)

Plaque_16S_splitsDA_basic <- runSplits(
    split_list = my_splits, 
    method_list = my_basic, 
    normalization_list = NULL, 
    object = ps_plaque_16S,
    min_counts = 0, min_samples = 2, 
    verbose = FALSE)
```

To put everything together, two nested `mapply`s can be used to exploit the output structures:

```{r add_a_second_round}
Plaque_16S_splitsDA_all <- mapply(
    Plaque_16S_splitsDA, # List of old results
    Plaque_16S_splitsDA_basic, # List of new results
    FUN = function(subset_old, subset_new){
        mapply( 
            subset_old, 
            subset_new,
            FUN = function(old, new){ 
                return(c(old, new))
        }, SIMPLIFY = FALSE)
    }, SIMPLIFY = FALSE)
```

## Comparing the concordances

For each pair of methods the concordance is computed by the `createConcordance()` function. It produces a long format data frame object with several columns:

-   *comparison* which indicates the comparison number;
-   *n_features* which indicates the total number of taxa in the comparison dataset;
-   name of *method1*;
-   name of *method2*;
-   *rank*;
-   *concordance* which is defined as the cardinality of the intersection of the top *rank* elements of each list, divided by *rank*, *i.e.*, $\frac{L_{1:rank} \bigcap M_{1:rank}}{rank}$, where *L* and *M* represent the lists of p-values of *method1* and *method2* respectively. A noise value ($<10^{-10}$) is added to each p-value (or statistic) in order to avoid duplicated values which could not be ordered.

```{r createConcordance}
concordance <- createConcordance(
    object = Plaque_16S_splitsDA_all,
    slot = "pValMat",
    colName = "rawP",
    type = "pvalue"
)

head(concordance)
```

The `createConcordance()` method is very flexible. In the example below the concordances are built using the log fold changes or other statistics instead of the p-values. To do so, it is necessary to know the column names generated by each differential abundance method in the *statInfo* matrix.

Firstly, the method order is extracted using the *name* slot:

```{r getNames}
names(Plaque_16S_splitsDA_all$Subset1$Comparison1)
```

Then, the column names of the *statInfo* slot are investigated:

```{r logFC_names}
cat("edgeR.TMM", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$edgeR.TMM$statInfo)
cat("DESeq2.poscounts", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$DESeq2.poscounts$statInfo)
cat("limma.TMM", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$limma.TMM$statInfo)
cat("ALDEx2.all.wilcox.paired", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$ALDEx2.all.wilcox.paired$
    statInfo)
cat("MAST.median", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$MAST.median$statInfo)
cat("dearseq.repeated.asymptotic", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$dearseq.repeated.asymptotic$
    statInfo)
# cat("ANCOM", "\n")
# names(Plaque_16S_splitsDA_all$Subset1$Comparison1$ANCOM$statInfo)
cat("linda", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$linda.win0.03.pc0.5$statInfo)
cat("Maaslin2", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$Maaslin2.TSSnorm.LOGtrans.LM$statInfo)
# cat("mixMC", "\n")
# names(Plaque_16S_splitsDA_all$Subset1$Comparison1$mixMC.pc1$statInfo)
cat("ZicoSeq", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$ZicoSeq.winsor0.03top.ref0.5.excl0.2$statInfo)
cat("basic.wilcox.paired", "\n")
names(Plaque_16S_splitsDA_all$Subset1$Comparison1$basic.wilcox.paired$statInfo)
```

All methods, except for `DESeq2`, `ALDEx2`, `dearseq`, `ANCOM`, `mixMC`, and `ZicoSeq` contain the log fold change values in the *logFC* column of `statInfo` matrix. Knowing this, the alternative concordance data frame can be built using:

```{r alternativeConcordance, eval=FALSE}
concordance_alternative <- createConcordance(
    object = Plaque_16S_splitsDA_all,
    slot = "statInfo",
    colName = c("logFC", "log2FoldChange", "logFC", 
        "effect", "logFC", "rawP", 
        # "W", # ANCOM
        "log2FoldChange", 
        "coef", # "importance", # mixMC
        "effect", "logFC"),
    type = c("logfc", "logfc", "logfc", "logfc", 
        "logfc", "pvalue", 
        # "logfc", # ANCOM
        "logfc", "logfc", 
        #"logfc", # mixMC
        "logfc", "logfc")
)
```

## Visualization

Starting from the table of concordances, the `plotConcordance()` function can produce 2 graphical results visible in Figure \@ref(fig:plotConcordance):

-   the dendrogram of methods, clustered by the area over the concordance bisector in *concordanceDendrogram* slot;

-   the heatmap of the between and within method concordances in *concordanceHeatmap* slot. For each tile of the symmetric heatmap, which corresponds to a pair of methods, the concordance from rank 1 to a threshold rank is drawn.

The area between the curve and the bisector is colored to highlight concordant methods (blue) and non-concordant ones (red). The two graphical results should be drawn together for the best experience.

```{r plotConcordance, fig.cap="BMC and WMC plot. Between-method concordance (BMC) and within-method concordance (WMC) (main diagonal) averaged values from rank 1 to 30.", fig.width=10, fig.height=8}
pC <- plotConcordance(concordance = concordance, threshold = 30)
cowplot::plot_grid(plotlist = pC, ncol = 2, align = "h", axis = "tb",
    rel_widths = c(1, 3))
```

The WMC and BMC from rank 1 to rank 30 are reported in the plot above. More than 40 (use `table(concordance$rank)` to find out) is the maximum rank obtained by all split comparisons, *i.e.* the number of taxa for which all methods have been able to calculate p-values (in all comparisons). However, a custom threshold of 30 was supplied.

It is common that WMC values (in red rectangles) are lower than BMC ones. Indeed, BMC is computed between different methods on the same data, while WMC is computed for a single method, run in different datasets (some taxa are dataset-specific).

`dearseq.repeated.asymptotic` and `Maaslin2.TSSnorm.LOGtrans.LM` methods show the highest BMC values but they are also concordant with `limma.TMM`, `edgeR.TMM`, `linda.win0.03.pc0.5`, and `basic.wilcox.paired`. Differently, `DESeq2.poscounts` and `MAST.median` are not concordant with the other methods.

Regarding the WMC, `ALDEx2.all.wilcox.paired` has the highest value while `MAST.median` has the lowest values, maybe because it has not been implemented properly for repeated measure designs. Other methods have comparable WMC values.

## Discussion about Concordance

Random splits allow to evaluate concordance between methods and within a method. These analyses do not assess the correctness of the discoveries. Even the method with the highest WMC could nonetheless consistently identify false positive DA taxa. For this reason, the concordance analysis framework should be used as a tool to detect groups of similar methods.

# Enrichment analysis

**Assumption**: Previous analyses did not assess the correctness of the discoveries.

**Question**: If some prior knowledge about the experiment is available, would the findings be coherent with that knowledge?

## Enrichment structure

While the lack of ground truth makes it challenging to assess the validity of DA results in real data, enrichment analysis can provide an alternative solution to rank methods in terms of their ability to identify, as significant, taxa that are known to be differentially abundant between two groups. To run methods, the `runDA()` function is used. Leveraging the prior knowledge (if present), the correctness of the findings is checked using the `createEnrichment()` and `createPositives()` functions. Many graphical outputs are available through the `plotContingency()`, `plotEnrichment()`, `plotMutualFindings()`, and `plotPositives()` functions.

The relationships between the functions used in this section are explained by the diagram in Figure \@ref(fig:figenrichment).

```{r figenrichment, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, fig.cap="Enrichment analysis diagram."}
knitr::include_graphics("./enrichment_structure.svg")
```

## *A priori* knowledge

Here, we leveraged the peculiar environment of the gingival site [@plaque_dynamics]:

-   the **supragingival biofilm** is directly exposed to the open atmosphere of the oral cavity, favoring the growth of aerobic species;

-   in the **subgingival biofilm**, the atmospheric conditions gradually become strict anaerobic, favoring the growth of anaerobic species.

From the comparison of the two sites, an abundance of **aerobic microbes in the supragingival plaque** and of **anaerobic bacteria in the subgingival plaque** is expected. DA analysis should reflect this difference by finding an enrichment of aerobic (anaerobic) bacteria among the DA taxa with a positive (negative) log-fold-change.

Firstly, the microbial metabolism information is necessary. These data comes from [@cigarettes] research article's github repository (<https://github.com/waldronlab/nychanesmicrobiome>), but they can be loaded using `data("microbial_metabolism")`:

```{r priorKnowledge}
data("microbial_metabolism")
head(microbial_metabolism)
```

The microbial genus and its type of metabolism are specified in the first and second column respectively. To match each taxon of the phyloseq object to its type of metabolism the next chunk of code can be used:

```{r exampleOfIntegration}
# Extract genera from the phyloseq tax_table slot
genera <- tax_table(ps_plaque_16S)[, "GENUS"]

# Genera as rownames of microbial_metabolism data.frame
rownames(microbial_metabolism) <- microbial_metabolism$Genus

# Match OTUs to their metabolism
priorInfo <- data.frame(genera, "Type" =  microbial_metabolism[genera, "Type"])
unknown_metabolism <- is.na(priorInfo$Type)
priorInfo[unknown_metabolism, "Type"] <- "Unknown"

# Relabel 'F Anaerobic' to 'F_Anaerobic' to remove space
priorInfo$Type <- factor(priorInfo$Type, 
    levels = c("Aerobic","Anaerobic","F Anaerobic","Unknown"), 
    labels = c("Aerobic","Anaerobic","F_Anaerobic","Unknown"))

# Add a more informative names column
priorInfo[, "newNames"] <- paste0(rownames(priorInfo), "|",
    priorInfo[, "GENUS"])
```

## Set up normalizations and DA methods

Both the normalization/scaling factors and the DA methods' instructions are available since the dataset is the same used in the previous section.

In concordance analysis, normalizations factor were added inside the `runSlits()` function, so the original object *ps_plaque_16S* does not contain the values. The normalization/scaling factors are added to the object:

```{r normalize_plaque}
ps_plaque_16S <- runNormalizations(my_normalizations, object = ps_plaque_16S)
```

A simple filter to remove rare taxa is applied:

```{r filter_plauque}
ps_plaque_16S <- phyloseq::filter_taxa(physeq = ps_plaque_16S, 
    flist = function(x) sum(x > 0) >= 3, prune = TRUE)
ps_plaque_16S
```

Differently from the Type I Error Control and Concordance analyses, the enrichment analysis rely on a single `phyloseq` or `TreeSummarizedExperiment` object (no mocks, no splits, no comparisons). For this reason many methods can be assessed without computational trade-offs (*e.g.*, `ANCOM` without sampling fraction bias correction and methods which use *ZINB* weights).

The observational weights are computed:

```{r plaque_weights}
plaque_weights <- weights_ZINB(object = ps_plaque_16S, design = ~ 1,
    zeroinflation = TRUE)
```

The existing instructions are concatenated with the instructions of methods which use observational weights:

```{r enrichment_methods}
my_edgeR <- set_edgeR(
    group_name = "HMP_BODY_SUBSITE", 
    design = ~ 1 + RSID + HMP_BODY_SUBSITE, 
    coef = "HMP_BODY_SUBSITESupragingival Plaque", 
    norm = "TMM", 
    weights_logical = TRUE)

my_DESeq2 <- set_DESeq2(
    contrast = c("HMP_BODY_SUBSITE",
    "Supragingival Plaque", "Subgingival Plaque"), 
    design = ~ 0 + RSID + HMP_BODY_SUBSITE, 
    norm = "poscounts",
    weights_logical = TRUE)

my_limma <- set_limma(
    design = ~ 1 + RSID + HMP_BODY_SUBSITE, 
    coef = "HMP_BODY_SUBSITESupragingival Plaque", 
    norm = "TMM",
    weights_logical = TRUE)

my_methods <- c(my_methods_noWeights, my_edgeR, my_DESeq2, my_limma)
```

All the ingredients are ready to run DA methods:

```{r runDA_enrichment, message=FALSE}
Plaque_16S_DA <- runDA(method_list = my_methods, 
    object = ps_plaque_16S, weights = plaque_weights, verbose = FALSE)
```

## Testing the enrichment

`Plaque_16_DA` object contains the results for 10 methods. In order to extract p-values, the optional direction of DA (DA vs non-DA, or UP Abundant vs DOWN Abundant), and to add any *a priori* information, the `createEnrichment()` function can be used.

In the *direction* argument, which is set to *NULL* by default, the column name containing the direction (*e.g.*, *logfc*, *logFC*, *logFoldChange*...) of each method's *statInfo* matrix can be supplied.

Firstly, the order of methods is investigated:

```{r info_DA}
names(Plaque_16S_DA)
```

Following the methods' order, the *direction* parameter is supplied together with other parameters:

-   *threshold_pvalue*, *threshold_logfc*, and *top* (optional), to set differential abundance thresholds;

-   *slot*, *colName*, and *type*, which specify where to apply the above thresholds;

-   *priorKnowledge*, *enrichmentCol*, and *namesCol*, to add enrichment information to DA analysis;

The `createEnrichment()` function, with the direction parameter for all method except `ANCOM` and `dearseq` (the first has the *W* statistic which is only positive, while the second has only p-values), is used:

```{r createEnrichment}
enrichment <- createEnrichment(
    object = Plaque_16S_DA[-c(6)], 
    priorKnowledge = priorInfo, 
    enrichmentCol = "Type",
    namesCol = "newNames", 
    slot = "pValMat", 
    colName = "adjP", 
    type = "pvalue",
    direction = c(
        "logFC", # edgeR
        "log2FoldChange", # DEseq2
        "logFC", # limma
        "effect", # ALDEx2
        "logFC", # MAST
        "log2FoldChange", # linda
        "coef", # Maaslin2
        # "importance", # mixMC
        "effect", # ZicoSeq
        "logFC", # edgeR with weights
        "log2FoldChange", # DESeq2 with weights
        "logFC"), # limma with weights
    threshold_pvalue = 0.1,
    threshold_logfc = 0,
    top = NULL,
    alternative = "greater",
    verbose = TRUE
)
```

The produced *enrichment* object consists in a list of elements as long as the number of tested methods:

-   the *data* slot contains information for each feature. P-values, adjusted p-values (or other statistics) in *stats* column, log fold changes (or other statistics, if specified) in *direction* column, differential abundance information in the *DA* column (according to the thresholds), the variable of interest for the enrichment analysis, and the name of the feature in the *feature* column;

-   in the *tables* slot a maximum of *2 x (levels of enrichment variable)* contingency tables (2x2) are present;

-   in the *tests* slot, the list of Fisher exact tests produced by the `fisher.test()` function are saved for each contingency table;

-   in the *summaries* slot, the first elements of the contingency tables and the respective p-values are collected for graphical purposes.

## Visualization

### Contingency tables

Considering one of the methods, `DESeq2.poscounts`, 8 contingency tables are obtained. Both *UP Abundant* and *DOWN Abundant* taxa are found and the enrichment variable has *Aerobic*, *Anaerobic*, *F_Anaerobic*, and *Unknown* levels. For each level, 2 contingency tables could be built: one for DOWN Abundant vs non-DOWN Abundant features and one for *UP Abundant* vs *non-UP Abundant* features. The enrichment is tested using Fisher exact test. The `plotContingency()` function summarize all these information (Figure \@ref(fig:plotContingency)).

```{r plotContingency, fig.cap="Contingency tables plot. Contingency tables for Aerobic and Anaerobic taxa found as differentially abundant by DESeq2.poscounts DA method. Fisher exact test has been performed on each contingency table. If the enrichment is signficantly present, the correspondent cell will be highlighted.", fig.height=4, fig.width=6}
plotContingency(enrichment = enrichment, 
    levels_to_plot = c("Aerobic", "Anaerobic"), 
    method = "DESeq2.poscounts")
```

### Enrichment plot

To summarize enrichment analysis for all the methods simultaneously, the `plotEnrichment()` function can be used. Only *Aerobic* and *Anaerobic* levels are plotted in Figure \@ref(fig:plotEnrichment):

```{r plotEnrichment, fig.width=5, fig.height=6, fig.cap="Enrichment plot. Number of differentially abundant features, colored by aerobic or anaerobic metabolism, and directed according to differential abundance direction (UP or DOWN abundant)."}
plotEnrichment(enrichment = enrichment, enrichmentCol = "Type", 
    levels_to_plot = c("Aerobic", "Anaerobic"))
```

Since *Subgingival Plaque* is the reference level for each method, the coefficients extracted from the methods are referred to the *Supragingival Plaque* class. The majority of methods identify, as expected, a statistically significant ($0.001 < p \le 0.05$) amount of DOWN Abundant Anaerobic features in Supragingival Plaque (Figure \ref{fig:plotEnrichment}). Moreover, many of them find an enriched amount of UP Abundant Aerobic genera in Supragingival Plaque. Unexpectedly, both `DESeq2.poscounts` and `DESeq2.poscounts.weighted` find many Anaerobic genera as UP Abundant, they could be FPs.

### Mutual Findings

To investigate the DA features, the `plotMutualFindings()` function can be used (Figure \@ref(fig:plotMutualFindings)). While *levels_to_plot* argument allows to choose which levels of the enrichment variable to plot, *n_methods* argument allows to extract only features which are mutually found as DA by more than 1 method.

```{r plotMutualFindings, fig.width=6, fig.height=6, fig.cap="Mutual Findings plot. Number of differentially abundant features mutually found by 1 or more methods, colored by the differential abundance direction and separated by aerobic and anaerobic metabolism."}
plotMutualFindings(enrichment, enrichmentCol = "Type", 
    levels_to_plot = c("Aerobic", "Anaerobic"), n_methods = 1)
```

In this example (Figure \@ref(fig:plotMutualFindings)), many Anaerobic genera and 6 Aerobic genera are found as DA by more than 1 method simultaneously. Among them, all methods find *Prevotella*, *Treponema*, *Fusobacterium*, and *Dialister* genera DOWN Abundant in Supragingival Plaque, while the *Actinomyces* genus UP Abundant, even if it has an aerobic metabolism. Similarly, all methods find *Corynebacterium*, *Leutropia,* and *Neisseria* aerobic genera UP abundant in Supragingival Plaque.

## True and False Positives

To evaluate the overall performances a statistic based on the difference between putative True Positives (TP) and the putative False Positives (FP) is used. To build the matrix to plot, the `createPositives()` can be used. In details, the correctness of the DA features is evaluated comparing the direction of the top ranked features to the expected direction supplied by the user in the *TP* and *FP* lists. The procedure is performed for several thresholds of *top* parameter in order to observe a trend, if present:

```{r createPositives}
positives <- createPositives(
    object = Plaque_16S_DA[-c(6)], 
    priorKnowledge = priorInfo, 
    enrichmentCol = "Type",
    namesCol = "newNames", 
    slot = "pValMat", 
    colName = "rawP", 
    type = "pvalue",
    direction = c(
        "logFC", # edgeR
        "log2FoldChange", # DEseq2
        "logFC", # limma
        "effect", # ALDEx2
        "logFC", # MAST
        "log2FoldChange", # linda
        "coef", # Maaslin2
        # "importance", # mixMC
        "effect", # ZicoSeq
        "logFC", # edgeR with weights
        "log2FoldChange", # DESeq2 with weights
        "logFC"), # limma with weights
    threshold_pvalue = 0.1,
    threshold_logfc = 0,
    top = seq.int(from = 0, to = 30, by = 3),
    alternative = "greater",
    verbose = FALSE,
    TP = list(c("DOWN Abundant", "Anaerobic"), c("UP Abundant", "Aerobic")),
    FP = list(c("DOWN Abundant", "Aerobic"), c("UP Abundant", "Anaerobic"))
)
head(positives)
```

The `plotPositives()` function can be used to summarize the methods' performances (Figure \@ref(fig:plotPositives)). Higher values usually represents better performances. In our example, all methods show similar values of the statistics for the top 10 ranked features.

```{r plotPositives, fig.cap="TP, FP differences plot. Differences between the number of True Positives and False Positives for several thresholds of the top ranked raw p-values (the top 3 lowest p-values, the top 6, 9, ..., 30) for each method.", fig.height=5, fig.width=7}
plotPositives(positives)
```

`MAST.median` and `ALDEx2.all.wilcox.paired` methods are very conservative and are located on the lower part of the Figure \@ref(fig:plotPositives). `mixMC` is also in the lower part of the figure probably because its aim is to find the minimum amount of features that have a discriminant power (sparse PLS-DA), so the findings depends on the dataset (sometimes few features are sufficient for perfectly separating the groups). The highest performances are of `limma.TMM.weighted`, `edgeR.TMM.weighted`, and `ZicoSeq.winsor0.03top.post25.ref0.5.excl0.2`. This means that their findings are in line with the *a priori* knowledge supplied by the user.

## Enrichment without direction

When the user have a custom method where the direction of the differential abundance is not returned (*e.g.*, NOISeq), or when the direction of DA is not of interest, the sole information about DA and not DA feature can be used. The `createEnrichment()` function is used without the direction parameter for all methods:

```{r createEnrichment_nodir}
enrichment_nodir <- createEnrichment(
    object = Plaque_16S_DA, 
    priorKnowledge = priorInfo, 
    enrichmentCol = "Type",
    namesCol = "newNames", 
    slot = "pValMat", 
    colName = "adjP", 
    type = "pvalue",
    threshold_pvalue = c(
        0.1, 0.1, 0.1, 0.1, 0.1, 0.1, # thresholds for methods
        # 0.4, # ANCOM threshold on 1-W/(ntaxa-1) 0.4 = liberal
        0.1, 0.1, 0.1, 0.1, 0.1, 0.1), # other methods
    threshold_logfc = 0,
    top = NULL,
    alternative = "greater",
    verbose = FALSE
)
```

To summarize enrichment analysis for all the methods simultaneously, the `plotEnrichment()` function is used. All levels are plotted in Figure \@ref(fig:plotEnrichmentnodir).

```{r plotEnrichmentnodir, fig.width=6, fig.height=5, fig.cap="Enrichment plot. Number of differentially abundant features, colored by metabolism."}
plotEnrichment(enrichment = enrichment_nodir, enrichmentCol = "Type")
```

The highest amount of DA features belongs to the *Anaerobic* metabolism, followed by *F_Anaerobic*, and *Aerobic*. The method that finds more DA features is `DESeq2.poscounts`, while `ANCOM` is the most conservative (even if the threshold value based on $1-\frac{W}{ntaxa-1}$ is set at liberal value of 0.4.

As for enrichment analysis with DA direction, the `plotMutualFindings()` function can be used here too (Figure \@ref(fig:plotMutualFindingsnodir)). While *levels_to_plot* argument allows to choose which levels of the enrichment variable to plot, *n_methods* argument allows to extract only features which are mutually found as DA by more than 1 method.

```{r plotMutualFindingsnodir, fig.width=5, fig.height=6, fig.cap="Mutual Findings plot. Number of differentially abundant features mutually found by 1 or more methods, separated by aerobic and anaerobic metabolism."}
plotMutualFindings(enrichment_nodir, enrichmentCol = "Type", 
    levels_to_plot = c("Aerobic", "Anaerobic"), n_methods = 1)
```

In this example (Figure \@ref(fig:plotMutualFindingsnodir)), many OTUs are found as DA in Supragingival Plaque by all methods.

## Enrichment analysis for simulated data

To enlarge the scope of the enrichment analysis, simulations could be used, *e.g.*, by using the user's dataset as a template to generate simulated data, in which to know the DA features and provide this information as prior knowledge.

As an example, the `SPsimSeq` package is used (the tool to use is up to the user) to simulate only a single dataset (*n.sim = 1*) from the *ps_plaque_16S* dataset where two body sub sites are available (without considering the paired design). The data are simulated with the following properties - 100 features (*n.genes = 100*) - 50 samples (*tot.samples = 50*) - the samples are equally divided into 2 groups each with 25 samples (*group.config = c(0.5, 0.5)*) - all samples are from a single batch (*batch.config = 1*) - 20% DA features (*pDE = 0.2*) - the DA features have a log-fold-change of at least 0.5.

```{r simulate_plaques}
data("ps_plaque_16S")
counts_and_metadata <- get_counts_metadata(ps_plaque_16S)
plaque_counts <- counts_and_metadata[["counts"]]
plaque_metadata <- counts_and_metadata[["metadata"]]

set.seed(123)

sim_list <- SPsimSeq(
    n.sim = 1, 
    s.data = plaque_counts,
    group = plaque_metadata[, "HMP_BODY_SUBSITE"],
    n.genes = 100, 
    batch.config = 1,
    group.config = c(0.5, 0.5), 
    tot.samples = 50, 
    pDE = 0.2, 
    lfc.thrld = 0.5, 
    model.zero.prob = FALSE,
    result.format = "list")
```

Simulated data are organised into a `TreeSummarizedExperiment` object:

```{r simulated_to_TSE}
sim_obj <- TreeSummarizedExperiment::TreeSummarizedExperiment(
    assays = list("counts" = sim_list[[1]][["counts"]]),
    rowData = sim_list[[1]]["rowData"],
    colData = sim_list[[1]]["colData"],
)
# Group as factor
SummarizedExperiment::colData(sim_obj)[, "colData.Group"] <- as.factor(
    SummarizedExperiment::colData(sim_obj)[, "colData.Group"])
```

The *apriori* informations are readily available from the `sim_list[[1]]["rowData"]`:

```{r simulated_priorInfo}
priorInfo <- sim_list[[1]][["rowData"]]
priorInfo$Reality <- ifelse(priorInfo[, "DE.ind"], "is DA", "is not DA")
```

Once again, normalization/scaling factors are added:

```{r normalization_simulated_data}
sim_obj <- runNormalizations(
    normalization_list = my_normalizations, 
    object = sim_obj, 
    verbose = TRUE)
```

Rare and low variance taxa are filtered:

```{r simulated_filter}
taxa_to_keep <- apply(assays(sim_obj)[["counts"]], 1, 
    function(x) sum(x > 0) >= 3 & sd(x) > 1)
sim_obj <- sim_obj[taxa_to_keep, ]
priorInfo <- priorInfo[taxa_to_keep, ]
```

Observational weights are computed:

```{r simulated_weights}
sim_weights <- weights_ZINB(
    object = sim_obj, 
    design = ~ 1, 
    zeroinflation = TRUE)
```

DA methods are set up. The paired design is not considered and all the methods are used. The *contrast*, *design*, *group*, *coef*, and all the other parameters involved in the experimental design definition are changed:

```{r simulated_set}
my_basic <- set_basic(pseudo_count = FALSE,
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"), 
    test = c("t", "wilcox"), 
    paired = FALSE, 
    expand = TRUE)

my_edgeR <- set_edgeR(
    pseudo_count = FALSE,
    group_name = "colData.Group",
    design = ~ colData.Group,
    robust = FALSE,
    coef = 2,
    norm = "TMM",
    weights_logical = c(TRUE, FALSE),
    expand = TRUE)

my_DESeq2 <- set_DESeq2(
    pseudo_count = FALSE,
    design = ~ colData.Group,
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    norm = "poscounts",
    weights_logical = c(TRUE, FALSE),
    alpha = 0.1,
    expand = TRUE)

my_limma <- set_limma(
    pseudo_count = FALSE,
    design = ~ colData.Group,
    coef = 2,
    norm = "TMM",
    weights_logical = c(FALSE, TRUE),
    expand = TRUE)

my_ALDEx2 <- set_ALDEx2(
    pseudo_count = FALSE,
    design = "colData.Group",
    mc.samples = 128,
    test = "wilcox",
    paired.test = FALSE, 
    denom = c("all", "iqlr"), 
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    expand = TRUE)

my_metagenomeSeq <- set_metagenomeSeq(
    pseudo_count = FALSE, 
    design = "~ colData.Group", 
    coef = "colData.GroupSupragingival Plaque", 
    norm = "CSS", 
    model = "fitFeatureModel",
    expand = TRUE)

# Temporarily removed from CRAN
# my_corncob <- set_corncob(
#     pseudo_count = FALSE,
#     formula = ~ colData.Group,
#     formula_null = ~ 1,
#     phi.formula = ~ colData.Group,
#     phi.formula_null = ~ colData.Group,
#     test = c("Wald", "LRT"),
#     boot = FALSE,
#     coefficient = "colData.GroupSupragingival Plaque")

my_MAST <- set_MAST(
    pseudo_count = FALSE,
    rescale = c("default", "median"),
    design = "~ 1 + colData.Group",
    coefficient = "colData.GroupSupragingival Plaque",
    expand = TRUE)

my_Seurat <- set_Seurat(
    pseudo_count = FALSE,
    test = c("t", "wilcox"),
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    norm = c("LogNormalize", "CLR"),
    scale.factor = 10^5,
    expand = TRUE
)

my_ANCOM <- set_ANCOM(
    pseudo_count = FALSE,
    fix_formula = "colData.Group", 
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    BC = c(TRUE, FALSE), 
    alpha = 0.1,
    expand = TRUE   
)

my_dearseq <- set_dearseq(
    pseudo_count = FALSE,
    covariates = NULL,
    variables2test = "colData.Group", 
    preprocessed = FALSE, 
    test = c("permutation", "asymptotic"),
    expand = TRUE)

my_NOISeq <- set_NOISeq(
    pseudo_count = FALSE, 
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    norm = c("rpkm", "tmm"), 
    expand = TRUE)

my_linda <- set_linda(
    formula = "~ colData.Group", 
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    is.winsor = TRUE,
    zero.handling = "pseudo-count",
    alpha = 0.1,
    expand = TRUE)

my_Maaslin2 <- set_Maaslin2(
    normalization = "TSS",
    transform = "LOG",
    analysis_method = "LM",
    fixed_effects = "colData.Group", 
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    expand = TRUE)

my_mixMC <- set_mixMC(
    pseudo_count = 1,
    contrast = c("colData.Group", "Supragingival Plaque",
        "Subgingival Plaque"),
    expand = TRUE
)

my_ZicoSeq <- set_ZicoSeq(
    contrast = c("colData.Group", "Supragingival Plaque", 
        "Subgingival Plaque"),
    feature.dat.type = "count", strata = NULL,
    is.winsor = TRUE, outlier.pct = 0.03, winsor.end = "top",
    is.post.sample = TRUE, post.sample.no = 25, perm.no = 99,
    ref.pct = 0.5, stage.no = 6, excl.pct = 0.2, 
    link.func = list(function(x) sign(x) * (abs(x))^0.5))

my_methods <- c(my_basic, my_edgeR, my_DESeq2, 
    my_limma, my_metagenomeSeq, #my_corncob, 
    my_ALDEx2, my_MAST, my_Seurat, my_ANCOM, 
    my_dearseq, my_NOISeq, my_linda, my_Maaslin2, my_mixMC, 
    my_ZicoSeq)
```

DA methods are run using the `runDA()` function:

```{r simulated_run}
sim_DA <- runDA(
    method_list = my_methods, 
    object = sim_obj, 
    weights = sim_weights, 
    verbose = FALSE)
```

The `createEnrichment()` without the direction parameter for all methods is used. A 0.1 threshold for the adjusted p-values is chosen to define DA and non-DA taxa for all methods, a 0.4 threshold is used for `ANCOM` instead:

```{r createEnrichment_nodir_sim}
enrichment_nodir <- createEnrichment(
    object = sim_DA, 
    priorKnowledge = priorInfo, 
    enrichmentCol = "Reality",
    namesCol = NULL, 
    slot = "pValMat", 
    colName = "adjP", 
    type = "pvalue",
    threshold_pvalue = c(
        rep(0.1,17), # adjP thresholds 
        0.4, # adjP threshold for ANCOM on 1-W/(ntaxa-1)
        rep(0.1,7), # adjP thresholds for other methods
        0.9, # adjP threshold for mixMC 1 - stability 
        0.1), # ZicoSeq
    threshold_logfc = 0,
    top = NULL,
    alternative = "greater",
    verbose = FALSE
)
```

To summarize enrichment analysis for all the methods simultaneously, the `plotEnrichment()` function can be used. Both the numbers of "is DA" and "is not DA" features are plotted in Figure \@ref(fig:plotEnrichmentnodirsim). Their interpretation is quite straightforward: *is DA* are the positives, while the *is not DA* the negatives. Positives reported in Figure \@ref(fig:plotEnrichmentnodirsim) are the True Positives, while negatives are the FPs.

```{r plotEnrichmentnodirsim, fig.width=6, fig.height=6, fig.cap="Enrichment plot for simulated data. Number of differentially abundant features, colored by DA reality."}
plotEnrichment(enrichment = enrichment_nodir, enrichmentCol = "Reality")
```

From this example, less than half of the methods are able to find an enriched amount of truly DA features without any false discovery: `dearseq.counts.permutation.1000` in the first position. On the contrary, `basic.t.counts`, `DESeq2.counts.poscounts.weighted`, `limma.counts.TMM.weighted`, `metagenomeSeq.counts.CSS.fitFeatureModel`, `ALDEx2.counts.iqlr.wilcox.unpaired`, `MAST.counts.default`, `MAST.counts.median`, `ANCOM.counts`, and `NOISeq.counts.tmm` methods do not find any DA feature. This could be strongly related to the template taxa chosen to simulate the DA features.

To further assess methods' power, the `createPositives()` function can be used specifying as TPs the resulting DA features created as real DA features and as FPs the resulting DA features created as not DA features (Figure \@ref(fig:plotPositivessim)).

We use a *threshold_pvalue = 0.1* (*0.4* for ANCOM, 0.9 for mixMC) to call a feature DA based on its adjusted p-value. We compute the difference between TPs and FPs for several *top* thresholds (from 5 to 30, by 5) in order to observe a trend:

```{r createPositives_sim}
positives_nodir <- createPositives(
    object = sim_DA, 
    priorKnowledge = priorInfo, 
    enrichmentCol = "Reality",
    namesCol = NULL, 
    slot = "pValMat", 
    colName = "adjP", 
    type = "pvalue",
    threshold_pvalue = c(
        rep(0.1,19), # adjP thresholds 
        0.4, # adjP threshold for ANCOM on 1-W/(ntaxa-1)
        rep(0.1,7), # adjP thresholds for other methods
        0.9, # adjP threshold for mixMC 1 - stability 
        0.1), # ZicoSeq
    threshold_logfc = 0,
    top = seq(5, 30, by = 5),
    alternative = "greater",
    verbose = FALSE,
    TP = list(c("DA", "is DA")),
    FP = list(c("DA", "is not DA"))
)
```

Since the number of simulated DA feature is 20, the maximum number of TPs is 20 and it is added as an horizontal line to the figure.

```{r plotPositivessim, fig.cap="TP, FP differences plot. Differences between the number of True Positives and False Positives for several thresholds of the top ranked adjusted p-values lower than 0.1 (the top 5 lowest adjusted p-values, the top 10, 15, ..., 30) for each method in simulated data. Red dotted line represents the total number of DA simulated features.", fig.width=10, fig.height=10}
plotPositives(positives = positives_nodir) +
    facet_wrap( ~ method) + 
    theme(legend.position = "none") +
    geom_hline(aes(yintercept = 20), linetype = "dotted", color = "red") +
    geom_hline(aes(yintercept = 0), color = "black") +
    ylim(NA, 21)
```

From figure \@ref(fig:plotPositivessim) it is clearly visible that `dearseq.counts.permutation.1000`, `dearseq.counts.asymptotic`, and `linda.win0.03.pc0.5` reach the highest values of the difference, followed by `corncob.counts.Wald` (temporarily removed) and `DESeq2.counts.poscounts`. As already mentioned the desired level of power which a methods should be able to reach is represented by the red dotted line, *i.e.* the total number of DA simulated features (20 in our case). These methods, in this specific example, have the highest power. Differently, methods characterized by flat lines have a fixed number of features with an adjusted p-value lower than the threshold. If their lines are above the zero line, it means that the number of True Positives is greater than the number of FPs. On the contrary, if their lines are below the zero line, it means that the number of FPs is greater (*e.g.*, `edgeR.counts.TMM.weighted` and `ANCOM.counts.BC` have negative values. Maybe, it is due to poor weight estimates for weight-based methods and/or specific characteristics of this template).

## Discussion about Enrichment

The enrichment analysis toolbox provides many methods to study DA in a dataset.

Firstly, when some prior knowledge is available, it allows to evaluate methods' power. Among the possible applications, it is especially useful to investigate conservative methods: are they calling only the most obvious taxa (also found by the other methods) or are they finding something new? The main drawback is that the availability of the prior knowledge is limited, especially for new datasets. For this reason, enrichment analysis could also be used in addition to simulation tools. Indeed, through parametric, semi-parametric, or non parametric assumptions it is possible to obtain an emulation of the prior knowledge.

Secondly, thanks to methods like `plotMutualFindings()` and `plotEnrichment()`, which produce graphical results like Figure \@ref(fig:plotEnrichmentnodir) and Figure \@ref(fig:plotMutualFindingsnodir), it is also possible to use the enrichment analysis to study the distribution of the findings across class of taxa (*e.g.*, by using as prior knowledge the phylum of the features, it would be possible to study if a phylum is characterized by an increased number of DA compared to another phylum), or more simply, drawing biological conclusions based only on taxa found as DA by the majority of the methods.

# Session Info

```{r sessionInfo}
sessionInfo()
```

# References