--- title: "Workflow with real data" output: rmarkdown::html_vignette description: | vignette: > %\VignetteIndexEntry{Workflow with real data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( message = FALSE, digits = 3, collapse = TRUE, comment = "#>", fig.align = "center", fig.width = 8 ) options(digits = 3) ``` To illustrate the functionality of the `dar` package, this study will use the data set from (Noguera-Julian, M., et al. 2016). The authors of this study found that men who have sex with men (MSM) predominantly belonged to the Prevotella-rich enterotype whereas most non-MSM subjects were enriched in Bacteroides, independently of HIV-1 status. This result highlights the potential impact of sexual orientation on the gut microbiome and emphasizes the importance of controlling for such variables in microbiome research. Using the `dar` package, we will conduct a differential abundance analysis to further explore this finding and uncover potential microbial biomarkers associated with this specific population. ## Load dar package and data ```{r setup} library(dar) # suppressPackageStartupMessages(library(plotly)) data("metaHIV_phy") metaHIV_phy ``` ## Recipe initialization To begin the analysis process with the `dar` package, the first step is to initialize a `Recipe` object, which is an S4 class. This recipe object serves as a blueprint for the data preparation steps required for the differential abundance analysis. The initialization of the recipe object is done through the function `recipe()`, which takes as inputs a `phyloseq` or `TreeSummarizedExperiment` (TSE) object, the name of the categorical variable of interest and the taxonomic level at which the differential abundance analyses are to be performed. As previously mentioned, we will use the data set from (Noguera-Julian, M., et al. 2016) and the variable of interest "RiskGroup2" containing the categories: men who have sex with men (msm), non-MSM (hts) and people who inject drugs (pwid) and we will perform the analysis at the species level. ```{r} ## Recipe initialization rec <- recipe(metaHIV_phy, var_info = "RiskGroup2", tax_info = "Species") rec ``` ## Recipe QC and preprocessing steps definition Once the recipe object has been initialized, the next step is to populate it with steps. Steps are the methods that will be applied to the data stored in the recipe. There are two types of steps: preprocessing (prepro) and differential abundance (da) steps. Initially, we will focus on the prepro steps which are used to modify the data loaded into the recipe, which will then be used for the da steps. The ‘dar’ package includes 3 main preprocessing functionalities: `step_subset_taxa`, which is used for subsetting columns and values in the taxon table connected to the phyloseq object, `step_filter_taxa`, which is used to filter the OTUs, and `step_rarefaction`, which is used to resample the OTU table to ensure that all samples have the same library size. These functionalities allow for a high level of flexibility and customization in the data preparation process before performing the differential abundance analysis. The `dar` package provides convenient wrappers for the `step_filter_taxa` function, designed to filter Operational Taxonomic Units (OTUs) based on specific criteria: prevalence, variance, abundance, and rarity. - `step_filter_by_prevalence`: Filters OTUs according to the number of samples in which the OTU appears. - `step_filter_by_variance`: Filters OTUs based on the variance of the OTU's presence across samples. - `step_filter_by_abundance`: Filters OTUs according to the OTU's abundance across samples. - `step_filter_by_rarity`: Filters OTUs based on the rarity of the OTU across samples. In addition to the preprocessing steps, the `dar` package also incorporates the function `phy_qc` which returns a table with a set of metrics that allow for informed decisions to be made about the data preprocessing that will be done. In our case, we decided to use the step_subset_taxa function to retain only those observations annotated within the realm of Bacteria and Archaea. We also used the `step_filter_by_prevalence` function to retain only those OTUs with at least 1% of the samples with values greater than 0. This approach ensured that we were working with a high-quality, informative subset of the data, which improved the overall accuracy and reliability of the differential abundance analysis. ```{r} ## QC phy_qc(rec) ## Adding prepro steps rec <- rec |> step_subset_taxa(tax_level = "Kingdom", taxa = c("Bacteria", "Archaea")) |> step_filter_by_prevalence() rec ``` ## Define Differential Analysis (DA) steps Once data is preprocessed and cleaned, the next step is to add the da steps. The `dar` package incorporates multiple methods to analyze the data, including: ALDEx2, ANCOM-BC, corncob, DESeq2, Lefse, MAaslin2, MetagenomeSeq, and Wilcox. These methods provide a range of options for uncovering potential microbial biomarkers associated with the variable of interest. To ensure consistency across methods, we decided not to use default parameters, but to set the `min_prevalence` parameter to 0 for MAaslin2, and the `rm_zeros` parameter to 0.01 for MetagenomeSeq, since it was observed that the pct_all_zeros value was not equal to 0 in some levels of the categorical variable in the results of `phy_qc()`. This approach ensured that the analysis was consistent across all methods and that the results were interpretable. Note: to reduce computation time, in this example we will only use the metagenomeSeq and MAaslin2 methods, that are the fastest ones. However, we recommend using all the methods available in the package to ensure a more robust analysis. ```{r} ## DA steps definition rec <- rec |> step_metagenomeseq(rm_zeros = 0.01) |> step_maaslin(min_prevalence = 0) rec ``` ## Prep recipe Once the recipe has been defined, the next step is to execute all the steps defined in the recipe. This is done through the function `prep()`. Internally, it first executes the preprocessing steps, which modify the phyloseq object stored in the recipe. Then, using the modified phyloseq, it executes each of the defined differential abundance methods. To speed up the execution time, the `prep()` function includes the option to run in parallel. The resulting object has class `PrepRecipe` and when printed in the terminal, it displays the number of taxa detected as significant in each of the methods and also the total number of taxa shared across all methods. This allows for a provisional overview of the results and a comparison between methods. ```{r} ## Execute in parallel da_results <- prep(rec, parallel = TRUE) da_results ``` ## Default results extraction At this point, we could extract the taxa shared across all methods using the function `bake()` to define a default consensus strategy and then `cool()` to extract the results. ```{r} ## Default DA taxa results results <- bake(da_results) |> cool() results ``` However, `dar` allows for complex consensus strategies based on the obtained results. To that end, the user has access to different functions to graphically represent different types of information. This feature allows for a more in-depth analysis of the results and a better understanding of the underlying patterns in the data. ## Exploration for consensus strategie definition For example, `intersection_plt()` gives an overview of the overlaps between methods by creating an upSet plot. In our case, this function has shown that 210 taxa are shared across all the methods used. ```{r, fig.height=5} ## Intersection plot intersection_plt(da_results, ordered_by = "degree", font_size = 1) ``` In addition to the `intersection_plt()` function, `dar` also has the function `exclusion_plt()` which provides information about the number of OTUs shared between methods. This function allows to identify the OTUs that are specific to each method and also the ones that are not shared among any method. ```{r} ## Exclusion plot exclusion_plt(da_results) ``` Besides to the previously mentioned functions, `dar` also includes the function `corr_heatmap()`, which allows for visualization of the overlap of significant OTUs between tested methods. This function can provide similar information to the previous plots, but in some cases it may be easier to interpret. comprehensive view of the results. ```{r, fig.height=6} ## Correlation heatmap corr_heat <- corr_heatmap(da_results, font_size = 10) corr_heat ``` Finally, `dar` also includes the function `mutual_plt()`, which plots the number of differential abundant features mutually found by a defined number of methods, colored by the differential abundance direction and separated by comparison. The resulting graph allows us to see that the features detected correspond mainly to the comparisons between hts vs msm and msm vs pwid. Additionally, the graph also allows us to observe the direction of the effect; whether a specific OTU is enriched or depleted for each comparison. ```{r, fig.height=6} ## Mutual plot mutual_plt( da_results, count_cutoff = length(steps_ids(da_results, type = "da")) ) ``` ## Define a consesus strategy using bake After visually inspecting the results from running all the differential analysis methods on our data, we have the necessary information to define a consensus strategy that fits our dataset. In our case, we will retain all the methods. However if one or more methods are not desired, the `bake()` function includes the `exclude` parameter, which allows to exclude specific methods. Additionally, the `bake()` function allows to further refine the consensus strategy through its parameters, such as `count_cutoff`, which indicates the minimum number of methods in which an OTU must be present, and `weights`, a named vector with the ponderation value for each method. However, for simplicity, these parameters are not used in this example. ```{r} ## Define consensus strategy da_results <- bake(da_results) da_results ``` ## Extract results To conclude, we can extract the final results using the `cool()` function. This function takes a `PrepRecipe` object and the ID of the bake to be used as input (by default it is 1, but if you have multiple consensus strategies, you can change it to extract the desired results). ```{r} ## Extract results for bake id 1 f_results <- cool(da_results, bake = 1) f_results ``` To further visualize the results, the `abundance_plt()` function can be utilized to visualize the differences in abundance of the differential abundant taxa. ```{r, fig.height=6} ## Ids for Bacteroide and Provotella species ids <- f_results |> dplyr::filter(stringr::str_detect(taxa, "Bacteroi.*|Prevote.*")) |> dplyr::pull(taxa_id) ## Abundance plot as boxplot abundance_plt(da_results, taxa_ids = ids, type = "boxplot") ``` ## Session info ```{r} devtools::session_info() ```