--- title: "Getting started with MetaProViz" author: - name: Christina Schmidt affiliation: - Heidelberg University output: BiocStyle::html_document: self_contained: true toc: true toc_float: true toc_depth: 5 code_folding: show vignette: > %\VignetteIndexEntry{Quick Overview} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} bibliography: bibliography.bib editor_options: chunk_output_type: console --- ```{r chunk_setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.crop = FALSE ) ``` First if you have not done yet, install the required dependencies and load the libraries: ```{r load_libraries, message=FALSE, warning=FALSE} # 1. Install MetaProViz from Bioconductor devel: # if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") # BiocManager::install(version = "devel") # BiocManager::install("MetaProViz") # 2. Install the latest development version from GitHub using devtools # remotes::install_github("saezlab/MetaProViz") # Install Rtools if you haven’t done this yet, using the appropriate version (e.g.windows or macOS). library(MetaProViz) # dependencies that need to be loaded: library(magrittr) library(dplyr) library(rlang) library(ggfortify) library(tibble) ``` \ \

# 1. Loading the example data

\ Here we choose an example datasets, which is publicly available on [metabolomics workbench project PR001418](https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Project&ProjectID=PR001418) including metabolic profiles of human renal epithelial cells HK2 and cell renal cell carcinoma (ccRCC) cell lines cultured in Plasmax cell culture media [@Sciacovelli_Dugourd2022]. Here we use the integrated raw peak data as example data using the trivial metabolite name in combination with the KEGG ID as the metabolite identifiers.\ \ As part of the **MetaProViz** package you can load the example data into your global environment using the function `toy_data()`:\ Intracellular experiment **(Intra)** \ The raw data are available via [metabolomics workbench study ST002224](https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST002224&StudyType=MS&ResultType=1) were intracellular metabolomics of HK2 and ccRCC cell lines 786-O, 786-M1A and 786-M2A were performed.\ We can access the built-in dataset `intracell_raw`, which includes columns with Sample information and columns with the measured metabolite integrated peaks.\ ```{r load_data} data(intracell_raw) Intra <- intracell_raw%>% column_to_rownames("Code") ``` ```{r show_data_preview, echo=FALSE} # https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html # Check how our data looks like: Intra[1:5, c(1:4,21,44)]%>% kableExtra::kbl(caption = "Preview of the DF `Intra` including columns with sample information and metabolite ids with their measured values.") %>% kableExtra::kable_classic(full_width = FALSE, html_font = "Cambria", font_size = 12) #%>% #kableExtra::scroll_box(width = "100%", height = "200px") ``` \ \

# 2. Pre-processing

**MetaProViz** includes a pre-processing module with the function `processing()` that has multiple parameters to perform customize data processing.\ `Feature_Filtering` applies the 80%-filtering rule on the metabolite features either on the whole dataset (="Standard") [@Bijlsma2006] or per condition (="Modified") [@Wei2018]. This means that metabolites are removed were more than 20% of the samples (all or per condition) have no detection. With the parameter `cutoff_featurefilt` we enable the adaptation of the stringency of the filtering based on the experimental context. For instance, patient tumour samples can contain many unknown subgroups due to gender, age, stage etc., which leads to a metabolite being detected in only 50% (or even less) of the tumour samples, hence in this context it could be considered to change the `cutoff_featurefilt` from the default (=0.8). If `featurefilt = "None"`, no feature filtering is performed. In the context of `featurefilt` it is also noteworthy that the function `pool_estimation()` can be used to estimate the quality of the metabolite detection and will return a list of metabolites that are variable across the different pool measurements (pool = mixture of all experimental samples measured several times during the LC-MS run) . Variable metabolite in the pool sample should be removed from the data.\ The parameter `tic` refers to total Ion Count (tic) normalisation, which is often used with LC-MS derived metabolomics data. If `tic = TRUE`, each feature (=metabolite) in a sample is divided by the sum of all intensity value (= total number of ions) for the sample and finally multiplied by a constant ( = the mean of all samples total number of ions). Noteworthy, tic normalisation should not be used with small number of features (= metabolites), since tic assumes that on “average” the ion count of each sample is equal if there were no instrument batch effects [@Wulff2018].\ The parameter `mvi` refers to Missing Value Imputation (mvi) and if `mvi = TRUE` half minimum (HM) missing value imputation is performed per feature (= per metabolite). Here it is important to mention that HM has been shown to perform well for missing vales that are missing not at random (MNAR) [@Wei2018].\ Lastly, the function `processing()` performs outlier detection and adds a column "Outliers" into the DF, which can be used to remove outliers. The parameter `hotellins_confidence` can be used to choose the confidence interval that should be used for the Hotellins T2 outlier test [@Hotelling1931].\ \ If your data contain pool samples, you can do `pool_estimation()` before applying the `processing()` function. This is important, since one should remove the features (=metabolites) that are too variable prior to performing any data transformations such as tic as part of the `processing()` function. If there is a high variability (high CVs), you should consider to remove those features from the data. If you have used internal standard in your experiment you should specifically check their CV as this would indicate technical issues.You can find details on this in the extended vignettes:\ - [Standard metabolomics data](https://saezlab.github.io/MetaProViz/articles/standard-metabolomics.html)\ - [Consumption-Release (CoRe) metabolomics data from cell culture media](https://saezlab.github.io/MetaProViz/articles/core-metabolomics.html)\ \ Now we will apply the `processing()` function to example data and have a look at the output produced. You will notice that all the chosen parameters and results are documented in messages. All the results data tables, the Quality Control (QC) plots and outlier detection plots are returned and can be easily viewed.\ ```{r processing, fig.width=6, fig.height=4.5, fig.align="left"} PreprocessingResults <- processing(data=Intra[-c(49:58) ,-c(1:3)], #remove pool samples and columns with sample information metadata_sample=Intra[-c(49:58) , c(1:3)], #remove pool samples and columns with metabolite measurements metadata_info = c(Conditions = "Conditions", Biological_Replicates = "Biological_Replicates"), featurefilt = "Modified", cutoff_featurefilt = 0.8, tic = TRUE, mvi = TRUE, hotellins_confidence = 0.99,# We perform outlier testing using 0.99 confidence intervall core = FALSE, save_plot = "svg", save_table= "csv", print_plot = TRUE, path = NULL) # This is the results table: Intra_Preprocessed <- PreprocessingResults[["DF"]][["Preprocessing_output"]] ``` \ \ \ ```{r show_preprocessing_results, echo=FALSE} # Check how our data looks like: Intra_Preprocessed[29:32, 1:9]%>% kableExtra::kbl(caption = "Preview of the pre-processing results, which has an additional column `Outlier` including the results of Hotellins T2.") %>% kableExtra::kable_classic(full_width = FALSE, html_font = "Cambria", font_size = 12) #%>% #kableExtra::scroll_box(width = "100%", height = "200px") ``` \ In the output table you can now see the column "Outliers" and for the Condition 786-M2A, we can see that based on Hotellin's T2 test, one sample was detected as an outlier in the first round of filtering.\ As part of the `processing()` function several plots are generated and saved. Additionally, the ggplots are returned into the list to enable further modifiaction using the ggplot syntax. These plots include plots showing the outliers for each filtering round and other QC plots.\ \ Before we proceed, we will remove the outlier:\ ```{r remove_outliers} Intra_Preprocessed <- Intra_Preprocessed%>% filter(Outliers=="no")#remove MS55_29 ``` \ As you may have noticed, in this example dataset we have several biological replicates that were injected (=measured) several times, which can be termed as analytical replicates. The **MetaProViz** pre-processing module includes the function `replicate_sum()`, which will do this task and save the results:\ ```{r replicate_sum} Intra_Preprocessed <- replicate_sum(data=Intra_Preprocessed[,-c(1:4)], metadata_sample=Intra_Preprocessed[,c(1:4)], metadata_info = c(Conditions="Conditions", Biological_Replicates="Biological_Replicates", Analytical_Replicates="Analytical_Replicates")) ``` \ \ In case you have performed a Consumption-Release (core) metabolomics experiment, which usually refers to a cell culture experiment where metabolomics is performed on the cell culture media, you will also need to set the parameter `core=TRUE` in the `processing()` function. Now additional data processing steps are applied:\ 1. Blank sample: This refers to media samples where no cells have been cultured in, which will be used as blank. In detail, the mean of the blank sample of a feature (= metabolite) will be substracted from the values measured in each sample for the same feature. In the column “Condition” of the Experimental_design DF, you will need to label your blank samples with “blank”.\ 2. Growth factor or growth rate: This refers to the different conditions and is either based on cell count or protein quantification at the start of the experiment (t0) and at the end of the experiment (t1) resulting in the growth factor (t1/t0). Otherwise, one can experimentally estimate the growth rate of each condition. Ultimately, this measure is used to normalize the data, since the amount of growth will impact the consumption and release of metabolites from the media and hence we need to account for this. If you do not have this information, this will be set to 1, yet be aware that this may affect the results.\ For details see extensive vignette [Consumption-Release (CoRe) metabolomics data from cell culture media](https://saezlab.github.io/MetaProViz/articles/core-metabolomics.html). ## PCA plot and Heatmap Using the processed data, we can now use the **MetaProViz** visualization module and generate some overview Heatmaps `viz_heatmap()` or PCA plots `viz_pca()`. 1. PCA\ Principal component analysis (PCA) is a dimensionality reduction method that reduces all the measured features (=metabolites) of one sample into a few features in the different principal components, whereby each principal component can explain a certain percentage of the variance between the different samples. Hence, this enables interpretation of sample clustering based on the measured features (=metabolites).\ We can interactively choose shape and color using the additional information of interest from our Metadata. Especially for complex data, such as patient data, it can be valuable to use different demographics (e.g. age, gender, medication,...) for this.\ Here the different cell lines are either control or cancerous, so we can display this as colour. The cancerous cell lines can further be divided into metastatic or primary, whcih we display by shape. This shows us that this is separated on the y-axis and accounts for 30%of the variance.Here is becomes apparent that the cell status is responsible for 64% of the variance (x-axis).\ ```{r pca_plot, fig.align="left", fig.width=6, fig.height=4.5, fig.cap="Figure: Do the samples cluster for the Cell type?"} #Create the metadata file: MetaData_Sample <- Intra_Preprocessed[,c(1:2)]%>% mutate(Celltype = case_when(Conditions=="HK2" ~ 'Healthy', Conditions=="786-O" ~ 'Primary Tumour', TRUE ~ 'Metastatic Tumour'))%>% mutate(Status = case_when(Conditions=="HK2" ~ 'Healthy', TRUE ~ 'Cancer')) #Make PCA plot viz_pca(metadata_info= c(color="Celltype", shape="Status"), metadata_sample= MetaData_Sample, data= Intra_Preprocessed[,-c(1:5)], plot_name = "Cell type") ``` \ Similarly, we can use the data and sample information to make a heatmap:\ ```{r heatmap_plot, fig.align="left", fig.cap="Colour for sample metadata."} viz_heatmap(data = Intra_Preprocessed[,-c(1:4)], metadata_sample = MetaData_Sample, metadata_info = c(color_Sample = list("Conditions","Biological_Replicates", "Celltype", "Status"))) ```

# 3. Differential Metabolite Analysis

Differential Metabolite Analysis (`dma`) is used to compare two conditions (e.g. Tumour versus Healthy) by calculating the Log2FC, p-value, adjusted p-value and t-value. With the different parameters `pval` and `padj` one can choose the statistical tests such as t.test, wilcoxon test, limma, annova, kruskal walles, etc. (see function reference for more information).\ As input one can use the processed data we have generated using the `processing` module, but here one can of course use any DF including metabolite values, even though we recommend to normalize the data and remove outliers prior to dma. Moreover, we require information which condition a sample corresponds to.\ \ By defining the numerator and denominator as part of the `metadata_info` parameter, it is defined which comparisons are performed:\ 1. **one_vs_one** (single comparison): numerator="Condition1", denominator ="Condition2"\ 2. **all_vs_one** (multiple comparison): numerator=NULL, denominator ="Condition"\ 3. **all_vs_all** (multiple comparison): numerator=NULL, denominator =NULL (=default)\ \ Noteworthy, if you have not performed missing value imputation and hence your data includes NAs or 0 values for some features, this is how we deal with this in the `dma()` function:\ 1. If you use the parameter `pval="lmFit"`, limma is performed. Limma does a baesian fit of the data and substracts Mean(Condition1 fit) - Mean(Condition2 fit). As such, unless all values of a feature are NA, Limma can deal with NAs. 2. Standard Log2FC: log2(Mean(Condition1)) - log2(Mean(Condition2)) a. If all values of the replicates of one condition are NA/0 for a feature (=metabolite): Log2FC= Inf/-Inf and the statistics will be NA\ b. If some values of the replicates of one condition are NA/0 for a feature (=metabolite): Log2FC= positive or negative value, but the statistics will be NA\ \ It is important to mention that in case of `pval="lmFit"`, we perform log2 transformation of the data as prior to running limma to enable the calculation of the log2FC, hence do not provide log2 transformed data.\ \ Here, the example data we have four different cell lines, healthy (HK2) and cancer (ccRCC: 786-M1A, 786-M2A and 786-O), hence we can perform multiple different comparisons. For simplicity, we will compare 786-M1A versus HK2. The results can be automatically saved and all the results are returned in a list with the different data frames. If parameter plot=TRUE, an overview Volcano plot is generated and saved.\ ```{r dma, fig.width=7, fig.height=5, fig.align="left"} # Perform multiple comparison All_vs_One using annova: DMA_Res <- dma(data=Intra_Preprocessed[,-c(1:3)], #we need to remove columns that do not include metabolite measurements metadata_sample=Intra_Preprocessed[,c(1:3)],#only maintain the information about condition and replicates metadata_info = c(Conditions="Conditions", Numerator="786-M1A" , Denominator = "HK2"),# we compare 786-M1A_vs_HK2 pval ="t.test", padj="fdr") # Inspect the dma results tables: DMA_786M1A_vs_HK2 <- DMA_Res[["dma"]][["786-M1A _vs_ HK2"]] ``` \ \ \ ```{r show_dma_results, echo=FALSE} # Check how our data looks like: DMA_786M1A_vs_HK2[c(7,9,11:12,14),]%>% kableExtra::kbl(caption = "2. Preview of the dma results for the comparison of 786-M1A versus HK2 cells.", row.names=FALSE) %>% kableExtra::kable_classic(full_width = FALSE, html_font = "Cambria", font_size = 12) ``` \ In case you have performed a Consumption-Release (core) metabolomics experiment, which usually refers to a cell culture experiment where metabolomics is performed on the cell culture media, you will also need to set the parameter `core=TRUE` in the `dma()` function. In a core experiment the normalized metabolite values can be either a negative value, if the metabolite has been consumed from the media, or a positive value, if the metabolite has been released from the cell into the culture media. Since we can not calculate a Log2FC using negative values, we calculate the absolute difference between the mean of Condition 1 versus the mean of Condition 2. The absolute difference is log2 transformed in order to make the values comparable between the different metabolites, resulting in the Log2Dist. The result doesn't consider whether one product is larger than the other; it only looks at the magnitude of their difference. to reflect the direction of change between the two conditions we multiply with -1 if C1 < C2. By setting the paramteter core = TRUE, instead of calclulating the Log2FC, the Log2 Distance is calculated. For details see extensive vignette [Consumption-Release (CoRe) metabolomics data from cell culture media](https://saezlab.github.io/MetaProViz/articles/core-metabolomics.html#dma). ## Volcano Plots In general,we have three different `Plot_Settings`, which will also be used for other plot types such as lollipop graphs.\ `1.` `"Standard"` is the standard version of the plot, with one dataset being plotted.\ `2.` `"Conditions"` here two or more datasets will be plotted together.\ `3.` `"PEA"` stands for Pathway Enrichment Analysis, and is used if the results of an GSE analysis should be plotted as here the figure legends will be adapted.\ \ Using the dma results, we can now use the **MetaProViz** visualization module and generate further customized Volcano plots `viz_volcano()`:\ - To plot the metabolite names you can change the paramter `select_label` from its default (`select_label=""`) to NULL and the metabolite names will be plotted randomly or you can also pass a vector with Metabolite names that should be labeled.\ - By providing additional feature or sample information, you can color code and/or shape the dots on the volcano plot.\ - Based on feature information (i.e. pathways), you can also create individual plots, one for each pathway.\ For detailed exaples check out `3. Run MetaProViz Visualisation` in the vignettes [Standard metabolomics data](https://saezlab.github.io/MetaProViz/articles/standard-metabolomics.html#run-metaproviz-visualisation) or [Consumption-Release (CoRe) metabolomics data from cell culture media](https://saezlab.github.io/MetaProViz/articles/core-metabolomics.html#run-metaproviz-visualisation)\

# 3. Enrichment Analysis and Prior knowledge

Over Representation Analysis (ORA) is a enrichment method that determines if a set of features (i.e. metabolic pathways) are over-represented in the selection of features (=metabolites) from the data in comparison to all measured features (metabolites) using the Fishers exact test. The selection of metabolites are usually the most altered metabolites in the data, which can be selected by the top and bottom t-values. Before we can perform ORA on the dma results, we have to ensure that the metabolite names match with the metabolite IDs of the prior knowledge (PK). \ ## Match IDs with PK As part of the **MetaProViz** package you can access metabolite prior knowledge with the collection of metabolite sets MetSigDB (Metabolite signature database) for pathway enrichment analysis, compound class enrichment analysis, and by using specific PK databases, it can be used to study the connection of metabolites and receptors or transporters. In metabolite PK, the many different PK databases and resources pose issues like metabolite identifiers (e.g. KEGG, HMDB, PubChem, etc.) are not standardized across databases, and the same metabolite can have multiple identifiers in different databases. This is known as the many-to-many mapping problem. If you want to know more on how to translate ids, quantify the mapping of your data to the prior knwoeldge resource, increase the mapping, etc. have a look at our dedicated vignette, [Prior Knowledge Access & Integration](https://saezlab.github.io/MetaProViz/articles/prior-knowledge.html).\ \ Here we will use the KEGG pathways [@Kanehisa2000], hence we have to ensure that the metabolite names match with the KEGG IDs or KEGG trivial names. ```{r match_ids_kegg} #--------Add metabolite IDs to our example data: # 1. Load Feature metainformation of our example data data(cellular_meta) MappingInfo <- cellular_meta # 2. Merge with our differential results (FYI: you can also do this automatically as part of the dma function using the parameter metadata_feature) ORA_Input <- merge(DMA_786M1A_vs_HK2, MappingInfo, by= "Metabolite", all.x=TRUE)%>% dplyr::filter(!is.na(KEGGCompound))%>%#remove features without KEGG ID tibble::column_to_rownames("KEGGCompound")%>% dplyr::select(-Metabolite) #--------Load KEGG pathways: KEGG_Pathways <- metsigdb_kegg() ``` ## Run ORA In general, the `input_pathway` requirements are column "term", "Metabolite" and "Description", and the `data` requirements are column "t.val" and column "Metabolite".\ ```{r run_ora} #Perform ORA DM_ORA_res <- standard_ora(data= ORA_Input , #Input data requirements: column `t.val` and column `Metabolite` metadata_info=c(pvalColumn="p.adj", percentageColumn="t.val", PathwayTerm= "term", PathwayFeature= "Metabolite"), input_pathway=KEGG_Pathways,#Pathway file requirements: column `term`, `Metabolite` and `Description`. Above we loaded the Kegg_Pathways using Load_KEGG() pathway_name="KEGG") # Lets check how the results look like: DM_ORA_786M1A_vs_HK2 <- DM_ORA_res[["ClusterGosummary"]] ``` ```{r show_ora_results, echo=FALSE} # Check how our data looks like: DM_ORA_786M1A_vs_HK2[c(1:5),-1]%>% kableExtra::kbl(caption = "Preview of the ORA results for the comparison of 786-M1A versus HK2 cells.", row.names=FALSE) %>% kableExtra::kable_classic(full_width = FALSE, html_font = "Cambria", font_size = 12) ``` ## Volcano plot If you have performed Pathway Enrichment Analysis (PEA) such as ORA or GSEA, we can also plot the results and add the information into the Figure legends.\ For this we need to prepare the correct input data including the pathways used to run the pathway analysis, the differential metabolite data used as input for the pathway analysis and the results of the pathway analysis: ```{r volcano_pea} #Here we select only a few pathways to make only the most important plots: InputPEA2 <- DM_ORA_786M1A_vs_HK2 %>% filter(!is.na(GeneRatio)) %>% filter(pvalue <= 0.1)%>% dplyr::rename("term"="ID") viz_volcano(plot_types="PEA", metadata_info= c(PEA_Pathway="term",# Needs to be the same in both, metadata_feature and data2. PEA_stat="pvalue",#Column data2 PEA_score="GeneRatio",#Column data2 PEA_Feature="Metabolite"),# Column metadata_feature (needs to be the same as row names in data) metadata_feature= KEGG_Pathways,#Must be the pathways used for pathway analysis data= ORA_Input, #Must be the data you have used as an input for the pathway analysis data2= InputPEA2, #Must be the results of the pathway analysis plot_name= "KEGG", select_label = NULL) ```

# 4. Metabolite Clustering Analysis

Metabolite Clustering Analysis (`MCA`) is a module, which includes different functions to enable clustering of metabolites into groups either based on logical regulatory rules. This can be particularly useful if one has multiple conditions and aims to find patterns in the data. ## MCA-2Cond This metabolite clustering method is based on the Regulatory Clustering method (RCM) that was developed as part of the Signature Regulatory Clustering (SiRCle) model (@Mora_Schmidt2024). As part of the [SiRCleR package](https://github.com/ArianeMora/SiRCleR/tree/main), also variation of the initial RCM method are proposed as clustering based on two comparisons (e.g. KO versus WT in hypoxia and in normoxia).\ Here we set two different thresholds, one for the differential metabolite abundance (`Log2FC`) and one for the `significance` (e.g. p.adj). This will define if a feature (= metabolite) is assigned into:\ 1. ***"UP"***, which means a metabolite is significantly up-regulated in the underlying comparison.\ 2. ***"DOWN"***, which means a metabolite is significantly down-regulated in the underlying comparison.\ 3. ***"No Change"***, which means a metabolite does not change significantly in the underlying comparison and/or is not defined as up-regulated/down-regulated based on the Log2FC threshold chosen.\ \ Therebye “No Change” is further subdivided into four states:\ 1. ***“Not Detected”***, which means a metabolite is not detected in the underlying comparison.\ 2. ***“Not Significant”***, which means a metabolite is not significant in the underlying comparison.\ 3. ***“Significant positive”***, which means a metabolite is significant in the underlying comparison and the differential metabolite abundance is positive, yet does not meet the threshold set for "UP" (e.g. Log2FC >1 = "UP" and we have a significant Log2FC=0.8).\ 4. ***“Significant negative”***, which means a metabolite is significant in the underlying comparison and the differential metabolite abundance is negative, yet does not meet the threshold set for "DOWN".\ \ This definition is done individually for each comparison and will impact in which metabolite cluster a metabolite is sorted into. \ Since we have two comparisons, we can choose between different Background settings, which defines which features will be considered for the clusters (e.g. you could include only features (= metabolites) that are detected in both comparisons, removing the rest of the features).The background methods `method_background` are the following from ***1.1. - 1.4.*** from most restrictive to least restrictive:\ ***1.1. C1&C2***: Most stringend background setting and will lead to a small number of metabolites.\ ***1.2. C1***: Focus is on the metabolite abundance of Condition 1 (C1).\ ***1.3. C2***: Focus is on the metabolite abundance of Condition 2 (C2).\ ***1.4. C1|C2***: Least stringent background method, since a metabolite will be included in the input if it has been detected on one of the two conditions.\ \ Lastly, we will get clusters of metabolites that are defined by the metabolite change in the two conditions. For example, if Alanine is "UP" based on the thresholds in both comparisons it will be sorted into the cluster "core_UP". As there are two 6-state6 transitions between the comparisons, the flows are summarised into smaller amount of metabolite clusters using different Regulation Groupings (RG): 1. RG1_All\ 2. RG2_Significant taking into account genes that are significant (UP, DOWN, significant positive, significant negative)\ 3. RG3_SignificantChange only takes into account genes that have significant changes (UP, DOWN).\ \ ```{r load_mca_rules} # Example of all possible flows: data(mca_twocond_rules) MCA2Cond_Rules <- mca_twocond_rules ``` ```{r show_mca_2cond_rules, echo=FALSE} # Check how our data looks like: MCA2Cond_Rules%>% kableExtra::kbl(caption ="Metabolite Clustering Analysis: 2 Conditions.", row.names=FALSE) %>% kableExtra::kable_classic(full_width = FALSE, html_font = "Cambria", font_size = 12) # easyalluvial::alluvial_wide(mca_2cond[,c(1:2,4)], fill_by = 'last_variable' ) # easyalluvial::alluvial_wide(mca_2cond[,c(1:2,5)], fill_by = 'last_variable' ) ``` \ For a detailed example of the `mca_2cond()` function visit the extended vignette [Standard metabolomics data](https://saezlab.github.io/MetaProViz/articles/standard-metabolomics.html#mca). ## MCA-CoRe This metabolite clustering method is based on logical regulatory rules to sort metabolites into metabolite clusters. Here you need intracellular metabolomics and corresponding consumption-release metabolomics.\ Here we will define if a feature (= metabolite) is assigned into:\ 1. ***"UP"***, which means a metabolite is significantly up-regulated in the underlying comparison.\ 2. ***"DOWN"***, which means a metabolite is significantly down-regulated in the underlying comparison.\ 3. ***"No Change"***, which means a metabolite does not change significantly in the underlying comparison and/or is not defined as up-regulated/down-regulated based on the Log2FC threshold chosen.\ \ Therebye "No Change" is further subdivided into four states:\ 1. ***"Not Detected"***, which means a metabolite is not detected in the underlying comparison.\ 2. ***"Not Significant"***, which means a metabolite is not significant in the underlying comparison.\ 3. ***"Significant positive"***, which means a metabolite is significant in the underlying comparison and the differential metabolite abundance is positive, yet does not meet the threshold set for "UP" (e.g. Log2FC >1 = "UP" and we have a significant Log2FC=0.8).\ 4. ***"Significant negative"***, which means a metabolite is significant in the underlying comparison and the differential metabolite abundance is negative, yet does not meet the threshold set for "DOWN".\ \ Lastly, we also take into account the core direction, meaning if a metabolite was:\ 1. ***"Released"***, which means is released into the media in both conditions of the underlying comparison.\ 2. ***"Consumed"***, which means is consumed from the media in both conditions of the underlying comparison.\ 3. ***"Released/Consumed"***, which means is consumed/released in one condition, whilst the opposite occurs in the second condition of the underlying comparison.\ 4. ***"Not Detected"***, which means a metabolite is not detected in the underlying comparison. \ This definition is done individually for each comparison and will impact in which metabolite cluster a metabolite is sorted into. \ Since we have two comparisons (Intracellular and core), we can choose between different Background settings, which defines which features will be considered for the clusters (e.g. you could include only features (= metabolites) that are detected in both comparisons, removing the rest of the features).The background methods `method_background` are the following from ***1.1. - 1.4.*** from most restrictive to least restrictive:\ ***1.1. Intra&core***: Most stringend background setting and will lead to a small number of metabolites.\ ***1.2. core***: Focus is on the metabolite abundance of the core.\ ***1.3. Intra***: Focus is on the metabolite abundance of intracellular.\ ***1.4. Intra|core***: Least stringent background method, since a metabolite will be included in the input if it has been detected on one of the two conditions.\ \ Lastly, we will get clusters of metabolites that are defined by the metabolite change in the two conditions. For example, if Alanine is "UP" based on the thresholds in both comparisons it will be sorted into the cluster "core_UP". As there are three 6-state6-state4 transitions between the comparisons, the flows are summarised into smaller amount of metabolite clusters using different Regulation Groupings (RG): 1. RG1_All\ 2. RG2_Significant taking into account genes that are significant (UP, DOWN, significant positive, significant negative)\ 3. RG3_SignificantChange only takes into account genes that have significant changes (UP, DOWN).\ \ In order to define which group a metabolite is assigned to, we set two different thresholds. For intracellular those are based on the differential metabolite abundance (`Log2FC`) and the `significance` (e.g. p.adj). For the core data this is based on the `Log2 Distance` and the `significance` (e.g. p.adj). For `Log2FC` we recommend a threshold of 0.5 or 1, whilst for the `Log2 Distance` one should check the distance ranges and base the threshold on this.\ \ Regulatory rules:\ \ ```{r load_mca_core_rules} # Example of all possible flows: data(mca_core_rules) MCA_CoRe_Rule <- mca_core_rules ```

```{r show_mca_core_rules, echo=FALSE} # Check how our data looks like: MCA_CoRe_Rule[,1:6]%>% kableExtra::kbl(caption ="Metabolite Clustering Analysis: core.", row.names=FALSE) %>% kableExtra::kable_classic(full_width = FALSE, html_font = "Cambria", font_size = 12) ```

\ \ \ For a detailed example of the `mca_core()` function visit the extended vignette [Consumption-Release (CoRe) metabolomics data from cell culture media](https://saezlab.github.io/MetaProViz/articles/core-metabolomics.html#mca). ## ORA on each metabolite cluster As explained in detail above, Over Representation Analysis (ORA) is a pathway enrichment analysis (PEA) method. As ORA is based on the Fishers exact test it is perfect to test if a set of features (=metabolic pathways) are over-represented in the selection of features (= clusters of metabolites) from the data in comparison to all measured features (all metabolites). In detail, `cluster_ora()` will perform ORA on each of the metabolite clusters we got as a result of performing `mca_2cond` or `mca_core` using all metabolites as the background. For a detailed example of the `cluster_ora()` function visit the extended vignette [Consumption-Release (CoRe) metabolomics data from cell culture media](https://saezlab.github.io/MetaProViz/articles/core-metabolomics.html#ora-on-each-metabolite-cluster) or [Standard metabolomics data](https://saezlab.github.io/MetaProViz/articles/standard-metabolomics.html#ora-on-each-metabolite-cluster). # Viz The big advantages of the `MetaProViz` visualization module is its flexible and easy usage, which we will showcase below and that the figures are saved in a publication ready style and format. For instance, the x- and y-axis size will always be adjusted for the amount of samples or features (=metabolites) plotted, or in the case of Volcano plot and PCA plot the axis size is fixed and not affected by figure legends or title. In this way, there is no need for many adjustments and the figures can just be dropped into the presentation or paper and are all in the same style.\ \ All the `VizPlotName()` functions are constructed in the same way. Indeed, with the parameter `Plot_metadata_info` the user can pass a named vector with information about the metadata column that should be used to customize the plot by colour, shape or creating individual plots, which will all be showcased for the different plot types. Via the parameter `Plot_SettingsFile` the user can pass the metadata DF, which can be dependent on the plot type for the samples and/or the features (=metabolites). In case of both the parameter is named `Plot_metadata_sample` and `Plot_metadata_feature`.\ \ In each of those Plot_Settings, the user can label color and/or shape based on additional information (e.g. Pathway information, Cluster information or other other demographics like gender). Moreover, we also enable to plot individual plots where applicable based on those MetaData (e.g. one plot for each metabolic pathway).\ \ We support the plot types:\ - PCA plot - Superplots (Bar, Box and Violin plots) - Heatmaps - Volcano Plots For a detailed example of the visualisation functions visit the extended vignette [Consumption-Release (CoRe) metabolomics data from cell culture media](https://saezlab.github.io/MetaProViz/articles/core-metabolomics.html#run-metaproviz-visualisation) or [Standard metabolomics data](https://saezlab.github.io/MetaProViz/articles/standard-metabolomics.html#run-metaproviz-visualisation).

# Session information ```{r session_info, echo=FALSE} options(width = 120) sessionInfo() ``` # Bibliography