--- title: "multiGSEA: an example workflow" author: - name: Sebastian Canzler affiliation: Helmholtz-Centre for Environmental Research - UFZ, Leipzig, Germany - name: Jörg Hackermüller affiliation: Helmholtz-Centre for Environmental Research - UFZ, Leipzig, Germany package: multiGSEA date: "March 11, 2020" output: BiocStyle::html_document: toc_float: true BiocStyle::pdf_document: default bibliography: references.bib vignette: > %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{multiGSEA: an example workflow} --- # Introduction The `multiGSEA` package was designed to run a robust GSEA-based pathway enrichment for multiple omics layers. The enrichment is calculated for each omics layer separately and aggregated p-values are calculated afterwards to derive a composite multi-omics pathway enrichment. Pathway definitions can be downloaded from up to eight different pathway databases by means of the `graphite` Bioconductor package [@graphite:2018]. Feature mapping for transcripts and proteins is supported towards Entrez Gene IDs, Uniprot, Gene Symbol, RefSeq, and Ensembl IDs. The mapping is accomplished through the `AnnotationDbi` package [@AnnotationDbi:2019] and currently supported for 11 different model organisms including human, mouse, and rat. ID conversion of metabolite features to Comptox Dashboard IDs (DTXCID, DTXSID), CAS-numbers, Pubchem IDs (CID), HMDB, KEGG, ChEBI, Drugbank IDs, or common metabolite names is accomplished through the AnnotationHub package `metabliteIDmapping`. This package provides a comprehensive ID mapping for more than 1.1 million entries. This vignette covers a simple example workflow illustrating how the `multiGSEA` package works. The omics data sets that will be used throughout the example were originally provided by Quiros _et al._ [@Quiros:2017]. In their publication the authors analyzed the mitochondrial response to four different toxicants, including Actinonin, Diclofenac, FCCB, and Mito-Block (MB), within the transcriptome, proteome, and metabolome layer. The transcriptome data can be downloaded from [NCBI GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84631), the proteome data from the [ProteomeXchange Consortium](http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD006293), and the non-targeted metabolome raw data can be found in the online supplement. ## Installation There are two different ways to install the `multiGSEA` package. First, the `multiGSEA` package is part of [Bioconductor](https://bioconductor.org/packages/devel/bioc/html/multiGSEA.html). Hence, the best way to install the package is via `BiocManager`. Start R (>=4.0.0) and run the following commands: ```{r bioconductor, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } # The following initializes usage of Bioc devel BiocManager::install(version = "devel") BiocManager::install("multiGSEA") ``` Alternatively, the `multiGSEA` package is hosted on our github page https://github.com/yigbt/multiGSEA and can be installed via the `devtools` package: ```{r devtools, eval=FALSE} install.packages("devtools") devtools::install_github("https://github.com/yigbt/multiGSEA") ``` Once installed, just load the `multiGSEA` package with: ```{r load_multiGSEA, eval=FALSE} library(multiGSEA) ``` # Example workflow A common workflow which is exemplified in this vignette is typically separated in the following steps: 1. Load required libraries, including the `multiGSEA` package, and omics data sets. 2. Create data structure for enrichment analysis. 3. Download and customize the pathway definitions. 4. Run the pathway enrichment for each omics layer. 5. Calculate the aggregated pathway enrichment. ## Load libraries and omics data At first, we need to load the necessary packages to run the multi-omics enrichment analysis. In our example data we have to deal with omics data that has been measured in human cell lines. We therefore need the `org.Hs.eg.db` package [@CarlsonHs:2019] for transcript and protein mapping. In case the omics data was measured in mouse or rat, we would need the packages `org.Mm.eg.db` [@CarlsonMm:2019] and `org.Rn.eg.db` [@CarlsonRn:2019], respectively. ```{r load_mapping_library, results='hide', warning=FALSE, message=FALSE} library("org.Hs.eg.db") ``` In principle, `multiGSEA` is able to deal with 11 different model organisms. A summary of supported organisms, their naming format within `multiGSEA` and their respective `AnnotationDbi` package is shown in Table \@ref(tab:organismsTable). ```{r organismsTable, results='asis', echo=FALSE} caption <- "Supported organisms, their abbreviations being used in `multiGSEA`, and mapping database that are needed for feature conversion. Supported abbreviations can be seen with `getOrganisms()`" df <- data.frame( Organisms = c( "Human", "Mouse", "Rat", "Dog", "Cow", "Pig", "Chicken", "Zebrafish", "Frog", "Fruit Fly", "C.elegans" ), Abbreviations = c( "hsapiens", "mmusculus", "rnorvegicus", "cfamiliaris", "btaurus", "sscrofa", "ggallus", "drerio", "xlaevis", "dmelanogaster", "celegans" ), Mapping = c( "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db", "org.Cf.eg.db", "org.Bt.eg.db", "org.Ss.eg.db", "org.Gg.eg.db", "org.Xl.eg.db", "org.Dr.eg.db", "org.Dm.eg.db", "org.Ce.eg.db" ) ) knitr::kable(df, caption = caption) ``` To run the analysis of this vignette, load the installed version of `multiGSEA`. ```{r load_multiGSEA_package, results='hide', message=FALSE, warning=FALSE} library(multiGSEA) library(magrittr) ``` Load the omics data for each layer where an enrichment should be calculated. The example data is provided within the package and already preprocessed such that we have log2 transformed fold changes and their associated p-values. ```{r load_omics_data} # load transcriptomic features data(transcriptome) # load proteomic features data(proteome) # load metabolomic features data(metabolome) ``` ### Cautionary note This example involves preprocessed omics data from public repositories, which means that the data might look different when you download and pre-process it with your own workflow. Therefore, we put our processed data as an example data in the R package. We here sketch out the pipeline described in the `multiGSEA` paper. We will not focus on the pre-processing steps and how to derive the necessary input format for the multi-omics pathway enrichment in terms of differentially expression analysis, since this is highly dependent on your experiment and analysis workflow. However, the required input format is quite simple and exactly the same for each input omics layer: A data frame with 3 mandatory columns, including feature IDs, the log2-transformed fold change (logFC), and the associated p-value. The header of the data frame can be seen in Table \@ref(tab:omicsTable): ```{r omicsTable, results='asis', echo=FALSE} caption <- "Structure of the necessary omics data. For each layer (here: transcriptome), feature IDs, log2FCs, and p-values are needed." knitr::kable( transcriptome %>% dplyr::arrange(Symbol) %>% dplyr::slice(1:6), caption = caption ) ``` ## Create data structure `multiGSEA` works with nested lists where each sublist represents an omics layer. The function `rankFeatures` calculates the pre-ranked features, that are needed for the subsequent calculation of the enrichment score. `rankFeatures` calculates the a local statistic `ls` based on the direction of the fold change and the magnitude of its significance: \begin{equation} ls = sign( log_2(FC)) * log_{10}( p-value) \end{equation} Please note, that any other rank metric will work as well and the choice on which one to chose is up to the user. However, as it was shown by Zyla _et al._ [@Zyla:2017], the choice of the applied metric might have a big impact on the outcome of your analysis. ```{r rank_features, results='hide'} # create data structure omics_data <- initOmicsDataStructure(layer = c( "transcriptome", "proteome", "metabolome" )) ## add transcriptome layer omics_data$transcriptome <- rankFeatures( transcriptome$logFC, transcriptome$pValue ) names(omics_data$transcriptome) <- transcriptome$Symbol ## add proteome layer omics_data$proteome <- rankFeatures(proteome$logFC, proteome$pValue) names(omics_data$proteome) <- proteome$Symbol ## add metabolome layer ## HMDB features have to be updated to the new HMDB format omics_data$metabolome <- rankFeatures(metabolome$logFC, metabolome$pValue) names(omics_data$metabolome) <- metabolome$HMDB names(omics_data$metabolome) <- gsub( "HMDB", "HMDB00", names(omics_data$metabolome) ) ``` The first elements of each omics layer are shown below: ```{r omics_short} omics_short <- lapply(names(omics_data), function(name) { head(omics_data[[name]]) }) names(omics_short) <- names(omics_data) omics_short ``` ## Download and customize pathway definitions Now we have to select the databases we want to query and the omics layer we are interested in before pathway definitions are downloaded and features are mapped to the desired format. ```{r calculate_enrichment, results='hide', message=FALSE, warning=FALSE} databases <- c("kegg", "reactome") layers <- names(omics_data) pathways <- getMultiOmicsFeatures( dbs = databases, layer = layers, returnTranscriptome = "SYMBOL", returnProteome = "SYMBOL", returnMetabolome = "HMDB", useLocal = FALSE ) ``` The first two pathway definitions of each omics layer are shown below: ```{r pathways_short} pathways_short <- lapply(names(pathways), function(name) { head(pathways[[name]], 2) }) names(pathways_short) <- names(pathways) pathways_short ``` ## Run the pathway enrichment At this stage, we have the ranked features for each omics layer and the extracted and mapped features from external pathway databases. In the following step we can use the `multiGSEA` function to calculate the enrichment for all omics layer separately. ```{r run_enrichment, results='hide', message=FALSE, warning=FALSE} # use the multiGSEA function to calculate the enrichment scores # for all omics layer at once. enrichment_scores <- multiGSEA(pathways, omics_data) ``` The enrichment score data structure is a list containing sublists named `transcriptome`, `proteome`, and `metabolome`. Each sublist represents the complete pathway enrichment for this omics layer. ## Calculate the aggregated p-values Making use of the Stouffers Z-method to combine multiple p-values that have been derived from independent tests that are based on the same null hypothesis. The function `extractPvalues` creates a simple data frame where each row represents a pathway and columns represent omics related p-values and adjusted p-values. This data structure can then be used to calculate the aggregated p-value. The subsequent calculation of adjusted p-values can be achieved by the function `p.adjust`. `multiGSEA` provided three different methods to aggregate p-values. These methods differ in their way how they weight either small or large p-values. By default, `combinePvalues` will apply the Z-method or Stouffer's method [@Stouffer:1949] which has no bias towards small or large p-values. The widely used Fisher's combined probability test [@Fisher:1932] can also be applied but is known for its bias towards small p-values. Edgington's method goes the opposite direction by favoring large p-values [@Edgington:1972]. Those methods can be applied by setting the parameter `method` to "fisher" or "edgington". ```{r combine_pvalues} df <- extractPvalues( enrichmentScores = enrichment_scores, pathwayNames = names(pathways[[1]]) ) df$combined_pval <- combinePvalues(df) df$combined_padj <- p.adjust(df$combined_pval, method = "BH") df <- cbind(data.frame(pathway = names(pathways[[1]])), df) ``` **Please note:** In former versions of `multiGSEA`, adjusted p-values of individual omics-layers have been combined. The correction for multiple testing prior to the p-value combination will, however, violate the assumptions about p-values of certain combination methods (e.g., Fisher's combined probability test). Therefore, **multiGSEA uses p-values for combination from RELEASE 3.18**. To keep previous results reproducible, we introduced the `col_pattern` parameter which can be set to `padj` to use adjusted p-values. Finally, print the pathways sorted based on their combined adjusted p-values. For displaying reasons, only the adjusted p-values are shown in Table \@ref(tab:resultTable). ```{r resultTable, results='asis', echo=FALSE} caption <- "Table summarizing the top 15 pathways where we can calculate an enrichment for all three layers . Pathways from KEGG, Reactome, and BioCarta are listed based on their aggregated adjusted p-value. Corrected p-values are displayed for each omics layer separately and aggregated by means of the Stouffer's Z-method." knitr::kable( df %>% dplyr::arrange(combined_padj) %>% dplyr::filter(!is.na(metabolome_padj)) %>% dplyr::select(c(pathway, transcriptome_padj, proteome_padj, metabolome_padj, combined_pval)) %>% dplyr::slice(1:15), caption = caption ) ``` # Customizable gene sets In principle, `multiGSEA` can also be run as single/multi omics analysis on custom gene sets. The `pathways` object storing the pathway features across multiple omics layers is a nested list, and hence can be designed manually to fit ones purposes. In the following example, HALLMARK gene sets are retrieved from `MSigDB` and used to create a transcriptomics input list: ```{r msigdbr, eval=FALSE} library(msigdbr) library(dplyr) hallmark <- msigdbr(species = "Rattus norvegicus", category = "H") %>% dplyr::select(gs_name, ensembl_gene) %>% dplyr::filter(!is.na(ensembl_gene)) %>% unstack(ensembl_gene ~ gs_name) pathways <- list("transcriptome" = hallmark) ``` **Please note**, feature sets across multiple omics layers have to be in the same order and their names have to be identical, see the example presented above. # Session Information Here is the output of `sessionInfo()` on the system on which this document was compiled: ```{r session, echo=FALSE} sessionInfo() ``` # References