--- title: "Introduction to gep2pep" date: author: "Francesco Napolitano" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Introduction to gep2pep} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} library(knitr) opts_chunk$set(collapse = TRUE) ``` ## About gep2pep Pathway Expression Profiles (PEPs) are based on the expression of all the pathways (or generic gene sets) belonging to a collection under a given experimental condition, as opposed to individual genes. `gep2pep` supports the conversion of gene expression profiles (GEPs) to PEPs and performs enrichment analysis of both pathways and conditions. `gep2pep` creates a local repository of gene sets, which can also be imported from the MSigDB database [1]. The local repository is in the `repo` format. When a GEP, defined as a ranked list of genes, is passed to `buildPEPs`, the stored database of pathways is used to convert the GEP to a PEP and permanently store the latter. One type of analysis that can be performed on PEPs and that is directly supported by `gep2pep` is the Drug-Set Enrichment Analysis (DSEA, see reference below). It finds pathways that are consistently dysregulated by a set of drugs, as opposed to a background of other drugs. Of course PEPs may refer to non-pharmacological conditions (genetic perturbations, disease states, etc.) for analogous analyses. See the `CondSEA` function. A complementary approach is that of finding conditions under which a set of pathways is consistently UP- or DOWN-regulated. This is the pathway-based version of the Gene Set Enrichment Analysis (GSEA). As an application example, this approach can be used to find drugs mimicking the dysregulation of a gene by looking for drugs dysregulating the pathways involving the gene (this has been published as the `gene2drug` tool [3]). See the `PathSEA` function. ### Toy and real-world examples This vignette uses toy data, as real data can be computationally expensive. Connectivity Map data [4] (drug induced gene expression profiles) pre-converted to PEPs can be downloaded from http://dsea.tigem.it in the `gep2pep` format. At the end of this vignette, a precomputed example is reported in which that data is used. ## Creating a repository of pathways In order to use the package, it must be loaded as follows (the GSEABase package will also be used in this vignette to access gene set data): ```{r} library(gep2pep) suppressMessages(library(GSEABase)) ``` The [MSigDB](http://software.broadinstitute.org/gsea/msigdb) is a curated database of gene set collections. The entire database can be downloaded as a single XML file and used by `gep2pep`. The following commented code would import the database once downloaded (gep2pep uses a slight variation of the `BroadCollection` type used by MSigDB, named `CategorizedCollection`, thus a conversion is necessary): ```{r} ## db <- importMSigDB.xml("msigdb_v6.1.xml") ## db <- as.CategorizedCollection(db) ``` However, for this vignette a small excerpt will be used. ```{r} db <- loadSamplePWS() ``` The database is in `GSEABase::GeneSetCollection` format and includes 30 pathways, each of which is included in one of 3 different collections. Following MSigDB conventions, each collection is identified by a "category" and "subCategory" fields. But the CategorizedCollection type allows them to be arbitrary strings, as opposed to MSigDB categories. This allows for the creation of custom collection with categories. `gep2pep` puts together into a single collection identifier using `makeCollectionIDs`. ```{r} colltypes <- sapply(db, collectionType) cats <- sapply(colltypes, attr, "category") subcats <- sapply(colltypes, attr, "subCategory") print(cats) print(subcats) makeCollectionIDs(db) ``` In order to build a local `gep2pep` repository containing pathway data, `createRepository` is used: ```{r, collapse=TRUE} repoRoot <- file.path(tempdir(), "gep2pep_data") rp <- createRepository(repoRoot, db) ``` The repository is in `repo` format (see later in this document how to access the repository directly). However, knowing `repo` is not necessary to use `gep2pep`. The following lists the contents of the repository, loads the `GeneSetCollection` object containing all the TFT database gene sets and finally shows the description of the pathway "E47_01": ```{r} rp TFTsets <- loadCollection(rp, "c3_TFT") TFTsets description(TFTsets[["E47_01"]]) ``` `rp$get` is a `repo` command, see later in this vignette. ## Creating Pathway Expression Profiles Pathway Expression Profiles (PEPs) are created from Gene Expression Profiles (GEPs) using pathway information from the repository. GEPs must be provided as a matrix with rows corresponding to genes and columns corresponding to conditions (*conditions*). Genes and conditions must be specified through row and column names respectively. The values must be ranks: for each condition, the genes must be ranked from that being most UP-regulated (rank 1) to that being most DOWN-regulated (rank equal to the number of rows of the matrix). One well known database that can be obtained in this format is for example the Connectivty Map. A small excerpt (after further processing) is included with the `gep2pep`. The excerpt must be considered as a dummy example, as it only includes 500 genes for 5 conditions. It can be loaded as follows: ```{r} geps <- loadSampleGEP() dim(geps) geps[1:5, 1:3] ``` The GEPs can be converted to PEPs using the `buildPEPs` function. They are stored as repository items by the names "category_subcategory". ```{r, collapse=TRUE} buildPEPs(rp, geps) ``` Each PEP is composed of an Enrichment Score (ES) -- p-value (PV) pair associated to each pathway. ESs and PVs are stored in two separated matrices. For each condition, the p-value reports wether a pathway is significantly dysregulated and the sign of the corresponding ES indicates the direction (UP- or DOWN-regulation). ```{r} loadESmatrix(rp, "c3_TFT")[1:3, 1:3] loadPVmatrix(rp, "c3_TFT")[1:3, 1:3] ``` ## Performing Analysis ### Performing Condition-Set Enrichment Analysis (CondSEA). Suppose the stored PEPs correspond to pharmacological perturbations. Then `gep2pep` can perform Drug-Set Enrichment Analysis (DSEA, see Napolitano et al., 2016, Bioinformatics). It finds pathways that are consistently dysregulated by a set of drugs, as opposed to a background of other drugs. Of course PEPs may refer to non-pharmacological conditions (genetic perturbations, disease states, etc.) for analogous analyses (Condition-Set Enrichment Analysis, CondSEA). Given a set `pgset` of drugs of interest, CondSEA (which in this case is a DSEA) is performed as follows: ```{r, collapse=TRUE} pgset <- c("(+)_chelidonine", "(+/_)_catechin") psea <- CondSEA(rp, pgset) ``` The result is a list of of 2 elements, named "CondSEA" and "details", the most important of which is the former. Per-collection results can be accessed as follows: ```{r} getResults(psea, "c3_TFT") ``` In this dummy example the statistical background is made of only 3 GEPs (we added 5 in total), thus, as expected, there are no significant p-values. For the c3_MIR collection, the pathway most UP-regulated by the chosen set of two drugs is M5012, while the most DOWN-regulated is M18759. They are respectively described as: ```{r} sets <- loadCollection(rp, "c3_MIR") wM5012 <- which(sapply(sets, setIdentifier)=="M5012") wM18759 <- which(sapply(sets, setIdentifier)=="M18759") description(sets[[wM5012]]) description(sets[[wM18759]]) ``` The analysis can be exported in XLS format as follows: ```{r, eval=FALSE} exportSEA(rp, psea) ``` Performing `CondSEA` using Pathway Expression Profiles derived from drug-induced gene expression profiles yields Drug Set Enrichment Analysis (DSEA [2]). ### Performing Pathway-Set Enrichment Analysis (PathSEA) A complementary approach to CondSEA is Pathway-Set Enrichment Analysis (PathSEA). PathSEA searches for conditions that consistently dysregulate a set of pathways. It can be seen as a pathway-based version of the popular Gene Set Enrichment Analysis (GSEA). The PathSEA is run independently in each pathway collection. ```{r, collapse=TRUE} pathways <- c("M11607", "M10817", "M16694", ## from c3_TFT "M19723", "M5038", "M13419", "M1094") ## from c4_CGN w <- sapply(db, setIdentifier) %in% pathways subdb <- db[w] psea <- PathSEA(rp, subdb) ``` ```{r} getResults(psea, "c3_TFT") ``` PathSEA results are analogous to those of CondSEA, but condition-wise. A set of pathways con also be obtained starting from a gene of interest, for example: ```{r} pathways <- gene2pathways(rp, "FAM126A") pathways ``` Using a gene to obtain the pathways and performing `PathSEA` with drug-induced Pathway Expression Profiles yields "gene2drug" analysis, see the following reference: * Napolitano F. et al, gene2drug: a Computational Tool for Pathway-based Rational Drug Repositioning, bioRxiv (2017) 192005; doi: https://doi.org/10.1101/192005 ## A real-world example Precomputed Pathway Expression Profiles of the Connectivity Map data in the gep2pep format can be downloaded, unpacked and opened as follows: ```{r, eval=F} download.file("http://dsea.tigem.it/data/Cmap_MSigDB_v6.1_PEPs.tar.gz", "Cmap_MSigDB_v6.1_PEPs.tar.gz") untar("Cmap_MSigDB_v6.1_PEPs.tar.gz") rpBig <- openRepository("Cmap_MSigDB_v6.1_PEPs") ``` Using these data, two kinds of analysis can be performed: 1. Drug Set Enrichment Analysis, which looks for commond pathways shared by a set of drugs. 2. Gene2drug analysis, which looks for drugs dysregulating a gene of interest. The analyses below are not built at runtime with this document and could become outdated. ### Drug Set Enrichment Analysis (DSEA) Drug Set Enrichment Analysis for a set of HDAC inhibitors using the Gene Ontology collections can be performed as follows: ```{r, eval=F} csea <- CondSEA(rpBig, c("scriptaid", "trichostatin_a", "valproic_acid", "vorinostat", "hc_toxin", "bufexamac"), collections=c("C5_BP", "C5_MF", "C5_CC")) ## [16:41:40] Working on collection: C5_BP ## [16:41:42] Common conditions removed from bgset ## [16:41:42] Row-ranking collection ## [16:41:48] Computing enrichments ## [16:41:58] done ## [16:41:58] Working on collection: C5_MF ## [16:41:58] Row-ranking collection ## [16:42:00] Computing enrichments ## [16:42:02] done ## [16:42:02] Working on collection: C5_CC ## [16:42:02] Row-ranking collection ## [16:42:03] Computing enrichments ## [16:42:04] done ``` The following code retrieves information about the top 10 pathways ranked by CondSEA in GO-MF. ```{r, eval=F} library(GSEABase) setids <- sapply(loadCollection(rpBig, "C5_MF"), setIdentifier) MFresults <- getResults(csea, "C5_MF") w <- match(rownames(MFresults)[1:10], setids) top10 <- loadCollection(rpBig, "C5_MF")[w] sapply(top10, setName) ## [1] "GO_TRANSCRIPTION_FACTOR_ACTIVITY_PROTEIN_BINDING" ## [2] "GO_TRANSCRIPTION_COACTIVATOR_ACTIVITY" ## [3] "GO_PHOSPHATIDYLCHOLINE_1_ACYLHYDROLASE_ACTIVITY" ## [4] "GO_RETINOIC_ACID_RECEPTOR_BINDING" ## [5] "GO_PRE_MRNA_BINDING" ## [6] "GO_N_ACETYLTRANSFERASE_ACTIVITY" ## [7] "GO_CYTOSKELETAL_PROTEIN_BINDING" ## [8] "GO_PEPTIDE_N_ACETYLTRANSFERASE_ACTIVITY" ## [9] "GO_ACETYLTRANSFERASE_ACTIVITY" ## [10] "GO_HYDROGEN_EXPORTING_ATPASE_ACTIVITY" ``` Note that 2 main effects of HDAC inhibitors have been correctly identified: regulation of transcription, and alteration of the acetylation/deacetylation homeostasis. The full analysis can be exported to the Excel format with: ```{r, eval=F} exportSEA(rpBig, csea) ``` ### Gene2drug analysis A Gene2drug analysis can be performed starting by a gene of interest, for example the TFEB gene. Pathways including the gene are found as follows: ```{r, eval=F} pws <- gene2pathways(rpBig, "TFEB") ``` The following code runs the PathSEA analysis on the pathways involving TFEB. Also in this case the analysis is performed on Gene Ontology collections. Note that a warning is thrown as the GO-CC category has no annotation for TFEB (this is ok). ```{r, eval=F} psea <- PathSEA(rpBig, pws, collections=c("C5_BP", "C5_MF", "C5_CC")) ## Warning: [17:17:13] There is at least one selected collections for ## which no pathway has been provided ## [17:17:13] Removing pathways not in specified collections ## [17:17:13] Working on collection: C5_BP ## [17:17:13] Common pathway sets removed from bgset ## [17:17:15] Column-ranking collection ## [17:17:22] Computing enrichments ## [17:17:29] done ## [17:17:29] Working on collection: C5_MF ## [17:17:29] Common pathway sets removed from bgset ## [17:17:29] Column-ranking collection ## [17:17:30] Computing enrichments ## [17:17:32] done ``` Thus the top 10 drugs causing (or mimicking) TFEB upregulation are: ```{r, eval=F} getResults(psea, "C5_BP")[1:10,] ## ES PV ## loperamide 0.7324720 1.504075e-11 ## proadifen 0.7278256 2.079448e-11 ## hydroquinine 0.7220082 3.110434e-11 ## bepridil 0.6904276 2.616027e-10 ## clomipramine 0.6891810 2.839879e-10 ## alexidine 0.6741085 7.574142e-10 ## digitoxigenin 0.6737685 7.741670e-10 ## lanatoside_c 0.6651556 1.342553e-09 ## helveticoside 0.6642112 1.425479e-09 ## ouabain 0.6631157 1.527949e-09 ``` Note that the top drug, loperamide, has been demonstrated to induce TFEB translocation at very low concentrations [3]. As before, the full analysis can be exported to the Excel format with: ```{r, eval=F} exportSEA(rpBig, psea) ``` ## Advanced access to the repository `gep2pep` users don't need to understand how to interact with a `gep2pep` repository, however it can be useful in some cases. `gep2pep` repositories are in the `repo` format (see the `repo` package), so they can be accessed as any other `repo` repository. However item tags should not be changed, as they are used by `gep2pep` to identify data types. Each `gep2pep` repository always contains a special *project* item including repository and session information, which can be shown as follows: ```{r} rp$info("gep2pep repository") ``` Poject name and description can be provided when creating the repository (`createRepository`), or edited with `rp$set("gep2pep repository", newname="my project name", description="my project description")`. The repository will also contain an item for each pathway collection, and possibly an item for each corresponding PEP collection, as in this example: ```{r} rp ``` In order to look at the space that the repository is using, the following command can be used: ```{r} rp$info() ``` `put`, `set` and other `repo` commands can be used to alter repository contents directly, however this could leave the repository in an inconsistent state. The following code checks a repository for possible problems: ```{r, collapse=TRUE} checkRepository(rp) ``` The last check (summary of commond conditions) ensures that the same conditions have been computed for all the pathway collections, which is however not mandatory. ## Further documentation Additional methodological help can be found at: 1. http://dsea.tigem.it 2. http://gene2drug.tigem.it 3. Open access papers [2] and [3]. ```{r include=FALSE} unlink(repoRoot, TRUE) ``` ## References [1] Subramanian A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102, 15545-15550 (2005). [2] Napolitano F. et al, Drug-set enrichment analysis: a novel tool to investigate drug mode of action. Bioinformatics 32, 235-241 (2016). [3] Napolitano F. et al, gene2drug: a Computational Tool for Pathway-based Rational Drug Repositioning, bioRxiv (2017) 192005; doi: https://doi.org/10.1101/192005 [4] Lamb, J. et al. The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science 313, 1929-1935 (2006).