--- title: "roastgsa vignette (gene set collections)" author: "Adria Caballe-Mestres " output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{roastgsa vignette (gene set collections)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- `r library("knitr")` `r opts_chunk$set(cache=FALSE, fig.width=9, message=FALSE, warning=FALSE, fig.align = "left")` # Installation ```{r, message = FALSE, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("roastgsa") ``` # Using gene set collections for battery testing ## Gene set collections Several gene set databases are publicly available and can be used for gene set analysis as part of the screening of bulk data for hypothesis generation. The Hallmark gene set database [1] provides a well-curated gene set list of biological states which can be used to obtain an overall characterization of the data. Other Human Molecular Signatures Database (MSigDB) collections including gene ontology pathways (Biological Process, Cellular Component, Molecular Function), immunologic signature gene sets or regulatory target gene sets can be employed for `roastgsa` analysis to identify the most relevant expression changes between experimental conditions for highly specific biological functions. These broad collections and sub-collections can be downloaded through https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp. Other public collections that can be employed for battery testing such as the KEGG pathway database (https://www.genome.jp/kegg/pathway.html) [2] or reactome (https://reactome.org/) [3] provide a broad representation of gene sets for human diseases, metabolism or cellular processes among others. These gene set collections can be considered for `roastgsa` either by (1) R loading the gene sets in a `list` object, each element containing the gene identifiers for the testing set or (2) saving a `.gmt` file with the whole collection in a specific folder, i.e., ```{r, message = FALSE} # DO NOT RUN # gspath = "path/to/folder/of/h.all.v7.2.symbols.gmt" # gsetsel = "h.all.v7.2.symbols.gmt" ``` ## Usage of Hallmarks in another example with real data We download publicly available arrays from GEO, accession 'GSE145603', which contain LGR5 and POLI double tumor cell populations in colorectal cancer [4] ```{r, message = FALSE} # DO NOT RUN # library(GEOquery) # data <- getGEO('GSE145603') # normdata <- (data[[1]]) # pd <- pData(normdata) # pd$group_LGR5 <- pd[["lgr5:ch1"]] ``` We are interested in screening the hallmarks with the largests changes between LGR5 negative and LGR5 positive samples. The hallmark gene sets are stored in `gspath` with the file name specified in `gsetsel` (see specifications above). The formula, design matrix and the corresponding contrast can be obtained as follows ```{r, message = FALSE} # DO NOT RUN # form <- "~ -1 + group_LGR5" # design <- model.matrix(as.formula(form),pd) # cont.mat <- data.frame(Lgr5.high_Lgr5.neg = c(1,-1)) # rownames(cont.mat) <- colnames(design) ``` In microarrays, the expression values for each gene can be measured in several probesets. To perform gene set analysis, we select the probeset with maximum variability for every gene: ```{r, message = FALSE} # DO NOT RUN # mads <- apply(exprs(normdata), 1, mad) # gu <- strsplit(fData(normdata)[["Gene Symbol"]], split=' \\/\\/\\/ ') # names(gu) <- rownames(fData(normdata)) # gu <- gu[sapply(gu, length)==1] # gu <- gu[gu!='' & !is.na(gu) & gu!='---'] # ps <- rep(names(gu), sapply(gu, length)) # gs <- unlist(gu) # pss <- tapply(ps, gs, function(o) names(which.max(mads[o]))) # psgen.mvar <- pss ``` The `roastgsa` function is used for competitive gene set analysis testing: ```{r, message = FALSE} # DO NOT RUN # roast1 <- roastgsa(exprs(normdata), form = form, covar = pd, # psel = psgen.mvar, contrast = cont.mat[, 1], # gspath = gspath, gsetsel = gsetsel, nrot = 1000, # mccores = 7, set.statistic = "maxmean") # # print(roast1) # ``` Summary tables can be presented following the `roastgsa::htmlrgsa` documentation. # sessionInfo ```{r, message = FALSE} sessionInfo() ``` # References [1] A. Liberzon, C. Birger, H. Thorvaldsdottir, M. Ghandi, J. P. Mesirov, and P. Tamayo. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Systems, 1(6):417-425, 2015. [2] M. Kanehisa et al. KEGG as a reference resource for gene and protein annotation, Nucleic Acids Research, Volume 44, Issue D1, 4 January 2016, D457–D462, https://doi.org/10.1093/nar/gkv1070 [3] M. Gillespie et al. The reactome pathway knowledgebase 2022, Nucleic Acids Research, 2021;, gkab1028, https://doi.org/10.1093/nar/gkab1028 [4] Morral C, Stanisavljevic J, Hernando-Momblona X, et al. Zonation of Ribosomal DNA Transcription Defines a Stem Cell Hierarchy in Colorectal Cancer. Cell Stem Cell. 2020;26(6):845-861.e12. doi:10.1016/j.stem.2020.04.012