--- title: "roastgsa vignette (RNAseq)" author: "Adria Caballe-Mestres " output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{roastgsa vignette (RNAseq)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- `r library("knitr")` `r opts_chunk$set(cache=FALSE, fig.width=9, message=FALSE, warning=FALSE, fig.align = "left")` # Installation ```{r, message = FALSE, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("roastgsa") ``` # roastgsa in RNA-seq data This vignette explains broadly the main functions for applying `roastgsa` in RNA-seq data. A more exhaustive example to explore the `roastgsa` functionality is presented in the "roastgsa vignette (main)". All the analyses explained in the main vignette can be reproduced for RNA-seq data, after undertaking the steps covered here in the section "Data normalization and filtering". ```{r load-packages, message=FALSE, warning=FALSE} library( roastgsa ) ``` # Data: RNA-seq experiment from GSEABenchmarkeR We consider the first dataset available in the `tcga` compendium from the `GSEABenchmarkeR` package [1], which consists of a RNA-seq study with 19 tumor Bladder Urothelial Carcinoma samples and 19 adjacent healthy tissues. ```{r, message = FALSE} #library(GSEABenchmarkeR) #tcga <- loadEData("tcga", nr.datasets=1,cache = TRUE) #ysel <- assays(tcga[[1]])$expr #fd <- rowData(tcga[[1]]) #pd <- colData(tcga[[1]]) data(fd.tcga) data(pd.tcga) data(expr.tcga) fd <- fd.tcga ysel <- expr.tcga pd <- pd.tcga N <- ncol(ysel) head(pd) cnames <- c("BLOCK","GROUP") covar <- data.frame(pd[,cnames,drop=FALSE]) covar$GROUP <- as.factor(covar$GROUP) colnames(covar) <- cnames print(table(covar$GROUP)) ``` # Data normalization and filtering To apply `roastgsa`, the expression data should be approximately normally distributed, at least in their univariate form. Depending on the user's preferred method for differential expression analysis, counts transformation methods such as `rlog` or `vst` (`DESeq2`) [2], `zscoreDGE` (`edgeR`) [3] or `voom` (`limma`) [4], can be applied. In the paper we explored the type I and type II errors when applying the `rlog` or `vst` transformation followed by `roastgsa`, showing both good control of type I errors and decent true discovery rates. In the example presented here we transform the expression data with `vst` function from `DESeq2` R package ```{r, message = FALSE} library(DESeq2) dds1 <- DESeqDataSetFromMatrix(countData=ysel,colData=pd, design= ~ BLOCK + GROUP) dds1 <- estimateSizeFactors(dds1) ynorm <- assays(vst(dds1))[[1]] colnames(ynorm) <- rownames(covar) <- paste0("s",1:ncol(ynorm)) ``` Another key step before using the roastgsa methods for enrichment analysis is to filter out low expressed genes, where coverage might be a limitation for detecting true differentially expressed genes. For the TCGA data considered here, the default filter employed by the authors when loading the data was to exclude genes with cpm < 2 in more than half of the samples. A short discussion about the relationship between gene coverage and statistical power for the `roastgsa` approach is available in our article presenting the `roastgsa` package. ```{r, message = FALSE} threshLR <- 10 dim(ysel) min(apply(ysel,1,mean)) ``` # Gene sets We consider a classic repository of general biological functions for battery gene set analysis such as broad hallmarks [5]. The gene sets for human are saved within the roastgsa package and can be loaded by ```{r, message = FALSE} data(hallmarks.hs) head(names(hallmarks.hs)) ``` In this case, `hallmarks.hs` contains gene symbols whereas the row names for `ynorm` are entrez identifiers. We can set the row names to symbols, which in this case presents a one-to-one relationship ```{r, message = FALSE} rownames(ynorm) <-fd[rownames(ynorm),1] ``` Other gene set databases that could be applied to these data for battery testing are presented in the `roastgsa` vignette (gene set collections). # Enrichment analysis The comparison of interest can be specified by a numeric vector with length matching the number of columns in the design. ```{r, message = FALSE} form <- as.formula(paste0("~ ", paste0(cnames, collapse = "+"))) design <- model.matrix(form , data = covar) terms <- colnames(design) contrast <- rep(0, length(terms)) contrast[length(colnames(design))] <- 1 ``` Below, there is the standard `roastgsa` instruction (under competitive testing) for `maxmean` and `mean` statistics. ```{r, message = FALSE} fit.maxmean <- roastgsa(ynorm, form = form, covar = covar, contrast = contrast, index = hallmarks.hs, nrot = 500, mccores = 1, set.statistic = "maxmean", self.contained = FALSE, executation.info = FALSE) f1 <- fit.maxmean$res rownames(f1) <- gsub("HALLMARK_","",rownames(f1)) head(f1) fit.mean <- roastgsa(ynorm, form = form, covar = covar, contrast = contrast, index = hallmarks.hs, nrot = 500, mccores = 1, set.statistic = "mean", self.contained = FALSE, executation.info = FALSE) f2 <- fit.mean$res rownames(f2) <- gsub("HALLMARK_","",rownames(f2)) head(f2) ``` ## Visualization of the results Several graphics can be obtained to complement the table results in `f1` and `f2`. Here we only show the heatmaps that summarize the expression patterns obtained for all tested hallmarks. Full description and usage of all graphical options available in the `roastgsa` package are considered in the `roastgsa` vignette for arrays data and the `roastgsa` manual ```{r, fig.height=7, fig.width=7, message = FALSE} hm1 <- heatmaprgsa_hm(fit.maxmean, ynorm, intvar = "GROUP", whplot = 1:50, toplot = TRUE, pathwaylevel = TRUE, mycol = c("orange","green", "white"), sample2zero = FALSE) ``` ```{r, fig.height=7, fig.width=7, message = FALSE} hm2 <- heatmaprgsa_hm(fit.mean, ynorm, intvar = "GROUP", whplot = 1:50, toplot = TRUE, pathwaylevel = TRUE, mycol = c("orange","green", "white"), sample2zero = FALSE) ``` ## sessionInfo ```{r, message = FALSE} sessionInfo() ``` # References [1] Geistlinger L, Csaba G, Santarelli M, Schiffer L, Ramos M, Zimmer R, Waldron L (2019). GSEABenchmarkeR: Reproducible GSEA Benchmarking. R package version 1.6.0, https://github.com/waldronlab/GSEABenchmarkeR. [2] Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550. doi:10.1186/s13059-014-0550-8. [3] Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. doi:10.1093/bioinformatics/btp616. [4] M. E. Ritchie, B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. limma powers differential expression analyses for RNAsequencing and microarray studies. Nucleic acids research, 43(7):e47, 2015. [5] A. Liberzon, C. Birger, H. Thorvaldsdottir, M. Ghandi, J. P. Mesirov, and P. Tamayo. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Systems, 1(6):417-425, 2015.