--- title: "User Guide" output: BiocStyle::html_document: toc: true toc_depth: 2 highlight: pygments author: "Søren Helweg Dam" date: "Last updated: `r format(Sys.Date(), '%Y.%m.%d')`" bibliography: library.bib vignette: > %\VignetteIndexEntry{User Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ```
# Introduction `pariedGSEA` is a user-friendly framework for paired differential gene expression and splicing analyses. Providing bulk RNA-seq count data, `pariedGSEA` combines the results of `r BiocStyle::Biocpkg("DESeq2")` [@deseq2] and `r BiocStyle::Biocpkg("DEXSeq")` [@dexseq], aggregates the p-values to gene level and allows you to run a subsequent gene set over-representation analysis using `r BiocStyle::Biocpkg("fgsea")`'s `fora` function [@fgsea]. Since version 0.99.2, you can also do the differential analyses using `r BiocStyle::Biocpkg("limma")` . `pairedGSEA` was developed to highlight the importance of differential splicing analysis. It was build in a way that yields comparable results between splicing and expression-related events. It, by default, accounts for surrogate variables in the data, and facilitates exploratory data analysis either through storing intermediate results or through plotting functions for the over-representation analysis. This vignette will guide you through how to use the main functions of `pairedGSEA`. `pariedGSEA` assumes you have already preprocessed and aligned your sequencing reads to transcript level. Before starting, you should therefore have a counts matrix and a metadata file. This data may also be in the format of a `r BiocStyle::Biocpkg("SummarizedExperiment")` [@se] or `DESeqDataSet`. *Importantly*, please ensure that the rownames have the format: `gene:transcript`. The metadata should, as a minimum, contain the `sample IDs` corresponding to the column names of the count matrix, a `group` column containing information on which samples corresponds to the baseline (controls) and the case (condition). Bear in mind, the column names can be as you wish, but the names must be provided in the `sample_col` and `group_col` parameters, respectively. To see an example of what such data could look like, please see `?pairedGSEA::example_se`. # Installation To install this package, start R (version "4.3") and enter: Bioconductor dependencies ``` {r install_dep, eval=FALSE} # Install Bioconductor dependencies if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c( "SummarizedExperiment", "S4Vectors", "DESeq2", "DEXSeq", "fgsea", "sva", "BiocParallel")) ``` Install `pairedGSEA` from Bioconductor ```{r install_bioc, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("pairedGSEA") ``` Install development version from [GitHub](https://github.com/shdam/pairedGSEA) ``` {r install_github, eval=FALSE} # Install pairedGSEA from github devtools::install_github("shdam/pairedGSEA", build_vignettes = TRUE) ``` # Paired differential expression and splicing ## Preparing the experiment parameters Running a paired Differential Gene Expression (DGE) and Differential Gene Splicing (DGS) analysis is the first step in the `pairedGSEA` workflow. But before running the `paired_diff` function, we recommend storing the experiment parameters in a set of variables at the top of your script for future reference and easy access: ```{r setup} library("pairedGSEA") # Defining plotting theme ggplot2::theme_set(ggplot2::theme_classic(base_size = 20)) ## Load data # In this vignette we will use the included example Summarized Experiment. # See ?example_se for more information about the data. data("example_se") if(FALSE){ # Examples of other data imports # Example of count matrix tx_count <- read.csv("path/to/counts.csv") # Or other load function md_file <- "path/to/metadata.xlsx" # Can also be a .csv file or a data.frame # Example of locally stored DESeqDataSet dds <- readRDS("path/to/dds.RDS") # Example of locally stored SummarizedExperiment se <- readRDS("path/to/se.RDS") } ## Experiment parameters group_col <- "group_nr" # Column with the groups you would like to compare sample_col <- "id" # Name of column that specifies the sample id of each sample. # This is used to ensure the metadata and count data contains the same samples # and to arrange the data according to the metadata # (important for underlying tools) baseline <- 1 # Name of baseline group case <- 2 # Name of case group experiment_title <- "TGFb treatment for 12h" # Name of your experiment. This is # used in the file names that are stored if store_results is TRUE (recommended) ``` ## Check your metadata ```{r metadata_check} # Check if parameters above fit with metadaata SummarizedExperiment::colData(example_se) ``` ```{r sample_check} # Check that all data samples are in the metadata all(colnames(example_se) %in% SummarizedExperiment::colData(example_se)[[sample_col]]) ``` ## Running the analysis The paired DGE/DGS analysis is run with `paired_diff()`. `paired_diff()` is essentially a wrapper function around `DESeq2::DESeq` [@deseq2] and `DEXSeq::DEXSeq` [@dexseq], the latter takes in the ballpark of 20-30 minutes to run depending on the size of the data and computational resources. Please visit their individual vignettes for further information. ```{r paired_diff} set.seed(500) # For reproducible results diff_results <- paired_diff( object = example_se, metadata = NULL, # Use with count matrix or if you want to change it in # the input object group_col = group_col, sample_col = sample_col, baseline = baseline, case = case, experiment_title = experiment_title, store_results = FALSE # Set to TRUE (recommended) if you want # to store intermediate results, such as the results on transcript level ) ``` After running the analyses, `paired_diff` will aggregate the p-values to gene level using lancaster `r BiocStyle::CRANpkg("aggregation")` [@lancaster] and calculate the FDR-adjusted p-values (see `?pairedGSEA:::aggregate_pvalue` for more information). For the DGE transcripts, the log2 fold changes will be aggregated using a weighted mean, whereas the DGS log2 fold changes will be assigned to the log2 fold change of the transcript with the lowest p-value. Use the latter with a grain of salt. From here, feel free to analyse the gene-level results using your preferred method. If you set `store_results = TRUE`, you could extract the transcript level results found in the `results` folder under the names "\*_expression_results.RDS" and "\*_splicing_results.RDS" for the DGE and DGS analysis, respectively (The \* corresponds to the provided experiment title). ## Additional parameters There are a range of other parameters you can play with to tailor the experience. Here, the default settings are showed. See `?pairedGSEA::paired_diff` for further details. ```{r extra_settings, eval=FALSE} covariates = NULL, run_sva = TRUE, use_limma = FALSE, prefilter = 10, fit_type = "local", test = "LRT", quiet = FALSE, parallel = TRUE, BPPARAM = BiocParallel::bpparam(), ... ``` To highlight some examples of use: 1. You can use `limma::eBayes` and `limma::diffSplice` for the analyses with `use_limma = TRUE`. 2. You can use additional columns in your metadata as covariates in the model matrix by setting `covariates` to a character vector of the specific names. This will be used in SVA, DGE, and DGS. 3. The test should be kept at default settings, but advanced users may use the "wald" test instead, if they wish. 4. The `...` parameters will be fed to `DESeq2::DESeq`, see their manual for options. 5. If you want a stricter, more loose, or more advanced pre-filtering (generally not necessary, as `r BiocStyle::Biocpkg("DESeq2")` and `r BiocStyle::Biocpkg("DEXSeq")` performs their own pre-filtering), you can set the parameter to a different value or NULL and use your own pre-filtering method directly on the counts matrix.
# Over-Representation Analysis `pairedGSEA` comes with a wrapper function for `fgsea::fora` [@fgsea]. If you wish, feel free to use that directly or any other gene set analysis method - investigate the `diff_results` object before use to see what information it contains. The inbuilt wrapper also computes the relative risk for each gene set and an enrichment score (`log2(relative_risk + 0.06)`, the pseudo count is for plotting purposes). ### Running the inbuilt ORA function Before you get going, you will need a list of gene sets (aka. pathways) according to the species you are working with and the category of gene sets of interest. For this purpose, feel free to use the `r BiocStyle::CRANpkg("msigdbr")` [@msigdbr] wrapper function in `pairedGSEA`: `pairedGSEA::prepare_msigdb`. If you do, see `?pairedGSEA::prepare_msigdb` for further details. The inbuilt ORA function is called `paired_ora` and is run as follows ```{r paied_ora} # Define gene sets in your preferred way gene_sets <- pairedGSEA::prepare_msigdb( species = "Homo sapiens", category = "C5", gene_id_type = "ensembl_gene" ) ora <- paired_ora( paired_diff_result = diff_results, gene_sets = gene_sets, experiment_title = NULL # experiment_title - if not NULL, # the results will be stored in an RDS object in the 'results' folder ) ``` ## Additional parameters ```{r ora_settings, eval=FALSE} cutoff = 0.05, min_size = 25, quiet = FALSE ``` Please investigate the returned object to see the column names and what they contain.
# Analysing ORA results There are many options for investigating your ORA results. `pairedGSEA` comes with an inbuilt scatter plot function that plots the enrichment score of DGE against those of DGS. The function allows you to interactively look at the placement of the significant pathways using `r BiocStyle::CRANpkg("plotly")` [@plotly]. You can color specific points based on a regular expression for gene sets of interest. ## Scatter plot of enrichment scores ```{r scatter} # Scatter plot function with default settings plot_ora( ora, plotly = FALSE, pattern = "Telomer", # Identify all gene sets about telomeres cutoff = 0.05, # Only include significant gene sets lines = TRUE # Guide lines ) ``` As mentioned, this function can be utilized in a few different ways. The default settings will plot the enrichment scores of each analysis and draw dashed lines for the `y = x` line, `y = 0`, and `x = 0`. Remove those by setting `lines = FALSE`. Make the plot interactive with `plotly = TRUE`. Highlight gene sets containing a specific regex pattern by setting `pattern` to the regex pattern of interest. # Session Info ```{r session info} sessionInfo() ``` # References