%\VignetteIndexEntry{Gene set enrichment analysis of RNA-Seq data with the SeqGSEA package} %\VignetteKeywords{RNA-Seq GSEA DE DS} %\VignettePackage{SeqGSEA} \documentclass[a4paper]{article} \usepackage{Sweave} \usepackage{amsmath} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{graphicx, Rd} \usepackage{listings} \usepackage{enumitem} \setdescription{leftmargin=0pt} \SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=FALSE,png=TRUE,include=FALSE,width=5.6,height=5.2,resolution=180, concordance=TRUE} \setlength{\textheight}{8.8in} \setlength{\textwidth}{6in} \setlength{\topmargin}{-0.25in} \setlength{\oddsidemargin}{0.25in} \setlength{\evensidemargin}{0.25in} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Robject}[1]{{\small\texttt{#1}}} \newcommand{\Rfunction}[1]{\Robject{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\fixme}[1]{{\textbf{Fixme:} \textit{{#1}}}} \begin{document} \SweaveOpts{concordance=TRUE} \title{\bf Gene set enrichment analysis of RNA-Seq data with the \Rpackage{SeqGSEA} package} \author{Xi Wang$^1$$^,$$^2$ and Murray Cairns$^1$$^,$$^2$$^,$$^3$} %\date{22 Feb 2013} \maketitle \noindent $^1$School of Biomedical Sciences and Pharmacy, The University of Newcastle, Callaghan, New South Wales, Australia\\ \noindent $^2$Hunter Medical Research Institute, New Lambton, New South Wales, Australia\\ \noindent $^3$Schizophrenia Research Institute, Sydney, New South Wales, Australia\\ \noindent \begin{center} {\tt xi.wang@mdc-berlin.de} \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} %------------------------------------------------------------------ \subsection{Background} %------------------------------------------------------------------ Transcriptome sequencing (RNA-Seq) has become a key technology in transcriptome studies because it can quantify overall expression levels and the degree of alternative splicing for each gene simultaneously. Many methods and tools, including quite a few \R{}/{\tt Bioconductor} packages, have been developed to deal with RNA-Seq data for differential expression analysis and thereafter functional analysis aiming at novel biological and biomedical discoveries. However, those tools mainly focus on each gene's overall expression and may miss the opportunities for discoveries regarding alternative splicing or the combination of the two. \Rpackage{SeqGSEA} is novel \R{}/{\tt Bioconductor} package to derive biological insight by integrating differential expression (DE) and differential splicing (DS) from RNA-Seq data with functional gene set analysis. Due to the digital feature of RNA-Seq count data, the package utilizes negative binomial distributions for statistical modeling to first score differential expression and splicing in each gene, respectively. Then, integration strategies are applied to combine the two scores for integrated gene set enrichment analysis. See the publications \cite{Wang13} and \cite{Wang14} for more details. The \Rpackage{SeqGSEA} package can also give detection results of differentially expressed genes and differentially spliced genes based on sample label permutation. %------------------------------------------------------------------ \subsection{Getting started} %------------------------------------------------------------------ The \Rpackage{SeqGSEA} depends on \Rpackage{Biobase} for definitions of class \Rclass{ReadCountSet} and class \Rclass{SeqGeneSet}, \Rpackage{DESeq2} for differential expression analysis, \Rpackage{biomaRt} for gene IDs/names conversion, and \Rpackage{doParallel} for parallelizing jobs to reduce running time. Make sure you have these dependent packages installed before you install \Rpackage{SeqGSEA}. To load the \Rpackage{SeqGSEA} package, type {\tt library(SeqGSEA)}. To get an overview of this package, type {\tt ?SeqGSEA}. <>= library(SeqGSEA) @ <>= ? SeqGSEA @ In this Users' Guide of the \Rpackage{SeqGSEA} package, an analysis example is given in Section \ref{sec:example}, and detailed guides for DE, DS, and integrative GSEA analysis are given in Sections \ref{sec:DE}, \ref{sec:DS}, and \ref{sec:GSEA}, respectively. A guide to parallelize those analyses is given in Section \ref{sec:parallel}. %------------------------------------------------------------------ \subsection{Package citation} %------------------------------------------------------------------ To cite this package, please cite the article below: \\ Wang X and Cairns MJ (2014). \textbf{SeqGSEA: a Bioconductor package for gene set enrichment analysis of RNA-Seq data integrating differential expression and splicing.} \textit{Bioinformatics}, \textbf{30}(12):1777-9. To cite/discuss the method used in this package, please cite the article below: \\ Wang X and Cairns MJ (2013). \textbf{Gene set enrichment analysis of RNA-Seq data: integrating differential expression and splicing.} \textit{BMC Bioinformatics}, \textbf{14}(Suppl 5):S16. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Differential splicing analysis and DS scores}\label{sec:DS} %------------------------------------------------------------------ \subsection{The \Rclass{ReadCountSet} class} %------------------------------------------------------------------ To facilitate differential splicing (DS) analysis, \Rpackage{SeqGSEA} saves exon read count data using \Rclass{ReadCountSet} class, which is derived from \Rclass{eSet}. While below is an example showing the steps to create a new \Robject{ReadCountSet} object, creating a \Robject{ReadCountSet} object from your own data should refer to Section \ref{sec:example}. <>= rcounts <- cbind(t(sapply(1:10, function(x) {rnbinom(5, size=10, prob=runif(1))} )), t(sapply(1:10, function(x) {rnbinom(5, size=10, prob=runif(1))} ))) colnames(rcounts) <- c(paste("S", 1:5, sep=""), paste("C", 1:5, sep="")) geneIDs <- c(rep("G1", 4), rep("G2", 6)) exonIDs <- c(paste("E", 1:4, sep=""), paste("E", 1:6, sep="")) RCS <- newReadCountSet(rcounts, exonIDs, geneIDs) RCS @ %------------------------------------------------------------------ \subsection{DS analysis and DS scores} %------------------------------------------------------------------ To better illustrate DS analysis functions, we load an example \Robject{ReadCountSet} object from a real RNA-Seq data set as follows. <>= data(RCS_example, package="SeqGSEA") RCS_example @ This example \Robject{ReadCountSet} object is comprised of 20 samples and 5,000 exons, part of the prostate cancer RNA-Seq data set \citep{CancerData}. With the function \Rfunction{geneID} and the script below, we can easily check the number of genes involved in this data set. <>= length(unique(geneID(RCS_example))) @ Noticed that some exons are too short or not expressed, we should first filter out these exons from following analysis to secure the robustness of our analysis. By default, function \Rfunction{exonTestablity} marks exons with the sum of read counts across all samples less than {\tt cutoff} (default: 5) to be excluded in downstream analysis. Users can also exclude genes with no or low expression from downstream analysis by checking \Rfunction{geneTestability}. <>= RCS_example <- exonTestability(RCS_example, cutoff = 5) @ Then, the main DS analysis is executed using function \Rfunction{estiExonNBstat} for exon DS NB-statistics and function \Rfunction{estiGeneNBstat} for gene DS NB-statistics by averaging exon NB-statistics. Please refer to \cite{DSGseq} for detailed statistic analysis regarding differential splicing from exon count data. <>= RCS_example <- estiExonNBstat(RCS_example) RCS_example <- estiGeneNBstat(RCS_example) head(fData(RCS_example)[, c("exonIDs", "geneIDs", "testable", "NBstat")]) @ We run DS analysis on the permutation data sets as well. Here we set to run permutation 20 times for demonstration; however, in practice at least ~1,000 permutations are recommended. To do so, we first generate a permutation matrix, each column corresponding to each permutation; then run DS analysis on the permutation data sets, and updated \Robject{permute\_NBstat\_gene} slot for results. <>= permuteMat <- genpermuteMat(RCS_example, times=20) RCS_example <- DSpermute4GSEA(RCS_example, permuteMat) head(RCS_example@permute_NBstat_gene) @ The DS NB-statistics from the permutation data sets offer an empirical background of NB-statistics on the real data set. By normalizing NB-statistics against this background, we get the DS scores, which will be used in integrated GSEA runs (Section \ref{sec:GSEA}). <>= DSscore.normFac <- normFactor(RCS_example@permute_NBstat_gene) DSscore <- scoreNormalization(RCS_example@featureData_gene$NBstat, DSscore.normFac) DSscore.perm <- scoreNormalization(RCS_example@permute_NBstat_gene, DSscore.normFac) DSscore[1:5] DSscore.perm[1:5,1:10] @ %------------------------------------------------------------------ \subsection{DS permutation p-values} %------------------------------------------------------------------ Besides calculating DS scores, based on the NB statistics on the real data set and the permutation data sets, we can also calculate a permutation p-value for each gene's DS significance in the studied data set. \begin{center} <>= RCS_example <- DSpermutePval(RCS_example, permuteMat) head(DSresultGeneTable(RCS_example)) @ \end{center} The adjusted p-values accounting for multiple testings were given by the BH method \citep{BH}. Users can also apply function \Rfunction{topDSGenes} and function \Rfunction{topDSExons} to quickly get the most significant DS genes and exons, respectively. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Differential expression analysis and DE scores}\label{sec:DE} %------------------------------------------------------------------ \subsection{Gene read count data from \Rclass{ReadCountSet} class} %------------------------------------------------------------------ For gene DE analysis, read counts on each gene should be first calculated. With \Rpackage{SeqGSEA}, users usually analyze DE and DS simultaneously, so the package includes the function \Rfunction{getGeneCount} to facilitate gene read count calculation from a \Robject{ReadCountSet} object. <>= geneCounts <- getGeneCount(RCS_example) dim(geneCounts) # 182 20 head(geneCounts) @ This function results in a matrix of 182 rows and 20 columns, corresponding to 182 genes and 20 samples. %------------------------------------------------------------------ \subsection{DE analysis and DE scores} %------------------------------------------------------------------ DE analysis has been implemented in several \R{}/{\tt Bioconductor} packages, of which \Rpackage{DESeq2} \citep{DESeq2010} is mainly utilized in \Rpackage{SeqGSEA} for DE analysis. With \Rpackage{DESeq2}, we can model count data with negative binomial distributions for accounting biological variations and various biases introduced in RNA-Seq. Given the read count data on individual genes and sample grouping information, basic DE analysis based on \Rpackage{DESeq2} including size factor estimation and dispersion estimation, is encapsulated in the function \Rfunction{runDESeq}. \begin{center} <>= label <- label(RCS_example) dds <- runDESeq(geneCounts, label) @ \end{center} \noindent The function \Rfunction{runDESeq} returns a \Robject{DESeqDataSet} object, which is defined in the \Rpackage{DESeq2} package. The DE analysis in the \Rpackage{DESeq2} package continues with the output \Robject{DESeqDataSet} object and conducts negative-binomial-based statistical tests for DE genes (using \Rfunction{nbinomTest} or \Rfunction{nbinomGLMTest}). However, in this \Rpackage{SeqGSEA} package, we define NB statistics to quantify each gene's expression difference between sample groups. The NB statistics for DE can be achieved by the following scripts. <>= DEGres <- DENBStat4GSEA(dds) DEGres[1:5, "NBstat"] @ Similarly, we run DE analysis on the permutation data sets as well. The \Robject{permuteMat} should be the same as used in DS analysis on the permutation data sets. <>= DEpermNBstat <- DENBStatPermut4GSEA(dds, permuteMat) DEpermNBstat[1:5, 1:10] @ Once again, the DE NB-statistics from the permutation data sets offer an empirical background, so we can normalize NB-statistics against this background. By doing so, we get the DE scores, which will also be used in integrated GSEA runs (Section \ref{sec:GSEA}). <>= DEscore.normFac <- normFactor(DEpermNBstat) DEscore <- scoreNormalization(DEGres$NBstat, DEscore.normFac) DEscore.perm <- scoreNormalization(DEpermNBstat, DEscore.normFac) DEscore[1:5] DEscore.perm[1:5, 1:10] @ %------------------------------------------------------------------ \subsection{DE permutation p-values} %------------------------------------------------------------------ Similar to DS analysis, comparing NB-statistics on the real data set and those on the permutation data sets, we can get permutation p-values for each gene's DE significance. <>= DEGres <- DEpermutePval(DEGres, DEpermNBstat) DEGres[1:6, c("NBstat", "perm.pval", "perm.padj")] @ For a comparison to the nominal p-values from exact testing and forming comprehensive results, users can run \Rfunction{DENBTest} first and then \Rfunction{DEpermutePval}, which generates results as follows. <>= DEGres <- DENBTest(dds) DEGres <- DEpermutePval(DEGres, DEpermNBstat) DEGres[1:6, c("NBstat", "pval", "padj", "perm.pval", "perm.padj")] @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Integrative GSEA runs}\label{sec:GSEA} %------------------------------------------------------------------ \subsection{DE/DS score integration} %------------------------------------------------------------------ We have proposed two strategies for integrating normalized DE and DS scores \citep{Wang13}, one of which is the weighted summation of the two scores and the other is a rank-based strategy. The functions \Rfunction{geneScore} and \Rfunction{genePermuteScore} implement two methods for the weighted summation strategy: weighted linear combination and weighted quadratic combination. Scripts below show a linear combination of DE and DS scores with weight for DE equal to 0.3. Users should keep the weight for DE in \Rfunction{geneScore} and \Rfunction{genePermuteScore} the same, and the weight rangs from 0 (i.e., DS only) to 1 (i.e., DE only). Visualization of gene scores can be made by applying the \Rfunction{plotGeneScore} function. <>= gene.score <- geneScore(DEscore, DSscore, method="linear", DEweight = 0.3) gene.score.perm <- genePermuteScore(DEscore.perm, DSscore.perm, method="linear", DEweight=0.3) plotGeneScore(gene.score, gene.score.perm) @ \begin{figure} \centering \includegraphics[width=0.6\textwidth]{SeqGSEA-genescore_l} \caption{Gene scores resulted from linear combination. Scores are sorted from the largest to the smallest. Red, green, orange, blue dotted horizontal lines represent the maximum score, average score on the real data set, and the maximum score, average score on the permutation data sets.} \label{fig:gsl} \end{figure} The plot generated by the \Rfunction{plotGeneScore} function (Fig. \ref{fig:gsl}) can also be saved as a PDF file easily with the {\tt pdf} argument of \Rfunction{plotGeneScore}. The functions \Rfunction{geneScore} and \Rfunction{genePermuteScore} also implement one method for the rank-based integration strategy: using data-set-specific ranks. The plot for integrated gene scores is shown in Fig. \ref{fig:gsr}. <>= gene.score <- geneScore(DEscore, DSscore, method="rank", DEweight = 0.3) gene.score.perm <- genePermuteScore(DEscore.perm, DSscore.perm, method="rank", DEweight=0.3) plotGeneScore(gene.score, gene.score.perm) @ \begin{figure} \centering \includegraphics[width=0.6\textwidth]{SeqGSEA-genescore_r} \caption{Gene scores resulted from rank-based combination with data-set-specific ranks. Scores are sorted from the largest to the smallest. Red, green, orange, blue dotted horizontal lines represent the maximum score, average score on the real data set, and the maximum score, average score on the permutation data sets.} \label{fig:gsr} \end{figure} Rather than the above method to integrate scores with data-set-specific ranks, an alternative method is implemented with the \Rfunction{rankCombine} function, which takes only the ranks from the real data set for integrating DE and DS scores on both real and permutation data sets. This provides a method in a global manner. The plot of gene scores is shown in Fig. \ref{fig:gsgr}. <>= combine <- rankCombine(DEscore, DSscore, DEscore.perm, DSscore.perm, DEweight=0.3) gene.score <- combine$geneScore gene.score.perm <- combine$genePermuteScore plotGeneScore(gene.score, gene.score.perm) @ \begin{figure} \centering \includegraphics[width=0.6\textwidth]{SeqGSEA-genescore_gr} \caption{Gene scores resulted from rank-based combination with the same rank got from the real data set. Scores are sorted from the largest to the smallest. Red, green, orange, blue dotted horizontal lines represent the maximum score, average score on the real data set, and the maximum score, average score on the permutation data sets.} \label{fig:gsgr} \end{figure} Basically the integrated gene scores are distributed similarly with the three integration methods at DE weight 0.3 (Figs. \ref{fig:gsl}, \ref{fig:gsr}, and \ref{fig:gsgr}); however, according to the analysis in \cite{Wang13}, SeqGSEA can detect slightly more significant gene sets with rank-based integration strategy than with linear combination. %------------------------------------------------------------------ \subsection{Initialization of \Rclass{SeqGeneSet} objects} %------------------------------------------------------------------ To facilitate running gene set enrichment analysis, \Rpackage{SeqGSEA} implements a \Rclass{SeqGeneSet} class. The \Rclass{SeqGeneSet} class has several slots for accommodating a category of gene sets derived from any biological knowledge-based databases such as Kyoto Encyclopedia of Genes and Genomes (KEGG). However, we recommend to start with the formatted gene-set files from the well-maintained resource Molecular Signatures Database (MSigDB, \url{http://www.broadinstitute.org/gsea/msigdb/index.jsp}) \citep{GSEA2005}. After downloading a {\tt gmt} file from the above URL, users can use \Rfunction{loadGenesets} to initialize a \Robject{SeqGeneSet} object easily. Please note that with the current version of \Rpackage{SeqGSEA}, only gene sets with gene symbols are supported, though read count data's gene IDs can be either gene symbols or Ensembl Gene IDs. Below is shown an example of the \Robject{SeqGeneSet} object, which contains information such as how many gene sets in this object and the names/sizes/descriptions of each gene set. <>= data(GS_example, package="SeqGSEA") GS_example @ %------------------------------------------------------------------ \subsection{running GSEA with integrated gene scores} %------------------------------------------------------------------ With the initialized \Robject{SeqGeneSet} object and integrated gene scores as well as gene scores on the permutation data sets, the main \Rfunction{GSEnrichAnlyze} can be executed; and the \Rfunction{topGeneSets} allows users promptly access to the top significant gene sets. <>= GS_example <- GSEnrichAnalyze(GS_example, gene.score, gene.score.perm) topGeneSets(GS_example, 5) @ The main GSEA includes several steps detailed in \cite{Wang13} and its original paper \cite{GSEA2005}. In \Rpackage{SeqGSEA}, functions \Rfunction{calES}, \Rfunction{calES.perm}, \Rfunction{normES} and \Rfunction{signifES} are implemented to complete the analysis. Advanced users may set up customized pipelines with the functions above themselves. %------------------------------------------------------------------ \subsection{\Rpackage{SeqGSEA} result displays} %------------------------------------------------------------------ Several functions in \Rpackage{SeqGSEA} can be employed for visualization of gene set enrichment analysis running results. The \Rfunction{plotES} function is to plot the distribution of normalized enrichment scores (NES) of all gene sets in a \Robject{SeqGeneSet} object on the observed data set versus its empirical background provided by the NES on the permutation data sets (Fig. \ref{fig:es}). <>= plotES(GS_example) @ \begin{figure} \centering \includegraphics[width=0.6\textwidth]{SeqGSEA-plotES} \caption{Distribution of normalized enrichment scores (NES) on the observed and permutation (null) data sets. Blue: observed NES density; Orange: each for NES density on one permutation data set; Red: the average density on all permutation data sets; Black: observed NES values. } \label{fig:es} \end{figure} The \Rfunction{plotSig} function plots the distributions of permutation $p$-value, false discovery rate (FDR) and family-wise error rate (FWER) versus NES. The example plot is not shown in this vignette as the distributions can be far from the real ones due to the limited permutation times. <>= plotSig(GS_example) @ The \Rfunction{plotSigGS} function is to plot detailed results of a particular gene set that has been analyzed. Information in the plot includes running enrichment scores, null NES on the permutation data sets. See Fig. \ref{fig:siggs} for an example. <>= plotSigGeneSet(GS_example, 9, gene.score) # 9th gene set is the most significant one. @ \begin{figure} \centering \includegraphics[width=0.95\textwidth]{SeqGSEA-plotSigGS} \caption{Left: gene locations of a particular gene set according to the gene score rank and running enrichment scores; Right: null NES distribution and the relative position of the observed NES. } \label{fig:siggs} \end{figure} Besides the functions to generate plots, the \Rfunction{writeSigGeneSet} function can write the detailed information of any analyzed gene sets, including NES, p-values, FDR, and the leading set (see the definition in \cite{Wang13}). An example is shown below. <>= writeSigGeneSet(GS_example, 9, gene.score) # 9th gene set is the most significant one. @ The \Rfunction{GSEAresultTable} generates a summary table of the GSEA analysis, which can also be output with customized scripts. An example can be found in Section \ref{sec:example}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Running \Rpackage{SeqGSEA} with multiple cores}\label{sec:parallel} %------------------------------------------------------------------ \subsection{R-parallel packages} %------------------------------------------------------------------ There are many \R{} packages for facilitating users in running \R{} scripts in parallel, including \Rpackage{parallel}, \Rpackage{snowfall}, \Rpackage{multicore}, and many others. While experienced users may parallelize \Rpackage{SeqGSEA} runnings with the above packages themselves to reduce the running time, we provide with in the \Rpackage{SeqGSEA} package vignette an general way for users to parallelize their runnings utilizing the \Rpackage{doParallel} package (which depends on \Rpackage{parallel}). First, we show a toy example for a basic idea how \Rpackage{doParallel} works. Basically, \Rpackage{doParallel} is a $parallel$ $backend$ for the \Rpackage{foreach} package using \Rpackage{parallel}, which provides a mechanism to execute \Rpackage{foreach} loops in parallel. With the \Rfunction{foreach} function in the \Rpackage{foreach} package, we can specify which \Rfunction{foreach} loops need to be parallelized using the {\tt \%dopar\%} operator. However, without a registered parallel backend, the \Rfunction{foreach} loops will be executed sequentially even if the {\tt \%dopar\%} operator is used. In those cases, the \Rpackage{foreach} package will issue a warning that it is running sequentially. Below are two running examples showing how the task is running sequentially and in parallel, respectively. \begin{description} \item [Run sequentially] without parallel backend registered \begin{center} <>= library(doParallel) a <- matrix(1:16, 4, 4) b <- t(a) foreach(b=iter(b, by='col'), .combine=cbind) %dopar% (a %*% b) @ \end{center} Although the warning message didn't appear here, you would definitely see a warning message when you run the scripts above, like: \\ Warning message: \\ executing \%dopar\% sequentially: no parallel backend registered \item [Run in parallel] with two cores \begin{center} <>= library(doParallel) cl <- makeCluster(2) # specify 2 cores to be used in this computing registerDoParallel(cl) getDoParWorkers() # 2 a <- matrix(1:16, 4, 4) b <- t(a) foreach(b=iter(b, by='col'), .combine=cbind) %dopar% (a %*% b) @ \end{center} \end{description} The parallel backend registration was done with \Rfunction{registerdoParallel}. For more details please refer to \Rpackage{doParallel}'s vignette (\url{http://cran.r-project.org/web/packages/doParallel/index.html}). %------------------------------------------------------------------ \subsection{Parallelizing analysis on permutation data sets} %------------------------------------------------------------------ In \Rpackage{SeqGSEA}, the loops for analyzing permutation data sets are implemented by \Rfunction{foreach} with {\tt \%dopar\%} operator used. Those loops include DS, DE, and GSEA analyses, which are the most time consuming parts. Although there are three parts can take the advantage of parallel running, users only need to register parallel backend once at the beginning of all analyses. See an analysis example in the next section (Section \ref{sec:example}). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Analysis examples}\label{sec:example} %------------------------------------------------------------------ \subsection{Starting from your own RNA-Seq data} %------------------------------------------------------------------ With this \Rpackage{SeqGSEA} package, we provide complementary Python scripts for counting reads on exons of each genes from SAM/BAM files: two scripts {\tt prepare\_exon\_annotation\_refseq.py} and {\tt prepare\_exon\_annotation\_ensembl.py} for preparing (sub-)exon annotation, and {\tt count\_in\_exons.py} for counting reads. The scripts are based on the {\tt HTSeq} Python package (\url{http://www-huber.embl.de/users/anders/HTSeq/}). Please install it before using the Python scripts provided. The scripts can be found in the directory given by the following command. <>= system.file("extscripts", package="SeqGSEA", mustWork=TRUE) @ Simply by typing ``python'' + the file name of echo script in your shell console, the help documentation will be on your screen. Other than the Python scripts provided, users who prefer playing with \R{}/{\tt Bioconductor} packages can also use \Rfunction{easyRNASeq} in \Rpackage{easyRNASeq}, \Rfunction{summarizeOverlaps} in \Rpackage{GenomicRanges}, and \Rfunction{featureCounts} in \Rpackage{Rsubread} to count reads that mapped to each exon. Please refer to respective packages for detailed usage. For users who are not familiar with RNA-Seq data processing, the upstream steps of counting reads are (1) data preprocessing, including adapter removal, low-quality read filtering, data quality-control analysis, and (2) read mapping. \R{}/{\tt Bioconductor} users can apply \Rpackage{Rsubread} to map reads based on a seed-and-vote approach, as well as a few QC analysis. Users familiar with command-line can choose from a wide range of tools, such as already widely used ones including {\tt TopHat} (\url{http://tophat.cbcb.umd.edu}), {\tt START} (\url{http://code.google.com/p/rna-star}), and etc.. %------------------------------------------------------------------ \subsection{Exemplified pipeline for integrating DE and DS} %------------------------------------------------------------------ Below is shown a typical SeqGSEA running example with the data enclosed with the \Rpackage{SeqGSEA} package, which are a part of the prostate cancer data set \citep{CancerData}. We divide the process into five steps for a complete SeqGSEA run. \begin{description} \item [Step 0:] Initialization. (Users should change values in this part accordingly.) <>= rm(list=ls()) # input count data files data.dir <- system.file("extdata", package="SeqGSEA", mustWork=TRUE) case.pattern <- "^SC" # file name starting with "SC" ctrl.pattern <- "^SN" # file name starting with "SN" case.files <- dir(data.dir, pattern=case.pattern, full.names = TRUE) control.files <- dir(data.dir, pattern=ctrl.pattern, full.names = TRUE) # gene set file geneset.file <- system.file("extdata", "gs_symb.txt", package="SeqGSEA", mustWork=TRUE) # output file prefix output.prefix <- "SeqGSEA.test" # setup parallel backend library(doParallel) cl <- makeCluster(2) # specify 2 cores to be used in computing registerDoParallel(cl) # parallel backend registration # setup permutation times perm.times <- 20 # change the number to >= 1000 in your analysis @ \item [Step 1:] DS analysis <>= # load exon read count data RCS <- loadExonCountData(case.files, control.files) # remove genes with low expression RCS <- exonTestability(RCS, cutoff=5) geneTestable <- geneTestability(RCS) RCS <- subsetByGenes(RCS, unique(geneID(RCS))[ geneTestable ]) # get gene IDs, which will be used in initialization of gene set geneIDs <- unique(geneID(RCS)) # calculate DS NB statistics RCS <- estiExonNBstat(RCS) RCS <- estiGeneNBstat(RCS) # calculate DS NB statistics on the permutation data sets permuteMat <- genpermuteMat(RCS, times=perm.times) RCS <- DSpermute4GSEA(RCS, permuteMat) @ \item [Step 2:] DE analysis <>= # get gene read counts geneCounts <- getGeneCount(RCS) # calculate DE NB statistics label <- label(RCS) dds <-runDESeq(geneCounts, label) DEGres <- DENBStat4GSEA(dds) # calculate DE NB statistics on the permutation data sets DEpermNBstat <- DENBStatPermut4GSEA(dds, permuteMat) # permutation @ \item [Step 3:] score integration <>= # DE score normalization DEscore.normFac <- normFactor(DEpermNBstat) DEscore <- scoreNormalization(DEGres$NBstat, DEscore.normFac) DEscore.perm <- scoreNormalization(DEpermNBstat, DEscore.normFac) # DS score normalization DSscore.normFac <- normFactor(RCS@permute_NBstat_gene) DSscore <- scoreNormalization(RCS@featureData_gene$NBstat, DSscore.normFac) DSscore.perm <- scoreNormalization(RCS@permute_NBstat_gene, DSscore.normFac) # score integration gene.score <- geneScore(DEscore, DSscore, DEweight=0.5) gene.score.perm <- genePermuteScore(DEscore.perm, DSscore.perm, DEweight=0.5) # visilization of scores # NOT run in the example; users to uncomment the following 6 lines to run #plotGeneScore(DEscore, DEscore.perm, pdf=paste(output.prefix,".DEScore.pdf",sep=""), # main="Expression") #plotGeneScore(DSscore, DSscore.perm, pdf=paste(output.prefix,".DSScore.pdf",sep=""), # main="Splicing") #plotGeneScore(gene.score, gene.score.perm, # pdf=paste(output.prefix,".GeneScore.pdf",sep="")) @ \item [Step 4:] main GSEA <>= # load gene set data gene.set <- loadGenesets(geneset.file, geneIDs, geneID.type="ensembl", genesetsize.min = 5, genesetsize.max = 1000) # enrichment analysis gene.set <- GSEnrichAnalyze(gene.set, gene.score, gene.score.perm, weighted.type=1) # format enrichment analysis results GSEAres <- GSEAresultTable(gene.set, TRUE) # output results # NOT run in the example; users to uncomment the following 4 lines to run #write.table(GSEAres, paste(output.prefix,".GSEA.result.txt",sep=""), # quote=FALSE, sep="\t", row.names=FALSE) #plotES(gene.set, pdf=paste(output.prefix,".GSEA.ES.pdf",sep="")) #plotSig(gene.set, pdf=paste(output.prefix,".GSEA.FDR.pdf",sep="")) @ \end{description} For gene sets used in Step 4, while we recommend users directly download and use those already well-formatted gene sets from MSigDB (\url{http://www.broadinstitute.org/gsea/msigdb/index.jsp}), users can also feed whatever gene sets to SeqGSEA as long as they are in the GMT format. Please refer to the following URL for details: \url{http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats}. %------------------------------------------------------------------ \subsection{Exemplified pipeline for DE-only analysis} %------------------------------------------------------------------ For the demanding of DE-only analysis, such as for organisms without much alternative splicing annotated, here we show an exemplified pipeline for such analysis. It includes 4 steps as follows. \begin{description} \item [Step 0:] Initialization. (Users should change values in this part accordingly.) <>= rm(list=ls()) # input count data files data.dir <- system.file("extdata", package="SeqGSEA", mustWork=TRUE) count.file <- paste(data.dir,"geneCounts.txt",sep="/") # gene set file geneset.file <- system.file("extdata", "gs_symb.txt", package="SeqGSEA", mustWork=TRUE) # output file prefix output.prefix <- "SeqGSEA.test" # setup parallel backend library(doParallel) cl <- makeCluster(2) # specify 2 cores to be used in computing registerDoParallel(cl) # parallel backend registration # setup permutation times perm.times <- 20 # change the number to >= 1000 in your analysis @ \item [Step 1:] DE analysis <>= # load gene read count data geneCounts <- read.table(count.file) # speficify the labels of each sample label <- as.factor(c(rep(1,10), rep(0,10))) # calculate DE NB statistics dds <-runDESeq(geneCounts, label) DEGres <- DENBStat4GSEA(dds) # calculate DE NB statistics on the permutation data sets permuteMat <- genpermuteMat(label, times=perm.times) DEpermNBstat <- DENBStatPermut4GSEA(dds, permuteMat) # permutation @ \item [Step 2:] score normalization <>= # DE score normalization DEscore.normFac <- normFactor(DEpermNBstat) DEscore <- scoreNormalization(DEGres$NBstat, DEscore.normFac) DEscore.perm <- scoreNormalization(DEpermNBstat, DEscore.normFac) # score integration - DSscore can be null gene.score <- geneScore(DEscore, DEweight=1) gene.score.perm <- genePermuteScore(DEscore.perm, DEweight=1) # visilization of scores # NOT run in the example; users to uncomment the following 6 lines to run #plotGeneScore(DEscore, DEscore.perm, pdf=paste(output.prefix,".DEScore.pdf",sep=""), # main="Expression") #plotGeneScore(gene.score, gene.score.perm, # pdf=paste(output.prefix,".GeneScore.pdf",sep="")) @ \item [Step 3:] main GSEA <>= # load gene set data geneIDs <- rownames(geneCounts) gene.set <- loadGenesets(geneset.file, geneIDs, geneID.type="ensembl", genesetsize.min = 5, genesetsize.max = 1000) # enrichment analysis gene.set <- GSEnrichAnalyze(gene.set, gene.score, gene.score.perm, weighted.type=1) # format enrichment analysis results GSEAres <- GSEAresultTable(gene.set, TRUE) # output results # NOT run in the example; users to uncomment the following 4 lines to run #write.table(GSEAres, paste(output.prefix,".GSEA.result.txt",sep=""), # quote=FALSE, sep="\t", row.names=FALSE) #plotES(gene.set, pdf=paste(output.prefix,".GSEA.ES.pdf",sep="")) #plotSig(gene.set, pdf=paste(output.prefix,".GSEA.FDR.pdf",sep="")) @ \end{description} %------------------------------------------------------------------ \subsection{One-step SeqGSEA analysis} %------------------------------------------------------------------ While users can choose to run SeqGSEA step by step in a well-controlled manner (see above), the one-step SeqGSEA analysis with an all-in \Rfunction{runSeqGSEA} function enables users to run SeqGSEA in the easiest way. With the \Rfunction{runSeqGSEA} function, users can also test multiple weights for integrating DE and DS scores. DE-only analysis starting with exon read counts is also supported in the all-in function. Follow the example below to start your first SeqGSEA analysis now! \begin{description} <>= ### Initialization ### # input file location and pattern data.dir <- system.file("extdata", package="SeqGSEA", mustWork=TRUE) case.pattern <- "^SC" # file name starting with "SC" ctrl.pattern <- "^SN" # file name starting with "SN" # gene set file and type geneset.file <- system.file("extdata", "gs_symb.txt", package="SeqGSEA", mustWork=TRUE) geneID.type <- "ensembl" # output file prefix output.prefix <- "SeqGSEA.example" # analysis parameters nCores <- 8 perm.times <- 1000 # >= 1000 recommended DEonly <- FALSE DEweight <- c(0.2, 0.5, 0.8) # a vector for different weights integrationMethod <- "linear" ### one step SeqGSEA running ### # NOT run in the example; uncomment the following 4 lines to run # CAUTION: running the following lines will generate lots of files in your working dir #runSeqGSEA(data.dir=data.dir, case.pattern=case.pattern, ctrl.pattern=ctrl.pattern, # geneset.file=geneset.file, geneID.type=geneID.type, output.prefix=output.prefix, # nCores=nCores, perm.times=perm.times, integrationMethod=integrationMethod, # DEonly=DEonly, DEweight=DEweight) @ \end{description} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Session information} \begin{center} <>= sessionInfo() @ \end{center} \section*{Cleanup} This is a cleanup step for the vignette on Windows; typically not needed for users. <<>= allCon <- showConnections() socketCon <- as.integer(rownames(allCon)[allCon[, "class"] == "sockconn"]) sapply(socketCon, function(ii) close.connection(getConnection(ii)) ) @ \noindent %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\newpage \bibliographystyle{apalike} \bibliography{SeqGSEA} \end{document}