%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{EGSEA vignette} %\VignettePackage{EGSEA} \documentclass[11pt]{article} <>= BiocStyle::latex() @ \begin{document} \bioctitle[EGSEA]{Ensemble of Gene Set Enrichment Analyses} \author{Monther Alhamdoosh\footnote{m.hamdoosh@gmail.com}, Milica Ng and Matthew Ritchie\footnote{mritchie@wehi.edu.au}} \maketitle \tableofcontents \newpage \section{Introduction} The \Biocpkg{EGSEA} package implements the Ensemble of Gene Set Enrichment Analysis (EGSEA) algorithm that utilizes the analysis results of eleven prominent GSE algorithms in the literature to calculate collective significance scores for each gene set. These methods include: \Rfunction{ora} \cite{Tavazoie1999}, \Biocpkg{globaltest} \cite{Goeman2004}, \Rfunction{plage} \cite{Tomfohr2005}, \Biocpkg{safe} \cite{Barry2005}, \Rfunction{zscore} \cite{Lee2008}, \Biocpkg{gage} \cite{Luo2009}, \Rfunction{ssgsea} \cite{Barbie2009}, \Rfunction{roast} \cite{Wu2010}, \Biocpkg{PADOG} \cite{Tarca2012}, \Rfunction{camera} \cite{Wu2012} and \Biocpkg{GSVA} \cite{Hanzelmann2013}. The \Rfunction{ora}, \Rfunction{gage}, \Rfunction{camera} and \Rfunction{gsva} methods depend on a competitive null hypothesis while the remaining seven methods are based on a self-contained hypothesis. Conveniently, \Biocpkg{EGSEA} is not limited to these eleven GSE methods and new GSE tests can be easily integrated into the framework. The plage, zscore and ssgsea algorithms are implemented in the \Biocpkg{GSVA} package and \Rfunction{camera} and \Rfunction{roast} are implemented in the \Biocpkg{limma} package. \Biocpkg{EGSEA} was implemented with parallel computation enabled using the \CRANpkg{parallel} package. There are two levels of parallelism in EGSEA:(i) parallelism at the method-level and (ii) parallelism at the experimental contrast level. A wrapper function was written for each individual GSE method to utilize existing R packages and create a universal interface for all methods. The ora method was implemented using the \Rfunction{phyper} function from the \CRANpkg{stats} package, which estimates the hypergeometric distribution for a $2 \times 2$ contingency table. \par RNA-seq reads are first aligned to the reference genome and mapped reads are assigned to annotated genomic features to obtain a summarized \textit{count matrix}. The \Biocpkg{EGSEA} package was developed so that it can accept a count matrix or a \Robject{voom} object. Most of the GSE methods were intrinsically designed to work with microarray expression values and not with RNA-seq counts, hence the \Rfunction{voom} transformation is applied to the count matrix to generate an expression matrix applicable for use with these methods \cite{Law2014} . Since gene set tests are most commonly applied when two experimental conditions are compared, a design matrix and a contrast matrix are used to construct the experimental comparisons of interest. The target collection of gene sets is indexed so that the gene identifiers can be substituted with the indices of genes in the rows of the count matrix. The GSE analysis is then carried out by each of the selected methods independently and an FDR value is assigned to each gene set. Lastly, the ensemble functions are invoked to calculate collective significance scores for each gene set. \par The \Biocpkg{EGSEA} package also allows for performing over-representation analysis on the EGSEA gene set collections that were adopted from MSigDB, KEGG and GeneSetDB databases. \section{Citation} \label{sec:citation} \begin{itemize} \item Monther Alhamdoosh, Milica Ng, Nicholas J. Wilson, Julie M. Sheridan, Huy Huynh, Michael J. Wilson and Matthew E. Ritchie. Combining multiple tools outperforms individual methods in gene set enrichment analyses. \end{itemize} \section{Installation instructions} The \Biocpkg{EGSEA} package was developed so that it harmonizes with the existing R packages in the CRAN repository or the Bioconductor project. \subsection{System prerequistes} \Biocpkg{EGSEA} does not require any software package or library to be installed before it can be installed regardless of the operating system. \subsection{R package dependencies} The \Biocpkg{EGSEA} package depends on several R packages that are not in the Bioconductor project. These packages are listed below: \begin{itemize} \item \CRANpkg{HTMLUtils} facilitates automated HTML report creation, in particular framed HTML pages and dynamically sortable tables. It is used in \Biocpkg{EGSEA} to generate the stats tables. To install it, type in the R console \par install.packages("HTMLUtils") \item \CRANpkg{hwriter} has easy-to-use and versatile functions to output R objects in HTML format. It is used in this package to create the HTML pages of the EGSEA report. To install it, \par install.packages("hwriter") \item \CRANpkg{ggplot2} is an implementation of the grammar of graphics in R. It is used in this package to create the summary plots. To install it, type \par install.packages("ggplot2") \item \CRANpkg{gplots} has various R programming tools for plotting data. It is used in \Biocpkg{EGSEA} to create heatmaps. To install it, run \par install.packages("gplots") \item \CRANpkg{stringi} allows for fast, correct, consistent, portable, as well as convenient character string/text processing in every locale and any native encoding. It is used in generating the HTML pages. To install this package, type \par install.packages("stringi") \item \CRANpkg{parallel} handles running much larger chunks of computations in parallel. It is used to carry out gene set tests on parallel. It is usually installed with R. \end{itemize} \subsubsection{Bioconductor packages} The Bioconductor packages that need to be installed in order for \Biocpkg{EGSEA} to function properly are: \Biocpkg{PADOG}, \Biocpkg{GSVA}, \Biocpkg{AnnotationDbi}, \Biocpkg{topGO}, \Biocpkg{pathview}, \Biocpkg{gage}, \Biocpkg{globaltest}, \Biocpkg{limma}, \Biocpkg{edgeR}, \Biocpkg{safe}, \Biocpkg{org.Hs.eg.db}, \Biocpkg{org.Mm.eg.db}, \Biocpkg{org.Rn.eg.db}. Thesea packages can be installed from Biocondcutor using the following commands in R console <>= source("http://www.bioconductor.org/biocLite.R") biocLite(c("PADOG", "GSVA", "AnnotationDbi", "topGO", "pathview", "gage", "globaltest", "limma", "edgeR", "safe", "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db")) @ \subsubsection{Essential data package} The gene set collections that are used by \Biocpkg{EGSEA} were preprocessed and converted into R data objects to be used by the EGSEA functions. The data objects are stored in an R package, named \Rpackage{EGSEAdata}. It contains the gene set collections that are used by \Biocpkg{EGSEA} to perform gene set testing. \Rpackage{EGSEAdata} can be installed from Bitbucket. \par To install packages from Bitbucket, \CRANpkg{devtools} should be installed. \CRANpkg{devtools} Package devtools is available at CRAN. For Windows this seems to depend on having Rtools for Windows installed. You can download and install this from: http://cran.r-project.org/bin/windows/Rtools/ \par To install \CRANpkg{devtools}, run <>= install.packages("devtools") @ \par To install \Rpackage{EGSEAdata}, run in R console the following commands <>= library(devtools) install_bitbucket("malhamdoosh/egseadata", ref="Stable_Release") @ \subsection{Installation} \subsubsection{Bioconductor} In R console, type <>= source("http://bioconductor.org/biocLite.R") biocLite("EGSEA") @ \subsubsection{Bitbucket} To install EGSEA from bitbucket, type in the R console <>= library(devtools) install_bitbucket("malhamdoosh/egsea", ref="Stable_Release") @ \section{Quick start} \subsection{EGSEA gene set collections} The Molecular Signatures Database (MSigDB) \cite{Subramanian2005} v5.0 was downloaded from \url{http://www.broadinstitute.org/gsea/msigdb} (05 July 2015, date last accessed) and the human gene sets were extracted for each collection (h, c1, c2, c3, c4, c5, c6, c7). Mouse orthologous gene sets of these MSigDB collections were adopted from \url{http://bioinf.wehi.edu.au/software/MSigDB/index.html} \cite{Wu2012}. EGSEA uses Entrez Gene identifiers \cite{Maglott2005} and alternate gene identifiers must be first converted into Entrez IDs. KEGG pathways \cite{Kanehisa2000} for mouse and human were downloaded using the \textit{gage} package. To extend the capabilities of EGSEA, a third database of gene sets was downloaded from the GeneSetDB \cite{Araki2012} \url{http://genesetdb.auckland.ac.nz/sourcedb.html} project. In total, more than 20,000 gene sets have been collated along with annotation information for each set (where available). \par The \Biocpkg{EGSEA} package has four main functions that utilizes the gene set collections of \Rpackage{EGSEAdata}. They map the dataset Entrez gene IDs into the available gene sets of each collection and create indexes for each gene set collection. They also compile annotation information for each gene set to be integrated with the final EGSEA report. These functions are: \begin{itemize} \item \Rfunction{buildKEGGIdxEZID} indexes the KEGG pathway gene sets and loads gene set annotation. Type ?buildKEGGIdxEZID in the console to see how to use this function. \item \Rfunction{buildMSigDBIdxEZID} indexes the MSigDB gene sets and loads gene set annotation. Type ?buildMSigDBIdxEZID in the console to see how to use this function. \item \Rfunction{buildGeneSetDBIdxEZID} indexes the GeneSetDB gene sets and loads gene set annotation. Type ?buildGeneSetDBIdxEZID in the console to see how to use this function. \item \Rfunction{buildIdxEZID} indexes the MSigDB and KEGG gene sets and loads gene set annotation. Type ?buildIdxEZID in the console to see how to use this function. \end{itemize} \par The above functions require a vector of Entrez Gene IDs and the species name. To use the output of these functions with the \Biocpkg{EGSEA} functions, the order of gene ids in the \textit{entrezIDs} parameter should match that of the row names of the count matrix or the \Robject{voom} object. \subsection{A simple example} The \Biocpkg{EGSEA} package basically performs gene set enrichment analysis on a \Robject{voom} object generated by the \Rfunction{voom} function from the \Biocpkg{limma} package. It was primarily developed to extend the limma-voom RNA-seq analysis pipeline. \par To quickly start with \Biocpkg{EGSEA} analysis, an example on analyzing a human IL-13 dataset is presented here. This experiment aims to identify the biological pathways and diseases associated with the cytokine Interleukin 13 (IL-13) using gene expression measured in peripheral blood mononuclear cells (PBMCs) obtained from 3 healthy donors. The expression profiles of \textit{in vitro} IL-13 stimulation were generated using RNA-seq technology for $3$ PBMC samples at $24$ hours. The transcriptional profiles of PBMCs without IL-13 stimulation were also generated to be used as controls. Finally, an IL-13R$\alpha$1 antagonist was introduced into IL-13 stimulated PBMCs and the gene expression levels after 24h were profiled to examine the neutralization of IL-13 signaling by the antagonist. Only two samples were available for the last condition. Single-end 100bp reads were obtained via RNA-seq from total RNA using a HiSeq 2000 Illumina sequencer. TopHat was used to map the reads to the human reference genome (GRCh37.p10). HTSeq was then used to summarize reads into a gene-level count matrix. The TMM method from the \Biocpkg{edgeR} package was used to normalize the RNA-seq counts. \par To perform EGSEA analysis on this dataset, the \Biocpkg{EGSEA} package is first loaded: <>= library(EGSEA) @ Then, the \Robject{voom} data object of this experiment from \Rpackage{EGSEAdata} is loaded to perform the EGSEA analysis: <>= library(EGSEAdata) data(il13.data) v = il13.data$voom names(v) v$design contrasts = il13.data$contra contrasts @ Before the EGSEA function is called gene set collection(s) needs to be preprocessed and indexed using EGSEA indexing functions that were presented earlier. For example, to use the KEGG pathway collections without the Metabolism pathways, type <>= # prepare gene set collections gs.annots = buildIdxEZID(entrezIDs=rownames(v$E), species="human", msigdb.gsets="c5", kegg.exclude = c("Metabolism")) names(gs.annots) @ Finally, the EGSEA analysis can be invoked using the \Rfunction{egsea} function as follows <>= # perform the EGSEA analysis # set display.top = 20 to display more gene sets. It takes longer time to run. gsa = egsea(voom.results=v, contrasts=contrasts, gs.annots=gs.annots, symbolsMap=v$genes, baseGSEAs=egsea.base()[-2], display.top = 3, sort.by="avg.rank", egsea.dir="./il13-egsea-report", num.threads = 4) topSets(gsa, contrast=1, gs.label="kegg", number = 10) topSets(gsa, contrast=1, gs.label="kegg", sort.by="ora", number = 10, names.only=FALSE) topSets(gsa, contrast="comparison", gs.label="kegg", number = 10) @ \par The \Rfunction{egsea} returns a list of elements, one for each gene set collection and one for the comparative analysis. Each element is also a list of two elements: the \Robject{top.gene.sets}, which stores the top \textit{display.top} gene sets for each contrast and the \Robject{test.results}, which stores the stores the EGSEA test results for all gene sets along with the ensemble and individual rankings. \par The EGSEA report of this experiment can be launched from \url{./il13-egsea-report/index.html}. %normalizePath("./il13-egsea-report") \par Finally, the EGSEA analysis can be run with all the gene set collections that are avilable in the \Rpackage{EGSEAdata} package as follows <>= gs.annots = buildIdxEZID(entrezIDs=rownames(v$E), species="human") gsetdb.annots = buildGeneSetDBIdxEZID(entrezIDs=rownames(v$E), species="human") gs.annots = c(gs.annots, gsetdb.annots) names(gs.annots) @ \section{Ensemble of Gene Set Enrichment Analysis} Given an RNA-seq dataset $D$ of samples from $N$ experimental conditions, $K$ annotated genes $g_k (k=1,\cdots,K)$, $L$ experimental comparisons of interest $C_l (l=1, \cdots, L)$, a collection of gene sets $\Gamma$ and $M$ methods for gene set enrichment analysis, the objective of a GSE analysis is to find the most relevant gene sets in $\Gamma$ which explain the biological processes and/or pathways that are perturbed in expression in individual comparisons and/or across multiple contrasts simultaneously. Numerous statistical gene set enrichment analysis methods have been proposed in the literature over the past decade. Each method has its own characteristics and assumptions on the analyzed dataset and gene sets tested. In principle, gene set tests calculate a statistic for each gene individually $f(g_k)$ and then integrate these significance scores in a framework to estimate a set significance score $h(\gamma_i)$. \par We propose seven statistics to combine the individual gene set statistics across multiple methods, and to rank and hence identify biologically relevant gene sets. Assume a collection of gene sets $\Gamma$, a given gene set $\gamma_i \in \Gamma$, and that the GSE analysis results of $M$ methods on $\gamma_i$ for a specific comparison (represented by ranks $R_i^m$ and statistical significance scores $p_i^m$, where $m=1, \cdots, M$ and $i=1, \cdots, |\Gamma|$) are given. The EGSEA scores can then be devised, for each experimental comparison, as follows: \begin{itemize} \item The $p$-value score is the average $p$-value assigned to $\gamma_i$ \item The minimum $p$-value score is the smallest $p$-value calculated for $\gamma_i$ \item The minimum rank score of $\gamma_i$ is the smallest rank assigned to $\gamma_i$ \item The average ranking score is the mean rank across the $M$ ranks \item The median ranking score is the median rank across the $M$ ranks \item The majority voting score is the most commonly assigned bin ranking \item The significance score assigns high scores to the gene sets with strong fold changes and high statistical significance \end{itemize} It is worth noting that the $p$-value score can only be calculated under the independence assumption of individual gene set tests, and thus it is not an accurate estimate of the ensemble gene set significance, but can still be useful for ranking results. The significance score is scaled into $[0, 100]$ range for each gene set collection. To learn more about the calculation of each EGSEA score, the original paper of this work is available at Section 2. \section{EGSEA report} \begin{figure}[!h] \includegraphics[width=0.7\paperwidth]{{./il13-egsea-report/hm-top-gs-kegg/hsa05310.heatmap.multi}.pdf} \caption{Asthma heatmap for the comparative analysis} \label{fig:heatmaps_multi} \end{figure} \begin{figure}[!h] \includegraphics[width=0.8\paperwidth]{{./il13-egsea-report/pv-top-gs-kegg/X24IL13-X24/hsa05310.pathview}.png} \caption{Asthma pathway map for the contrast X24IL13-X24} \label{fig:pathways_single} \end{figure} \begin{figure}[!h] \includegraphics[width=0.8\paperwidth]{{./il13-egsea-report/go-graphs/X24IL13-X24-c5-top-BP}.pdf} \caption{The top significant Biological Processes (BP) from GO terms.} \label{fig:go_terms} \end{figure} Since the number of annotated gene set collections in public databases continuously increases and there is a growing trend towards generating dynamic analytical tools, our software tool was developed to enable users to interactively navigate through the analysis results by generating an HTML \textit{EGSEA Report}. The report presents the results in different ways. For example, the \textit{Stats table} displays the top $n$ gene sets (where $n$ is selected by the user) for each experimental comparison and includes all calculated statistics. Hyperlinks are enabled wherever possible, to access additional information on the gene sets such as annotation information. The gene expression fold changes can be visualized using heat maps for individual gene sets (Fig. 1) or projected onto pathway maps where available (e.g. KEGG gene sets) (Fig. 2). The most significant Gene Ontology (GO) terms for each comparison can be viewed in a GO graph that shows their relationships (Fig. 3). \par Additionally, EGSEA creates summary plots for each gene set collection to visualize the overall statistical significance of gene sets (Fig. 4). Two types of summary plots are generated: (i) a plot that emphasizes the gene regulation direction and the significance score of a gene set and (ii) a plot that emphasizes the set cardinality and its rank. EGSEA also generates a multidimensional scaling (MDS) plot that shows how various GSE methods rank a collection of gene sets (Fig. 5). This plot gives insights into the similarity of different methods on a given dataset. Finally, the reporting capabilities of EGSEA can be used to extend any existing or newly developed GSE method by simply using only that method. \begin{figure}[!h] \includegraphics[width=0.7\paperwidth]{{./il13-egsea-report/summary/X24IL13-X24-kegg-summary.dir}.pdf} \caption{Summary plot for the contrast X24IL13-X24} \label{fig:summary_dir_single} \end{figure} \begin{figure}[!h] \includegraphics[width=0.7\paperwidth]{{./il13-egsea-report/summary/X24IL13-X24-kegg-methods}.pdf} \caption{The performance of multiple GSE methods on the contrast X24IL13-X24.} \label{fig:summary_methods} \end{figure} Similar reporting capabilities are also provided for the comparative analysis results of EGSEA (Fig. 6 and Fig. 7). \begin{figure}[!h] \includegraphics[width=0.8\paperwidth]{{./il13-egsea-report/pv-top-gs-kegg/hsa05310.pathview.multi}.png} \caption{Asthma pathway map for the comparative analysis} \label{fig:pathways_multi} \end{figure} \begin{figure}[!h] \includegraphics[width=0.7\paperwidth]{{./il13-egsea-report/summary/kegg-summary.dir}.pdf} \caption{Summary plot for the comparative analysis} \label{fig:summary_dir_multi} \end{figure} \subsection{Comparative analysis} Unlike most GSE methods that calculate a gene set enrichment score for a given gene set under a single experimental contrast (e.g. disease vs. control), the comparative analysis proposed here allows researchers to estimate the significance of a gene set across multiple experimental contrasts. This analysis helps in the identification of biological processes that are perturbed by multiple experimental conditions simultaneously. Comparative significance scores are calculated for a gene set. \par An interesting application of the comparative analysis would be finding pathways or biological processes that are activated by a stimulation with a particular cytokine yet are completely inhibited when the cytokine's receptor is blocked by an antagonist, revealing the functions uniquely associated with the signaling of that particular receptor as in the experiment below. \section{EGSEA on a non-human dataset} Epithelial cells from the mammary glands of female virgin 8-10 week-old mice were sorted into three populations of basal, luminal progenitor (LP) and mature luminal (ML) cells. Three independent samples from each population were profiled via RNA-seq on total RNA using an Illumina HiSeq 2000 to generate 100bp single-end read libraries. The \Biocpkg{Rsubread} aligner was used to align these reads to the mouse reference genome (\textit{mm10}) and mapped reads were summarized into gene-level counts using \Rfunction{featureCounts} with default settings. The raw counts are also normalized using the TMM method. Data are available from the GEO database as series GSE63310. \par To perform EGSEA analysis on this dataset, the following commands are invoked in the R console <>= # load the mammary dataset library(EGSEA) library(EGSEAdata) data(mam.data) v = mam.data$voom names(v) v$design contrasts = mam.data$contra contrasts # build the gene set collections gs.annots = buildIdxEZID(entrezIDs=rownames(v$E), species="mouse", msigdb.gsets = "c2", kegg.exclude = "all") names(gs.annots) # create Entrez IDs - Symbols map symbolsMap = v$genes[,c(1,3)] colnames(symbolsMap) = c("FeatureID", "Symbols") symbolsMap[, "Symbols"] = as.character(symbolsMap[, "Symbols"]) # replace NA Symbols with IDs na.sym = is.na(symbolsMap[, "Symbols"]) symbolsMap[na.sym, "Symbols"] = symbolsMap[na.sym, "FeatureID"] # perform the EGSEA analysis # set report = TRUE to generate the EGSEA interactive report gsa = egsea(voom.results=v, contrasts=contrasts, gs.annots=gs.annots, symbolsMap=symbolsMap, baseGSEAs=egsea.base()[-c(2,5,6,9)], display.top= 20, sort.by="med.rank", egsea.dir="./mam-egsea-report", num.threads=4, report=FALSE) # show top 20 comparative gene sets in C2 collection topSets(gsa, contrast="comparison", gs.label="c2", number = 20) @ \section{EGSEA on a count matrix} The EGSEA analysis can be also performed on the count matrix directly without the need of having a \Robject{voom} object in advance. The \Rfunction{egsea.cnt} can be invoked on a count matrix given the group of each sample is provided with design and contrast matrices as it is illustrated in this example. This function uses the \Rfunction{voom} function from the \Biocpkg{limma} pakcage to convert the RNA-seq counts into expression values. \par Here, the IL-13 human dataset is reanalyzed using the count matrix. <>= # load the count matrix and other relevant data library(EGSEAdata) data(il13.data.cnt) cnt = il13.data.cnt$counts group = il13.data.cnt$group group design = il13.data.cnt$design contrasts = il13.data.cnt$contra genes = il13.data.cnt$genes # build the gene set collections gs.annots = buildIdxEZID(entrezIDs=rownames(cnt), species="human", msigdb.gsets="none", kegg.exclude = c("Metabolism")) # perform the EGSEA analysis # set report = TRUE to generate the EGSEA interactive report gsa = egsea.cnt(counts=cnt, group=group, design=design, contrasts=contrasts, gs.annots=gs.annots, symbolsMap=genes, baseGSEAs=egsea.base()[-2], display.top = 5, sort.by="avg.rank", egsea.dir="./il13-egsea-cnt-report", num.threads = 4, report=FALSE) @ \section{EGSEA on a list of genes} Since performing simple over-representation analysis on large collections of gene sets is not readily available in Bioconductor, an ORA analysis was augmented to the \Biocpkg{EGSEA} package so that all the reporting capabilities of EGSEA are enabled. \par To perform ORA using the DE genes of the \textit{X24IL13-X24} contrast from the IL-13 dataset, cut-off thresholds of p-value=0.05 and logFC = 1 are used to select a subset of DE genes. Then, the \Rfunction{egsea.ora} function is invoked as it is illulstrated in the following example <>= # load IL-13 dataset library(EGSEAdata) data(il13.data) voom.results = il13.data$voom contrast = il13.data$contra # find Differentially Expressed genes library(limma) vfit = lmFit(voom.results, voom.results$design) vfit = contrasts.fit(vfit, contrast) vfit = eBayes(vfit) # select DE genes (Entrez IDs and logFC) at p-value <= 0.05 and |logFC| >= 1 top.Table = topTable(vfit, coef=1, number=Inf, p.value=0.05, lfc=1) deGenes = as.character(top.Table$FeatureID) logFC = top.Table$logFC names(logFC) = deGenes # build the gene set collection index gs.annots = buildIdxEZID(entrezIDs=deGenes, species="human", msigdb.gsets="none", kegg.exclude = c("Metabolism")) # perform the ORA analysis # set report = TRUE to generate the EGSEA interactive report gsa = egsea.ora(entrezIDs=deGenes, universe= as.character(voom.results$genes[,1]), logFC =logFC, title="X24IL13-X24", gs.annots=gs.annots, symbolsMap=top.Table[, c(1,2)], display.top = 5, egsea.dir="./il13-egsea-ora-report", num.threads = 4, report=FALSE) @ \section{Non-standard gene set collections} Scientists usually have their own lists of gene sets and are interested in finding which sets are significant in the investigated dataset. Additional collections of gene sets can be easily added and tested using the EGSEA algorithm. The \Rfunction{buildCustomIdxEZID} function indexes newly created gene sets and attach gene set annotation if provided. To illustrate the use of this function, assume a list of gene sets is available where each gene set is represented by a character vector of Entrez Gene IDs. In this example, 50 gene sets were selected from the KEGG collection and then they were used to build a custom gene set collection index. <>= library(EGSEAdata) data(il13.data) v = il13.data$voom # load KEGG pathways data(kegg.pathways) # select 50 pathways gsets = kegg.pathways$human$kg.sets[1:50] gsets[1] # build custom gene set collection using these 50 pathways gs.annots = buildCustomIdxEZID(entrezIDs=rownames(v$E), gsets= gsets, species="human") names(gs.annots) colnames(gs.annots$anno) @ The \Rfunction{buildCustomIdxEZID} creates an annotation data frame for the gene set collection if the \textit{anno} parameter is not provided. Once the gene set collection is indexed, it can be used with any of the \Biocpkg{EGSEA} functions: \Rfunction{egsea}, \Rfunction{egsea.cnt} or \Rfunction{egsea.ora}. \section{Add new GSE method} If you have an interesting gene set test method that you would like to add to the EGSEA framework, please contact us and we will be happy to add your method to the next release of \Biocpkg{EGSEA}. \addcontentsline{toc}{section}{References} \bibliography{references} \end{document} % R CMD Sweave --engine=knitr::knitr --pdf EGSEA.Rnw