%\VignetteIndexEntry{Generally Applicable Gene-set/Pathway Analysis} %\VignetteDepends{multtest, graph, gageData} %\VignetteSuggests{Rgraphviz, RBGL} %\VignettePackage{gage} \documentclass[11pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{geometry} \usepackage{longtable} \usepackage[pdftex]{graphicx} \usepackage[authoryear,round]{natbib} \SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=TRUE,prefix=TRUE} % R part \newcommand{\R}[1]{{\textsf{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Metas}[1]{{\texttt{#1}}} \begin{document} \setkeys{Gin}{width=0.8\textwidth} \title{Generally Applicable Gene-set/Pathway Analysis} \author{Weijun Luo} \maketitle \begin{abstract} In this vignette, we demonstrate the \Rpackage{gage} package for gene set or pathway analysis. The \Rpackage{gage} package implement the GAGE method. GAGE is generally applicable independent of microarray data attributes including sample sizes, experimental designs, microarray platforms, and other types of heterogeneity, and consistently achieves superior performance over other frequently used methods. We introduce functions and data for routine and advanced gene set analysis, as well as results presentation and interpretation. We also cover package installation, data preparation and common application errors. \end{abstract} \section{Introduction} Gene set analysis (GSA) is a powerful strategy to infer functional and mechanistic changes from high through microarray data. GSA focuses on sets of related genes and has established major advantages over per-gene based different expression analyses, including greater robustness, sensitivity and biological relevance. However, classical GSA methods only have limited usage to a small number of microarray studies as they cannot handle datasets of different sample sizes, experimental designs and other types of heterogeneity \citep{luo:etal:2009}. To address these limitations, we present a new GSA method called Generally Applicable Gene-set Enrichment (GAGE) \citep{luo:etal:2009}. We have validated GAGE method extensively against the most widely used GSA methods in both applicability and performance \citep{luo:etal:2009}. In this manual, we focus on the implementation and usage of the R package, \Rpackage{gage}. In \Rpackage{gage} package, we provide a series of functions for basic GAGE analysis, result processing and presentation. We have also built pipeline routines for of multiple GAGE analysis on different comparisons or samples, the comparison of parallel analysis results, and even the combined analysis of heterogeneous data from different sources/studies. This package also supplies an example microarray dataset. In GAGE paper \citep{luo:etal:2009} and the old versions of \Rpackage{gage} package, we used a BMP6 microarray experiment as the demo data. Here we choose to use another published microarray dataset from GEO, GSE16873 \citep{emery:etal:2009}, as our primary demo data, as to show more and advanced functionality and applications of GAGE with this update package. GSE16873 is a breast cancer study \citep{emery:etal:2009}, which covers twelve patient cases, each with HN (histologically normal), ADH (ductal hyperplasia), and DCIS (ductal carcinoma in situ) RMA samples. Hence, there are 12*3=36 microarray hybridizations or samples interesting to us plus 4 others less interesting in this full dataset. Due to the size limit of this package, we split this GSE16873 into two halves, each including 6 patients with their HN and DCIS but not ADH tissue types. Raw data for these two halves were processed using two different methods, FARMS \citep{hochreiter:etal:2006} and RMA \citep{iriz:etal:2003}, respectively to generate the non-biological data heterogeneity. This \Rpackage{gage} package, we only include the first half dataset for 6 patients (processed using FARMS). The second half dataset plus the full dataset (including all HN, ADH and DCIS samples of all 12 patients, processed together using FARMS) and the original BMP6 dataset is supplied with a data package, \Rpackage{gageData}. Most of our demo analyses will be done on the first half dataset, except for the advanced analysis where we use both halves datasets with all 12 patients. The \Rpackage{gage} package also includes useful human gene set data derived from KEGG pathway and Gene Ontology databases. We also supply extra gene set data for other species including mouse, rat and yeast in the data package, \Rpackage{gageData}, available from the \href{http://bioconductor.org/packages/release/data/experiment/}{bioconductor website}. In addition, the users may derive other own gene sets using tools like \Rpackage{GSEABase} package. The manual is written by assuming the user has minimal R/Bioconductor knowledge. Some descriptions and code chunks cover very basic usage of R. The more experienced users may simply omit these parts. \section{Installation} Assume R has been correctly installed and accessible under current directory. Otherwise, please contact your system admin or follow the instructions on \href{http://www.r-project.org/}{R website}. Start R: from Linux/Unix command line, type \verb@R@ (Enter); for Mac or Windows GUI, double click the R application icon to enter R console. End R: type in \verb@q()@ when you are finished with the analysis using R, but not now. Two options: Either install with Bioconductor installation script \verb@biocLite@ directly: <>= source("http://bioconductor.org/biocLite.R") biocLite(c("gage","gageData")) @ Or install without using Bioconductor, but require packages \Rpackage{graph} and \Rpackage{multtest} to be installed too: Download \Rpackage{gage} and \Rpackage{gageData} packages (make sure with proper version number and zip format) and save to \verb@/your/local/directory/@. <>= install.packages(c("/your/local/directory/gage_2.0.0.tar.gz", "/your/local/directory/gageData_1.0.0.tar.gz"), repos = NULL, type = "source") @ \section{Get Started} Under R, first we load the \Rpackage{gage} package: <>= options(width=80) @ <>= library(gage) @ To see a brief overview of the package: <>= library(help=gage) @ To get help on any function (say the main function, \verb@gage@), use the \verb@help@ command in either one of the following two forms: <>= help(gage) ?gage @ \section{Basic Analysis} GAGE \citep{luo:etal:2009} is a widely applicable method for gene set or pathway analysis. In such analysis, we check for coordinated differential expression over gene sets instead of changes of individual genes. Gene sets are pre-defined groups of genes, which are functionally related. Commonly used gene sets include those derived from KEGG pathways, Gene Ontology terms, gene groups that share some other functional annotations, including common transcriptional regulators (like transcription factors, small interfering RNA's or siRNA etc), genomic locations etc. Consistent perturbations over such gene sets frequently suggest mechanistic changes. GSA is much more robust and sensitive than individual gene analysis \citep{luo:etal:2009}, especially considering the functional redundancy of individual genes and the noise level of high throughput assay data. GAGE method implemented in this package makes GSA feasible to all microarray studies, irrespecitve of the sample size, experiment design, array platform etc. Let's start with the most basic gene set analysis. We load and look at the demo microarray data first. <>= data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =T) adh=grep('ADH',cn, ignore.case =T) dcis=grep('DCIS',cn, ignore.case =T) print(hn) print(dcis) @ Check to make sure your gene sets and your expression data are using the same ID system (Entrez Gene ID, Gene symbol, or probe set ID etc). Entrez Gene IDs are integer numbers, Gene symbols are characters, and probe set IDs (Affymetrix GeneChip) are characters with \verb@_at@ suffix. <>= data(kegg.gs) data(go.gs) lapply(kegg.gs[1:3],head) head(rownames(gse16873)) @ We normally do GAGE analysis using gene sets derived from on KEGG pathways or GO term groups. Here we use the human gene set data coming with this package with the human microarray dataset GSE16873. Note that the \verb@go.gs@ derived from human GO database only includes 1000 gene sets due to size limit. For full \verb@go.gs@ or gene sets data for other species, we may always use the \Rpackage{gageData} package or compile our own gene set data. Here we look at the expression changes towards one direction (either up- or down- regulation) in the gene sets. The results of such 1-directional analysis is a list of two matrices, corresponding to the two directions. Each result matrix contains multiple columns of test statistics and p-/q-values for all tested gene sets. Among them, p.val is the global p-value and q.val is the corresponding FDR q-value. Gene sets are ranked by significance. Type \verb@?gage@ for more information. <>= gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) #go.gs here only the first 1000 entries as a fast example. gse16873.go.p <- gage(gse16873, gsets = go.gs, ref = hn, samp = dcis) str(gse16873.kegg.p, strict.width='wrap') head(gse16873.kegg.p$greater[, 1:5], 4) head(gse16873.kegg.p$less[, 1:5], 4) head(gse16873.kegg.p$stats[, 1:5], 4) @ As described in GAGE paper \citep{luo:etal:2009}, it is worthy to run \Rfunction{gage} with \verb@same.dir=F@ option on KEGG pathways too to capture pathways perturbed towards both directions simultaneously. Note we normally do not use \verb@same.dir=F@ option on GO groups, which are mostly gene sets coregulated towards the same direction. <>= gse16873.kegg.2d.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, same.dir = F) head(gse16873.kegg.2d.p$greater[,1:5], 4) head(gse16873.kegg.2d.p$stats[,1:5], 4) @ We may also do PAGE analysis \citep{kim:etal:2005} with or without different combinations of GAGE style options. The key difference is 1) to use z-test instead of two-sample t-test and 2) a group-on-group comparison scheme (by setting arguement \verb@compare="as.group"@) instead of default one-on-one scheme (\verb@compare="paired"@) as in GAGE. Here we only used z-test option to compare the net effect of using differet gene set tests, as detailed in \cite{luo:etal:2009}. <>= gse16873.kegg.page.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, saaTest = gs.zTest) head(gse16873.kegg.page.p$greater[, 1:5], 4) @ We get much smaller p-/q-values with PAGE than with GAGE. As described in GAGE paper \citep{luo:etal:2009}, many significant calls made by PAGE are likely false positives. With \Rpackage{gage}, we have much more options to tweak than those related to PAGE for more customized GAGE analysis. Here we show a few other alternative ways of doing GAGE analysis by setting other arguements. Use t-test statistics instead of fold change as per gene score: <>= gse16873.kegg.t.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, use.fold = F) @ If you are very concerned about the normality of expression level changes or the gene set sizes are mostly smaller than 10, we may do the non-parametric Mann Whitney U tests (rank test without distribution assumption) instead of the parametric two-sample t-tests on gene sets: <>= gse16873.kegg.rk.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, rank.test = T) @ Do the non-parametric Kolmogorov-Smirnov tests (like rank test, used in GSEA) instead of the parametric two-sample t-tests on gene sets: <>= gse16873.kegg.ks.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, saaTest = gs.KSTest) @ Frequently, the samples are not one-on-one paired in the experiment design. In such cases, we take the samples as unpaired: <>= gse16873.kegg.unpaired.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, compare = "unpaired") @ Since version 2.2.0, gage package does more robust p-value summarization using Stouffer's method through argument \verb@use.stouffer=TRUE@. The original p-value summarization, i.e. negative log sum following a Gamma distribution as the Null hypothesis, may produce less stable global p-values for large or heterogenous datasets. In other words, the global p-value could be heavily affected by a small subset of extremely small individual p-values from pair-wise comparisons. Such sensitive global p-value leads to the "dual signficance" phenomenon. Dual-signficant means a gene set is called significant simultaneously in both 1-direction tests (up- and down-regulated). While not desirable in other cases, "dual signficance" could be informative in revealing the sub-types or sub-classes in big clinical or disease studies. If we preferred the original Gamma distribution based p-value summarization, we only need to set \verb@use.stouffer=FALSE@: <>= gse16873.kegg.gamma.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, use.stouffer=FALSE) @ Sometimes we have expression data where genes are labelled by symbols (or other IDs). Let's make such a dataset \Robject{gse16873.sym} from \Robject{gse16873}, which is originally lable by Entrez Gene IDs. <>= head(rownames(gse16873)) gse16873.sym<-gse16873 data(egSymb) rownames(gse16873.sym)<-eg2sym(rownames(gse16873.sym)) head(rownames(gse16873.sym)) @ If we want to do GAGE analysis, we can convert rownames of \Robject{gse16873.sym} back to Entrez Gene IDs or convert gene set definitions to gene symbols. <>= kegg.gs.sym<-lapply(kegg.gs, eg2sym) lapply(kegg.gs.sym[1:3],head) gse16873.kegg.p2 <- gage(gse16873.sym, gsets = kegg.gs.sym, ref = hn, samp = dcis) @ \section{Result Presentation and Intepretation} We may output full result tables from the analysis. <>= write.table(gse16873.kegg.2d.p$greater, file = "gse16873.kegg.2d.p.txt", sep = "\t") write.table(rbind(gse16873.kegg.p$greater, gse16873.kegg.p$less), file = "gse16873.kegg.p.txt", sep = "\t") @ Sort and count signficant gene sets based on q- or p-value cutoffs: <>= gse16873.kegg.sig<-sigGeneSet(gse16873.kegg.p, outname="gse16873.kegg") str(gse16873.kegg.sig, strict.width='wrap') gse16873.kegg.2d.sig<-sigGeneSet(gse16873.kegg.2d.p, outname="gse16873.kegg") str(gse16873.kegg.2d.sig, strict.width='wrap') write.table(gse16873.kegg.2d.sig$greater, file = "gse16873.kegg.2d.sig.txt", sep = "\t") write.table(rbind(gse16873.kegg.sig$greater, gse16873.kegg.sig$less), file = "gse16873.kegg.sig.txt", sep = "\t") @ \begin{figure} \begin{center} \includegraphics[width=10cm,type=pdf,ext=.pdf,read=.pdf,page=3]{gse16873.kegg.gs.heatmap} \includegraphics[width=10cm,type=pdf,ext=.pdf,read=.pdf,page=2]{gse16873.kegg.gs.2d.heatmap} \caption{Example heatmaps generated with \Rfunction{sigGeneSet} function as to show whole gene set perturbations. Upper panel: signficant KEGG pathways in 1-directional (either up or down) test; lower panel: signficant KEGG pathways in 2-directional test. Only gene set test statistics are shown, heatmaps for -log10(p-value) look similar.} \end{center} \end{figure} Heatmaps in Figure 1 visualize whole gene set perturbations, i.e. the gene set test statistics (or p-values) in pseudo-color. There are frequently multiple significant gene sets that share multiple member genes or represent the same regulatory mechanism. Such redundancy in signficant gene set could be misleading too. A pathway or functional group itself is not relevant, but may be called signficant because it share multiple perturbed genes with one really significant gene set. We derive the non-redundant signficant gene set lists, with result output as tab-delimited text files automatically below. Here, function \Rfunction{esset.grp} calls redundant gene sets to be those overlap heavily in their core genes. Core genes are those member genes that really contribute to the signficance of the gene set in GAGE analysis. Arguement \verb@pc@ is the cutoff p-value for the overlap between gene sets, default to \verb@10e-10@, may need trial-and-error to find the optimal value in minor cases. Note that we use small \verb@pc@ because redundant gene sets are not just signficant in overlap, but are HIGHLY significant so. <>= gse16873.kegg.esg.up <- esset.grp(gse16873.kegg.p$greater, gse16873, gsets = kegg.gs, ref = hn, samp = dcis, test4up = T, output = T, outname = "gse16873.kegg.up", make.plot = F) gse16873.kegg.esg.dn <- esset.grp(gse16873.kegg.p$less, gse16873, gsets = kegg.gs, ref = hn, samp = dcis, test4up = F, output = T, outname = "gse16873.kegg.dn", make.plot = F) names(gse16873.kegg.esg.up) head(gse16873.kegg.esg.up$essentialSets, 4) head(gse16873.kegg.esg.up$setGroups, 4) head(gse16873.kegg.esg.up$coreGeneSets, 4) @ We may extract the gene expression data for top gene sets, output and visualize these expression data for further intepretation. Although we can check expression data in each top gene set one by one, here we work on the top 3 up-regulated KEGG pathways in a batch. The key functions we use here are, \Rfunction{essGene} (for extract genes with substiantial or above-background expression changes in gene sets) and \Rfunction{geneData} (for output and visualization of expression data in gene sets). <>= rownames(gse16873.kegg.p$greater)[1:3] gs=unique(unlist(kegg.gs[rownames(gse16873.kegg.p$greater)[1:3]])) essData=essGene(gs, gse16873, ref =hn, samp =dcis) head(essData, 4) ref1=1:6 samp1=7:12 for (gs in rownames(gse16873.kegg.p$greater)[1:3]) { outname = gsub(" |:|/", "_", substr(gs, 10, 100)) geneData(genes = kegg.gs[[gs]], exprs = essData, ref = ref1, samp = samp1, outname = outname, txt = T, heatmap = T, Colv = F, Rowv = F, dendrogram = "none", limit = 3, scatterplot = T) } @ Figure 2 shows example heatmap and scatter plots generated by the code chunk above. \begin{figure} \begin{center} \includegraphics[width=10cm,type=pdf,ext=.pdf,read=.pdf]{Oxidative_phosphorylation.geneData.heatmap} \includegraphics[width=10cm,type=pdf,ext=.pdf,read=.pdf]{Oxidative_phosphorylation.geneData} \caption{Example heatmap and scatter plot generated with \Rfunction{geneData} function to show the gene expresion perturbations in specified gene set(s). Only HN (control) and DCIS (experiment) data from the first two patients are plotted in the scatter plot.} \end{center} \end{figure} \section{Advanced Analysis} We frequently need to do GAGE analysis repetitively on mulitple comparisons (with different samples vs references) in a study, or even with different analysis options (paired or unpaired samples, use or not use rank test etc). We can carry these different analyses with different sub-datasets all at once using a composite funciton, \Rfunction{gagePipe}. Different from \Rfunction{gage}, \Rfunction{gagePipe} accepts lists of reference/sample column numbers, with matching lists/vectors of other arguements. Function \Rfunction{gagePipe} runs multiple rounds of GAGE in a batch without interference, and outputs signficant gene set lists in text format and save the results in \verb@.RData@ format. <>= #introduce another half dataset library(gageData) data(gse16873.2) cn2=colnames(gse16873.2) hn2=grep('HN',cn2, ignore.case =T) dcis2=grep('DCIS',cn2, ignore.case =T) #batch GAGE analysis on the combined dataset gse16873=cbind(gse16873, gse16873.2) dataname='gse16873' #output data prefix sampnames=c('dcis.1', 'dcis.2') refList=list(hn, hn2+12) sampList=list(dcis, dcis2+12) gagePipe(gse16873, gsname = "kegg.gs", dataname = "gse16873", sampnames = sampnames, ref.list = refList, samp.list = sampList, comp.list = "paired") @ We may further loaded the \verb@.RData@ results, and do more analysis. For instance, we may compare the GAGE analysis results from the differen comparisons or different sub-datasets we have worked on. Here, the main function to use is \Rfunction{gageComp}. Comparison rsults between multiple GAGE analyses will be output as text files and optionally, venn diagram can be plotted in PDF format as shown in Figure 3. <>= load('gse16873.gage.RData') gageComp(sampnames, dataname, gsname = "kegg.gs", do.plot = TRUE) @ \begin{figure} \begin{center} \includegraphics[trim = 25mm 40mm 25mm 40mm, width=8cm,type=pdf,ext=.pdf,read=.pdf,page=2]{gse16873.gage.comp} \caption{An example venn diagram generated with \Rfunction{gageComp} function. Compared are KEGG pathway results for the two half datasets when \Rfunction{gagePipe} was applied above.} \end{center} \end{figure} GAGE with single array analysis design also provide a framework for combined analysis accross heterogeneous microarray studies/experiments. The combined dataset of \verb@gse16873@ and \verb@gse16873.2@ provids a good example of heterogeneous data. As mentioned above, the two half-datasets were processed using FARMS \citep{hochreiter:etal:2006} and RMA \citep{iriz:etal:2003} methods separately. Therefore, the expression values and distributions are dramatically different between the two halves. Using function \Rfunction{heter.gage} we can do some combined analysis on such heterogeneous dataset(s). \Rfunction{heter.gage} is similar to \Rfunction{gagePipe} in that \verb@ref.list@ and \verb@samp.list@ arguements need to be lists of column numbers with matching vector of the \verb@compare@ arguement. Different from \Rfunction{gagePipe}, \Rfunction{heter.gage} does one combined GAGE analysis on all data instead of multiple separate analyses on different sub-datasets/comparisons. Just to have an idea on how heterogeneous these two half datasets are, we may visualize the expression level distributions (Figure 4): <>= boxplot(data.frame(gse16873)) gse16873.kegg.heter.p <- heter.gage(gse16873, gsets = kegg.gs, ref.list = refList, samp.list = sampList) gse16873.kegg.heter.2d.p <- heter.gage(gse16873, gsets = kegg.gs, ref.list = refList, samp.list = sampList, same.dir = F) @ \begin{figure} \begin{center} \includegraphics[width=10cm,type=pdf,ext=.pdf,read=.pdf]{gage-heter.gage} \caption{Sample-wise gene expression level distributions for GSE16873 with the two differently processed half datasets.} \end{center} \end{figure} We may compare the results from this combined analysis of the combined dataset vs the analysis on the first half dataset above. As expected the top gene sets from this combined analysis are consistent yet with smaller p-values due to the use of more data. \section{Common Errors} \begin{itemize} \item Gene sets and expression data use different ID systems (Entrez Gene ID, Gene symbol or probe set ID etc). To correct, use functions like eg2sym or sym2eg, or if you used customized CDF file based on Entrez Gene to processed the raw data, do: \verb@rownames(gse16873)=gsub('_at', '', rownames(gse16873))@. \item Expression data have multiple probesets (as in Affymetrix GeneChip Data) for a single gene, but gene set analysis requires one entry per gene. You may pick up the most differentially expressed probeset for a gene and discard the rest, or process the raw intensity data using customized probe set definition (CDF file), where probes are re-mapped on a gene by gene base. Check the Methods section of GAGE paper \citep{luo:etal:2009} for details. \item Expression data have genes as columns and samples/chips as rows, i.e. in a transposed form. To correct, do: \verb@expdata=t(expdata)@. \end{itemize} \bibliographystyle{plainnat} \bibliography{gage} \end{document}