%\VignetteIndexEntry{An introduction to PROMISE} %\VignetteDepends{Biobase,GSEABase} %\VignetteKeywords{Microarray Association Pattern} %\VignettePackage{PROMISE} \documentclass[]{article} \usepackage{times} \usepackage{hyperref} \newcommand{\Rpackage}[1]{{\textit{#1}}} \title{An Introduction to \Rpackage{PROMISE}} \author{Stan Pounds, Xueyuan Cao} \date{June 24, 2014} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle <>= options(width=60) @ \section{Introduction} PROMISE, PROjection onto the Most Interesting Statistical Evidence, is a general procedure to identify genomic features that exhibit a specific biologically interesting pattern of association with multiple phenotypic endpoint variables. Biological knowledge of the endpoint variables is used to define a vector that represents the biologically most interesting values for a set of association statistics. If prior biological knowledge of the endpoint variables is not available, a projection to point 0 is performed. The PROMISE performs one hypothesis test for each genomic feature, and is flexible to accommodate various types of endpoints. In this update, we also incorporte a fast permutation based on negative binomial sampling strategy, in which further permuation for a gene will not be performed for a gene with fixed number of success achieved. In this document, we describe how to perform PROMISE procedure using hypothetical example data sets provided with the package. \section{Requirements} The PROMISE package depends on {\em Biobase} and {\em GSEABase}. The understanding of {\em ExpressionSet} and {\em GeneSetCollection} is a prerequiste to perform the PROMISE procedure. Due to the internal handling of multiple endpoints, the consistency of {\em ExpressionSet} and {\em GeneSetCollection} is assumed. The detailed requirements are illustrated below. Load the PROMISE package and the example data sets: sampExprSet, sampGeneSet, and phPatt into R. <>= library(PROMISE) data(sampExprSet) data(sampGeneSet) data(phPatt) @ The {\em ExpressionSet} should contain at least two components: {\em exprs} (array data) and {\em phenoData} (endpoint data). {\em exprs} is a data frame with column names representing the array identifiers (IDs) and row names representing the probe (genomic feature) IDs. {\em phenoData} is an {\em AnnotatedDataFrame} with column names representing the endpoint variables and row names representing array. The array IDs of {\em phenoData} and {\em exprs} should be matched. <>= arrayData<-exprs(sampExprSet) ptData<- pData(phenoData(sampExprSet)) head(arrayData[, 1:4]) head(ptData) all(colnames(arrayData)==rownames(ptData)) @ The {\em GeneSetCollection} should be extractable in the following way. The probe IDs should be a subset of probe IDs of {\em exprs}. <>= GS.data<-NULL for (i in 1:length(sampGeneSet)){ tt<-sampGeneSet[i][[1]] this.name<-unlist(geneIds(tt)) this.set<-setName(tt) this.data<- cbind.data.frame(featureID=as.character(this.name), setID=rep(as.character(this.set), length(this.name))) GS.data<-rbind.data.frame(GS.data, this.data) } sum(!is.element(GS.data[,1], rownames(arrayData)))==0 @ The association pattern definition is critical. The prior biological knowledge is required to define the vector that represents the biologically most interesting values for statistics. If prior biological knowledge of the endpoint variables cannot be assumed, an arbitrary \emph{stat.coef} can be used as it will be ignored with \emph{proj0=TRUE}. In this hypothetical example, we are interested in identifying genomic features that are positively associated with active drug level, negatively associated with minimum disease, and positively associated with survival. The three endpoints are represented in three rows as shown below: <>= phPatt @ \section{PROMISE Analysis} As mentioned in section 2, the {\em ExpressionSet} and pattern definition are required by PROMISE procedure. {\em GeneSetCollection} is required if a gene set enrichment analysis (GSEA) is to be performed within the PROMISE analysis. The code below performs a PROMISE analysis without GSEA. As mentioned above, {\em GeneSetCollection} is not needed. The gene set result is NULL. <>= test1 <- PROMISE(exprSet=sampExprSet, geneSet=NULL, promise.pattern=phPatt, strat.var=NULL, proj0=FALSE, nbperm=FALSE, max.ntail=10, seed=13, nperms=100) @ Gene level (genomic feature) result: <>= gene.res<-test1$generes head(gene.res) @ Gene set level result: <>= set.res<-test1$setres head(set.res) @ The code below performs a PROMISE analysis with GSEA and using fast permutation. As mentioned above, {\em GeneSetCollection} is required. sampGeneSet, a {\em GeneSetCollection}, is passed as an argument to PROMISE. <>= test2 <- PROMISE(exprSet=sampExprSet, geneSet=sampGeneSet, promise.pattern=phPatt, strat.var=NULL, proj0=FALSE, nbperm=TRUE, max.ntail=10, seed=13, nperms=100) @ Gene level (genomic feature) result: <>= gene.res2<-test2$generes head(gene.res2) @ Gene set level result: <>= set.res2<-test2$setres head(set.res2) @ The code below performs a PROMISE analysis with GSEA and using fast permutation without prior knowlege of the three endpoint variables (\emph{ proj0=TRUE}). As mentioned above, {\em GeneSetCollection} is required. sampGeneSet, a {\em GeneSetCollection}, is passed as an argument to PROMISE. <>= test3 <- PROMISE(exprSet=sampExprSet, geneSet=sampGeneSet, promise.pattern=phPatt, strat.var=NULL, proj0=TRUE, nbperm=TRUE, max.ntail=10, seed=13, nperms=100) @ Gene level (genomic feature) result: <>= gene.res3<-test3$generes head(gene.res3) @ Gene set level result: <>= set.res3<-test3$setres head(set.res3) @ \end{document}