\name{GSNormalize} \alias{GSNormalize} \alias{identity} \alias{one} \title{Aggregating and calculating expression statistics by Gene Set} \description{ Provides an interface for producing aggregate gene-set statistics, for gene-set-enrichment analysis (GSEA). The function is best suited for mean or rescaled-mean GSEA approaches, but is hopefully generic enough to enable other approaches as well. } \usage{ GSNormalize(dataset, incidence, gseaFun = crossprod, fun1 = "/", fun2 = sqrt, removeShift=FALSE, removeStat=mean, ...) identity(x) one(x) } \arguments{ \item{dataset}{ a numeric matrix, typically of some gene-level statistics } \item{incidence}{ 0/1 incidence matrix indicating genes' membership in gene-sets} \item{gseaFun}{function name for the type of aggregation to take place, defaults to 'crossprod'. See 'Details' } \item{fun1}{function name for normalization, defaults to "/". See 'Details' } \item{fun2}{function name for scaling, defaults to 'sqrt'. See 'Details'} \item{removeShift}{logical: should normalization begin with a column-wise removal of the mean shift?} \item{removeStat}{(if above is TRUE) the column-wise statistic to be swept out of 'dataset'.} \item{...}{Additional arguments optionally passed on to 'gseaFun'.} \item{x}{any numerical value} } \details{ In gene-set-enrichment analysis (GSEA), the core step is aggregating (or calculating) gene-set-level statistics from gene-set statistics. This utility achieves the feat. It is tailored specifically for rescaled-sums of the type suggested by Jiang and Gentleman (2007), but is designed as a generic template that should other GSEA approaches. In such cases, at this moment users should provide their own version of 'gseaFun'. The default will generate sums of gene-level values divided by the square-root of the gene-set size (in other words, gene-set means multiplied by the square-root of gene-set size). The arithmetic works like this: gene-set stat = gseaFun(t(incidence),dataset),...) 'fun1' fun2(gene-set size). In case there is a known (or suspected) overall baseline shift (i.e., the mass of gene-level stats is not centered around zero) it may be scientifically more meaningful to look for gene-set deviating from this baseline rather than from zero. In this case, you can set 'removeShift=TRUE'. Also provided are the 'identity' function (identity = function(x) x), so that leaving 'gseaFun' and 'fun1' at their default and setting 'fun2 = identity' will generate gene-set means -- and the 'one' function to neutralize the effect of both 'fun1' and 'fun2' (see note below). } \value{ 'GSNormalize' returns a matrix with the same number of rows as 'incidence' and the same number of columns as 'dataset' (if 'dataset' is a vector, the output will be a vector as well). The respective row and column names will carry through from 'dataset' and 'incidence' to the output. 'identity' simply returns x. 'one' returns the number 1. } \references{ Z. Jiang and R. Gentleman, "Extensions to Gene Set Enrichment Analysis",Bioinformatics (23),306-313, 2007.} \author{Assaf Oron } \note{ If you want to create your own GSEA function for 'gseaFun', note that it should receive the transposed incidence matrix as its first argument, and the gene-level stats as its second argument. In other words, both should have genes as rows. also, you can easily neutralize the effect of 'fun1', 'fun2' by setting "fun2 = one". } \seealso{ \code{\link{gsealmPerm}}, which relies heavily on this function. The function \code{\link[Category]{applyByCategory}} from the \code{\link[Category]{Category}} package has similar functionality and is preferable when the applied function is complicated. \code{\link{GSNormalize}} is better optimized for matrix operations.} \examples{ data(sample.ExpressionSet) lm1 = lmPerGene(sample.ExpressionSet,~sex+type) ### Generating random pseudo-gene-sets fauxGS=matrix(sample(c(0,1),size=50000,replace=TRUE,prob=c(.9,.1)),nrow=100) ### "tau-stats" for gene-SET-level type effect, adjusting for sex fauxEffects=GSNormalize(lm1$coefficients[3,]/sqrt(lm1$coef.var[3,]),incidence=fauxGS) qqnorm(fauxEffects) ### diagonal line represents zero-shift null; note that it doesn't fit abline(0,1,col=2) ### a better option may be to run a diagonal through the middle of the ### data (nonzero-shift null, i.e. type may have an effect but it is the ### same for all gene-sets); note that if any outlier shows, it is a purely random one! abline(median(fauxEffects),1,col=4) #### Now try with baseline-shift removal fauxEffects=GSNormalize(lm1$coefficients[3,]/sqrt(lm1$coef.var[3,]),incidence=fauxGS,removeShift=TRUE) qqnorm(fauxEffects) abline(0,1,col=2) } \keyword{ methods }