%\VignetteIndexEntry{Global Test} %\VignetteDepends{globaltest, Rgraphviz, golubEsets, hu6800.db, vsn, lungExpression, GO.db, Biobase, org.Hs.eg.db, annotate, KEGGREST} %\VignetteKeywords{Hypothesis testing} %\VignettePackage{globaltest} \documentclass{report} \usepackage{natbib, Sweave, times} \bibliographystyle{apalike} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \sloppy \title{The Global Test \\ and the \Rpackage{globaltest} \texttt{R} package} \author{Jelle Goeman \and Jan Oosting \and Livio Finos \and Aldo Solari \and Dominic Edelmann} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents \newpage \chapter{Introduction} This vignette explains the use of the \Rpackage{globaltest} package. Chapter \ref{general chapter} describes the use of the test and the package from a general statistical perspective. Later chapters explain how to use the \Rpackage{globaltest} package for specific applications. \section{Citing \Rpackage{globaltest}} When using the \Rpackage{globaltest} package, please cite one or more of the following papers, as appropriate. \begin{itemize} \item \cite{Goeman2004} is the original paper describing the global test for linear and logistic regression, and its application to gene set testing. \item \cite{Goeman2005} extends the global test to survival data and explains how to deal with nuisance (null) covariates. \item \cite{Goeman2006} proves the local optimality of the global test and explores its general theoretical properties. This is the core paper of the global test methodology \item \cite{Goeman2008} develops the Focus Level method for multiple testing correction in the Gene Ontology graph. \item \cite{Goeman2009} derives the asymptotic distribution of the global test for generalized linear models. \item \cite{Jelier2011} describes the weighted test based on concept profiles (Section \ref{concept}). \item \cite{Goeman2012} describes the inheritance multiple testing procedure that is used in the \Rfunction{covariates} plot. \end{itemize} \section{Package overview} The global test is meant for data sets in which many covariates (or features) have been measured for the same subjects, together with a response variable, e.g.\ a class label, a survival time or a continuous measurement. The global test can be used on a group (or subset) of the covariates, testing whether that group of covariates is associated with the response variable. The null hypothesis of the global test is that none of the covariates in the tested group is associated with the response. The alternative is that at least one of the covariates has such an association. However, the global test is designed in such a way that it is especially directed against the alternative that most of the covariates are associated with the response in a small way. In fact, against such an alternative the global test is the optimal test to use \citep{Goeman2006}. The global test is based on regression models in which the distribution of the response variable is modeled as a function of the covariates. The type of regression model depends on the response. Currently implemented models are \begin{itemize} \item linear regression (continuous response), \item logistic regression (binary response), \item multinomial logistic regression (multi-class response), \item Poisson regression (count response), \item the Cox proportional hazards model (survival response). \end{itemize} Modeling in terms of a regression model makes it easy to adjust the test for the confounding effect of nuisance covariates: covariates that are known to have an effect on the response and which are correlated with (some of) the covariates of interest, and which may, if not adjusted for, lead to spurious associations. The \Rpackage{globaltest} package implements the global test along with additional functionality. Several diagnostic plots can be used to visualize the test result and to decompose it to see the influence of individual covariates and subjects. Multiple testing procedures are offered for the situation in which a user wants to perform many global tests on the same data, e.g.\ when testing many alternative subsets. In that case, possible relationships between the test results arise due to subset relationships among tested sets which may be exploited. The package also offers some functions that are tailored to specific applications of the global test. In the current version, the only application supported in this way is gene set testing (see Chapter \ref{gene set testing}). Tailored functions for other applications (goodness-of-fit testing, prediction/classification pre-testing, testing for the presence of a random effect) are under development. \section{Comparison with the likelihood ratio test} In its most general form, the global test is a score test for nested parametric models, and as such it is a competitor of the likelihood ratio test. It can be used in every situation in which a likelihood ratio test may also be used, but the global test's properties are different from those of the likelihood ratio test. We summarize the differences briefly from a theoretical statistical perspective. For more details, see \cite{Goeman2006}. It is well known that the likelihood ratio test is invariant to the parametrization of the alternative model. The global test does not have this property: it depends on the model's precise parametrization. Therefore, there is not a single global test for a given pair of null and alternative hypothesis, but a multitude of tests: one for each possible parametrization of the alternative hypothesis. In return for giving up this parametrization-invariance, the global test gains an optimality-property that depends on the parametrization of the model. As detailed in \cite{Goeman2006}, the global test is optimal (among all possible tests) on average in a neighborhood of the null hypothesis. The shape of this neighborhood is determined by the parametrization of the alternative hypothesis. In practice, this means that in situations in which a ``natural'' parametrization of the alternative model exists, the global test for that parametrization is often more powerful than the likelihood ratio test \citep[examples in][]{Goeman2006}. A second important property of the global test is that it may still be used in situations in which the alternative model cannot be fitted to the data, which may happen, for example, if the alternative model is overparameterized, or in high dimensional situations in which there are more parameters than observations. In such cases the likelihood ratio test usually breaks down, but the global test still functions, often with good power. Being a score test, the global test is most focused on alternatives close to the null hypothesis. This means that the global test is good at detecting alternatives that have many small effects (in terms of the chosen parametrization), but that it may not be the optimal test to use if the effects are very large. \chapter{The global test} \label{general chapter} \section{Global test basics} \label{basics} We illustrate most of the features of the \Rpackage{globaltest} package and its functions with a very simple application on simulated data using a linear regression model. More extensive real examples relating to specific areas of application can be found in later chapters of this vignette. \subsection{Example data} \label{example data} We simulate some data <>= set.seed(1) Y <- rnorm(20) X <- matrix(rnorm(200), 20, 10) X[,1:3] <- X[,1:3] + Y colnames(X) <- LETTERS[1:10] @ This generates a data matrix \Robject{X} with 10 covariates called \Robject{A}, \Robject{B}, \ldots, \Robject{J}, and a response \Robject{Y}. In truth, the covariates \Robject{A}, \Robject{B}, and \Robject{C} are associated with \Robject{Y}, and the rest are not. We start the \Rpackage{globaltest} package <<>>= library(globaltest) @ \subsection{Options} The \Rpackage{globaltest} package has a \Rfunction{gt.options} function, which can be used to set some global options of the package. We use this in this vignette to switch off the progress information, which is useful if the functions are used interactively, but does not combine well with \Rfunction{Sweave}, which was used to make this vignette. We also set the \verb:max.print: option in \Rpackage{globaltest}, which abbreviates long Gene Ontology terms in Chapter \ref{gene set testing}. <>= gt.options(trace=FALSE, max.print=45) @ \subsection{The test} \label{basic test} The main workhorse function of the \Rpackage{globaltest} package is the \Rfunction{gt} function, which performs the actual test. There are several alternative ways to call this function, depending on the user's preference to work with \Rclass{formula} objects or matrices. We start with the \Rclass{formula}-based way, because this is closest to the statistical theory. Matrix-based calls are detailed in Section \ref{alternative calls}. In the data set of Section \ref{example data}, if we are interested in testing for association between the group of variables \verb:A:, \verb:B: and \verb:C: with the response \verb:Y:, we can test the null hypothesis \verb:Y ~ 1: that the response depends on none of the variables in the group, against the alternative hypothesis \verb:Y ~ A + B + C: that \verb:A:, \verb:B: and \verb:C: may have an influence on the response. We test this with <>= gt(Y~1, Y~A+B+C, data = X) @ Unlike in \Rfunction{anova}, the order of the models matters in this call: the second argument must always be the alternative hypothesis. The output lists the p-value of the test, the test statistic with its expected value and standard deviation under the null hypothesis. The \verb:#Cov: column give the number of covariates in the alternative model that are not in the null model. In the linear model the test statistic is scaled in such a way that it takes values between 0 and 100. The test statistic can be interpreted as 100 times a weighted average (partial) correlation between the covariates of the alternative and the residuals of the response. In other models, the test statistic has a roughly similar scaling and interpretation. \subsection{Nuisance covariates} A similar syntax can be used to correct the test for nuisance covariates. To correct the test of the previous section for the possible confounding influence of the covariate \verb:D:, we specify the null hypothesis \verb:Y ~ D: versus the alternative \verb:Y ~ A + B + C + D:. Note that the nuisance covariate occurs both in the null and alternative models. <>= gt(Y~D, Y~A+B+C+D, data = X) @ \subsection{The \Rclass{gt.object} object: extracting information} The \Rfunction{gt} function returns a \Rclass{gt.object} object, which stores some useful information, for example the information to make diagnostic plots. Many methods have been defined for this object. One useful function is the \Rfunction{summary} method <>= summary(gt(Y~A, Y~A+B+C, data = X)) @ Other functions to extract useful information from a \Rclass{gt.object}. For example, <<>>= res <- gt(Y~A, Y~A+B+C, data = X) p.value(res) z.score(res) result(res) size(res) @ The \Rfunction{z.score} function returns the test statistic standardized by its expectation and standard deviation under the null hypothesis; \Rfunction{result} returns a \Rclass{data.frame} with the test result; \Rfunction{size} returns the number of alternative covariates. \subsection{Alternative function calls} \label{alternative calls} The call to $\Rfunction{gt}$ is quite flexible, and the null and alternative hypotheses can be specified using either \Rclass{formula} objects or design matrices. We illustrate both types of calls, starting with the \Rclass{formula}-based ones. As the global test always tests nested models, there is no need to repeat the response and the null covariates when specifying the alternative model, so we may abbreviate the call of the previous section by specifying only those alternative covariates that do not already appear in the null model. Therefore, <>= gt(Y~A, ~B+C, data = X) @ also tests the null hypothesis \verb:Y ~ A: versus the alternative \verb:Y ~ A + B + C:. If only a single model is specified, \Rfunction{gt} will test a null model with only an intercept against the specified model. So, to test the null hypothesis \verb:Y ~ 1: against the alternative \verb:Y ~ A + B + C:, we may write <<>>= gt(Y~A+B+C, data = X) @ The dot (\verb:.:) argument for \Rclass{formula} objects can often be useful. To test \verb:Y ~ A: against the global alternative that all covariates are associated with \Robject{Y}, we can test <<>>= gt(Y~A, ~., data = X) @ Using the information from the column names in the \Rfunarg{data} argument, the \verb:~.: argument is automatically expanded to \verb:~ A + B + C + D + E + F + G + H + I + J:. In some applications it is more natural to work with design matrices directly, rather than to specify them through a \Rclass{formula}. To perform the test of \verb:Y ~ 1: against \verb:Y ~ .:, we may write <<>>= gt(Y, X) @ Similarly, the null hypothesis may be specified as a design matrix. The call <>= designA <- cbind(1, X[,"A"]) gt(Y, X, designA) @ gives the same result as \verb:gt(Y~A, ~., data = X):, except for the \verb:#Cov: output: the function cannot detect that some of the null covariates are also present in the alternative design matrix, only that the latter contains exactly correlated ones. Note that when specified in this way the null design matrix must be a complete design matrix, i.e. with any intercept term included in the matrix. \subsection{Models} \label{models} The \Rfunction{gt} function can work with the following models: linear regression, logistic regression and multinomial logistic regression, poisson regression and the Cox proportional hazards model. The model to be used can be specified by the \Rfunarg{model} argument. <>= P <- rpois(20, lambda=2) gt(P~A, ~., data=X, model = "Poisson") gt(P~A, ~., data=X, model = "linear") @ If the null model has no covariates (i.e. \verb:~0: or \verb:~1:), the logistic and Poisson model results are identical to the linear model results. If missing, the function will try to determine the model from the input. If the response is a \Rclass{factor} with two levels or a \Rclass{logical}, it uses a logistic model; if a factor with more than two levels, a multinomial logistic model; if the response is a \Rclass{Surv} object, it uses a Cox model (for examples, see Section \ref{survival}). In all other cases the default is linear regression. Use \Rfunction{summary} to check which model was used. \subsection{Stratified Cox model and competing risks survival analysis} \label{models} When applying the \Rfunction{gt} for the Cox proportional hazards model, the user may also specify strata. For this purpose, the null hypothesis should be given as a \Rclass{formula} object. The strata can then be specified as in the package \Rpackage{survival}. We first simulate some survival data via <>= time <- rexp(20,1/100) status <- rbinom(20,1,0.5) str <- rbinom(20,1,0.5) @ To test the alternative \verb:Surv(time,status) ~ A + B + strata(str): against the null hypothesis \verb:Surv(time,status) ~ strata(str):, one can use the call <<>>= gt(Surv(time,status), ~A+B+strata(str), ~strata(str), data=X) @ As already described above, one could also use the shorter call: <<>>= gt(Surv(time,status), ~A+B, ~strata(str), data=X) @ All strata terms must be both part of alternative and the null hypothesis, e.g. we do not support the possibility of testing the alternative alternative \verb:Surv(time,status) ~ A + B + strata(C): against the null hypothesis \verb:Surv(time,status) ~ A + B:. In case that a strata term is only specified in the alternative, a warning is printed and the corresponding strata term is ignored. The \Rfunction{gt} function for the stratified Cox model can also be used to apply the global test for the cause-specific hazards model. Let us assume, that there are two different event types indicated by the values $1$ and $2$ in the status variable: <<>>= status <- status * (rbinom(20,1,0.5) + 1) @ Using the \Rpackage{mstate} package, we first transform the variables into long format with event type-specific covariates: <<>>= library(mstate) survdat <- data.frame(X, "time.01" = time, "time.02" = time, "status.01" = ifelse(status == 1, 1, 0), "status.02" = ifelse(status == 2, 1, 0)) survdat <- msprep(time = c(NA, "time.01","time.02"), status = c(NA, 'status.01','status.02'), data = survdat, trans = trans.comprisk(2), keep=c("A","B")) survdat <- expand.covs(survdat, c("A","B")) @ Now testing the alternative \verb:Surv(time,status) ~ A + B: against the null hypothesis \verb:Surv(time,status) ~ 1: in the cause-specific hazards model with event types $1$ and $2$ can e.g. be performed using <<>>= gt(Surv(time,status), ~A.1+B.1+A.2+B.2, ~strata(trans), data=survdat) @ \subsection{Null distribution: asymptotic or permutations} \label{permhist} By default the global test uses an analytic null distribution to calculate the p-values of the test. This analytic distribution is exact in case of the linear model with normally distributed errors, and asymptotic in all other models. The distribution that is used is described in \cite{Goeman2009} for linear and generalized linear models, and in \cite{Goeman2005} for the Cox proportional hazards model. The assumption underlying the asymptotic distribution is that the sample size is (much) larger than the number of covariates of the null hypothesis; the dimensionality of the alternative is not an issue. For the linear, logistic and poisson models, the reported p-values are numerically reliable up to at least two decimal places down to values of around $10^{-12}$. Reported lower p-values are less reliable (although they can be trusted to be below $10^{-12}$). In situations in which the assumptions underlying the asymptotics are questionable, or in which an exact alpha level of the test is necessary, it is possible to calculate the p-value using permutations instead. Because permutations require an exchangeable null hypothesis, such a permutation p-value is only available for the linear model and for the exchangeable null hypotheses \verb:~1: and \verb:~0: in other models. To calculate permutation p-values, specify the number of permutations with the \Rfunarg{permutations} argument. The default, \verb:permutations = 0:, selects the asymptotic distribution. If the number of permutations specified in \Rfunarg{permutations} is larger than the total number of possible permutations, all possible permutations are used; otherwise the function draws permutations at random. Use \Rfunction{summary} to see which variant was actually used. Compare <>= gt(Y,X) gt(Y,X, permutations=1e4) @ The distribution of the permuted test statistic can be visualized using the \Rfunction{hist} function. \begin{figure}[!ht] <>= hist(gt(Y,X, permutations=1e4)) @ \end{figure} \subsection{Intercept terms} If \Rfunarg{null} is given as a \Rclass{formula} object, intercept terms are automatically included in the model unless this term is explicitly removed with \verb:~0+...: or \verb:~...-1:, as is usual in \Rclass{formula} objects. This automatic addition of an intercept does not happen if \Rfunarg{null} is specified as a design matrix. Therefore, the calls <<>>= A <- X[,"A"] gt(Y,X,A) gt(Y,X,~A) @ test different null hypotheses: \verb:Y ~ 1 + A: and \verb:Y ~ 0 + A:, respectively. In contrast, in the alternative model the intercept term is always suppressed, even if \Rfunarg{alternative} is a \Rclass{formula} and an intercept is not present in the null model. If a user wants to include an intercept term in the alternative model but not in the null model, he must explicitly construct an intercept variable. The reason for this is that the test result is not invariant to the scaling of variables in the alternative, and therefore also not invariant to relative scaling of the intercept to the other variables. The user must therefore choose and construct an appropriately scaled intercept. The call <>= gt(Y~0+A, ~ B+C, data = X) @ suppresses the intercept both in null and alternative hypotheses. To include an intercept in the alternative, we must say something like <>= IC <- rep(1, 20) gt(Y~0+A, ~ IC+B+C, data = X) @ Note that setting \verb:IC <- rep(2, 20): gives a different result. \subsection{Covariates of class \Rclass{factor}} Another consequence of the fact that the global test is not invariant to the parametrization of the alternative model is that one must carefully consider the choice of contrasts for \Rclass{factor} covariates. We distinguish nominal (unordered) factors and ordinal (ordered) factors. The usual coding of nominal factors with a reference category and dummy variables that describe the difference between each category and the reference is usually not appropriate for global test, as this parametrization (and therefore the test result) depends on the choice of the reference category, which is often arbitrary. More appropriate is to do a symmetric parametrization with a dummy for each category. This works even if multiple factors are considered, because the global test is not adversely affected by overparametrization. If \Rfunction{gt} was called with the argument \Rfunarg{x} set to \Robject{TRUE}, we can use \Rfunction{model.matrix} on the \Rclass{gt.object} to check the design matrix. <>= set.seed(1234) YY <- rnorm(6) FF <- factor(rep(letters[1:2], 3)) GG <- factor(rep(letters[3:5], 2)) model.matrix(gt(YY ~ FF + GG, x = TRUE))$alternative @ This choice of contrasts guarantees that the test result does not depend on the order of the levels of any factors. For ordered factors it is often reasonable to make contrasts between adjacent categories. In a model without an intercept term the frequently used \emph{split coding} scheme allows the parameters $\beta_i$ to be interpreted as the increases in the transition from category $i-1$ to category $i$, which is intuitively appropriate for ordinal data. In our example this yields <>= GG <- ordered(GG) model.matrix(gt(YY ~ GG, x = TRUE))$alternative @ If now an intercept term is included in \emph{null} (i.e. if it is not explicitly removed), this choice of contrasts is equivalent to taking an arbitrary category to be the reference category and, starting from that, assuming that the effects of categories further apart are more diverse than the effects of categories close-by. More explicitly, choosing the first, second, and third category as reference theoretically would result in the design matrices <>= R1 <- matrix(c(0,1,1,0,1,1,0,0,1,0,0,1),6,2,dimnames=list(1:6,c("GGd","GGe"))) R2 <- matrix(c(-1,0,0,-1,0,0,0,0,1,0,0,1),6,2,dimnames=list(1:6,c("GGc","GGe"))) R3 <- matrix(c(-1,0,0,-1,0,0,-1,-1,0,-1,-1,0),6,2,dimnames=list(1:6,c("GGc","GGd"))) @ It can be shown that the global test statistic --- and hence the test result --- is invariant to the choice of the reference category. In the \Rfunction{gt} function we can check this easily with <>= gt(YY ~ GG) gt(YY, alternative=R1) @ The same results are obtained for \Robject{R2} and \Robject{R3}, respectively. The choice of contrasts therefore guarantees that, in a model with an intercept term, the test result does not depend on the choice of the reference category. The difference in the number of covariates included in the alternative (3 vs. 2 in the above outputs) is due to the additional vector of ones in the split coding which, however, does not have any effect on the test result. (Strictly speaking, the effective number of covariates included in the alternative is the number given by the output minus the number of ordinal factors.) Note that otherwise, if the intercept term is removed from \Rfunarg{null}, the test result will depend on the choice of the reference category, which may not be desirable. The used implementation protects us from such situations and, most notably, leads to more interpretable results if an intercept is excluded from the null model. If a user nevertheless wants to test such alternatives that do depend on the choice of the reference category, he must explicitly specify a corresponding design matrix in \Rfunarg{alternative} (such as \Robject{R1}, \Robject{R2}, and \Robject{R3} from above). This, for example, gives <>= gt(YY, alternative=R1, null=~0) @ In contrast, the variant that \Rfunction{gt} is based on leads to <>= gt(YY ~ GG, null=~0) @ \subsection{Directing the test: weights} \label{weighting} The global test assigns relative weights to each covariate in the alternative which determine the contribution of each covariate to the test result. The default weighting, which follows from the theory of the test \citep{Goeman2006}, is proportional to the residual variance of each of the covariates, after orthogonalizing them with respect to the null covariates. The weights that \Rfunction{gt} uses internally can be retrieved with the \Rfunction{weights} function. <>= res <- gt(Y, X) weights(res) @ Only the ratios between weights are relevant. The weights that are returned are scaled so that the maximum weight is 1. In some applications the default weighting is not appropriate, for example if the covariates are all measured in different units and the relative scaling of the units is arbitrary. In that case it is better to standardize all covariates to unit standard deviation before performing the test. This can be done using the \Rfunarg{standardize} argument. <<>>= res <- gt(Y,X, standardize=TRUE) weights(res) @ Alternatively, the function can work with user-specified weights, given in the \Rfunarg{weights} argument. These weights are multiplied with the default weights, unless the \Rfunarg{standardize} argument is set to \verb:TRUE:. The following two calls give the same test result. <>= gt(Y, X[,c("A","A","B")], weights=c(.5,.5,1)) gt(Y, X[,c("A","B")]) @ \subsection{Directing the test: directional} The power of the global test does not depend on the sign of the true regression coefficients. However, in some applications the regression coefficients of different covariates are a priori expected to have the same sign. Using the \Rfunarg{directional} argument The test can be directed to be more powerful against the alternative that the regression coefficients under the alternative all have the same sign. <<>>= gt(Y, X, directional = TRUE) @ In the hierarchical model formulation of the test, this is achieved by making the random regression coefficients a priori positively correlated. The default, \verb:directional = TRUE:, corresponds to an a priori correlation between regression coefficients of $\sqrt{1/2}$. If desired, the \Rfunarg{directional} argument can be set to a value other than \verb:TRUE:. Setting \Rfunarg{directional} to a value of $d$ corresponds to an a priori correlation of $\sqrt{d/(1+d)}$. If some covariates are a priori expected to have regression coefficients with opposite signs, the corresponding covariates can be given negative weights. \subsection{Offset terms and testing values other than zero} By default, the global test tests the null hypothesis that all regression coefficients of the covariates of the alternative hypothesis are all zero. It is also possible to test the null hypothesis that these covariates have a different value than zero, specified by the user. This can be done using the \Rfunarg{test.value} argument. <<>>= gt(Y~A+B+C,data=X, test.value=c(.2,.2,.2)) @ The \Rfunarg{test.value} argument is always applied to the original alternative design matrix, i.e.\ before any standardization or weighting. Specifying \Rfunarg{test.value} in this way is equivalent to adding an \Rfunction{offset} term to the null hypothesis of $X\mathbf{v}$, where $X$ is the design matrix of the alternative hypothesis and $\mathbf{v}$ is the specified \Rfunarg{test.value}. <<>>= os <- X[,1:3]%*%c(.2,.2,.2) gt(Y~offset(os), ~A+B+C, data=X) @ Offset terms are not implemented for the multinomial logistic model. \section{Diagnostic plots} Aside from the permutations histogram already mentioned in Section \ref{permhist}, there are two main diagnostic plots that can help users to interpret a test result. Both plots are based on a decomposition of the test result into component test statistics that only use part of the information that the full test uses. \subsection{The \Rfunction{covariates} plot} \label{covariates} As shown in \cite{Goeman2004}, the global test statistic on a collection of alternative covariates can be seen as a weighted average of the global test statistics for each individual alternative covariate. <>= gt(Y~A+B, data=X) gt(Y~A, data=X) gt(Y~B, data=X) @ The test statistic of the test against \verb:~A+B: is between the test statistics against the alternatives \verb:~A: and \verb:~B:, even though the cumulative evidence of \Robject{A} and \Robject{B} may make the p-value of the combined test smaller than that of each individual one. This is because the global test statistic for an alternative hypothesis is always a weighted average of the test statistics for tests of the component single covariate alternatives. The \Rfunction{covariates} plot is based on this decomposition of the test statistic into the contributions made by each of the covariates in the alternative hypothesis. The contribution of each such covariate is itself a test. It can be useful to make a plot of these test results to find those covariates or groups of covariates that contribute most to a significant test result. \begin{figure}[!ht] <>= covariates(gt(Y,X)) @ \end{figure} The \Rfunction{covariates} plot by default plots the p-values of the tests of individual component covariates of the alternative. Other characteristic values of the component tests may be plotted using the \Rfunarg{what} argument: specifying \verb:what = "z": plots standardized test statistics (compare the \Rfunction{z.score} method for \Rclass{gt.object} objects); specifying \verb:what = "s": gives the unstandardized test statistics and \verb:what = "w": give the unstandardized test statistics weighted for the relative weights of the covariates in the test (compare the \Rfunction{weights} method for \Rclass{gt.object} objects). If (weighted or unweighted) test statistics are plotted, bars and stripes appear to signify mean and standard deviation of the bars under the null hypothesis. \begin{figure}[!ht] <>= covariates(gt(Y,X), what="w") @ \end{figure} The plotted covariates are ordered in a hierarchical clustering graph. The distance measure used for the graph is absolute correlation distance if the \Rfunarg{directional} argument of \Rfunction{gt} was \Robject{FALSE} (the default), or correlation distance otherwise. (Absolute) correlation distance is appropriate here because the test results for the individual covariates can be expected to be similar if the covariates are strongly correlated, and because the sign of the correlation matters only if a directional test was used. The default clustering method is average linkage. This can be changed if desired, using the \Rfunarg{cluster} argument. Clustering can also be turned off by setting \verb:cluster = FALSE:. The hierarchical clustering graph induces a collection of subsets of the tested covariates between the full set that is the top of the clustering graph and the single covariates that are the leaves. There are $2k-1$ such sets for a graph with $k$ leave nodes, including top and leaves. It is possible to do a multiple testing procedure on all $2k-1$ sets, controlling the family-wise error rate while taking the structure of the graph into account. The \Rfunction{covariates} function performs such a procedure, called the \emph{inheritance} procedure, which is an adaptation of the method of \cite{Meinshausen2008}: see Section \ref{inheritance}. By coloring the part of the clustering graph that has a significant multiplicity-corrected p-value in black, the user can get an impression what covariates and clusters of covariates are most clearly associated with the response variable. The significance threshold at which a multiplicity-corrected p-value is called significant can be adjusted with the \Rfunarg{alpha} argument (default 0.05). In some situations the significant branches do not reach all the way to the leaf nodes. The interpretation of this is that the multiple testing procedure can infer with confidence that at least one of the covariates below the last significant branch is associated with the response, but it cannot pinpoint with enough confidence which one(s). The result of the covariates function can be stored to access the information in the graph. The \Rfunction{covariates} function returns a \Rclass{gt.object} containing all tests on all subsets induced by the clustering graph, with their familywise error adjusted p-values. <>= res <- covariates(gt(Y,X)) res[1:10] @ The names of the subsets should be read as follows. ``O'' refers to the origin or root, and each ``[1'' refers to a first (or left) branch, whereas each ``[2'' refers to a second (or right) branch. Leaf nodes are also referred to by name. To get the leaf nodes of the subgraph that is significant after multiple testing correction, use the \Rfunction{leafNodes} function <>= leafNodes(res, alpha=0.10) @ To get a nice table of only the information of the single covariates, including their direction of association, use the \Rfunction{extract} function. <>= extract(res) @ The function \Rfunction{covariates} tries to sort the bars in such a way that the most significant covariates appear on the left. This sorting is, of course, constrained by the dendrogram if present. Setting the \Rfunarg{sort} argument to \verb:FALSE: to keep the bars in the original order as much as possible under the same constraints. An additional option \Rfunarg{zoom} is available that ``zooms in'' on the significant branches by discarding the non-significant ones. If the whole graph is non-significant \Rfunarg{zoom} has no effect. \begin{figure}[!ht] <>= covariates(gt(Y,X), zoom=TRUE) @ \end{figure} The default colors, legend and labels in the plot can be adjusted with the \Rfunarg{colors}, \Rfunarg{legend} and \Rfunarg{alias} arguments. The covariates returns the test results for all tests it performs, invisibly, as a \Rclass{gt.object}. The \Rfunction{leafNodes} function can be used to extract useful information from this object. Using \Rfunction{leafNodes} with the same value of alpha that was used in the covariates function, extracts the test results for the leaves of the significant subgraph. Using \verb:alpha = 1: extracts the test results for leaves of the full graph, i.e.\ for the individual covariates. By default, the \Rfunction{covariates} function can only make a plot for a single test result, even if the \Rclass{gt.object} contains multiple test results (see Section \ref{manytests}). However, by providing a filename in the \Rfunarg{pdf} argument of the \Rfunction{covariates} function it is possible to make multiple plots, writing them to a pdf file as separate pages. Those who like a more machine-learning oriented terminology can use the \Rfunction{features} function, which is identical to \Rfunction{covariates} in all respects. \subsection{The \Rfunction{subjects} plot} \label{subjects} Alternatively, it is possible to visualize the influence of the subjects, rather than of the covariates, on the test result. This can be useful in order to look for subjects that have an overly large influence on the test result, of to find subjects that deviate from the main pattern. Visualizing the test result in terms of the contributions of the subjects can be done using a different decomposition of the test result. In the linear model the test statistic $Q$ can be viewed as a weighted sum of the quantities \[ Q_i = \mathrm{sign}(Y_i-\mu_i) \sum_{j=1}^n \sum_{k=1}^p X_{ik} X_{jk} (Y_j-\mu_j), \] where $Y_i$ is the response variable of subject $i$, $\mu_i$ that person's expected response under the null hypothesis, and $X$ the design matrix of the alternative. We subtract $\hat\mathrm{E}(Q_i) = \mathrm{sign}(Y_i-\mu_i) \sum_{k=1}^p X_{ik} X_{ik} (Y_i-\mu_i)$ as a crude estimate of the expectation of $Q_i$. An estimate of the variance of $Q_i$ is $\mathrm{Var}\{Q_i - \hat\mathrm{E}(Q_i)\} = \sigma^2 \sum_{j=1}^n \sum_{k=1}^p X_{ik}^2 X_{jk}^2$. The quantities are asymptotically normally distributed. A similar decomposition can be made for the test statistic in other models than the linear one. The resulting quantity $Q_i - \hat\mathrm{E}(Q_i)$ can be interpreted as the contribution of the $i$-th subject to the test statistic in the sense that it is proportional to the difference between the test statistic for the full sample and the test statistic of a reduced sample in which subject $i$ has been removed. It can also be interpreted as an alternative test statistic for the same null hypothesis as the global test, but one which uses only part of the information that the full global test uses. The contribution $Q_i - \hat\mathrm{E}(Q_i)$ of individual $i$ takes a large value if other subjects who are similar to subject $i$ in terms of their covariates $X$ (measured in correlation distance) also tend to be similar in terms of their residual $Y_j-\mu_j$ (i.e. has the same sign). This contribution $Q_i - \hat\mathrm{E}(Q_i)$ can, therefore, be viewed as a partial global test statistic that rejects if individuals that are similar to individual $i$ in terms of their alternative covariates tend to deviate from the null model in the same direction as individual $i$ with their response variable. The \Rfunction{subjects} function plots the p-values of these partial test statistics. As in the \Rfunction{covariates} function, other values may be plotted using the \Rfunarg{what} argument. Specifying \verb:what = "z": plots test statistics standardized by their expectation and standard deviation; specifying \verb:what = "s": gives the unstandardized test statistics $Q_i$ and \verb:what = "w": give the unstandardized test statistics weighted for the relative weights of the subjects in the test (proportional to $|Y_i|$). If weighted or unweighted standardized test statistics are plotted, bars and stripes appear to signify mean and standard deviation of the bars under the null hypothesis. \begin{figure}[!ht] <>= subjects(gt(Y,X)) @ \end{figure} An additional argument $\Rfunarg{mirror}$ (default: \verb:TRUE:) can be used to plot the unsigned version $\bar Q_i = \sum_{j=1}^n \sum_{k=1}^p X_{ik} X_{jk} (Y_j-\mu_j)$ (no effect if \verb:what = "p":). Combined with \verb:what = "s":, this gives the first partial least squares component of the data, which can be interpreted as a first order approximation of the estimated linear predictor under the alternative. In the resulting plot, large positive values correspond to subjects that have a much higher predicted value under the alternative hypotheses than under the null, whereas large negative values correspond to subjects with a much lower expected value under the alternative than under the null. \begin{figure}[!ht] <>= subjects(gt(Y,X), what="s", mirror=FALSE) @ \end{figure} As in the \Rfunction{covariates} plot, the subjects in the \Rfunction{subjects} plot are ordered in a hierarchical clustering graph. The distance measure used for the clustering graph is correlation distance. Correlation distance is appropriate because the test results for subjects can be expected to be similar if their measurements are close in terms of correlation distance. The default clustering method is average linkage. This can be changed if desired, using the \Rfunarg{cluster} argument. Clustering can also be turned off by setting \verb:cluster = FALSE:. Unlike in the \Rfunction{covariates} plot, no multiple testing is done on the clustering graph. The function tries to sort the bars in such a way that the most significant partial tests appear on the left. This sorting is, of course, constrained by the dendrogram if present. Setting the \Rfunarg{sort} argument to \verb:FALSE: to keep the bars in the original order as much as possible under the same constraints. The default colors, legend and labels in the plot can be adjusted with the \Rfunarg{colors}, \Rfunarg{legend} and \Rfunarg{alias} arguments. By default, the \Rfunction{subjects} function can only make a plot for a single test result, even if the \Rclass{gt.object} contains multiple test results (see Section \ref{manytests}). However, by providing a filename in the \Rfunarg{pdf} argument of the \Rfunction{subjects} function it is possible to make multiple plots, writing them to a pdf file as separate pages. \section{Doing many tests: multiple testing} In high-dimensional data, when the dimensionality of the design matrix of the alternative is very high, it is often interesting to study subsets of the covariates, or to compare alternative weighting options. The \Rpackage{globaltest} package facilitates this by making it possible to perform tests for many alternatives at once, and to perform various algorithms for multiple testing correction. \subsection{Many subsets or many weights} \label{manytests} To test one or many subsets covariates of the alternative design matrix, use the \Rfunarg{subsets} argument. If a single subset is to be tested, the \Rfunarg{subsets} argument can be presented as a vector of covariate names or of covariate indices in the alternative design matrix. <>= set <- LETTERS[1:3] gt(Y,X, subsets = set) @ To test many subsets, \Rfunarg{subsets} can be a (named) list of such vectors. <>= sets <- list(one=LETTERS[1:3], two=LETTERS[4:6]) gt(Y,X, subsets = sets) @ Duplicate identifiers in the subset vectors are not removed, but lead to increased weight for the duplicated covariates in the resulting test, except if the \verb:trim: option was set to \verb:TRUE: (see Section \ref{trim}). To retrieve the subsets from a \Rclass{gt.object}, use the \Rfunction{subsets} method. <>= res <- gt(Y,X, subsets = sets) subsets(res) @ Weighting was already discussed in Section \ref{weighting}. To test many different weights simultaneously, the \Rfunarg{weights} argument can also be given as a (named) list, similar to the subsets argument. <>= wts <- list(up = 1:10, down = 10:1) gt(Y,X,weights=wts) @ Weights can also be used as an alternative way of specifying subsets, by giving weight 1 to included covariates and 0 to others. Weights and subsets can also be combined. Either specify a single weights vector for many subsets <>= gt(Y,X, subsets=sets, weights=1:10) @ or specify a separate weights vector for each subset. In the latter case case each weights vector may be either a vector of the same length as the number of covariates in the alternative design matrix, or, alternatively, be equal in length to corresponding subset. <>= gt(Y,X, subsets=sets, weights=wts) gt(Y,X, subsets=sets, weights=list(1:3,7:5)) @ Note that in case of a name conflict between the \Rfunarg{subsets} and \Rfunarg{weights} arguments, the names of the \Rfunarg{weights} argument are returned under ``alias''. In general, the alias is meant to store additional information on each test performed. Unlike the name, the alias does not have to be unique. An alias for the test result may be provided with the \Rfunarg{alias} argument, or added or changed later using the \Rfunction{alias} method. <>= res <- gt(Y,X, weights=wts, alias = c("one", "two")) alias(res) alias(res) <- c("ONE", "TWO") @ To take a subset of the test results, a \Robject{gt.object} can be subsetted using \verb:[: or \verb:[[: as with other \verb:R: objects. There is no distinction between \verb:[: or \verb:[[:. A \Robject{gt.object} can be sorted to increasing p-values with the \Rfunction{sort} command. In case of equal p-values, which may happen e.g.\ when doing permutation testing, the tests with the same p-values are sorted to decreasing z-scores. <>= res[1] sort(res) @ \subsection{Unstructured multiple testing procedures} \label{multtest} When doing many tests, it is important to correct for multiple testing. The \Rpackage{globaltest} package offers different methods for correcting for multiple testing. For unstructured tests in which the tests are simply considered as an exchangeable list with no inherent structure. These methods are described in the help file of the \Rfunction{p.adjust} function (\Rpackage{stats} package). The three most important ones are \begin{itemize} \item[Holm] The procedure of \cite{Holm1979} for control of the family-wise error rate \item[BH] The procedure of \cite{Benjamini1995} for control of the false discovery rate \item[BY] The procedure of \cite{Benjamini2001} for control of the false discovery rate \end{itemize} The procedures of Holm and \cite{Benjamini2001} are valid for any dependency structure between the null hypotheses, but the procedure of \cite{Benjamini1995} is only valid for independent or positively correlated test statistics \citep[see][for details]{Benjamini2001}. Multiplicity-corrected p-values can be calculated with the \Rfunction{p.adjust} function. The default procedure is Holm's procedure. <>= p.adjust(res) p.adjust(res, "BH") p.adjust(res, "BY") @ \subsection{Graph-structured hypotheses 1: the focus level method} \label{focus level} Sometimes the sets of covariates that are to be tested are structured in such a way that some sets are subsets of other sets. Such a structure can be exploited to gain improved power in a multiple testing procedure. The \Rpackage{globaltest} package offers two procedures that make use of the structure of the sets when controlling the familywise error rate. These procedures are the focus level procedure of \cite{Goeman2008}, and the inheritance procedure, a variant of the procedure of \cite{Meinshausen2008}. We treat both of these methods in turn. Sets of covariates can be viewed as nodes in a graph, with subset relationships form the directed edges. Viewed in this way, any collection of covariates forms a directed acyclic graph. The inheritance procedure is restricted to tree-structured graphs. The focus level is not so restricted, and can work with any directed acyclic graph. To illustrate the focus level method, let's make some covariate sets of interest. <>= level1 <- as.list(LETTERS[1:10]) names(level1) <- letters[1:10] level2 <- list(abc = LETTERS[1:3], cde = LETTERS[3:5], fgh = LETTERS[6:8], hij = LETTERS[8:10]) level3 <- list(all = LETTERS[1:10]) dag <- c(level1, level2, level3) @ This gives one top node, 10 leaf nodes and 4 intermediate nodes. The structure is a directed acyclic graph because leaf nodes ``C'' and ``H'' both have more than one parent. The focus level method requires the choice of a \emph{focus level}. This is the level in the graph at which the procedure starts testing. If significant nodes are found at this level, the procedure will fan out to find significant ancestors and offspring of that significant node. A focus level can be specified as a character vector of node identifiers, or it can be generated in an automated way using the \Rfunction{findFocus} function. <>= fl <- names(level2) fl <- findFocus(dag, maxsize=8) @ The \Rfunction{findFocus} function chooses the focus level in such a way that each focus level node has at most \Rfunarg{maxsize} non-redundant offspring nodes, where a redundant node is a node that can be constructed as a union of other nodes. An optional argument \Rfunarg{atoms} (default: \verb:TRUE:) first decomposes all nodes into \emph{atoms}: small sets from which all offspring sets can be reconstructed as unions of atoms. Making use of these atoms often reduces computation time considerably, although it may, in theory, result in some loss of power. To apply the focus level method, first create a \Rclass{gt.object} that contains all the covariates under the alternative, e.g.\ the \Rclass{gt.object} that uses the full alternative design matrix. <>= res <- gt(Y,X) res <- focusLevel(res, sets = dag, focus=fl) sort(res) @ As the \Rfunction{p.adjust} function, the \Rfunction{focusLevel} function reports familywise error rate adjusted p-values. It is a property of both the inheritance and the focus level method, that the adjusted p-value of a node can never be smaller than a p-value of an ancestor node. The significant graph at a certain significance level is therefore always a coherent graph, which always contains all ancestor nodes of any rejected node. Such a graph can be succinctly summarized by reporting only its leaf nodes. This can be done using the \Rfunction{leafNodes} function. <>= leafNodes(res) @ The \Rfunarg{alpha} argument of the \Rfunction{leafNodes} function can be used to specify the rejection threshold for the familywise error of the significant graph. To visualize the test result as a graph, use the \Rfunction{draw}. By default, this function draws the graph with the significant nodes in black and the non-significant ones in gray. The \Rfunarg{alpha} argument can be used to change the significance threshold. Alternatively, it is possible to draw only the significant subgraph, setting the \Rfunarg{sign.only} argument to \verb:TRUE:. \begin{figure}[!ht] <>= draw(res, names=TRUE) @ \end{figure} The \Rfunarg{names} argument (default \verb:FALSE:) forces the use of names in the nodes. This can quickly become unreadable even for small graphs if the names for the nodes are long. By default, therefore, \Rfunction{draw} numbers the nodes, returning a legend to interpret the numbers. <>= legend <- draw(res) @ The \Rfunarg{interactive} argument can be used to make the plot interactive. In an interactive plot, click on a node to see the node label. Exit the interactive plot by pressing escape. \subsection{Graph-structured hypotheses 2: the inheritance method} \label{inheritance} An alternative method for multiple testing in graph-structured hypotheses is the inheritance method. This procedure is based on the work of \cite{Meinshausen2008}. \Rfunction{inheritance} reports familywise error rate adjusted p-values, as \Rfunction{p.adjust} and \Rfunction{focusLevel} functions do. Compared with the focus level method, the inheritance procedure is less computationally intensive, and does not require the definition of any (focus) level. However, it requires that the graph has a tree structure, rather than the more general directed acyclic graph structure that the focus level works with. To illustrate the inheritance method, we make use of the example data. However, we can not make uso of the \verb:dag: object created in Section \ref{focus level} since it does not have a tree structure. For example, \verb:c: in \verb:dag: is a descendant of both \verb:abc: and \verb:cde:. We modify the commands of the previous section to make sure that each element of \verb:dag: has (at maximum) one parent; this guarantees that it is a tree-structured graph. <>= level1 <- as.list(LETTERS[1:10]) names(level1) <- letters[1:10] level2 <- list(ab = LETTERS[1:2], cde = LETTERS[3:5], fg = LETTERS[6:7], hij = LETTERS[8:10]) level3 <- list(all = LETTERS[1:10]) tree <- c(level1, level2, level3) @ Now we can apply the \Rfunction{inheritance} method. The syntax of the function is very similar to the \Rfunction{focusLevel} function. <>= res <- gt(Y,X) resI <- inheritance(res, tree) resI @ The inheritance procedure has two variants: one with and one without the \Rfunarg{Shaffer} variant \citep{Meinshausen2008}. Setting the argument \verb:Shaffer = TRUE: allows uniform improvement of the power of the procedure, but it the familywise error rate control is guaranteed only if the hypotheses tested in each node of the graph with only leaf nodes as offspring is precisely the intersection hypothesis of its child nodes. When doing the inheritance procedure in combination with the global test, this condition is fulfilled if the set of covariates at each node with only leaf nodes as offspring is precisely the union of the sets of covariates of its offspring leaf nodes. This condition is fulfilled for the \verb:tree: graph above, but if we had set \verb:level1 <- as.list(LETTERS[1:9]):, the node \verb:hij: contains a covariate (\verb:J:) that is not present in any of its child nodes, so that the condition for the Shaffer improvement is not fulfilled, and setting \verb:Shaffer = TRUE: does not control the familywise error rate. If \Rfunarg{test} is a \Rclass{gt.object} the procedure check if structure of \Rfunarg{sets} allows for a Shaffer improvement, and sets \Rfunarg{Shaffer} to the correct default. In other cases, checking the validity of the Shaffer improvement is left to the user. Note that setting \verb:Shaffer = TRUE: always gives a correct procedure. The tree structure of the hypotheses may be fixed a priori, based on the prior knowledge rather than on the data. However, in some situations a data-driven definition of the structure is allowed. \cite{Meinshausen2008} suggests to use a hierarchical clustering method using as distance matrix based on the (correlation) distance between explanatory covariates. This is valid for the global test, and may in some cases also be valid if other tests are performed. In \Rfunction{inheritance}, the tree-structured graph \Rfunarg{sets} can be an object of class \Rclass{hclust} or \Rclass{dendrogram}. If \Rfunarg{sets} is missing and \Rfunarg{test} is a \Rclass{gt.object} the structure is derived from the structure of \Rfunarg{test}. <>= hc <- hclust(dist(t(X))) resHC <- inheritance(res, hc) resHC @ It is a property of both the inheritance and the focus level method, that the adjusted p-value of a node can never be smaller than a p-value of an ancestor node. The significant graph at a certain significance level is therefore always a coherent graph, which always contains all ancestor nodes of any rejected node. Such a graph can be succinctly summarized by reporting only its leaf nodes. This can be done using the \Rfunction{leafNodes} function. <>= leafNodes(resI) leafNodes(resHC) @ The \Rfunarg{alpha} argument of the \Rfunction{leafNodes} function can be used to specify the rejection threshold for the familywise error of the significant graph. Like for \Rfunction{focusLevel}, the \Rfunction{draw} can be used to visualize the test result as a graph: \begin{figure}[!ht] <>= draw(resHC, names=TRUE) @ \end{figure} However, in most cases the \Rfunction{covariates} function does a better graphical job. \Rfunction{covariates} performs \Rfunction{hclust} on the covariates and calls the \Rfunction{inheritance} function using this data-driven structure. <>= covariates(res) @ \chapter{Gene Set Testing} \label{gene set testing} \section{Introduction} One important application of the global test is in gene set testing in gene expression microarray data \citep{Goeman2004, Goeman2005}. Such data consist of simultaneous gene expression measurements of many thousands of probes across the genome, performed for a number of biological samples. The typical goal of a microarray experiment is to find associations between the expression of genes and a phenotype variable. Gene set testing is a common denominator for a type of analysis for microarray data that takes together groups of genes that have a common annotation, e.g.\ which are all annotated to the same Gene Ontology term, which are all members of the same KEGG pathway, or which have a similar chromosomal location. Gene set testing methods test such gene sets together to investigate whether the genes in the gene set have a higher association with the response than expected by chance. These methods provide a single p-value for the gene set, rather than a p-value for each gene. The global test is well suited for gene set testing; in fact, the global test was initially designed specifically with this application in mind \citep{Goeman2004}. The model that the global test uses for gene set testing is a regression model, such as might also be used to predict the response based on the gene expression measurements: in this model the gene expression measurements correspond to the covariates and the phenotype corresponds to the response. The null hypothesis that the global test tests is the null hypotheses that all regression coefficients of all the genes in the gene set are zero, i.e.\ the genes in the gene set have no predictive ability for predicting the response. The global test can therefore be seen as a method that looks for differentially expressed gene sets. The global test tests gene sets in a single step, based on the full data, without an intermediate step of finding individual differentially expressed genes. In the classification scheme for gene set testing methods of \cite{Goeman2007}, the global test is a \emph{self-contained} method rather than a \emph{competitive} one: it tests the null hypothesis that no gene in the gene set is associated with the phenotype rather than the null hypothesis that the genes in the gene set are not more associated with the phenotype than random genes on the microarray. The latter approach is followed by enrichment methods such as GSEA and methods based on Fisher's exact test. The global test is also a \emph{subject-sampling} rather than a \emph{gene-sampling} method. This means that when determining whether the genes in the gene set have a higher association with the phenotype than expected by chance, the method looks at the random biological variation between subjects, rather than comparing the gene set with random sets of genes. The latter approach is used by gene set testing methods based on Fisher's exact test. Unlike the validity of gene-sampling methods, the validity of subject-sampling methods does not depend on the unrealistic assumption that gene expression measurements are independent. As shown by \cite{Goeman2006}, the global test is designed to have optimal power in the situation in which the gene set has many small non-zero regression coefficients. This means that the test is especially directed to find gene sets for which many genes are associated with the phenotype in a small way. This behavior is appropriate for gene set testing, because the situation that many genes are associated with the phenotype is usually the most interesting from a gene set perspective. Still, it is true that the null hypotheses that the global test tests is false even if only a single gene in the gene set is associated with the phenotype; especially smaller gene sets may therefore become significant as a result of only a single significant gene. However, because the test is directed especially against the alternative that there are many associated genes, such examples are rare among larger gene sets. \section{Data format} The globaltest package uses the usual statistical orientation of data matrices in which the columns of the data matrix correspond to covariates, and the rows of the data matrix correspond to subjects. In gene set testing and in other genomics applications it is more common to use the reverse orientation, in which the columns of the data matrix correspond to the subjects and the rows to the covariates. The \Rfunction{gt.options} function can be used to change the default orientation expected by \Rfunction{gt} for the \Rfunarg{alternative} argument. <>= gt.options(transpose=TRUE) @ Note that this option is only relevant if \Rfunarg{alternative} is given as a matrix. A \Rclass{formula} or \Rclass{ExpressionSet} input (Section \ref{ExpressionSet}) input for \Rfunarg{alternative} is automatically interpreted correctly. \subsection{Using \Rclass{ExpressionSet} data} \label{ExpressionSet} We illustrate gene set testing using the \cite{Golub1999} data set, a famous data set which was one of the first to use microarray data in a classification context. This dataset is available from bioconductor as the \Rpackage{golubEsets} package. We load the \verb:Golub_Train: data set, consisting of 38 Leukemia patients for which 7129 gene expression measurements were taken. <>= library(golubEsets) data(Golub_Train) @ The \verb:Golub_Train: data are in \Rclass{ExpressionSet} format, which is the standard format in bioconductor for storing gene expression data. The \Rclass{ExpressionSet} objects contain the gene expression data, phenotypic data, and annotation information about the genes and the experiment, all in the same \verb:R: object. The data have to be properly normalized and log- or otherwise transformed, as usual in microarray data. We keep the normalization simple and use only \Rpackage{vsn}. <>= library(vsn) exprs(Golub_Train) <- exprs(vsn2(Golub_Train)) @ The phenotype of interest is the leukemia subtype, coded as the variable \verb:ALL.AML:, with values \verb:"ALL": and \verb:"AML":, in \verb:pData(Golub_Train):. It is generally a good idea to start by testing the overall expression profile to see whether that is notably different between AML and ALL patients. We supply the \Rclass{ExpressionSet} \verb:Golub_Train: in the \Rfunarg{alternative} argument of \Rfunction{gt}. Because the \Rfunarg{alternative} argument is of class \Rclass{ExpressionSet}, the function now uses \verb:t(exprs(Golub_Train)): as the \Rfunarg{alternative} argument and \verb:pData(Golub_Train): as the \Rfunarg{data} argument. <>= gt(ALL.AML, Golub_Train) @ From the test result we conclude that the overall expression profile of ALL patients and AML patients differs markedly in this experiment. This is not very surprising, as this data set has been used in many papers as an example of a data set that can be classified very easily. From this result we may expect to find many genes and gene sets to be differentially expressed. If the overall test is not significant or only marginally significant, it can be difficult to find many genes or pathways that are differentially expressed. In this case it is usually not a good idea to perform a broad untargeted data mining type analysis of the data, e.g.\ by testing complete pathway databases, because it is likely that in this case the signal of the genes and gene sets that are differentially expressed is drowned in the noise of the genes that are not differentially expressed. A more targeted approach focussed on a limited number of candidate gene sets may be more opportune in such a situation. Adjustment of the test result for confounders such as batch effects, clinical or phenotype covariates can be specified by specifying these variables as covariates under the null hypothesis, as described in Section \ref{basic test}. When using \Rclass{ExpressionSet} data, the easiest way to do this is with a \Rclass{formula}. The terms of such a \Rclass{formula} are automatically interpreted in terms of the \Rfunction{pData} slot of the \Rclass{ExpressionSet}. Missing data ar not allowed in phenotype variables, so we illustrate the adjustment for confounders by correcting for the data source in the Golub data (the DFCI and CALGB centers) <>= gt(ALL.AML ~ Source, Golub_Train) @ In this specific case we see that the association between gene expression and disease subtype is completely confounded by the source variable. In fact, all ALL patients came from DFCI, and all AML patients from CALGB. In this case we cannot distinguish between the effects of disease subtype from the center effects: the design of this study is, unfortunately, broken. \subsection{Other input formats} Alternatively, the formula or matrix-based input described in Section \ref{basics} may also be used instead of the \Rclass{ExpressionSet}-based one. For matrix-based input, \Rfunction{gt} expects the usual statistical data-format in which the subjects correspond to the rows of the data matrix and the covariates (probes or genes) are the columns. The option \Rfunarg{transpose} in \Rfunction{gt.options} can be used to change this. Setting <>= gt.options(transpose=TRUE) @ changes the default behavior of \Rfunction{gt} to expect the transposed format that is usual in genomics, with the rows of the data matrix corresponding to the genes and the columns to the subjects. The \Rfunction{gtKEGG}, \Rfunction{gtGO} and \Rfunction{gtBroad} functions (Section \ref{databases}) always expect the genomics data format rather than the usual statistical format. \subsection{The \Rfunarg{trim} option} \label{trim} A second useful option to set when doing gene set testing is the \Rfunarg{trim} option. This option governs the way \Rfunction{gt} handles covariate names that appear in the \Rfunarg{subsets} argument, but are not present in the expression data matrix. The default behavior of \Rfunction{gt} is to return an error when this happens. However, in gene set testing covariates may easily be missing from the expression data, for example because the subsets are based on the annotation of the complete microarray, while some genes have been removed from the expression data matrix, perhaps due to poor measurement quality. Setting <>= gt.options(trim=TRUE) @ makes \Rfunction{gt} silently remove such missing covariates from the \Rfunarg{subsets} argument. Additionally, if \verb:trim = TRUE:, duplicate covariate names in \Rfunarg{subsets} are automatically removed. \section{Testing gene set databases} \label{databases} The most common approach to gene set testing is to test gene sets from public databases. The globaltest package provides utility functions for three such databases: Gene Ontology, KEGG and the pathway databases maintained by the Broad Institute. In all cases, these functions make heavy use of the annotation packages available in Bioconductor. If the microarray that was used does not have an annotation package, the Entrez-based organism annotation packages (e.g.\ \Rpackage{org.Hs.eg.db} for human) can be used instead. \subsection{KEGG} The functions \Rfunction{gtKEGG} can be used to test KEGG terms. To test a single KEGG id, e.g.\ cell cycle (KEGG id \verb"04110"), use <>= gtKEGG(ALL.AML, Golub_Train, id = "04110") @ The function automatically finds the right KEGG information from the \Rpackage{KEGGREST} package, and the right set of genes belonging to the KEGG id from the annotation package of the \Rpackage{hu6800} Affymetrix chip; the reference to this annotation package is contained in the \verb:Golub_Train: \Rclass{ExpressionSet} object. If \Rclass{ExpressionSet} objects are not used, the name of the annotation package can be supplied in the \Rfunarg{annotation} argument of \Rfunction{gtKEGG}. Annotation packages are not always available for all microarray types. Therefore, a general Entrez-based annotation package is available for many organisms, e.g.\ \Rpackage{org.Hs.eg.db} for human. See \verb:www.bioconductor.org: for the names of the organism specific packages. This general entrez-based annotation package may be substituted for a specific probe annotation package if a mapping from probe(set) ids to Entrez is given (as a list or as a vector) in the \Rfunarg{probe2entrez} argument. For the Golub data we find such a mapping in the \Rpackage{hu6800.db} package. <>= eg <- as.list(hu6800ENTREZID) gtKEGG(ALL.AML, Golub_Train, id="04110", probe2entrez = eg, annotation="org.Hs.eg.db") @ If more than one KEGG id is tested, multiple testing corrected p-values are automatically provided. The default multiple testing method is Holm's, but others are available through the \Rfunarg{multtest} argument. See also the \Rfunction{p.adjust} function, described in Section \ref{multtest}. The results are sorted to increasing p-values (using the \Rfunction{sort} method), unless the \Rfunarg{sort} argument of \Rfunction{gtKEGG} is set to \verb:FALSE:. <>= gtKEGG(ALL.AML, Golub_Train, id=c("04110","04210"), multtest="BH") @ If the \Rfunarg{id} argument is not specified, the function \Rfunction{gtKEGG} will test all KEGG pathways. <>= gtKEGG(ALL.AML, Golub_Train) @ \subsection{Gene Ontology} To test Gene Ontology terms the special function \Rfunction{gtGO} is available. This function accepts the same arguments as \Rfunction{gt}, except the \Rfunarg{subsets} argument, which is replaced by a collection of options to create gene sets from Gene Ontology. To test a single gene ontology term, e.g.\ cell cycle (\verb"GO:0007049"), we say <>= gtGO(ALL.AML, Golub_Train, id="GO:0007049") @ The function automatically finds the right Gene Ontology information from the \Rpackage{GO.db} package, and the right set of genes belonging to the gene ontology term from the annotation package of the \Rpackage{hu6800} Affymetrix chip; the reference to this annotation package is contained in the \verb:Golub_Train: \Rclass{ExpressionSet} object. If \Rclass{ExpressionSet} objects are not used, the name of the annotation package can be supplied in the \Rfunarg{annotation} argument of \Rfunction{gtGO}. Annotation packages are not always available for all microarray types. Therefore, a general Entrez-based annotation package is available for many organisms, e.g.\ \Rpackage{org.Hs.eg.db} for human. See \verb:www.bioconductor.org: for the names of the organism specific packages. This general entrez-based annotation package may be substituted for a specific probe annotation package if a mapping from probe(set) ids to Entrez is given (as a list or as a vector) in the \Rfunarg{probe2entrez} argument. For the Golub data we find such a mapping in the \Rpackage{hu6800.db} package. <>= eg <- as.list(hu6800ENTREZID) gtGO(ALL.AML, Golub_Train, id="GO:0007049", probe2entrez = eg, annotation="org.Hs.eg") @ It is also possible to test all terms in one or more of the three ontologies: Biological Process (BP), Molecular Function (MF) and Cellular component (CC). A minimum and/or a maximum number of genes may be specified for each term. <>= gtGO(ALL.AML, Golub_Train, ontology="BP", minsize = 10, maxsize = 500) @ If more than one gene ontology term is tested, multiple testing corrected p-values are automatically provided. The default multiple testing method is Holm's, but others are available through the \Rfunarg{multtest} argument. See also the \Rfunction{p.adjust} function, described in Section \ref{multtest}. The results are sorted to increasing p-values (using the \Rfunction{sort} method), unless the \Rfunarg{sort} argument of \Rfunction{gtGO} is set to \verb:FALSE:. <>= gtGO(ALL.AML, Golub_Train, id=c("GO:0007049","GO:0006915"), multtest="BH") @ A multiple testing method that is very suitable for Gene Ontology is the focus level method, described in more detail in Section \ref{focus level}. This multiple testing method presents a coherent significant subgraph of the Gene Ontology graph. This is a relatively computationally intensive method. To keep this vignette light, we shall only demonstrate the focus level method on the subgraph of ``cell cycle'' GO term and all its descendants. <>= descendants <- get("GO:0007049", GOBPOFFSPRING) res <- gtGO(ALL.AML, Golub_Train, id = c("GO:0007049", descendants), multtest = "focus") leafNodes(res) @ The leaf nodes can be seen as a summary of the significant GO terms: they present the most specific terms that have been declared significant at a specified significance level \Rfunarg{alpha} (default 0.05). The graph can be drawn using the \Rfunction{draw} function. In the interactive mode of this function, click on the nodes to see the GO id and term. The default of this function is to draw the full graph, with the non-significant nodes greyed out. It is also possible to only draw the significant graph by setting the \Rfunarg{sign.only} argument to \verb:TRUE:. The draw function returns a legend to the graph, relating the numbers appearing in the plot to the GO terms. This is useful when using \Rfunction{draw} in non-interactive mode \begin{figure}[!ht] <>= draw(res, interactive=TRUE) legend <- draw(res) @ <>= draw(res) @ \end{figure} \subsection{The Broad gene sets} A third frequently used database is the collection of curated gene sets maintained by the Broad institute. The sets are only available after registration at \verb&http://www.broad.mit.edu/gsea/downloads.jsp#msigdb&. To use the Broad gene sets, download the file \verb:msigdb_v.2.5.xml:, which contains all sets. A convenient function to read the xml file into \verb:R: is provided in the \Rfunction{getBroadSets} function from the \Rpackage{GSEABase} package. Once downloaded and read, the \Rfunction{gtBroad} function can be used to analyze these gene sets using the global test. <>= broad <- getBroadSets("your/path/to/msigdb_v.2.5.xml") @ The examples in this vignette are displayed without results, because we cannot include the \verb:msigdb_v.2.5.xml: file in the \Rpackage{globaltest} package. To test a single Broad set, e.g.\ the chromosomal location chr5q33, use <>= gtBroad(ALL.AML, Golub_Train, id = "chr5q33", collection=broad) @ The function automatically maps the gene set to the probe identifiers from the annotation package of the \Rpackage{hu6800} Affymetrix chip; the reference to this annotation package is contained in the \verb:Golub_Train: \Rclass{ExpressionSet} object. If \Rclass{ExpressionSet} objects are not used, the name of the annotation package can be supplied in the \Rfunarg{annotation} argument of \Rfunction{gtBroad}. Annotation packages are not always available for all microarray types. Therefore, a general Entrez-based annotation package is available for many organisms. This general annotation package may be substituted for a specific annotation package if a mapping from probe(set) ids to Entrez is given (as a list or as a vector). For the Golub data we use the mapping from the \Rpackage{hu6800.db} package to obtain this mapping. <>= eg <- as.list(hu6800ENTREZID) gtBroad(ALL.AML, Golub_Train, id = "chr5q33", collection=broad, probe2entrez = eg, annotation="org.Hs.eg.db") @ See \verb:www.bioconductor.org: for the names of the organism specific packages. If more than one Broad set is tested, multiple testing corrected p-values are automatically provided. The default multiple testing method is Holm's, but others are available through the \Rfunarg{multtest} argument. See also the \Rfunction{p.adjust} function, described in Section \ref{multtest}. The results are sorted to increasing p-values (using the \Rfunction{sort} method), unless the \Rfunarg{sort} argument of \Rfunction{gtBroad} is set to \verb:FALSE:. <>= gtBroad(ALL.AML, Golub_Train, id=c("chr5q33","chr5q34"), multtest="BH", collection=broad) @ The broad collection contains four categories \begin{itemize} \item[c1] positional gene sets \item[c2] curated gene sets \item[c3] motif gene sets \item[c4] computational gene sets \item[c5] GO gene sets \end{itemize} To test all gene sets from a certain category, use <>= gtBroad(ALL.AML, Golub_Train, category="c1", collection=broad) @ \section{Concept profiles} \label{concept} A drawback of the three gene set databases above is that they have hard criterion for membership: each gene either belongs to the set or it does not. In reality, however, association of genes with biological concepts is gradual. Some genes are more central to a certain biological process than others, and for some genes the association with a process is more certain or well-documented than for others. To take this into account, databases can be used that contain associations between genes and concepts, rather than simply gene sets. One of these is the Anni tool, available from \verb&http://www.biosemantics.org/anni&. A function to test concepts from Anni is given in the function \Rfunction{gtConcept}. Like \Rfunction{gtBroad}, the function \Rfunction{gtConcept} requires the user to download files that are not available within \verb:R:, but can be found on \verb:www.biosemantics.org/weightedglobaltest:. The examples for \Rfunction{gtConcept} in this vignette are displayed without results, because we the concept files are too large to be included in the \Rpackage{globaltest} package. To test a certain collection, for example \verb:Body System.txt:, we say <>= gtConcept(ALL.AML, Golub_Train, conceptmatrix="Body System.txt") @ This automatically tests all concepts included in the file. Note that the files \verb"conceptID2name.txt" and \verb"entrezGeneToConceptID.txt" must also be downloaded from the same website or the function to work. The function automatically maps the gene set to the probe identifiers from the annotation package of the \Rpackage{hu6800} Affymetrix chip; the reference to this annotation package is contained in the \verb:Golub_Train: \Rclass{ExpressionSet} object. If \Rclass{ExpressionSet} objects are not used, the name of the annotation package can be supplied in the \Rfunarg{annotation} argument of \Rfunction{gtConcept}. Annotation packages are not always available for all microarray types. Therefore, a general Entrez-based annotation package is available for many organisms. This general annotation package may be substituted for a specific annotation package if a mapping from probe(set) ids to Entrez is given (as a list or as a vector). For the Golub data we use the mapping from the \Rpackage{hu6800.db} package to obtain this mapping. <>= eg <- as.list(hu6800ENTREZID) gtConcept(ALL.AML, Golub_Train, conceptmatrix="Body System.txt", probe2entrez = eg, annotation="org.Hs.eg.db") @ The \Rfunction{gtConcept} function uses the weighted version of the global test (see also Section \ref{weighting}), with weights given by each gene's association with a concept. An argument \Rfunarg{threshold} sets weights below the given threshold to zero, which limits computation time. The \verb:#Cov: column in the results output gives the number of probes with non-zero weight. A further argument \Rfunarg{share}, determines what to do with genes that have multiple probes. If \Rfunarg{share} is set to \Robject{TRUE}, the weight for each probe is set to the weight of the gene divided by the number of probes of that gene, making the probes share the total weight allocated to the gene. If \Rfunarg{share} is set to \Robject{FALSE}, each probe gets the full weight allocated to the gene. Multiple testing corrected p-values are automatically provided by \Rfunction{gtConcept}. The default multiple testing method is Holm's, but others are available through the \Rfunarg{multtest} argument. See also the \Rfunction{p.adjust} function, described in Section \ref{multtest}. The results are sorted to increasing p-values (using the \Rfunction{sort} method), unless the \Rfunarg{sort} argument of \Rfunction{gtConcept} is set to \verb:FALSE:. <>= gtConcept(ALL.AML, Golub_Train, conceptmatrix="Body System.txt", multtest="BH") @ \section{Gene and sample plots} \subsection{Visualizing features} The covariate (or ``features'') plot may be used to great effect for investigating to which individual probes or genes or to which subsets of the gene set a significant result for a gene set may be attributed. \begin{figure}[!ht] <>= res <- gtKEGG(ALL.AML, Golub_Train, id = "04110") features(res, alias=hu6800SYMBOL) @ \end{figure} The details of the \Rfunction{features} plot are described in Section \ref{covariates}. The \Rfunarg{alias} argument is useful to replace the probe identifiers with more familiar gene symbols. The black line in the \Rfunction{features} plot represents the significant subgraph of the clustering tree. To find the leaf nodes that characterize the graph, use the function \Rfunction{leafNodes}. <>= ft <- features(res, alias=hu6800SYMBOL) leafNodes(ft) @ It may happen that the leaf nodes of the significant graph are not individual features, but sets of features higher up in the clustering graph. Use the \Rfunction{subsets} method to find which features belong to such a node. <>= subsets(leafNodes(ft)) @ It is possible to only plot the significant subtree with the \Rfunarg{zoom} argument. This is especially useful if the set of features is large. \begin{figure}[!ht] <>= res <- gtKEGG(ALL.AML, Golub_Train, id = "04110") features(res, alias=hu6800SYMBOL, zoom=TRUE) @ \end{figure} The \Rfunction{extract} function can be useful to get information on the individual features, and the \Rfunarg{plot} argument can be used to suppress plotting. <>= ft <- features(res, alias=hu6800SYMBOL, plot=FALSE) extract(ft) @ When testing many GO or KEGG terms it can be convenient to make features plots for all tested gene sets at once, writing all plots to a pdf. <>= res_all <- gtKEGG(ALL.AML, Golub_Train) features(res_all[1:5], pdf="KEGGcov.pdf", alias=hu6800SYMBOL) @ \subsection{Visualizing subjects} Similarly, the subjects plot, described in Section \ref{subjects}, can be used to investigate which subjects are similar in terms of their expression signature to other subjects with the same response variable, and which deviate from that pattern. In the \Rfunction{subjects} diagnostic plot, the subjects associated with strong evidence for the association between the response and the gene expression profile of the pathway have low p-values (tall bars), whereas the subjects with high p-values have weak or even contrary evidence. The most interesting subjects plot to look at is usually the subjects plot for the global test on all genes. From the figure, in this case, we note that the expression profile of the AML subjects seems more homogeneous than that of the ALL subjects: the latter group tends to be less coherent overall, and to contain more outlying subjects. \begin{figure}[!ht] <>= res <- gt(ALL.AML, Golub_Train) subjects(res) @ \end{figure} Just as with the \Rfunction{covariates} plot, \Rfunction{subjects} plots can be called on many gene sets at once, e.g.\ the top 25 pathways, and the results written to a pdf file. <>= res_all <- gtKEGG(ALL.AML, Golub_Train) subjects(res_all[1:25], pdf="KEGGsubj.pdf") @ \section{Survival data} \label{survival} The examples in this chapter so far were all in a classification context, in which the response category had two possible values, and the logistic regression model was used. The \Rpackage{globaltest} package is not limited to this response type, but can also handle multi-category response (using a multinomial logistic regression model), continuous response (using a linear regresseion model), count data (using a Poisson regression model), and survival data (using the Cox proportional hazards model). See section \ref{models} for more details. The multi-category, linear and count data versions are called in exactly the same way as the two-category one. The \Rfunction{gt} function will try to determine the model from the input, but the user can override any automatic choice by specifying the \Rfunarg{model} argument. For survival data, the input format is similar to the one used by the \Rpackage{survival} package. In the \verb:michigan: data set \citep{Beer2002} from the \Rpackage{lungExpression} package, for example, the survival time is coded in a time variable \verb:TIME..months.:, and a status variable \verb:death:, for which 1 indicates that the patient passed away at the recorded time point, and 0 that the patient was withdrawn alive. To test for an overall association between the gene expression profile and survival, we test <>= library(lungExpression) data(michigan) gt(Surv(TIME..months., death==1), michigan) @ \section{Comparative proportions} In some cases it can be of interest not only to know whether a certain gene set is significantly associated with a phenotype, but also whether it is exceptionally associated with the phenotype for a gene set of its size in the data set under study. This is a so-called competitive view on gene set testing. See \cite{Goeman2007} for issues involved with this competitive view. It is possible to use \Rpackage{globaltest} for such competitive gene set exploration using the function \Rfunction{comparative}. For each gene set tested, this function calculates the proportion of randomly sampled gene sets of the same size as the tested gene set that has an equal or larger global test p-value. This comparative proportion can be used as a diagnostic for the test results. Gene sets with small comparative proportions are exceptionally significant in comparison to a random gene set of its size in the data set. The comparative proportion is a diagnostic that conveys additional information. It should not be interpreted as a p-value in the usual sense. <>= res <- gtKEGG(ALL.AML, Golub_Train, id = "04110") comparative(res) @ The number of random gene sets of each size that are sampled can be controlled with the argument \Rfunarg{N} (default 1000). The argument \Rfunarg{zscores} (default: \verb:TRUE:) controls whether the comparison between the test results of the gene set and its random competitors is based on the p-values or on the z-scores of the test. \chapter{Goodness-of-Fit Testing}\label{gof testing} \section{Introduction} Another application of the global test is in goodness-of-fit testing for regression models. Generalized linear models, while flexible in terms of the supported response distributions, obey rather strong assumptions like linearity of the effect of the covariates on the predictor and the independence of the observations. The Cox regression model, even though leaves the baseline hazard unspecified, relies on the quite restrictive proportional hazards assumption. Therefore, in practical regression problems, lack of fit comes in all shapes and sizes: \begin{itemize} \item Unit- or cluster-specific heterogeneity may exist; \item The effect of continuous covariates on the predictor may be of non-linear form; \item Interactions between covariates may be missed or be more complex; \item The proportional hazards assumption may not hold. \end{itemize} Distinguishing the different types of lack of fit is of practical importance: if we find evidence against the model, we generally also want to know why the model does not fit. In this Chapter we introduce a goodness-of-fit testing approach based on the global test. It requires the specification of an alternative model, which identifies the type of lack of fit the test is directed against. Various types of lack of fit are treated within the same framework and many existing tests can be considered as special cases. Suppose that we are concerned with the adequacy of some regression model \verb:Y ~ X:, where \verb:X: represents the null design matrix. The alternative model can be cast into the generic form \verb:Y ~ X + Z:, where the choice of \verb:Z: depends on the type of lack of fit under investigation. Once \verb:Z: has been chosen, the global test is applied for testing \verb:Y ~ X: against \verb:Y ~ X + Z:. Sometimes a reparameterization of the alternative model is necessary. The required parameterization is either a penalized regression model with a ridge penalty on the coefficients associated with \verb:Z: or a mixed effect model where the coefficients associated with \verb:Z: are i.i.d. random effects. The examples listed below illustrate testing against several types of lack of fit. We have not attempted to write an exhaustive list, but rather to show how different choices of \verb:Z: accomodate to various types of lack of fit. \section{Heterogeneity} The data \verb"faults" gives the number of faults in rolls of textile fabric with varying length \citep{Bissell1972}. We consider a Poisson log-linear model with logarithm of the roll length as covariate. However, we may allow for the possibility of extra-Poisson variation by using a mixed model with i.i.d. random effects, one for each observation. Here $\Robject{Z}$ is specified as the identity matrix with ones on the main diagonal and zeros elsewhere. For testing against overdispersion, use <>= require("boot") data(cloth) Z <- matrix(diag(nrow(cloth)), ncol = nrow(cloth), dimnames = list(NULL, 1:nrow(cloth))) gt(y ~ log(x), alternative = Z, data = cloth, model = "poisson") @ The null hypothesis of no overdispersion can be rejected at the significance level of 5\%. The data \verb"rats" comes from a carcinogen experiment using 150 female rats, 3 each from 50 litters \cite{Mantel1977}. One rat from each litter was injected with a powerful carcinogen, and the time to tumor development, measured in weeks, was recorded. It is conceivable that the risk of tumor formation may depend on the genetic background or the early environmental conditions shared by siblings within litters, but differing between litters. Thus, intra-litter correlation in time to tumor appearance may exists. An alternative model allowing intra-litter correlations is a mixed model with i.i.d. random intercepts representing the litter effect. Here the matrix $\Robject{Z}$ is specified as a block matrix where each row is a vector of zeros except for a 1 in one position that indicates which litter the rat is from. For testing the hypothesis of no intra-litter correlation, use <>= library("survival") data(rats) nlitters<-length(unique(rats$litter)) Z<-matrix(NA,dim(rats)[1],nlitters, dimnames=list(NULL,1:nlitters)) for (i in 1:nlitters) Z[,i]<-(rats[,1]==i)*1 gt(Surv(time,status)~rx,alternative=Z,data=rats,model="cox") @ The null hypothesis of no intra-litter correlation can not be rejected at the significance level of 5\%. \section{Non-linearity} In many applications, the assumption of a linear dependence of the response on covariates is inappropriate. Semiparametric models provide a flexible alternative for detecting non-linear covariate effects. For a single continuous covariate \verb:X:, the model \verb:Y~X: is extended to \verb:Y~X+s(X):, where \verb:s(X): is an unspecified smooth function. \subsection{P-Splines} One increasingly popular idea to represent \verb:s(X): is the P-splines approach, introduced by \cite{Eilers1996}. In this approach \verb:s(X): is replaced by a B-spline basis \verb:Z:, giving the alternative model \verb:Y~X+Z:, where the coefficients associated with \verb:Z: are penalized to guarantee sufficient smoothness. The function \Rfunction{gtPS} can be used to define P-splines. We need to specify the following arguments: i) \Rfunarg{bdeg}, the degree of B-spline basis, ii) \Rfunarg{nint}, the number of intervals determined by equally-spaced knots placed on the \verb:X:-axis, and iii) \Rfunarg{pord}, the order of the differences indicating the type of the penalty imposed to the coefficients. The \Rfunarg{bdeg} and \Rfunarg{nint} arguments are used to construct a B-spline basis \verb:Z: (default values are \verb:bdeg=3: and \verb:nint=10:). The order of differences \Rfunarg{pord} deserves more attention (default value is \verb:pord=2:). Remember that we should obtain a ridge penalty on the coefficients associated with \verb:Z:. This is true with \verb:pord=0:, but in the world of P-splines it is common to use a roughness penalty based on differences of adjacent B-Spline coefficients, for instance, second order differences \verb:pord=2:. In this case we have to reparameterize the alternative model by decomposing \verb:Z: into \verb:U: and \verb:P:. This gives the alternative model \verb:Y~X+U+P: where the coefficients associated with \verb:U: are unpenalized whereas the coefficients associated with \verb"P" are penalized by a ridge penalty. However, the global test is not meant for testing the unpenalized coefficient, but it is concerned with the penalized coefficients. To get around this and test only for the penalized coefficients to be zero, we have to make sure that the columns of \verb:U: spans a subspace of the columns of \verb:X:, so that \verb:U: can be absorbed into \verb:X:. Otherwise, we are inadvertently changing the null hypothesis, or equivalently, we are considering the null model \verb:Y~X+U:. We can best illustrate this with a simple example: we add some Gaussian noise to the second data set reported in \cite{Anscombe1973}, where \verb:Y: has a quadratic relation with \verb:X:. To test \verb:Y~X: against \verb:Y~X+s(X): with default values, use: <>= data(anscombe) set.seed(0) X<-anscombe$x2 Y<-anscombe$y2 + rnorm(length(X),0,3) gtPS(Y~X) @ The same result can be obtained by using \Rfunction{bbase} to construct the B-spline basis \verb:Z: and \Rfunction{reparamZ} to get the penalized part \verb:P: to be plugged into \Rfunction{gt}: <>= Z<-bbase(X,bdeg=3,nint=10) P<-reparamZ(Z,pord=2) gt(Y~X,alternative=P) @ A quick way to check whether \verb:U: is absorbed into the null design matrix or not is to fit the augmented null model and see if all the coefficients associated with \verb:U: are not defined because of singularities: <>= U<-reparamZ(Z,pord=2, returnU=TRUE)$U lm(Y~X+U)$coefficients @ The function \Rfunction{gtPS} allows the specification of multiple arguments: <>= gtPS(Y~X, pord=list(Z=0,P=2)) @ However, the result is not conclusive because \verb:pord=2: detects the deviation from linearity at the significance level of 5\%, whereas \verb:pord=0: does not. To assess the global significance, set \verb"robust=TRUE": <>= rob<-gtPS(Y~X, pord=list(Z=0,P=2), robust=TRUE) rob@result @ Another way to obtain a global result is to combine the matrices \verb:Z: (corresponding to \verb:pord=0:) and \verb:P: (corresponding to \verb:pord=2:) into one overall matrix: <>= comb<-gt(Y~X, alternative=cbind(Z,P)) comb@result @ However, it may not be satisfactory because the component matrices \verb:Z: and \verb:P: do not contribute equally in the test statistic. In constrast, the \Rfunarg{robust} argument assigns equal weight to each component: <>= colrange<-list(Z=1:ncol(Z), P=(ncol(Z)+1):(ncol(Z)+ncol(P))) sapply(list(combined=comb,robust=rob), function(x){sapply(colrange, function(y){sum(weights(x)[y])/sum(weights(x))})}) @ \subsection{Generalized additive models} With multiple covariates, generalized addittive models (GAMs) augment the linear predictor \verb:~X1+X2+...: by a sum of smooth terms \verb:s(X1)+s(X2)+...:. A classic example dataset for GAMs is \verb:kyphosis:, representing observations on 81 children undergoing corrective surgery of the spine. For testing against non-linearity, the logististic model \verb:Kyphosis~Age+Number+Start: is compared to the GAM \verb:Kyphosis~...+s(Age)+s(Number)+s(Start): <>= require("rpart") data("kyphosis") fit0<-glm(Kyphosis~., data = kyphosis, family="binomial") res<-gtPS(fit0) res@result @ From the test result we can suspect that there is a non-linear effect in at least one covariate. To list the smooth terms specified in the alternative model, use: <>= sterms(res) @ A follow-up question concerns which covariates exhibit non-linearity. To address this question, we can fit the the same alternative model used for the test to decide what modifications to the model may be appropriate to consider. An advantage of having a specified alternative is that the same alternative model that was used in the test can be fitted. We use the package \Rpackage{penalized} to perform ridge regression estimation with the amount of shrinkage determined by the tuning parameter \Rfunarg{lambda2}. We set \Rfunarg{lambda2} equal to 0.086, the value maximizing the cross-validated likelihood. To get the alternative design matrix used in the test, set the argument \Rfunarg{returnZ} to \verb"TRUE": <>= require("penalized") Z<-gtPS(fit0, returnZ=TRUE)$Z fit1<-penalized(Kyphosis, penalized=~ Z, unpenalized=~Age+Number+Start, data = kyphosis, model="logistic", lambda2 = 0.086, trace=FALSE) @ \begin{figure}[h!] \begin{center} <>= bd=3 ni=10 po=2 covnames<-names(kyphosis)[-1] d<-bd+ni-po gammas<-fit1@penalized betas<-fit1@unpenalized l<-length(covnames) cd<-c(0,cumsum(rep(d,l))) op <- par(mfrow = c(2, 2)) for (i in 1:3){ x<-kyphosis[,covnames[i]] sx<-sort(x,index.return=T) ind<-vector("list", l) ind[[i]]<-(cd[i]+1):(cd[i+1]) plot(sx$x, (Z[sx$ix,ind[[i]]]%*%gammas[ind[[i]]] ), type="b",ylim=c(-1.7,0.7), xlab=covnames[i], ylab=paste("s(",covnames[i],")", sep="")) lines(abline(h=0,lty="dotted")) rug(x) } par(op) @ \end{center} \caption{Kyphosis data: component smooth terms.} \label{fig:kyphosis} \end{figure} Figure \ref{fig:kyphosis} shows the component smooth terms of the fitted GAM. From the plots it seems that all the covariates have a quadratic pattern, though \verb:Number: it is much less pronounced than for the other two variables. The argument \Rfunarg{covs} can be used to select a subset of the covariates, and testing for non-linearity is done for that subset only: <>= gtPS(fit0, covs=c("Age","Start")) @ Because \verb:Number: and \verb:Start: are heavily tied, one can modify the number of intervals for those covariates: <>= gtPS(fit0,covs=c("Age","Number","Start"), nint=list(a=5, b=c(5,1,1)), pord=0) @ With \verb:pord=0:, the choice of \verb:nint: is crucial: too small may not be flexible enough to capture the variability of the data, too large tends to overfit the data. In constrast, higher-order penalties guarantee sufficient smoothness and are less affected by the choice of \verb:nint:. An alternative GAM construction is to build and concatenate each model component like building blocks: <>= covs=c("Age","Number","Start") bd=c(3,3,3);ni=c(10,10,10);po=c(2,2,2);cs<-c(0,cumsum(bd+ni-po)) X0<-model.matrix(fit0)[,] combZ<-do.call(cbind,lapply(1:length(covs),function(x){reparamZ(bbase(kyphosis[,covs[x]], nint=ni[x], bdeg=bd[x]), pord=po[x])})) comb<-gt(Kyphosis~., alternative=combZ, data = kyphosis, model="logistic") comb@result @ However, the model components may not contribute equally in the test statistic: <>= range<-lapply(1:length(covs),function(x){(cs[x]+1):(cs[x+1])}) names(range)<-covs sapply(range,function(x){sum(weights(comb)[x])/sum(weights(comb))}) @ To assign equal weight to each component, as \Rfunction{gtPS} does, use the function \Rfunction{reweighZ}: <>= rwgtZ<-do.call(cbind,lapply(1:length(covs),function(x){reweighZ(reparamZ(bbase(kyphosis[,covs[x]], nint=ni[x], bdeg=bd[x]), pord=po[x]),fit0)})) rwgt<-gt(Kyphosis~., alternative=rwgtZ, data = kyphosis, model="logistic") sapply(range,function(x){sum(weights(rwgt)[x])/sum(weights(rwgt))}) @ \section{Non-linear and missed interactions} Suppose we are modelling the dependence of the response on several covariates, expressed as \verb:Y~X1+X2+...:. For testing against the alternative that any non-linearities or interaction effects have been missed, one can consider the model \verb:Y~X1+X2+...+s(X1,X2,...):, where \verb:s(): is an unspecified multi-dimensional smooth function. \subsection{Kernel smoothers} Kernel smoothers have advantages over P-splines for constructing multi-dimensional smooth terms, even though tensor products of B-splines can still be used for low dimensions. The data \verb:LakeAcidity: concerns 112 lakes in the Blue Ridge mountains area. Of interest is the dependence of the water acidity on the geographic locations (latitude and longitude) and the calcium concentration (in the log10 scale). To test\verb:ph~log10(cal)+lat+lon: against \verb:ph~...+s(cal,lat,lon):, use: <>= library(gss) data(LakeAcidity) fit0<-lm(ph~log10(cal)+lat+lon, data=LakeAcidity) res<-gtKS(fit0) res@result sterms(res) @ The smoothing matrix \verb:Z: is defined by a distance measure \Rfunarg{metric}, a kernel shape \Rfunarg{kernel} and a bandwidth \Rfunarg{quant}, expressed as the percentile of the distribution of distance between observations, which controls the amount of smoothing. If the argument \Rfunarg{termlabels} is set to \verb:TRUE:, the smoothing term \verb:s(log10(cal),lat,lon): is obtained. <>= gtKS(fit0, quant=seq(.01,.99,.02), data=LakeAcidity, termlabels=TRUE, robust=T) @ \begin{figure}[h!] \begin{center} <>= p<-sapply(seq(.01,.99,.02),function(x){gtKS(fit0, termlabels=T,quant=x, data=LakeAcidity)@result[,1]}) plot(seq(.01,.99,.02),p, type="s", xlab="quant", ylab="p-value", ylim=c(0,.5), xlim=c(0,1)) abline(h=gtKS(ph~log10(cal)+lat+lon, quant=seq(.01,.99,.02), data=LakeAcidity, robust=T, termlabels=T)@result[,1], lty="dotted") @ \end{center} \caption{Lake Acidity data: significance trace.} \label{fig:trace} \end{figure} The choice of the bandwidth may be crucial: the plot of Figure \ref{fig:trace} illustrates the influence of \verb"quant" (from .01 to .99 with increment of .02) on the significance of the test. The result seems conclusive since it stays mostly under the conventional 5\% level. \cite{Azzalini1993} considered the same plot, which they refer to as the `significance trace'. The global significance (dotted line) is obtained by setting \verb:robust=TRUE:: Latitude and longitude are included in the model to allow for geographical effects in the pattern of water acidity. However, it is less natural to include these terms separately since they define a two-dimensional co-ordinate system. For testing whether the interaction between latitude and longitude is linear or is of a more complex non-linear form, a two-dimensional interaction surface \verb"s(lat,lon)" can be constructed by a tensor product of univariate P-splines penalized by a Kroneker sum of penalties. To test \verb"ph~lat+lon+lat:lon" against \verb"ph~...+s(lat,lon)", use: <>= fit0<-lm(ph~lat*lon, data=LakeAcidity) res<-gtPS(fit0, covs=c("lat","lon"), interact=TRUE, data=LakeAcidity) res@result sterms(res) @ Figure \ref{fig:lake} displays the fitted alternative model, which suggests a non-linear interaction between latitude and longitude. \begin{figure}[h!] \begin{center} <>= Z<-gtPS(fit0, returnZ=TRUE, interact=TRUE)$Z fit1<-penalized(ph, penalized=Z, unpenalized=~lat*lon, data = LakeAcidity, lambda2 = 1, trace=FALSE) lon<-seq(79,85,length=50) lat<-seq(33,38,length=50) new<-data.frame(expand.grid(lon=lon, lat=lat)) Znew<-btensor(new, bdeg=c(3,3), nint=c(10,10), pord=c(2,2)) fitted<-matrix(predict(fit1,unpenalized=~lat*lon, penalized=Znew, data=new)[,1],50,50) persp(lon,lat,fitted,theta=-25) @ \end{center} \caption{Lake Acidity data: fitted alternative model.} \label{fig:lake} \end{figure} To test against non-linear main effects or non-linear interactions, we can consider the alternative \verb"ph~...+s(lat)+s(lon)+s(lat,lon)". Each model component can be constructed and combined like building blocks. The function \Rfunction{bbase} in combination with \Rfunction{reparamZ} can be used for constructing \verb"s(lat)" and \verb"s(lon)", whereas \Rfunction{btensor} for constructing \verb"s(lat,lon)" as tensor product of P-splines (reparameterized according to Kroneker sum of penalties). Finally, \Rfunction{reweighZ} can be used to give to each component the same contribution in the test statistic: <>= Z1<-reweighZ(reparamZ(bbase(LakeAcidity$lat, bdeg=3, nint=10), pord=2), fit0) Z2<-reweighZ(reparamZ(bbase(LakeAcidity$lon, bdeg=3, nint=10), pord=2), fit0) Z12<-reweighZ(btensor(cbind(LakeAcidity$lat, LakeAcidity$lon),bdeg=c(3,3),nint=c(10,10),pord=c(2,2)), fit0) gt(ph~lat*lon, alternative=cbind(Z1,Z2,Z12), data=LakeAcidity) @ \subsection{Varying-coefficients models} Sometimes the linear interaction \verb"X:F" between a continuous covariate \verb"X" and a factor \verb"F" is not appropriate, and a non-linear interaction \verb"s(X):F" may be preferred to let \verb"F" to vary smoothly over the range of "X". Let's look at \verb:nox: data as an example. Ethanol fuel was burned in a single cylinder engine. For various settings of the engine compression \verb"comp" and the equivalence ratio \verb"equi", the emissions of nitrogen oxides \verb"nox" were recorded. To test if the model \verb"nox~equi+comp+equi:comp" requires a non-linear form \verb:equi:, that is, to test against the varying-coefficients alternative model \verb"nox~...+s(equi)+s(equi):comp", use: <>= data(nox) sE<-bbase(nox$equi, bdeg=3, nint=10) sEbyC<-model.matrix(~0+sE:factor(comp), data=nox)[,] gt(nox~equi*factor(comp), alternative=cbind(sE,sEbyC), data=nox) @ \subsection{Missed interactions} Consider the \verb:boston: data for 506 census tracts of Boston from the 1970 census. Suppose we want to predict the price of a house based on various attributes like number of rooms, distance to employment, and neighborhood type. These covariates may interact, e.g. the number of rooms might not be as important if the neighborhood has lots of crime. For checking whether any two-way linear interaction effect has been missed, use: <>= library(MASS) data(Boston) res<-gtLI(medv~., data=Boston) res@result round(weights(res)/sum(weights(res)),4) @ To prevent very unbalanced interaction terms contributions in the test statistic, we recommend to rescale the covariates to unit standard deviation by \verb"standardize=TRUE" or to center and scale the data: <<>>= gtLI(medv~., data=Boston, standardize=T) gtLI(medv~., data=scale(Boston)) @ \section{Non-proportional hazards} Different extensions of the Cox model have been proposed to deal with non-proportional hazards. One possibility is the addition of an interaction term of the covariates with a time function, leading to time-varying effects of the covariates. This allows the effect of the covariates to change over time, such as the effect of a treatment that might wash away. However, time-varying covariates are not yet implemented in the function \Rfunction{gt} (but are likely to be in the future). \addcontentsline{toc}{chapter}{References} \label{references} \bibliography{GlobalTest} \end{document}