\documentclass[a4paper,12pt]{article} %\VignetteIndexEntry{Manual for the gaga library} %\VignettePackage{gaga} \usepackage{amsmath} % need for subequations \usepackage{amssymb} %useful mathematical symbols \usepackage{bm} %needed for bold greek letters and math symbols \usepackage{graphicx} % need for PS figures %\usepackage{verbatim} % useful for program listings %\usepackage{color} % use if color is used in text \usepackage{hyperref} % use for hypertext links, including those to external documents and URLs \usepackage{natbib} %number and author-year style referencing %\usepackage{epsf} %\usepackage{lscape} %\bibpunct{(}{)}{;}{a}{,}{,} \pagestyle{empty} % use if page numbers not wanted \begin{document} \title{Manual for the R \texttt{gaga} package} \author{David Rossell \\ \small{Department of Bioinformatics \& Biostatistics} \\ \small{IRB Barcelona} \\ \small{Barcelona, Spain.} \\ \texttt{rosselldavid@gmail.com}} \date{} %comment to include current date \maketitle Here we illustrate several uses of the package \texttt{gaga}, including simulation, differential expression analysis, class prediction and sample size calculations. In Section \ref{sec:intro} we review the GaGa and MiGaGa models. In Section \ref{sec:simulate} we simulate gene expression data, which we use to fit the GaGa model in Section \ref{sec:modelfit}. Diagnostics for model goodness-of-fit are presented in Section \ref{sec:gof}. Section \ref{sec:degenes} shows how to find differentially expressed genes, Section \ref{sec:fc} how to obtain fold change estimates and Section \ref{sec:classpredict} how to classify samples into groups. Finally, in Section \ref{sec:design} we perform fixed and sequential sample size calculations. \section{Introduction} \label{sec:intro} \cite{newton:2001} and \cite{kendziorski:2003} introduced the Gamma-Gamma model to %Newton (2001) and Kendziorski (2003) introduced the Gamma-Gamma model to analyze microarray data, an elegant and parsimonious hierachical model that allows for the borrowing of information between genes. \cite{rossell:2009} showed that the assumptions of this model are too simplistic, which resulted in a rather poor fit to several real datasets, and developed two extensions of the model: GaGa and MiGaGa. The \texttt{gaga} library implements the GaGa and MiGaGa generalizations, which can be used both to find differentially expressed genes and to predict the class of a future sample ({\it e.g.} given the mRNA measurements for a new patient, predict whether the patient has cancer or is healthy). We now briefly outline the \texttt{GaGa} and \texttt{MiGaGa} models. Let $x_{ij}$ be the expression measurement for gene $i$ in array $j$, and let $z_j$ indicate the group to which array belongs to ({\it e.g.} $z_j=0$ for normal cells and $z_j=1$ for cancer cells). The GaGa models envisions the observations as arising from a gamma distribution, {\it i.e.} $x_{ij} \sim \mbox{Ga}(\alpha_{i,z_j},\alpha_{i,z_j}/\lambda_{i,z_j})$ ($\lambda_{i,z_j}$ is the mean), where $\alpha_{i,z_j}$ and $\lambda_{i,z_j}$ arise from a gamma and an inverse gamma distribution, respectively: \begin{align} \lambda_{i,k} | \delta_i,\alpha_0,\nu &\sim \mbox{IGa}(\alpha_0,\alpha_0/\nu) \mbox{, indep. for }i=1 \ldots n \nonumber \\ \alpha_{i,k} | \delta_i,\beta,\mu & \sim \mbox{Ga}(\beta,\beta/\mu) \mbox{, indep. for }i=1 \ldots n \nonumber \\ \delta_i|\bm{\pi} & \sim \mbox{Mult}(1,\bm{\pi}) \mbox{, indep. for } i=1 \ldots n. \label{eq:modelgengg} \end{align} $\delta_1 \ldots \delta_n$ are latent variables indicating what expression pattern each gene follows (see Section \ref{sec:modelfit} for more details). For example, if there are only two groups $\delta_i$ indicates whether gene $i$ is differentially expressed or not. In principle, both the shape and mean parameters are allowed to vary between groups, and $\delta_i$ compares both parameters between groups ({\it i.e.} the GaGa model allows to compare not only mean expression levels but also the shape of the distribution between groups). However, the \texttt{gaga} library also implements a particular version of the model which assumes that the shape parameter is constant across groups, {\it i.e.} $\alpha_{i,k}=\alpha_i$ for all $k$. The coefficient of variation in the Gamma distribution is equal to the inverse square root of the shape parameter, and hence assuming constant $\alpha_{i,k}$ is equivalent to assuming constant CV across groups. In most routines the user can choose the constant CV model with the option \texttt{equalcv=TRUE} (the default), and the varying CV model with the option \texttt{equalcv=FALSE}. The Bayesian model is completed by specifying priors on the hyper-parameters that govern the hierarchy: \begin{align} \alpha_0 &\sim \mbox{Ga}(a_{\alpha_0},b_{\alpha_0}); \nu \sim \mbox{IGa}(a_{\nu},b_{\nu}) \nonumber \\ \beta &\sim \mbox{Ga}(a_{\beta},b_{\beta}); \mu \sim \mbox{IGa}(a_{\mu},b_{\mu}) \nonumber \\ \bm{\pi} &\sim \mbox{Dirichlet}({\bf p}). \label{eq:priorgaga} \end{align} The \texttt{gaga} library provides some default values for the prior parameters that are a reasonable choice when the data has been normalized via the function \texttt{just.rma} from the R library \texttt{affy} or \texttt{just.gcrma} from the R library \texttt{just.gcrma}. The MiGaGa model extends GaGa by specifying a mixture of inverse gammas for $\nu$. Both models are fit using the routine \texttt{fitGG}: the argument \texttt{nclust} indicates the number of components in the mixture (\texttt{nclust=1} corresponds to the GaGa model). \section{Simulating the data} \label{sec:simulate} We start by loading the library and simulating mRNA expression levels for \texttt{n=100} genes and 2 groups, each with 6 samples. We set the seed for random number generation so that you can reproduce the results presented here. We use the parameter estimates obtained from the Armstrong dataset (Armstrong, 2002) %\cite{armstrong:2002}, as described in (Rossell, 2009) %\cite{rossell:2009}. As we shall see in the future sections, we use the first five samples from each group to fit the model. We will then use the model to predict the class for the sixth sample. <>= library(gaga) set.seed(10) n <- 100; m <- c(6,6) a0 <- 25.5; nu <- 0.109 balpha <- 1.183; nualpha <- 1683 probpat <- c(.95,.05) xsim <- simGG(n,m=m,p.de=probpat[2],a0,nu,balpha,nualpha,equalcv=TRUE) @ The object \texttt{xsim} is an \texttt{ExpressionSet}. The simulated expression values are accessible through \texttt{exprs(xsim)}, the parameters through \texttt{featureData(xsim)} and the group that each observation belongs through \texttt{pData(xsim)}. We save in \texttt{a} a matrix containing the gene-specific $\alpha$ parameters (\texttt{a[,1]} contains parameters for the first group, \texttt{a[,2]} for the second). Similarly, we save the gene-specific means $\lambda$ in \texttt{l} and the expression values in \texttt{x}. <>= xsim featureData(xsim) phenoData(xsim) a <- fData(xsim)[,c("alpha.1","alpha.2")] l <- fData(xsim)[,c("mean.1","mean.2")] x <- exprs(xsim) @ Figure \ref{fig:fig1}(a) shows the marginal distribution (kernel density estimate) of the simulated gene expression levels. Figure \ref{fig:fig1}(b) plots the simulated mean and coefficient of variation for group 1. The plots can be obtained with the following syntax: <>= plot(density(x),xlab='Expression levels',main='') @ <>= plot(l[,1],1/sqrt(a[,1]),xlab='Group 1 Mean',ylab='Group 1 CV') @ \setkeys{Gin}{width=0.5\textwidth} \begin{figure} \begin{center} \begin{tabular}{cc} (a) & (b) \\ <>= <> @ & <>= <> @ \end{tabular} \end{center} \caption{(a): marginal density of the simulated data; (b): plot of the simulated $(\alpha,\lambda)$ pairs} \label{fig:fig1} \end{figure} \section{Model fit} \label{sec:modelfit} To fit the model we use the function \texttt{fitGG}. First, we define the vector \texttt{groups}, which indicates the group each sample belongs to. Second, we specify the gene expression patterns or hypotheses that we wish to entertain. In our example, since we have two groups there really are only two possible expression patterns: \begin{description} \item Pattern 0 (null hypotheses): group 1 = group 2 \item Pattern 1 (alternative hypothesis): group 1 $\neq$ group 2. \end{description} More precisely, under pattern 0 we have that $\alpha_{i1}=\alpha_{i2}$ and $\lambda_{i1}=\lambda_{i2}$, while under pattern 1 $\alpha_{i1} \neq \alpha_{i2}$ and $\lambda_{i2} \neq \lambda_{i2}$. We specify the patterns with a matrix with as many rows as patterns and as many columns as groups. For each row of the matrix ({\it i.e.} each hypothesis), we indicate that two groups are equal by assigning the same number to their corresponding columns. The column names of the matrix must match the group codes indicated in \texttt{groups}, otherwise the routine returns an error. For example, in our two hypothesis case we would specify: <>= groups <- pData(xsim)$group[c(-6,-12)] groups patterns <- matrix(c(0,0,0,1),2,2) colnames(patterns) <- c('group 1','group 2') patterns @ For illustration, suppose that instead we had 3 groups and 4 hypotheses, as follows: \begin{description} \item Pattern 0: CONTROL = CANCER A = CANCER B \item Pattern 1: CONTROL $\neq$ CANCER A = CANCER B \item Pattern 2: CONTROL = CANCER A $\neq$ CANCER B \item Pattern 3: CONTROL $\neq$ CANCER A $\neq$ CANCER B \end{description} In this case we would specify <>= patterns <- matrix(c(0,0,0,0,1,1,0,0,1,0,1,2),ncol=3,byrow=TRUE) colnames(patterns) <- c('CONTROL','CANCER A','CANCER B') patterns @ That is, the second row indicates that under Pattern 1 cancers of type A and B are present the same expression levels, since they both have a \texttt{1} in their entries. The last row indicates that they are all different by specifying a different number in each entry. Now, to fit the GaGa model to our simulated data we use \texttt{fitGG}, with \texttt{nclust=1} (to fit the MiGaGa model we would set \texttt{nclust} to the number of components that we want in the mixture). We remove columns 6 and 12 from the dataset, {\it i.e.} we do not use them for the fit so that we can evaluate the out-of-sample behavior of the classifier built in Section \ref{sec:classpredict}. Here we use the option \texttt{trace=FALSE} to prevent iteration information from being printed. There are several available methods to fit the model. \texttt{method=='EM'} implements an Expectation-Maximization algorithm which seeks to maximize the expected likelihood. \texttt{method=='quickEM'} (the default) is a quicker version that uses only 1 maximization step. \texttt{quickEM} usually provides reasonably good hyper-parameter estimates at a low computational cost. In practice we have observed that the inference derived from the GaGa and MiGaGa models ({\it e.g.} lists of differentially expressed genes) is robust to slight hyper-parameter miss-specifications, so we believe \texttt{quickEM} to be a good default option for large datasets. \texttt{method=='SA'} implements a Simulated Annealing scheme which searches for a hyper-parameter value with high posterior density. The three above-mentioned methods (\texttt{EM}, \texttt{quickEM}, \texttt{SA}) only provide point estimates. We can obtain credibility intervals with \texttt{method=='Gibbs'} or \texttt{method=='MH'}, which fit a fully Bayesian model via Gibbs or Metropolis-Hastings MCMC posterior sampling, respectively. Of course, obtaining credibility intervals comes at a higher computational cost. In our experience the five methods typically deliver similar results. <>= patterns <- matrix(c(0,0,0,1),2,2) colnames(patterns) <- c('group 1','group 2') gg1 <- fitGG(x[,c(-6,-12)],groups,patterns=patterns,nclust=1,method='Gibbs',B=1000,trace=FALSE) gg2 <- fitGG(x[,c(-6,-12)],groups,patterns=patterns,method='EM',trace=FALSE) gg3 <- fitGG(x[,c(-6,-12)],groups,patterns=patterns,method='quickEM',trace=FALSE) @ We can obtain iteration plots to visually assess the convergence of the chain. The component \texttt{mcmc} of \texttt{gg1} contains an object of type \texttt{mcmc}, as defined in the library \texttt{coda}. To obtain parameter estimates and the posterior probability that each gene is differentially expressed we use the function \texttt{parest}. We discard the first 100 MCMC iterations with \texttt{burnin=100}, and we ask for 95\% posterior credibility intervals with \texttt{alpha=.05}. The slot \texttt{ci} of the returned object contains the credibility intervals (this option is only available for \texttt{method=='Gibbs'} and \texttt{method=='MH'}). <>= gg1 <- parest(gg1,x=x[,c(-6,-12)],groups,burnin=100,alpha=.05) gg2 <- parest(gg2,x=x[,c(-6,-12)],groups,alpha=.05) gg3 <- parest(gg3,x=x[,c(-6,-12)],groups,alpha=.05) gg1 gg1$ci gg2 gg3 @ Although the parameter estimates obtained from the four methods are similar to each other, some differences remain. This is due to some extent to our having a small dataset with only 100 genes. For the larger datasets encountered in practice the four methods typically deliver very similar results. In Section \ref{sec:degenes} we assess whether the lists of differentially expressed genes obtained with each method are actually the same. The slot \texttt{pp} in \texttt{gg1} and \texttt{gg2} contains a matrix with the posterior probability of each expression pattern for each gene. For example, to find probability that the first gene follows pattern 0 ({\it i.e.} is equally expressed) and pattern 1 ({\it i.e.} is differentially expressed) we do as follows. <>= dim(gg1$pp) gg1$pp[1,] gg2$pp[1,] @ \section{Checking the goodness of fit} \label{sec:gof} To graphically assess the goodness of fit of the model, one can used prior-predictive or posterior-predictive checks. The latter, implemented in the function \texttt{checkfit}, are based on drawing parameter values from the posterior distribution for each gene, and possibly using then to generate data values, and then compare the simulated values to the observed data. The data generated from the posterior predictive is compared to the observed data in Figure \ref{fig:fig4}(a). Figure \ref{fig:fig4}(b)-(d) compares draws from the posterior of $\alpha$ and $\lambda$ with their method of moments estimate, which is model-free. All plots indicate that the model has a reasonably good fit. The figures were generated with the following code: \setkeys{Gin}{width=0.5\textwidth} <>= checkfit(gg1,x=x[,c(-6,-12)],groups,type='data',main='') @ \setkeys{Gin}{width=0.5\textwidth} <>= checkfit(gg1,x=x[,c(-6,-12)],groups,type='shape',main='') @ \setkeys{Gin}{width=0.5\textwidth} <>= checkfit(gg1,x=x[,c(-6,-12)],groups,type='mean',main='') @ \setkeys{Gin}{width=0.5\textwidth} <>= checkfit(gg1,x=x[,c(-6,-12)],groups,type='shapemean',main='',xlab='Mean',ylab='1/sqrt(CV)') @ \setkeys{Gin}{width=0.5\textwidth} \begin{figure} \begin{center} \begin{tabular}{cc} (a) & (b) \\ <>= <> @ & <>= <> @ \\ <>= <> @ & <>= <> @ \end{tabular} \end{center} \caption{Assessing the goodness of fit. (a): compares samples from the posterior predictive to the observed data; (b): compares samples from the posterior of $\alpha$ to the method of moments estimate; (c): compares samples from the posterior of $\lambda$ to the method of moments estimate; (d): as (b) and (c) but plots the pairs $(\alpha,\lambda)$ instead of the kernel density estimates} \label{fig:fig4} \end{figure} It should be noted, however, that posterior-predictive plots can fail to detect departures from the model, since there is a double use of the data. Prior-predictive checks can be easily implemented using the function \texttt{simGG} and setting the hyper-parameters to their posterior mean. \section{Finding differentially expressed genes} \label{sec:degenes} The function \texttt{findgenes} finds differentially expressed genes, {\it i.e.} assigns each gene to an expression pattern. The problem is formalized as minizing the false negative rate, subject to an upper bound on the false discovery rate, say \texttt{fdrmax=0.05}. In a Bayesian sense, this is achieved by assigning to pattern 0 (null hypothesis) all genes for which the posterior probability of following pattern 0 is above a certain threshold (Mueller, 2004). %\cite{mueller:2004}. The problem is then to find the optimal threshold, which can be done parametrically or non-parametrically through the use of permutations (for details see Rossell, 2009).%\cite{rossell:2007} Here we explore both options, specifying \texttt{B=1000} permutations for the non-parametric option. <>= d1 <- findgenes(gg1,x[,c(-6,-12)],groups,fdrmax=.05,parametric=TRUE) d1.nonpar <- findgenes(gg1,x[,c(-6,-12)],groups,fdrmax=.05,parametric=FALSE,B=1000) dtrue <- (l[,1]!=l[,2]) table(d1$d,dtrue) table(d1.nonpar$d,dtrue) @ We set the variable \texttt{dtrue} to indicate which genes were actually differentially expressed (easily achieved by comparing the columns of \texttt{xsim\$l}). Both the parametric and non-parametric versions declare 4 genes to be DE, all of them true positives. They both fail to find one of the DE genes. To obtain an estimated frequentist FDR for each Bayesian FDR one can plot \texttt{d1.nonpar\$fdrest}. The result, shown in Figure \ref{fig:fig4}, reveals that setting the Bayesian FDR at a 0.05 level results in an estimated frequentist FDR around 0.015. That is, calling \texttt{findgenes} with the option \texttt{parametric=TRUE} results in a slightly conservative procedure from a frequentist point of view. \setkeys{Gin}{width=0.5\textwidth} <>= plot(d1.nonpar$fdrest,type='l',xlab='Bayesian FDR',ylab='Estimated frequentist FDR') @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Estimated frequenstist FDR vs. Bayesian FDR} \label{fig:fig5} \end{figure} Finally, we compare the list of differentially expressed genes with those obtained when using the other fitting criteria explained in Section \ref{sec:modelfit}. <>= d2 <- findgenes(gg2,x[,c(-6,-12)],groups,fdrmax=.05,parametric=TRUE) d3 <- findgenes(gg3,x[,c(-6,-12)],groups,fdrmax=.05,parametric=TRUE) table(d1$d,d2$d) table(d1$d,d3$d) @ Despite the existence of small differences in the hyper-parameter estimates between methods, the final list of differentially expressed genes is the same for all of them. This suggests that the GaGa model is somewhat robust to the hyper-parameter specification. \section{Obtaining fold change estimates} \label{sec:fc} The GaGa and MiGaGa models can be used to obtain fold change estimates, by computing the posterior expected expression values for each group. As these posterior expectations are derived from a hierarchical model, they are obtained by borrowing information across genes. Therefore, in small sample situations they are preferrable to simply using the group sample means. The function \texttt{posmeansGG} computes posterior expected values under any expression pattern. The expression pattern is indicated with the argument \texttt{underpattern}. In our example (as in most microarray experiments) pattern 0 corresponds to the null hypothesis that no genes are differentially expressed. Therefore, specifying \texttt{underpattern=0} would result in obtaining identical expression estimates for both groups. Instead, one is commonly interested in computing a different mean for each group, which in our case corresponds to pattern 1. As the expression measurements were simulated to be in log2 scale, the log-fold change can be computed by taking the difference between the two group means (if the data were in the original scale, we would divide instead). The code below computed posterior means and log-fold changes, and prints out the fold change for the first five genes. <>= mpos <- posmeansGG(gg1,x=x[,c(-6,-12)],groups=groups,underpattern=1) fc <- mpos[,1]-mpos[,2] fc[1:5] @ \section{Class prediction} \label{sec:classpredict} We now use the fitted model to predict the class of the arrays number 6 and 12, neither of which were used to fit the model. We assume that the prior probability is 0.5 for each group, though in most settings this will not be realistical. For example, if \texttt{groups==2} indicates individuals with cancer, one would expect the prior probability to be well below 0.5, say around 0.1. But if the individual had a positive result in some test that was administered previously, this probability would have increased, say to 0.4. Class prediction is implemented in the function \texttt{classpred}. The argument \texttt{xnew} contains the gene expression measurements for the new individuals, \texttt{x} is the data used to fit the model and \texttt{ngene} indicates the number of genes that should be used to build the classifier. It turns out that array 6 is correctly assigned to group 1 and array 12 is correctly assigned to group 2. \texttt{classpred} also returns the posterior probability that the sample belongs to each group. We see that for the dataset at hand the posterior probability of belonging to the wrong group is essentially zero. Similarly good results are obtained when using setting \texttt{ngene} to either 1 (the minimum value) or to 100 (the maximum value). The fact that adding more gene to the classifier does not change its performance is not surprising, since the classifier assigns little weight to genes with small probability of being DE. We have observed a similar behavior in many datasets. The fact that the classifier works so well with a single is typically not observed in real datasets, where it is rare to have a gene with such a high discrimination power. <>= pred1 <- classpred(gg1,xnew=x[,6],x=x[,c(-6,-12)],groups,ngene=50,prgroups=c(.5,.5)) pred2 <- classpred(gg1,xnew=x[,12],x=x[,c(-6,-12)],groups,ngene=50,prgroups=c(.5,.5)) pred1 pred2 @ \section{Designing high-throughput experiments} \label{sec:design} The \texttt{gaga} package incorporates routines which can be used for fixed and sequential sample size calculation in high-throughput experiments. For details on the methodology see \cite{rossell:2011}. We start by simulating some data from a GaGa model. Since the computations can be intensive, here we simulate data for 20 genes only. The question is, given the observed data, how many more samples should we collect, if any? <>= set.seed(1) x <- simGG(n=20,m=2,p.de=.5,a0=3,nu=.5,balpha=.5,nualpha=25) gg1 <- fitGG(x,groups=1:2,method='EM') gg1 <- parest(gg1,x=x,groups=1:2) @ The function \texttt{powfindgenes} evaluates the (posterior) expected number of new true gene discoveries if one were to obtain an additional batch of data with \texttt{batchSize} new samples per group. For our simulated data, we expect that obtaining 1 more sample per group would provide no new gene discoveries. For 2 and 3 more samples per group we still expect to discover less than one new gene (which seems reasonable for our simulated data with only 20 genes). <>= pow1 <- powfindgenes(gg1, x=x, groups=1:2, batchSize=1, fdrmax=0.05, B=1000) pow2 <- powfindgenes(gg1, x=x, groups=1:2, batchSize=2, fdrmax=0.05, B=1000) pow3 <- powfindgenes(gg1, x=x, groups=1:2, batchSize=3, fdrmax=0.05, B=1000) pow1$m pow2$m pow3$m @ As illustrated, calling \texttt{powfindgenes} for different values of \texttt{batchSize} can be used to determine the sample size. We refer to this approach as fixed sample size calculation, since the number of additional samples is fixed from now on, regardless of the evidence provided by new data. Function \texttt{forwsimDiffExpr} provides a sequential sample size alternative. The idea is that, every time that we observe new data, we can use \texttt{powfindgenes} to estimate the expected number of new gene discoveries for an additional data batch. As long as this quantity is large enough, we keep adding new samples. When this quantity drops below some threshold we stop experimentation. \texttt{forwsimDiffExpr} uses forward simulation to determine reasonable values for this threshold. Shortly, the function simulates \texttt{batchSize} new samples per group from the GaGa posterior predictive distribution, and for each of them evaluates the expected number of new discoveries via \texttt{powfindgenes} (estimated via \texttt{Bsummary} Monte Carlo samples). Then \texttt{batchSize} more samples are added, and \texttt{powfindgenes} is called again, up to a maximum number of batches \texttt{maxBatch}. The whole process is repeated \texttt{B} times. Notice that, although not illustrated here, parallel processing can be used to speed up computations ({\it e.g.} see \texttt{mcapply} from package \texttt{parallel}). <>= fs1 <- forwsimDiffExpr(gg1, x=x, groups=1:2, maxBatch=3,batchSize=1,fdrmax=0.05, B=100, Bsummary=100, randomSeed=1) head(fs1) @ \texttt{forwsimDiffExpr} returns a \texttt{data.frame} indicating, for each simulation and stopping time (number of batches), the posterior expectation of the number of true posities (\texttt{u}), false discovery and false negative rates (\texttt{fdr}, \texttt{fnr}), and power ({\it e.g.} number of detected DE genes at a Bayesian FDR \texttt{fdrmax} divided by overall number of DE genes). It also returns the (posterior predictive) expected new DE discoveries for one more data batch in \texttt{summary}. Since experimentation is always stopped at \texttt{time==maxBatch}, it is not necessary to evaluate \texttt{summary} at this time point and \texttt{NA} is returned. The output of \texttt{forwsimDiffExpr} can be used to estimate the expected number of new true discoveries for each sample size, as well as to estimate the expected utility by subtracting a sampling cost. As illustrated above these results can also be obtained with \texttt{powfindgenes}, which is much faster computationally. <>= tp <- tapply(fs1$u,fs1$time,'mean') tp samplingCost <- 0.01 util <- tp - samplingCost*(0:3) util @ Again, for our simulated data we expect to find very few DE genes with 1, 2 or 3 additional data batches. For a sampling cost of 0.01, the optimal fixed sample design is to obtain 3 more data batches. Here we set a very small sampling cost for illustration purposes, although in most applications both the number of genes and the sampling cost will be much larger. For instance, \texttt{samplingCost=50} would indicate that the experimenter considers it worthwhile to obtain one more batch of samples as long as that allows him to find at least 50 new DE genes. In order to find the optimal sequential design, we define a grid of intercept and slope values for the linear stopping boundaries. The function \texttt{seqBoundariesGrid} returns the expected utility for each intercept and slope in the grid in the element \texttt{grid}. The element \texttt{opt} contains the optimum and the expected utility, FDR, FNR, power and stopping time ({\it i.e.} the expected number of data batches). <>= b0seq <- seq(0,20,length=200); b1seq <- seq(0,40,length=200) bopt <-seqBoundariesGrid(b0=b0seq,b1=b1seq,forwsim=fs1,samplingCost=samplingCost,powmin=0) names(bopt) bopt$opt head(bopt$grid) @ The expected utility for the optimal boundary is slightly larger than for a fixed sample size of 3 batches per group (see above), and perhaps more importantly it is achieved with a smaller average sample size. The optimal intercept and slope are equal to 0. Recall that experimentation at time $t=1,\ldots,$\texttt{batchSize}$-2$ continues as long as \texttt{summary} is greater or equal than the stopping boundary. Therefore the optimal rule implies never stopping at $t=1$. At time \texttt{batchSize}$-1$ ($t=2$ in our example) the optimal decision is to continue whenever the one-step ahead expected new true discoveries (\texttt{summary}) is greater than \texttt{samplingCost}, regardless of the stopping boundary. We produce a plot to visualize the results (Figure \ref{fig:forwsim}). The plot shows the simulated trajectories for the summary statistic, and the optimal stopping boundary. \setkeys{Gin}{width=0.5\textwidth} <>= plot(fs1$time,fs1$summary,xlab='Additional batches',ylab='E(newly discovered DE genes)') abline(bopt$opt['b0'],bopt$opt['b1']) text(.2,bopt$opt['b0'],'Continue',pos=3) text(.2,bopt$opt['b0'],'Stop',pos=1) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Forward simulations and optimal sequential rule} \label{fig:forwsim} \end{figure} \bibliographystyle{plainnat} \bibliography{references} \end{document}