\documentclass[a4paper,12pt]{article}
%\VignetteIndexEntry{Manual for the gaga library}
%\VignettePackage{gaga}


\usepackage{amsmath}    % need for subequations
\usepackage{amssymb}    %useful mathematical symbols
\usepackage{bm}         %needed for bold greek letters and math symbols
\usepackage{graphicx}   % need for PS figures
%\usepackage{verbatim}   % useful for program listings
%\usepackage{color}      % use if color is used in text
\usepackage{hyperref}   % use for hypertext links, including those to external documents and URLs
\usepackage{natbib}    %number and author-year style referencing
%\usepackage{epsf} 
%\usepackage{lscape} 
%\bibpunct{(}{)}{;}{a}{,}{,}


\pagestyle{empty} % use if page numbers not wanted

\begin{document}

\title{Manual for the R \texttt{gaga} package}
\author{David Rossell \\
\small{Department of Bioinformatics \& Biostatistics} \\
\small{IRB Barcelona} \\
\small{Barcelona, Spain.} \\
\texttt{rosselldavid@gmail.com}}
\date{}  %comment to include current date

\maketitle

Here we illustrate several uses of the package \texttt{gaga},
including simulation, differential expression analysis, class prediction
and sample size calculations.
In Section \ref{sec:intro} we review the GaGa and MiGaGa models.
In Section \ref{sec:simulate} we simulate gene expression data,
which we use to fit the GaGa model in Section \ref{sec:modelfit}.
Diagnostics for model goodness-of-fit are presented in Section \ref{sec:gof}.
Section \ref{sec:degenes} shows how to find differentially expressed genes,
Section \ref{sec:fc} how to obtain fold change estimates
and Section \ref{sec:classpredict} how to classify samples into groups.
Finally, in Section \ref{sec:design} we perform fixed and sequential
sample size calculations.


\section{Introduction}
\label{sec:intro}


\cite{newton:2001} and \cite{kendziorski:2003} introduced the Gamma-Gamma model to
%Newton (2001) and Kendziorski (2003) introduced the Gamma-Gamma model to
analyze microarray data, an elegant and parsimonious hierachical model that allows for
the borrowing of information between genes.
\cite{rossell:2009} showed that the assumptions of this model are
too simplistic, which resulted in a rather poor fit to several real datasets,
and developed two extensions of the model: GaGa and MiGaGa.
The \texttt{gaga} library implements the GaGa and MiGaGa generalizations,
which can be used both to find differentially expressed genes and to predict
the class of a future sample ({\it e.g.} given the mRNA measurements for a new
patient, predict whether the patient has cancer or is healthy). 


We now briefly outline the \texttt{GaGa} and \texttt{MiGaGa} models.
Let $x_{ij}$ be the expression measurement for gene $i$ in array $j$,
and let $z_j$ indicate the group to which array belongs to
({\it e.g.} $z_j=0$ for normal cells and $z_j=1$ for cancer cells).
The GaGa models envisions the observations as arising from a gamma
distribution, {\it i.e.}
$x_{ij} \sim \mbox{Ga}(\alpha_{i,z_j},\alpha_{i,z_j}/\lambda_{i,z_j})$
($\lambda_{i,z_j}$ is the mean),
where $\alpha_{i,z_j}$ and $\lambda_{i,z_j}$
arise from a gamma and an inverse gamma distribution, respectively:


\begin{align}
  \lambda_{i,k} | \delta_i,\alpha_0,\nu &\sim \mbox{IGa}(\alpha_0,\alpha_0/\nu) \mbox{, indep. for }i=1 \ldots  n \nonumber \\
  \alpha_{i,k} | \delta_i,\beta,\mu & \sim \mbox{Ga}(\beta,\beta/\mu) \mbox{, indep. for }i=1 \ldots  n \nonumber \\
  \delta_i|\bm{\pi} & \sim \mbox{Mult}(1,\bm{\pi}) \mbox{, indep. for } i=1 \ldots  n.
\label{eq:modelgengg}
\end{align}

$\delta_1 \ldots \delta_n$ are latent variables indicating what expression
pattern each gene follows (see Section \ref{sec:modelfit} for more details).
For example, if there are only two groups $\delta_i$ indicates whether
gene $i$ is differentially expressed or not.


In principle, both the shape and mean parameters are allowed to vary between groups,
and $\delta_i$ compares both parameters between groups 
({\it i.e.} the GaGa model allows to compare not only mean expression levels but also
the shape of the distribution between groups).
However, the \texttt{gaga} library also implements a particular version of the model
which assumes that the shape parameter is constant across groups,
{\it i.e.} $\alpha_{i,k}=\alpha_i$ for all $k$.

The coefficient of variation in the Gamma distribution is equal to the inverse square root
of the shape parameter, and hence assuming constant $\alpha_{i,k}$ is equivalent
to assuming constant CV across groups.

In most routines the user can choose the constant CV model with the
option \texttt{equalcv=TRUE} (the default), and the varying CV model with
the option \texttt{equalcv=FALSE}.


The Bayesian model is completed by specifying priors on the hyper-parameters
that govern the hierarchy:

\begin{align}
  \alpha_0 &\sim \mbox{Ga}(a_{\alpha_0},b_{\alpha_0}); 
  \nu \sim \mbox{IGa}(a_{\nu},b_{\nu}) \nonumber \\
  \beta &\sim \mbox{Ga}(a_{\beta},b_{\beta}); 
  \mu \sim \mbox{IGa}(a_{\mu},b_{\mu}) \nonumber \\
  \bm{\pi} &\sim \mbox{Dirichlet}({\bf p}).
\label{eq:priorgaga}
\end{align}


The \texttt{gaga} library provides some default values for the prior parameters
that are a reasonable choice when the data has been normalized
via the function \texttt{just.rma} 
from the R library \texttt{affy} 
or \texttt{just.gcrma} from the R library \texttt{just.gcrma}.
The MiGaGa model extends GaGa by specifying a mixture of inverse gammas
for $\nu$.

Both models are fit using the routine \texttt{fitGG}: the argument
\texttt{nclust} indicates the number of components in the mixture
(\texttt{nclust=1} corresponds to the GaGa model).


\section{Simulating the data}
\label{sec:simulate}

We start by loading the library and simulating mRNA expression levels
for \texttt{n=100} genes and 2 groups, each with 6 samples.
We set the seed for random number generation so that you can reproduce
the results presented here.
We use the parameter
estimates obtained from the Armstrong dataset (Armstrong, 2002) %\cite{armstrong:2002},
as described in (Rossell, 2009) %\cite{rossell:2009}.
As we shall see in the future sections, we use the first five
samples from each group to fit the model.
We will then use the model to predict the class for the sixth sample.


<<one>>=
library(gaga)
set.seed(10)
n <- 100; m <- c(6,6)
a0 <- 25.5; nu <- 0.109
balpha <- 1.183; nualpha <- 1683
probpat <- c(.95,.05)
xsim <- simGG(n,m=m,p.de=probpat[2],a0,nu,balpha,nualpha,equalcv=TRUE)
@ 


The object \texttt{xsim} is an \texttt{ExpressionSet}.
The simulated expression values are accessible through \texttt{exprs(xsim)},
the parameters through \texttt{featureData(xsim)}
and the group that each observation belongs through \texttt{pData(xsim)}.
We save in \texttt{a} a matrix containing the gene-specific $\alpha$ parameters
(\texttt{a[,1]} contains parameters for the first group,
\texttt{a[,2]} for the second).
Similarly, we save the gene-specific means $\lambda$ in \texttt{l}
and the expression values in \texttt{x}.


<<two>>=
xsim
featureData(xsim)
phenoData(xsim)
a <- fData(xsim)[,c("alpha.1","alpha.2")]
l <- fData(xsim)[,c("mean.1","mean.2")]
x <- exprs(xsim)
@ 


Figure \ref{fig:fig1}(a) shows
the marginal distribution (kernel density estimate)
of the simulated gene expression levels.
Figure \ref{fig:fig1}(b) plots the simulated mean
and coefficient of variation for group 1.
The plots can be obtained with the following syntax:

<<label=fig1aplot,include=FALSE>>=
plot(density(x),xlab='Expression levels',main='')
@

<<label=fig1bplot,include=FALSE>>=
plot(l[,1],1/sqrt(a[,1]),xlab='Group 1 Mean',ylab='Group 1 CV')
@ 

\setkeys{Gin}{width=0.5\textwidth} 
\begin{figure}
\begin{center}
\begin{tabular}{cc}
(a) & (b) \\
<<label=fig1a,fig=TRUE,echo=FALSE>>=
<<fig1aplot>>
@ &
<<label=fig1b,fig=TRUE,echo=FALSE>>=
<<fig1bplot>>
@ 
\end{tabular}
\end{center}
\caption{(a): marginal density of the simulated data;
(b): plot of the simulated $(\alpha,\lambda)$ pairs}
\label{fig:fig1}
\end{figure}


\section{Model fit}
\label{sec:modelfit}
To fit the model we use the function \texttt{fitGG}.
First, we define the vector \texttt{groups}, which
indicates the group each sample belongs to.
Second, we specify the gene expression patterns
or hypotheses that we wish to entertain. 
In our example, since we have two groups there really are
only two possible expression patterns:

\begin{description}
\item Pattern 0 (null hypotheses): group 1 = group 2
\item Pattern 1 (alternative hypothesis): group 1 $\neq$ group 2.
\end{description}

More precisely, under pattern 0 we have that $\alpha_{i1}=\alpha_{i2}$
and $\lambda_{i1}=\lambda_{i2}$, while under pattern 1 $\alpha_{i1} \neq \alpha_{i2}$
and $\lambda_{i2} \neq \lambda_{i2}$.
We specify the patterns with a matrix with as many rows as patterns
and as many columns as groups.
For each row of the matrix ({\it i.e.} each hypothesis), we indicate that
two groups are equal by assigning the same number to their corresponding columns.
The column names of the matrix must match the group codes indicated in \texttt{groups},
otherwise the routine returns an error.
For example, in our two hypothesis case we would specify:


<<three>>=
groups <- pData(xsim)$group[c(-6,-12)]
groups
patterns <- matrix(c(0,0,0,1),2,2)
colnames(patterns) <- c('group 1','group 2')
patterns
@ 

For illustration, suppose that instead we had 3 groups and 4 hypotheses, as follows:

\begin{description}
\item Pattern 0: CONTROL = CANCER A = CANCER B
\item Pattern 1: CONTROL $\neq$ CANCER A = CANCER B
\item Pattern 2: CONTROL = CANCER A $\neq$ CANCER B
\item Pattern 3: CONTROL $\neq$ CANCER A $\neq$ CANCER B
\end{description}
In this case we would specify

<<threebis>>=
patterns <- matrix(c(0,0,0,0,1,1,0,0,1,0,1,2),ncol=3,byrow=TRUE)
colnames(patterns) <- c('CONTROL','CANCER A','CANCER B')
patterns
@

That is, the second row indicates that under Pattern 1 cancers of type A and B
are present the same expression levels, since they both have a \texttt{1} in their entries.
The last row indicates that they are all different by
specifying a different number in each entry.

Now, to fit the GaGa model to our simulated data we use \texttt{fitGG},
with \texttt{nclust=1} (to fit the MiGaGa model we would set \texttt{nclust}
to the number of components that we want in the mixture).
We remove columns 6 and 12 from the dataset, {\it i.e.} we do not use
them for the fit so that we can evaluate the out-of-sample behavior
of the classifier built in Section \ref{sec:classpredict}.
Here we use the option \texttt{trace=FALSE} to prevent iteration information
from being printed. 
There are several available methods to fit the model.
\texttt{method=='EM'} implements an Expectation-Maximization algorithm which seeks
to maximize the expected likelihood.
\texttt{method=='quickEM'} (the default) is a quicker version that uses only 1 maximization step.
\texttt{quickEM} usually provides reasonably good hyper-parameter estimates at a low
computational cost.
In practice we have observed that the inference derived from the GaGa and MiGaGa models 
({\it e.g.} lists of differentially expressed genes) is robust
to slight hyper-parameter miss-specifications, so we believe \texttt{quickEM} to be a good
default option for large datasets.
\texttt{method=='SA'} implements a Simulated Annealing scheme which
searches for a hyper-parameter value with high posterior density.

The three above-mentioned methods (\texttt{EM}, \texttt{quickEM}, \texttt{SA}) only provide
point estimates.
We can obtain credibility intervals with
\texttt{method=='Gibbs'} or \texttt{method=='MH'}, which fit a fully
Bayesian model via Gibbs or Metropolis-Hastings MCMC posterior sampling, respectively.
Of course, obtaining credibility intervals comes at a higher computational cost.
In our experience the five methods typically deliver similar results.


<<four>>=
patterns <- matrix(c(0,0,0,1),2,2)
colnames(patterns) <- c('group 1','group 2')
gg1 <- fitGG(x[,c(-6,-12)],groups,patterns=patterns,nclust=1,method='Gibbs',B=1000,trace=FALSE)
gg2 <- fitGG(x[,c(-6,-12)],groups,patterns=patterns,method='EM',trace=FALSE)  
gg3 <- fitGG(x[,c(-6,-12)],groups,patterns=patterns,method='quickEM',trace=FALSE)  
@ 

We can obtain iteration plots to visually assess the convergence of the chain.
The component \texttt{mcmc} of \texttt{gg1} contains an object of
type \texttt{mcmc}, as defined in the library \texttt{coda}.


To obtain parameter estimates and the posterior probability that each gene
is differentially expressed we use the function \texttt{parest}.
We discard the first 100 MCMC iterations with \texttt{burnin=100},
and we ask for 95\% posterior credibility intervals with \texttt{alpha=.05}.
The slot \texttt{ci} of the returned object contains the credibility intervals
(this option is only available for \texttt{method=='Gibbs'}
and \texttt{method=='MH'}).

<<five>>=
gg1 <- parest(gg1,x=x[,c(-6,-12)],groups,burnin=100,alpha=.05)
gg2 <- parest(gg2,x=x[,c(-6,-12)],groups,alpha=.05)
gg3 <- parest(gg3,x=x[,c(-6,-12)],groups,alpha=.05)
gg1
gg1$ci
gg2
gg3
@

Although the parameter estimates obtained from the four methods are similar to each
other, some differences remain.
This is due to some extent to our having a small dataset with only 100 genes.
For the larger datasets encountered in practice the four methods typically
deliver very similar results.
In Section \ref{sec:degenes} we assess whether the lists of differentially
expressed genes obtained with each method are actually the same.
The slot \texttt{pp} in \texttt{gg1} and \texttt{gg2}
contains a matrix with the posterior probability
of each expression pattern for each gene.
For example, to find probability that the first gene follows
pattern 0 ({\it i.e.} is equally expressed)
and pattern 1 ({\it i.e.} is differentially expressed)
we do as follows.

<<seven>>=
dim(gg1$pp)
gg1$pp[1,]
gg2$pp[1,]
@

\section{Checking the goodness of fit}
\label{sec:gof}

To graphically assess the goodness of fit of the model,
one can used prior-predictive or posterior-predictive checks.
The latter, implemented in the function \texttt{checkfit},
are based on drawing parameter values from the posterior
distribution for each gene, and possibly using then to generate data
values, and then compare the simulated values to the observed data.
The data generated from the posterior predictive is compared to the 
observed data in Figure \ref{fig:fig4}(a).
Figure \ref{fig:fig4}(b)-(d) compares draws from the posterior 
of $\alpha$ and $\lambda$ with their method of moments estimate,
which is model-free.
All plots indicate that the model has a reasonably good fit.
The figures were generated with the following code:

\setkeys{Gin}{width=0.5\textwidth} 
<<label=fig4aplot,include=FALSE>>=
checkfit(gg1,x=x[,c(-6,-12)],groups,type='data',main='')
@
\setkeys{Gin}{width=0.5\textwidth} 
<<label=fig4bplot,include=FALSE>>=
checkfit(gg1,x=x[,c(-6,-12)],groups,type='shape',main='')
@

\setkeys{Gin}{width=0.5\textwidth} 
<<label=fig4cplot,include=FALSE>>=
checkfit(gg1,x=x[,c(-6,-12)],groups,type='mean',main='')
@

\setkeys{Gin}{width=0.5\textwidth} 
<<label=fig4dplot,include=FALSE>>=
checkfit(gg1,x=x[,c(-6,-12)],groups,type='shapemean',main='',xlab='Mean',ylab='1/sqrt(CV)')
@


\setkeys{Gin}{width=0.5\textwidth} 
\begin{figure}
\begin{center}
\begin{tabular}{cc}
(a) & (b) \\
<<label=fig4a,fig=TRUE,echo=FALSE>>=
<<fig4aplot>>
@ &
<<label=fig4b,fig=TRUE,echo=FALSE>>=
<<fig4bplot>>
@ 
\\
<<label=fig4c,fig=TRUE,echo=FALSE>>=
<<fig4cplot>>
@ &
<<label=fig4d,fig=TRUE,echo=FALSE>>=
<<fig4dplot>>
@
\end{tabular}
\end{center}
\caption{Assessing the goodness of fit.
(a): compares samples from the posterior predictive to the observed data;
(b): compares samples from the posterior of $\alpha$ to the method of moments estimate;
(c): compares samples from the posterior of $\lambda$ to the method of moments estimate;
(d): as (b) and (c) but plots the pairs $(\alpha,\lambda)$ instead of the kernel density estimates}
\label{fig:fig4}
\end{figure}


It should be noted, however, that posterior-predictive plots can fail to detect
departures from the model, since there is a double use of the data.
Prior-predictive checks can be easily implemented using the function
\texttt{simGG} and setting the hyper-parameters to their posterior mean.


\section{Finding differentially expressed genes}
\label{sec:degenes}

The function \texttt{findgenes} finds differentially expressed genes,
{\it i.e.} assigns each gene to an expression pattern.
The problem is formalized as minizing the false negative
rate, subject to an upper bound on the false discovery rate,
say \texttt{fdrmax=0.05}.
In a Bayesian sense, this is achieved by assigning to pattern 0 (null hypothesis)
all genes for which the posterior probability of following pattern 0 is above a certain
threshold (Mueller, 2004). %\cite{mueller:2004}.
The problem is then to find the optimal threshold, which can be done
parametrically or non-parametrically through the use of permutations
(for details see Rossell, 2009).%\cite{rossell:2007}
Here we explore both options, specifying \texttt{B=1000} permutations
for the non-parametric option.

<<eight>>=
d1 <- findgenes(gg1,x[,c(-6,-12)],groups,fdrmax=.05,parametric=TRUE)
d1.nonpar <- findgenes(gg1,x[,c(-6,-12)],groups,fdrmax=.05,parametric=FALSE,B=1000)
dtrue <- (l[,1]!=l[,2])
table(d1$d,dtrue)
table(d1.nonpar$d,dtrue)
@

We set the variable \texttt{dtrue} to indicate 
which genes were actually differentially expressed
(easily achieved by comparing the columns of \texttt{xsim\$l}).
Both the parametric and non-parametric versions declare 4 genes to be DE, all of them
true positives. They both fail to find one of the DE genes.
To obtain an estimated frequentist FDR for each
Bayesian FDR one can plot \texttt{d1.nonpar\$fdrest}.
The result, shown in Figure \ref{fig:fig4}, reveals
that setting the Bayesian FDR at a 0.05 level results
in an estimated frequentist FDR around 0.015.
That is, calling \texttt{findgenes} with the option
\texttt{parametric=TRUE} results in a slightly conservative
procedure from a frequentist point of view.

\setkeys{Gin}{width=0.5\textwidth} 
<<label=fig5plot,include=FALSE>>=
plot(d1.nonpar$fdrest,type='l',xlab='Bayesian FDR',ylab='Estimated frequentist FDR')
@

\begin{figure}
\begin{center}
<<label=fig1,fig=TRUE,echo=FALSE>>=
<<fig5plot>>
@
\end{center}
\caption{Estimated frequenstist FDR vs. Bayesian FDR}
\label{fig:fig5}
\end{figure}


Finally, we compare the list of differentially expressed genes with those obtained
when using the other fitting criteria explained in  Section \ref{sec:modelfit}.

<<eightbis>>=
d2 <- findgenes(gg2,x[,c(-6,-12)],groups,fdrmax=.05,parametric=TRUE)
d3 <- findgenes(gg3,x[,c(-6,-12)],groups,fdrmax=.05,parametric=TRUE)
table(d1$d,d2$d)
table(d1$d,d3$d)
@

Despite the existence of small differences in the hyper-parameter estimates
between methods, the final list of
differentially expressed genes is the same for all of them.
This suggests that the GaGa model is somewhat robust to the hyper-parameter
specification.

\section{Obtaining fold change estimates}
\label{sec:fc}

The GaGa and MiGaGa models can be used to obtain fold change estimates,
by computing the posterior expected expression values for each group.
As these posterior expectations are derived from a hierarchical model,
they are obtained by borrowing information across genes.
Therefore, in small sample situations they are preferrable to simply
using the group sample means.

The function \texttt{posmeansGG} computes posterior expected values
under any expression pattern.
The expression pattern is indicated with the argument \texttt{underpattern}.
In our example (as in most microarray experiments) pattern 0 corresponds
to the null hypothesis that no genes are differentially expressed.
Therefore, specifying \texttt{underpattern=0} would result in obtaining
identical expression estimates for both groups.
Instead, one is commonly interested in computing a different mean for each
group, which in our case corresponds to pattern 1.
As the expression measurements were simulated to be in log2 scale, the log-fold change
can be computed by taking the difference between the two group means
(if the data were in the original scale, we would divide instead).
The code below computed posterior means and log-fold changes, and prints out
the fold change for the first five genes.

<<fc1>>=
mpos <- posmeansGG(gg1,x=x[,c(-6,-12)],groups=groups,underpattern=1)
fc <- mpos[,1]-mpos[,2]
fc[1:5]
@


\section{Class prediction}
\label{sec:classpredict}

We now use the fitted model to predict the class of the
arrays number 6 and 12, neither of which were used to fit the model.
We assume that the prior probability is 0.5 for each group,
though in most settings this will not be realistical.
For example, if \texttt{groups==2} indicates individuals with
cancer, one would expect the prior probability to be well below
0.5, say around 0.1. But if the individual had a positive result
in some test that was administered previously, this probability
would have increased, say to 0.4.

Class prediction is implemented in the function \texttt{classpred}.
The argument \texttt{xnew} contains the gene expression measurements for the new
individuals, \texttt{x} is the data used to fit the model and
\texttt{ngene} indicates the number of genes that should be used to build the classifier.
It turns out that array 6 is correctly assigned to group 1
and array 12 is correctly assigned to group 2.
\texttt{classpred} also returns the posterior probability that
the sample belongs to each group. We see that for the dataset
at hand the posterior probability of belonging to the wrong group
is essentially zero.
Similarly good results are obtained when using setting \texttt{ngene} to
either 1 (the minimum value) or to 100 (the maximum value).
The fact that adding more gene to the classifier does not change its 
performance is not surprising, since the classifier assigns little weight to genes
with small probability of being DE.
We have observed a similar behavior in many datasets.
The fact that the classifier works so well with a single is typically
not observed in real datasets, where it is rare to have a gene
with such a high discrimination power.

<<nine>>=
pred1 <- classpred(gg1,xnew=x[,6],x=x[,c(-6,-12)],groups,ngene=50,prgroups=c(.5,.5))
pred2 <- classpred(gg1,xnew=x[,12],x=x[,c(-6,-12)],groups,ngene=50,prgroups=c(.5,.5))
pred1
pred2
@


\section{Designing high-throughput experiments}
\label{sec:design}

The \texttt{gaga} package incorporates routines which can be used for
fixed and sequential sample size calculation in high-throughput experiments.
For details on the methodology see \cite{rossell:2011}.
We start by simulating some data from a GaGa model.
Since the computations can be intensive,
here we simulate data for 20 genes only.
The question is,
given the observed data,
how many more samples should we collect, if any?

<<simulatefs>>=
set.seed(1)
x <- simGG(n=20,m=2,p.de=.5,a0=3,nu=.5,balpha=.5,nualpha=25)
gg1 <- fitGG(x,groups=1:2,method='EM')
gg1 <- parest(gg1,x=x,groups=1:2)
@ 

The function \texttt{powfindgenes} evaluates the
(posterior) expected number of new true
gene discoveries if one were to obtain an
additional batch of data
with \texttt{batchSize} new samples per group.
For our simulated data,
we expect that 
obtaining 1 more sample per group would provide
no new gene discoveries.
For 2 and 3 more samples per group we still expect
to discover less than one new gene
(which seems reasonable for our simulated data with only 20 genes).

<<powfindgenes>>=
pow1 <- powfindgenes(gg1, x=x, groups=1:2, batchSize=1, fdrmax=0.05, B=1000)
pow2 <- powfindgenes(gg1, x=x, groups=1:2, batchSize=2, fdrmax=0.05, B=1000)
pow3 <- powfindgenes(gg1, x=x, groups=1:2, batchSize=3, fdrmax=0.05, B=1000)
pow1$m
pow2$m
pow3$m
@ 

As illustrated, calling \texttt{powfindgenes} for different values of
\texttt{batchSize} can be used to determine the sample size.
We refer to this approach as fixed sample size calculation,
since the number of additional samples is fixed from now on,
regardless of the evidence provided by new data.
Function \texttt{forwsimDiffExpr} provides a sequential sample
size alternative.
The idea is that, every time that we observe new data,
we can use \texttt{powfindgenes} to estimate the 
expected number of new gene discoveries for an additional data batch.
As long as this quantity is large enough, we keep
adding new samples. When this quantity drops below some threshold
we stop experimentation.
\texttt{forwsimDiffExpr} uses forward simulation to determine
reasonable values for this threshold.
Shortly, the function simulates \texttt{batchSize} new samples per group
from the GaGa posterior predictive distribution,
and for each of them evaluates the expected number of new discoveries
via \texttt{powfindgenes} (estimated via \texttt{Bsummary} Monte Carlo samples).
Then \texttt{batchSize} more samples are added,
and \texttt{powfindgenes} is called again,
up to a maximum number of batches \texttt{maxBatch}.
The whole process is repeated \texttt{B} times.
Notice that, although not illustrated here,
parallel processing can be used to speed up
computations ({\it e.g.} see \texttt{mcapply} from package \texttt{parallel}).

<<forwsim>>=
fs1 <- forwsimDiffExpr(gg1, x=x, groups=1:2,
maxBatch=3,batchSize=1,fdrmax=0.05, B=100, Bsummary=100, randomSeed=1)
head(fs1)
@ 

\texttt{forwsimDiffExpr} returns a \texttt{data.frame}
indicating, for each simulation and stopping time (number of batches),
the posterior expectation of the number of true posities (\texttt{u}),
false discovery and false negative rates (\texttt{fdr}, \texttt{fnr}),
and power ({\it e.g.} number of detected DE genes at a Bayesian FDR \texttt{fdrmax} 
divided by overall number of DE genes).
It also returns the (posterior predictive) expected new DE discoveries
for one more data batch in \texttt{summary}.
Since experimentation is always stopped at \texttt{time==maxBatch},
it is not necessary to evaluate \texttt{summary} at this time point
and \texttt{NA} is returned.

The output of \texttt{forwsimDiffExpr} can be used to estimate
the expected number of new true discoveries for each sample size,
as well as to estimate the expected utility by subtracting
a sampling cost.
As illustrated above these results can also be obtained with \texttt{powfindgenes},
which is much faster computationally.

<<fixedn>>=
tp <- tapply(fs1$u,fs1$time,'mean')
tp
samplingCost <- 0.01
util <- tp - samplingCost*(0:3)
util
@ 

Again, for our simulated data we expect to find very few DE genes
with 1, 2 or 3 additional data batches.
For a sampling cost of 0.01, 
the optimal fixed sample design is to obtain 3 more data batches.
Here we set a very small sampling cost for illustration purposes,
although in most applications both the number of genes and 
the sampling cost will be much larger.
For instance, \texttt{samplingCost=50} would indicate that the experimenter considers
it worthwhile to obtain one more batch of samples as long as that allows him to find
at least 50 new DE genes.

In order to find the optimal sequential design,
we define a grid of intercept and slope values for the linear stopping boundaries.
The function \texttt{seqBoundariesGrid}
returns the expected utility for each intercept and slope in the grid
in the element \texttt{grid}.
The element \texttt{opt} contains the optimum
and the expected utility, FDR, FNR, power and stopping time 
({\it i.e.} the expected number of data batches).

<<seqn>>=
b0seq <- seq(0,20,length=200); b1seq <- seq(0,40,length=200)
bopt <-seqBoundariesGrid(b0=b0seq,b1=b1seq,forwsim=fs1,samplingCost=samplingCost,powmin=0)
names(bopt)
bopt$opt
head(bopt$grid)
@ 

The expected utility for the optimal boundary is slightly larger than for a fixed sample
size of 3 batches per group (see above),
and perhaps more importantly it is achieved with a smaller average sample size.
The optimal intercept and slope are equal to 0.
Recall that experimentation at time $t=1,\ldots,$\texttt{batchSize}$-2$ 
continues as long as \texttt{summary} is greater or equal
than the stopping boundary.
Therefore the optimal rule implies never stopping at $t=1$.
At time \texttt{batchSize}$-1$ ($t=2$ in our example) the optimal decision is to continue
whenever the one-step ahead expected new true discoveries (\texttt{summary}) is greater than
\texttt{samplingCost}, regardless of the stopping boundary.

We produce a plot to visualize the results (Figure \ref{fig:forwsim}).
The plot shows the simulated trajectories for the summary statistic,
and the optimal stopping boundary.

\setkeys{Gin}{width=0.5\textwidth} 
<<label=figforwsim,include=FALSE>>=
plot(fs1$time,fs1$summary,xlab='Additional batches',ylab='E(newly discovered DE genes)')
abline(bopt$opt['b0'],bopt$opt['b1'])
text(.2,bopt$opt['b0'],'Continue',pos=3)
text(.2,bopt$opt['b0'],'Stop',pos=1)
@ 

\begin{figure}
\begin{center}
<<label=figforwsim,fig=TRUE,echo=FALSE>>=
<<figforwsim>>
@ 
\end{center}
\caption{Forward simulations and optimal sequential rule}
\label{fig:forwsim}
\end{figure}

\bibliographystyle{plainnat}
\bibliography{references} 


\end{document}