\name{samplesize} \alias{samplesize} \title{FDR as a function of sample size} \description{ This function tabulates the false discovery rate (FDR) for selecting differentially expressed genes as a function of sample size and cutoff level. Additionally, the same information can be displayed through an attractive plot. } \usage{ samplesize(n = seq(5, 50, by = 5), p0 = 0.99, sigma = 1, D, F0, F1, paired = FALSE, crit, crit.style = c("top percentage", "cutoff"), plot =TRUE, local.show=FALSE, nplot = 100, ylim = c(0, 1), main, legend.show = FALSE, grid.show = FALSE, ...) } %- maybe also 'usage' for other objects documented here. \arguments{ \item{n}{sample size (as subjects per group)} \item{p0}{the proportion of non-differentially expressed genes} \item{sigma}{the standard deviation for the log expression values} \item{D}{assumed average log fold change (in units of \code{sigma}), by default 1; this is a shortcut for specifying a simple symmetrical alternative hypothesis through \code{F1}.} \item{F0}{the distribution of the log2 expression values under the null hypothesis; by default, this is normal with mean zero and standard deviation \code{sigma}, but mixtures of normals can be specified, see Details and Examples.} \item{F1}{the distribution of the log2 expression values under the alternative hypothesis; by default, this is an equal mixture of two normals with means \code{D} and -\code{D} and standard deviation \code{sigma}; mixture of normals are again possible, see Details and Examples.} \item{paired}{logical value indicating whether this is the independent sample case (default) or the paired sample case.} \item{crit}{a vector of cutoff values for selecting differentially expressed genes; the interpretation depends on \code{crit.style}.} \item{crit.style}{indicates how differentially expressed genes are selected: either by a fixed cutoff level for the absolute value of the t-statistic or as a fixed percentage of the absolute largest t-statistics.} \item{plot}{logical value indicating whether to do the plotting business} \item{local.show}{logical value indicating whether to show local or global false discovery rate (default: global).} \item{nplot}{number of points that are evaluated for the curves} \item{ylim}{the usual limits on the vertical axis} \item{main}{the main title of the plot} \item{legend.show}{logical value indicating whether to show a legend for the types of gene selection in the plot} \item{grid.show}{logical value indicating whether to draw grid lines showing the sample sizes \code{n} to be tabulated in the plot} \item{\dots}{the usual graphical parameters, passed to \code{plot}} } \details{ This function plots the FDR as a function of the sample size when comparing the expression of multiple genes between two groups of subjects. This is based on a model assuming that a proportion \code{p0} of genes is not differentially expressed (regulated) between groups, and that 1-\code{p0} genes are. The logarithmized gene expression values of regulated and non regulated genes are assumed to be generated by mixtures of normal distributions; these mixtures can be specified through the parameters \code{F0}, \code{F1} or \code{D}, and \code{sigma}; please see \code{TOC} for details on the model and the specification of the mixtures. By default, the null distribution of the log expression values is a normal centered on zero, and the alternative an equal mixture of normals centered at \code{+D} and \code{-D}. The list of nominally differentially expressed genes can be selected in two ways: \itemize{ \item all genes with absolute t-statistic larger than the specified critical cutoff values (\code{cutoff}), \item all genes that represent the specified critical top percentage of the absolutely largest t-statistics (\code{top percentage}). Multiple critical values correspond to multiple curves, each labeled by the critical value, but only one value can be specified for the proportion of non-regulated genes \code{p0} and the standard deviation \code{sigma}. } } \value{ A matrix with rows corresponding to elements of \code{n} and columns corresponding to the specified critical values is returned. The matrix has the attribute \code{param} that contains the specified arguments, see Examples. } \references{ Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A (2005) False Discovery Rate, Sensitivity and Sample Size for Microarray Studies. \emph{Bioinformatics}, 21, 3017-3024. Jung SH (2005) Sample size for FDR-control in microarray data analysis. \emph{Bioinformatics}, 21, 3097-104.} \author{Y. Pawitan and A. Ploner} \note{Both the curve labels and the legend may be squashed if the plotting device is too small. Increasing the size of the device and re-plotting should improve readability.} \seealso{\code{\link{FDR}}, \code{\link{TOC}}, \code{\link{EOC}}} \examples{ # Default assumes a proportion of 0.01 regulated genes equally split # between two-fold up- and down-regulated # We select the top 1, 2, 3 percent absolute largest t-statistics samplesize(crit=c(0.03,0.02, 0.01)) # Same model, but using a hard cutoff for the t-statistics samplesize(crit=2:4, crit.style="cutoff") # Paired test of the same size has slightly better FDR (as expected) samplesize(paired=TRUE) # Compare the effect of p0 and effect size par(mfrow=c(2,2)) samplesize(crit=c(0.03,0.02, 0.01), p0=0.95, D=1) samplesize(crit=c(0.03,0.02, 0.01), p0=0.99, D=1) samplesize(crit=c(0.03,0.02, 0.01), p0=0.95, D=2) samplesize(crit=c(0.03,0.02, 0.01), p0=0.99, D=2) # An asymmetric alternative distribution: 20 percent of the regulated genes # are expected to be (at least) four-fold up regulated # NB, no graphical output ret = samplesize(F1=list(D=c(-1,1,2), p=c(2,2,1)), p0=0.95, crit=0.05, plot=FALSE) ret # Look at the parameters attr(ret, "param") # A wide null distribution that allows to disregard genes with small effect # Here: |log2 fold change| < 0.25, i.e. fold change of less than 19 percent samplesize(F0=list(D=c(-0.25,0,0.25)), grid=TRUE) # This is close to Example 3 in Jung's paper (see References): # p0=0.99 and sensitivity=0.6, so we want a rejection rate of # around 0.006 from the top list. # Here we require around 40 arrays/group, compared to # around 37 in Jung's paper, most likely because we use # the t-distribution instead of normal. Jung's alternative # is only one-sided, so the exact correspondence is # samplesize(p0=0.99,crit.style="top", crit=0.006, F1=list(D=1, p=1), grid=TRUE) abline(h=0.01) #The result is very close to the symmetric alternatives: samplesize(p0=0.99,crit=0.006, D=1, grid=TRUE, ylim=c(0,0.9)) } \keyword{hplot}% at least one, from doc/KEYWORDS \keyword{design}% __ONLY ONE__ keyword per line