\name{estimateDispersions} \Rdversion{1.1} \alias{estimateDispersions,CountDataSet-method} \alias{estimateDispersions} \title{ Estimate and fit dispersions for a CountDataSet. } \description{ This function obtains dispersion estimates for a count data set. For each condition (or collectively for all conditions, see 'method' argument below) it first computes for each gene an empirical dispersion value (a.k.a. a raw SCV value), then fits by regression a dispersion-mean relationship and finally chooses for each gene a dispersion parameter that will be used in subsequent tests from the empirical and the fitted value according to the 'sharingMode' argument. } \usage{ \S4method{estimateDispersions}{CountDataSet}( object, method = c( "pooled", "pooled-CR", "per-condition", "blind" ), sharingMode = c( "maximum", "fit-only", "gene-est-only" ), fitType = c("parametric", "local"), locfit_extra_args=list(), lp_extra_args=list(), modelFrame = NULL, modelFormula = count ~ condition, ... ) } \arguments{ \item{object}{ a \code{CountDataSet} with size factors. } \item{method}{ There are three ways how the empirical dispersion can be computed: \itemize{ \item \code{pooled} - Use the samples from all conditions with replicates to estimate a single pooled empirical dispersion value, called "pooled", and assign it to all samples. \item \code{pooled-CR} - Fit models according to \code{modelFormula} and estimate the dispersion by maximizing a Cox-Reid adjusted profile likelihood (CR-APL). This method is much slower than \code{method=="pooled"} but works also with crossed factors (as may occur, e.g., in designs with paired samples). Usually, you will need to specify the model formula, which should be the same as the one used later in the call to \code{nbinomFitGLMs} for fitting the full model. Note: The method of using CR-APL maximization for this application has been developed by McCarthy, Chen and Smyth [Nucl. Acid Res., 2012 and been first implemented in edgeR (in 2010). DESeq optimizes the expression for the CR-APL given in McCarthy et al.'s paper, but does not use the weigthed maximum likelihood scheme proposed there. \item \code{per-condition} - For each condition with replicates, compute a gene's empirical dispersion value by considering the data from samples for this condition. For samples of unreplicated conditions, the maximum of empirical dispersion values from the other conditions is used. If \code{object} has a multivariate design (i.e., if a data frame was passed instead of a factor for the \code{condition} argument in \code{\link{newCountDataSet}}), this method is not available. (Note: This method was called ``normal'' in previous versions.) \item \code{blind} - Ignore the sample labels and compute a gene's empirical dispersion value as if all samples were replicates of a single condition. This can be done even if there are no biological replicates. This method can lead to loss of power; see the vignette for details. The single estimated dispersion condition is called "blind" and used for all samples. } } \item{sharingMode}{ After the empirical dispersion values have been computed for each gene, a dispersion-mean relationship is fitted for sharing information across genes in order to reduce variability of the dispersion estimates. After that, for each gene, we have two values: the empirical value (derived only from this gene's data), and the fitted value (i.e., the dispersion value typical for genes with an average expression similar to those of this gene). The \code{sharingMode} argument specifies which of these two values will be written to the \code{featureData}'s \code{disp_} columns and hence will be used by the functions \code{\link{nbinomTest}} and \code{\link{fitNbinomGLMs}}. \itemize{ \item \code{fit-only} - use only the fitted value, i.e., the empirical value is used only as input to the fitting, and then ignored. Use this only with very \emph{few} replicates, and when you are not too concerned about false positives from dispersion outliers, i.e. genes with an unusually high variability. \item \code{maximum} - take the maximum of the two values. This is the conservative or prudent choice, recommended once you have at least three or four replicates and maybe even with only two replicates. \item \code{gene-est-only} - No fitting or sharing, use only the empirical value. This method is preferable when the number of replicates is large and the empirical dispersion values are sufficiently reliable. If the number of replicates is small, this option may lead to many cases where the dispersion of a gene is accidentally underestimated and a false positive arises in the subsequent testing. } } \item{fitType}{ \itemize{ \item \code{parametric} - Fit a dispersion-mean relation of the form \code{dispersion = asymptDisp + extraPois / mean} via a robust gamma-family GLM. The coefficients \code{asymptDisp} and \code{extraPois} are given in the attribute \code{coefficients} of the \code{dispFunc} in the \code{fitInfo} (see below). } \itemize{ \item \code{local} - Use the locfit package to fit a dispersion-mean relation, as described in the DESeq paper. } } \item{locfit_extra_args, lp_extra_args}{ (only for \code{fitType=local}) Options to be passed to the \code{locfit} and to the \code{lp} function of the locfit package. Use this to adjust the local fitting. For example, you may pass a value for \code{nn} different from the default (0.7) if the fit seems too smooth or too rough by setting \code{lp_extra_agrs=list(nn=0.9)}. As another example, you can set \code{locfit_extra_args=list(maxk=200)} if you get the error that locfit ran out of nodes. See the documentation of the \code{locfit} package for details. In most cases, you will not need to provide these parameters, as the defaults seem to work quite well.} \item{modelFrame}{ By default, the information in \code{conditions(object)} or \code{pData(object)} is used to determine which samples are replicates (see \code{\link{newCountDataSet}}). For \code{method="pooled"}, a data frame can be passed here, and all rows that are identical in this data frame are considered to indicate replicate samples in \code{object}. For \code{method="pooled-CR"}, the data frame is used in the fits. For the other methods, this argument is ignored. } \item{modelFormula}{ For \code{method="pooled-CR"}, this is the formual used for the dispersion fits. For all other methods, this argument is ignored. } \item{...}{extra arguments are ignored} } \details{ Behaviour for \code{method="per-condition"}: For each replicated condition, a list, named with the condition's name, is placed in the environment \code{object@fitInfo}. This list has five named elements: The vector \code{perGeneDispEsts} contains the empirical dispersions. The function \code{dispFunc} is the fitted function, i.e., it takes as its argument a normalized mean expression value and returns the corresponding fitted dispersion. The values fitted according to this function are in the third element \code{fittedDispEst}, a vector of the same length as \code{perGeneDispEsts}. The fourt element, \code{df}, is an integer, indicating the number of degrees of freedom of the per-gene estimation. The fifth element, \code{sharingMode}, stores the value of the \code{sharingMode} argument to \code{esimateDispersions}. Behaviour for \code{method="blind"} and \code{method="pooled"}: Only one list is produced, named \code{"blind"} or \code{"pooled"} and placed in \code{object@fitInfo}. For each list in the \code{fitInfo} environment, the dispersion values that are intended to be used in subsequent testing are computed according to the value of \code{sharingMode} and are placed in the \code{featureData} data frame, in a column named with the same name, prefixed with "\code{disp_}". Then, the \code{\link{dispTable}} (see there) is filled to assign to each condition the appropriate dispersion column in the phenoData frame. Note: Up to DESeq version 1.4.x (Bioconductor release 2.8), this function was called \code{estimateVarianceFunctions}, stored its result differently and had did not yet have the arguments \code{sharingMode} and \code{fitType}. \code{estimatevarianceFunction}'s behaviour corresponded to the settings \code{sharingMode="fit-only"} and \code{fitType="local"}. Note that these are not the default, because the new functionalities sharingMode="maximum" and fitType="local" tend to give better results in standard cases as they are more robust } \value{ The \code{CountDataSet} cds, with the slots \code{fitInfo} and \code{featureData} updated as described in Details. } \author{ Simon Anders, sanders@fs.tum.de } \examples{ cds <- makeExampleCountDataSet() cds <- estimateSizeFactors( cds ) cds <- estimateDispersions( cds ) str( fitInfo( cds ) ) head( fData( cds ) ) }