\name{nsFilter} \alias{nsFilter} \alias{varFilter} \alias{featureFilter} \alias{nsFilter,ExpressionSet-method} \title{Non-Specific-ly Filter an ExpressionSet} \description{ This function identifies and removes probesets that are unlikely to be of use when modeling the data. No phenotype variables are used in the filtering process, so the result can be used with any downstream analysis. } \usage{ nsFilter(eset, require.entrez = TRUE, require.GOBP = FALSE, require.GOCC = FALSE, require.GOMF = FALSE, remove.dupEntrez = TRUE, var.func = IQR, var.cutoff = 0.5, var.filter = TRUE, filterByQuantile=TRUE, feature.exclude="^AFFX", ...) varFilter(eset, var.func = IQR, var.cutoff = 0.5, filterByQuantile=TRUE) featureFilter(eset, require.entrez=TRUE, require.GOBP=FALSE, require.GOCC=FALSE, require.GOMF=FALSE, remove.dupEntrez=TRUE, feature.exclude="^AFFX") } \arguments{ \item{eset}{an \code{ExpressionSet} object} \item{require.entrez}{If \code{TRUE}, require that all probe sets have an Entrez Gene ID annotation. Probe sets without such an annotation will be filtered out.} \item{require.GOBP}{If \code{TRUE}, require that all probe sets have an annotation to at least one GO ID in the BP ontology. Probe sets without such an annotation will be filtered out.} \item{require.GOCC}{If \code{TRUE}, require that all probe sets have an annotation to at least one GO ID in the CC ontology. Probe sets without such an annotation will be filtered out.} \item{require.GOMF}{If \code{TRUE}, require that all probe sets have an annotation to at least one GO ID in the MF ontology. Probe sets without such an annotation will be filtered out.} \item{remove.dupEntrez}{If \code{TRUE} and there are multiple probe sets mapping to the same Entrez Gene ID, then the probe set with the largest value of \code{var.func} will be retained and the others removed.} \item{var.func}{A \code{function} that will be used to assess the variance of a probe set across all samples. This function should return a numeric vector of length one when given a numeric vector as input. Probe sets with a \code{var.func} value less than \code{var.cutoff} will be removed. The default is \code{IQR}.} \item{var.cutoff}{A numeric value to use in filtering out probe sets with small variance across samples. See the \code{var.func} argument and the details section below.} \item{var.filter}{A logical indicating whether or not to perform variance based filtering. The default is \code{TRUE}.} \item{filterByQuantile}{Logical: whether the variance-filter cutoff threshold should be interpreted as a quantile. Defaults to \code{TRUE}; if set to \code{FALSE} the cutoff value is used directly ``as is''.} \item{feature.exclude}{A character vector of regular expressions. Any probe sets identifiers (return value of \code{featureNames(eset)}) that match one of the specified patterns will be filtered out. The default value is intended to filter out Affymetrix quality control probe sets.} \item{...}{Unused, but available for specializing methods.} } \details{ A first step in many microarray analysis procedures is to carry out non-specific filtering. The goal is to remove uninteresting probe sets without regard to the phenotype data and reduce the number of probe sets that will be included in further analysis. \emph{Annotation Based Filtering} Arguments \code{require.entrez}, \code{require.GOBP}, \code{require.GOCC}, and \code{require.GOMF} turn on a filter based on available annotation data. The annotation package is determined by calling \code{annotation(eset)}. \emph{Duplicate Probe Removal} If \code{remove.dupEntrez=TRUE}, probes determined by your annotation to be pointing to the same gene will be compared, and only the probe with the highest \code{var.func} value will be retained. \emph{Variance Based Filtering} The \code{var.filter}, \code{var.func}, \code{var.cutoff} and \code{varByQuantile} arguments control numerical cutoff-based filtering. The intention is to remove uninformative probe sets, representing genes that were not expressed at all. The default \code{var.func} is \code{IQR}; this choice is motivated by the observation that unexpressed genes are detected most reliably through their low variability across samples. Additionally, \code{IQR} is robust to outliers (see note below). The default \code{var.cutoff} is \code{0.5} and is motivated by the rule of thumb that in many tissues only 40\% of genes are expressed. Of course, if you believe in a different approach to numerical filtering you can choose another function as \code{var.func}, or turn off numerical filtering by setting \code{var.filter=FALSE}. Note that by default the numerical-filter cutoff is interpreted as a quantile, so leaving the default values intact would filter out 50\% of the genes remaining at this stage. If you prefer to set the cutoff at some absolute threshold, change the value of \code{varByQuantile} to \code{FALSE}, and modify \code{var.cutoff} accordingly. Note also that now variance filtering is performed last, so that (if \code{varByQuantile=TRUE} and \code{remove.dupEntrez=TRUE}) the final number of genes does indeed exclude precisely the \code{var.cutoff} fraction of unique genes remaining after all other filters were passed. The stand-alone function \code{varFilter} does only numerical filtering, and returns an \code{ExpressionSet}. \code{featureFilter} does only feature based filtering and duplicate removal, and returns an expression set as well. Duplicate removal is hard-coded to retain the highest-IQR probe for each gene. } \value{ For \code{nsFilter} a list consisting of: \item{eset}{the filtered \code{ExpressionSet}} \item{filter.log}{a list giving details of how many probe sets where removed for each filtering step performed.} For both \code{varFilter} and \code{featureFilter} the filtered \code{ExpressionSet}. } \author{Seth Falcon (somewhat revised by Assaf Oron)} \note{\code{IQR} is a reasonable variance-filter choice when the dataset is split into two roughly equal and relatively homogeneous phenotype groups. If your dataset has important groups smaller than 25\% of the overall sample size, or if you are interested in unusual individual-level patterns, then \code{IQR} may not be sensitive enough for your needs. In such cases, you should consider using less robust and more sensitive measures of variance (the simplest of which would be \code{sd}).} \examples{ library("hgu95av2.db") data(sample.ExpressionSet) ans <- nsFilter(sample.ExpressionSet) ans$eset ans$filter.log ## skip variance-based filtering ans <- nsFilter(sample.ExpressionSet, var.filter=FALSE) a1 <- varFilter(sample.ExpressionSet) a2 <- featureFilter(sample.ExpressionSet) } \keyword{manip}