%
% NOTE -- ONLY EDIT howtogenefilter.Rnw!!!
% howtogenefilter.tex file will get overwritten.
%
%\VignetteIndexEntry{Using the genefilter function to filter genes from a microarray dataset}
%\VignetteDepends{Biobase, genefilter, class}
%\VignetteKeywords{Expression Analysis}
%\VignettePackage{genefilter}
\documentclass{article}

\usepackage{hyperref}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\classdef}[1]{%
  {\em #1}
}

\begin{document}
\title{Using the genefilter function to filter genes from a microarray dataset}

\maketitle

\section*{Introduction}

The {\em genefilter} package can be used to filter (select) genes from
a microarray dataset according to a variety of different
filtering mechanisms.
Here, we will consider the example dataset
in the \verb+sample.ExpressionSet+ example from the {\em Biobase} package.
This experiment has 26 samples, and there are 500 genes and 3
covariates. The covariates are named \verb+sex+, \verb+type+ and
\verb+score+. The first two have two levels and the last one is
continuous.

<<>>=
library("Biobase")
library("genefilter")
data(sample.ExpressionSet)
varLabels(sample.ExpressionSet)
table(sample.ExpressionSet$sex)
table(sample.ExpressionSet$type)
@
%$

One dichotomy that can be of interest for subsequent analyses is whether the filter is
\emph{specific} or \emph{non-specific}. Here, specific means that we are
filtering with reference to sample metadata, for example, \texttt{type}. For example, if
we want to select genes that are differentially expressed in the two
groups defined by \texttt{type}, that is a specific filter.
If on the other hand we want to select genes that are expressed in more
than 5 samples, that is an example of a non--specific filter.

First, let us see how to perform a non--specific filter.
Suppose we want to select genes that have an expression measure above 200 in at
least 5 samples. To do that we use the function \verb+kOverA+.

There are three steps that must be performed.
\begin{enumerate}
\item Create function(s) implementing the filtering criteria.
\item Assemble it (them) into a (combined) filtering function.
\item Apply the filtering function to the expression matrix.
\end{enumerate}

<<>>=
f1 <- kOverA(5, 200)
ffun <- filterfun(f1)
wh1 <- genefilter(exprs(sample.ExpressionSet), ffun)
sum(wh1)
@

Here \verb+f1+ is a function that implies our ``expression measure above 200 in at
least 5 samples'' criterion, the function \verb+ffun+ is the filtering
function (which in this case consists of only one criterion), and we apply it using \verb+genefilter+.
There were \Sexpr{sum(wh1)} genes that satisfied the criterion and passed the filter.

As an example for a specific filter, let us select genes that are differentially
expressed in the groups defined by \verb+type+.

<<>>=
f2 <- ttest(sample.ExpressionSet$type, p=0.1)
wh2 <- genefilter(exprs(sample.ExpressionSet), filterfun(f2))
sum(wh2)
@
%$
Here, \texttt{ttest} is a function from the \texttt{genefilter}
package which provides a suitable wrapper around \texttt{t.test} from
package \textit{stats}. Now we see that there are \Sexpr{sum(wh2)}
genes that satisfy the selection criterion. 

Suppose that we want to combine the two filters. We want those genes
for which at least 5 have an expression measure over 200 \emph{and} which also are differentially
expressed between the groups defined by \verb+type+.

<<>>=
ffun_combined <- filterfun(f1, f2)
wh3 <- genefilter(exprs(sample.ExpressionSet), ffun_combined)
sum(wh3)
@

Now we see that there are only \Sexpr{sum(wh3)} genes  that satisfy both conditions.

%%FIXME: need to replace this with something else
%Our last example is to select genes that are
%differentially expressed in at least one of the three groups defined
%by \verb+cov3+.
%To do that we use an Anova filter. This filter uses an analysis of
%variance appraoch (via the \verb+lm+) function to test the hypothesis
%that at least one of the three group means is different from the other
%%two. The test is applied, then the $p$--value computed. We select
%those genes that have a low $p$--value.
%
%<<>>=
%Afilter <- Anova(eset$cov3)
%aff <- filterfun(Afilter)
%wh4 <- genefilter(exprs(eset), aff)
%sum(wh4)
%
%@
%%$
%We see that there are 14 genes that pass this filter and that are
%candidates for further exploration.


\section*{Selecting genes that appear useful for prediction}

The function \texttt{knnCV} defined below performs $k$--nearest neighbour classification
using leave--one--out cross--validation.
At the same time it aggregates the genes that were selected. The
function returns the predicted classifications as its returned
value. However, there is an additional side effect. The number of
times that each gene was used (provided it was at least one) are
recorded and stored in the environment of the aggregator \verb+Agg+.
These can subsequently be retrieved and used for other purposes.

<<aggregate>>=

 knnCV <- function(EXPR, selectfun, cov, Agg, pselect = 0.01, Scale=FALSE) {
   nc <- ncol(EXPR)
   outvals <- rep(NA, nc)
   for(i in 1:nc) {
      v1 <- EXPR[,i]
      expr <- EXPR[,-i]
      glist <- selectfun(expr, cov[-i], p=pselect)
      expr <- expr[glist,]
      if( Scale ) {
        expr <- scale(expr)
        v1 <- as.vector(scale(v1[glist]))
      }
      else
         v1 <- v1[glist]
      out <- paste("iter ",i, " num genes= ", sum(glist), sep="")
      print(out)
      Aggregate(row.names(expr), Agg)
      if( length(v1) == 1)
         outvals[i] <- knn(expr, v1, cov[-i], k=5)
      else
          outvals[i] <- knn(t(expr), v1, cov[-i], k=5)
    }
    return(outvals)
  }
@
%$

<<aggregate>>=
 gfun <- function(expr, cov, p=0.05) {
    f2 <- ttest(cov, p=p)
    ffun <- filterfun(f2)
    which <- genefilter(expr, ffun)
  }

@

Next we show how to use this function on the dataset
\verb+geneData+.

<<aggregate, results=hide>>=
  library("class")

  ##scale the genes
  ##genescale is a slightly more flexible "scale"
  ##work on a subset -- for speed only
  geneData <- genescale(exprs(sample.ExpressionSet)[1:75,], 1)

  Agg <- new("aggregator")

  testcase <- knnCV(geneData, gfun, sample.ExpressionSet$type, 
         Agg, pselect=0.05)
@ 
<<aggregate>>=
sort(sapply(aggenv(Agg), c), decreasing=TRUE)
@
%$
The environment \verb+Agg+ contains, for each gene,
the number of times it was selected in the cross-validation.


\section*{Session Information}

The version number of R and packages loaded for generating the vignette were:

<<echo=FALSE,results=tex>>=
toLatex(sessionInfo())
@

\end{document}