%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
% \VignetteIndexEntry{DEDS Overview}
% \VignetteDepends{DEDS}
% \VignetteKeywords{Differential Expression Analysis}
% \VignettePackage{DEDS}
\documentclass[11pt]{article}

\usepackage{amsmath,epsfig,fullpage}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}

\parindent 0in

%\bibliographystyle{abbrvnat}

%------------------------------------------------------------
% newcommand
%------------------------------------------------------------
\newcommand{\code}[1]{{\tt #1}}
\newcommand{\Rfunc}[1]{{\tt #1}}

\newcommand{\myincfig}[3]{%
  \begin{figure}[htbp]
    \begin{center}
      \includegraphics[width=#2]{#1}
      \caption{\label{#1}#3}
    \end{center}
  \end{figure}
}

\begin{document}

\title{\bf Bioconductor's DEDS package}

\author{Yuanyuan Xiao$^1$ and Yee Hwa Yang$^2$}

\maketitle


\begin{center}
Departments of $^1$Biopharmaceutical Sciences and $^2$Medicine \\
University of California, San Francisco\\ 
{\tt yxiao@itsa.ucsf.edu}\\
{\tt jean@biostat.ucsf.edu}
\end{center}

\tableofcontents


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Overview}

This document provides a tutorial for the \code{DEDS} package for assessment of
differential expression (DE) in microarray data. \\

{\bf Introduction to DEDS.} There are numerous statistics
in the microarray literature that rank genes in evidence of DE, to name a few, 
fold change ($FC$), $t$ statistic, SAM (\cite{Tusheretal}).  selecting a best statistic or ordering 
statistics in terms of merit has been problematic.  No characterizations of 
microarray data that indicate desirability of a specific choice exist and, 
likewise, no comparisons across a sufficiently wide range of benchmark datasets 
have been undertaken. To avoid making fairly arbitrary choices when deciding which
ranking statistic to use and to borrow strength across related measures, we 
apply a novel ranking scheme that assesses DE via distance synthesis (DEDS) of
different related measures. Further details on the packages are given 
in \cite{DEDS}.\\  

{\bf Functionalities in \code{DEDS}.} The \code{DEDS} package implements the DEDS 
procedure and several common statisics, such as, 
$FC$, $t$ statistics, SAM, F statistics, B statistics (\cite{Ingrid}) and moderated 
F and $t$ statisitcs (\cite{limma}), for the analysis of DE in microarrys. \\

{\bf Case study.} We demonstrate the functionality of the {\tt DEDS} package using 
two microarray experiments: Affymetrix spike-in (\cite{Irizarryetal03}) and ApoA1
(\cite{Dudoitetal02}). \\

{\bf Related packages in Bioconductor.} The Bioconductor packages \code{marrayClasses},
\code{marrayInput} and \code{marrayNorm}  provide functions for reading and
normalizing spotted microarray data. The package  \code{affy} provides functions
for reading and normalizing Affymetrix microarray data.\\

{\bf Help files.}  As with any R package, detailed information on functions, classes and 
methods can be obtained in the help files. For instance, to view the help file for the 
function \Rfunc{comp.FC} in a browser, use \code{help.start()} followed by \code{?comp.FC}.\\


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Case study 1: Affymetrix Spike-in Experiment}
We demonstrate the functionality of this package using gene expression
data from the Affymetrix spike-in experiment. To load the dataset, use
\code{data(affySpikeIn)}, and to view a description of the experiments
and data, type \code{?affySpikeIn}.

% begin code ----------------------------------------
<<loadaffy,results=hide,eval=TRUE,echo=TRUE>>= 
library(DEDS)
data(affySpikeIn) 
@  
% end code -----------------------------------------

\subsection{Data}
The spike-in experiment represents a portion of the data used by Affymetrix
to develop their MAS 5.0 preprocessing algorithm.  The whole dataset 
features 14 human genes spiked-in at a series of 14 known concentrations 
($0, 2^{-2}, 2^{-1}, \ldots, 2^{10}$ pM) according to 
a Latin square design among 12612 null genes.  Each ``row'' of the Latin square
(given spike-in gene at a given concentration) was replicated (typically 3 
times, two rows 12 times, 59 arrays in total for the whole dataset).  
Further details are available 
at \url{http://www.affymetrix.com/analysis/download_center2.affx}.  
Here we showcase a portion of this dataset that presents a two-group
comparison problem with 12 replicates in each group.
Therefore, \code{affySpikeIn} contains the gene expression data for the 24
samples and 12,626 genes retained after RMA probe level summaries. The dataset 
includes 

\begin{itemize}
\item
{\code{affySpikeIn}:} a $12,626 \times 24 $ matrix of expression levels;
\item
{\code{affySpikeIn.gnames}:} a vector of gene identifiers of length $12,626$;
\item
{\code{affySpikeIn.L}:} a vector of class labels (0 for class 1, 1 for class 2). 
\item
{\code{spikegene}:} a vector that shows that location and identities of the 14 spiked genes. 
\end{itemize}

% begin code ----------------------------------------
<<showaffy,eval=TRUE,echo=TRUE>>= 
dim(affySpikeIn) 
affySpikeIn.L
spikedgene
@  
% end code -----------------------------------------
% begin code ----------------------------------------


\subsection{The \Rfunc{deds.stat.linkC} and \Rfunc{deds.stat} functions}
The \Rfunc{deds.stat.linkC} and \Rfunc{deds.stat} functions are the main 
functions that carry out the DEDS procedure. The former wraps around a
C function and is therefore quicker than the latter; the latter does the 
computation solely in R and is slower but is more flexible in fine-tuning parameters
for statistical measures. The user is recommended to use \Rfunc{deds.stat.linkC}
for efficiency purpose. We describe the most important arguments in the function
\Rfunc{deds.stat.linkC} below (see also \code{?deds.stat.linkC}):
\begin{description}
  \item{\code{X}: }{A  matrix, in the case of gene expression data, 
    rows correspond to $N$ genes and columns to $p$ mRNA samples.}
  \item{\code{L}: }{A vector of integers corresponding to observation (column)
    class labels. For $k$ classes, the labels must be integers
    between 0 and $k-1$.}
  \item{\code{B}: }{The number of permutations.}
  \item{\code{tests}: }{A character vector specifying the statistics
      for synthesis of DEDS. \code{test} could be any of the
      following: \code{``t''} ($t$ statistics), \code{``f''} (F statistics),
      \code{``fc''} ($FC$), \code{``sam''} (SAM), \code{``modt''} (moderated 
      $t$ statistics), \code{``modF''} (moderated F statistics) and 
      \code{``B''} (B statistics). As a default, DEDS synthesizes $t$ statistics, 
      $FC$ and SAM.}
 \item{\code{tail}: }{ A character string specifying the type of rejection
    region; choices include \code{``abs''}, \code{``higher''} and \code{``lower''}.}
 \item{\code{adj}: }{ A character string specifying the type of multiple testing
    adjustment; choices include \code{``fdr''} for returning $q$ values controlling False 
    Discovery Rate (FDR; see \cite{Benjamini&Hochberg95}) and \code{``adjp''}
    for adjusted $p$ values (see \cite{Dudoit&Shaffer02}) controlling family wise type I error rate.} 
 \item{\code{nsig}: }{ If \code{adj = ``fdr''}, \code{nsig} specifies the number of top
    differentially expressed genes whose $q$ values will be calculated; we recommend 
    setting \code{nsig < N}, as the computation of $q$ values will be extensive. $q$ values
    for the rest of genes will be approximated to 1. If \code{adj = ``adjp''}, the 
    calculation of the adjusted $p$ values will be for the whole dataset.}
\end{description}

We apply \Rfunc{deds.stat.linkC} on the \code{affySpikeIn} dataset using
400 permutations and evaluating the $q$ values for the top 100 genes. Here, 
as a default, DEDS synthesizes $t$ statistics,$FC$ and SAM. The information 
of the top 20 genes can be printed out using the function \code{topgenes};
note that the rankings of genes by DEDS balance among the three measures it 
synthesizes, $t$ statistics, $FC$ and SAM.

% start code -----------------------------------------
<<dedsaffy,eval=TRUE,echo=TRUE, results=hide>>= 
deds.affy <- deds.stat.linkC(affySpikeIn, affySpikeIn.L, B=400, nsig=100) 
@

<<showdedsaffy,eval=TRUE,echo=TRUE>>=
topgenes(deds.affy, number=20, genelist=affySpikeIn.gnames)
@
% end code -----------------------------------------

\subsection{The plotting functions \Rfunc{pairs.DEDS} and \Rfunc{hist.deds}}
We next illustrate the usage of the function \code{pairs.DEDS}, which is 
a S3 method for \code{pairs}. It displays a scatter matrix plot for 
individual statistics that DEDS synthesizes and highlights the top genes
according to a user specified threshold (\code{thresh}); see also
\code{?pairs.DEDS}. Plots on the diagonal panels are QQ-plots as the
default, but can be set as \code{``histogram''},\code{``boxplot''}, 
\code{``density''} or \code{``none''}. To display only qq-plots, the function 
\code{qqnorm.DEDS} can be use.

% begin code ------------------------------------------------------
%<<pairsaffy,include=FALSE,fig=TRUE,width=8,height=8>>=
<<pairsaffy,eval=FALSE,result=FALSE,echo=TRUE>>= 
pairs(deds.affy, subset=c(2:12626), thresh=0.01, legend=F)
@

%<<qqaffy,include=FALSE,fig=TRUE,width=8,height=8>>=
<<qqaffy,eval=FALSE,result=FALSE,echo=TRUE>>=
qqnorm(deds.affy, subset=c(2:12626), thresh=0.01)
@
% end code ------------------------------------------------------------

%\myincfig{DEDS-pairsaffy}{\textwidth}{Pairs plots for the Affymetrix spike-in data}
%\myincfig{DEDS-qqaffy}{\textwidth}{QQ- plots for the Affymetrix spike-in data} 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Case study 2: ApoA1 Experiment}
The next example we demonstrate is a set of cDNA microarray data from a study of a
mouse model with very low HDL cholesterol levels described in
\cite{Dudoitetal02}. To load the dataset, use
\code{data(ApoA1)}, and to view a description of the experiments
and data, type \code{?ApoA1}.

% begin code ----------------------------------------
<<loadApoA1,results=hide,eval=TRUE,echo=TRUE>>= 
data(ApoA1) 
@  
% end code -----------------------------------------

\subsection{Data}
The goal of the ApoA1 experiment to identify DE genes in apolipoprotein A1
(apo A1) knock-out mice. The treatment group consists of eight knock-out mice
and the control group consists of eight normal mice. The dataset 
includes 

\begin{itemize}
\item
{\code{ApoA1}:} a $6,384 \times 16 $ matrix of expression levels;
\item
{\code{ApoA1.L}:} a vector of class labels (0 for the control group, 
1 for the treatment group);
\end{itemize}

% begin code ----------------------------------------
<<showApoA1,eval=TRUE,echo=TRUE>>= 
dim(ApoA1) 
ApoA1.L
@  
% end code -----------------------------------------

\subsection{Application of the \Rfunc{deds.stat.linkC} function}
We apply \Rfunc{deds.stat.linkC} on the \code{ApoA1} dataset using
400 permutations and evaluating the adjusted$p$. Here, 
as a default, DEDS synthesizes $t$ statistics, $FC$ and SAM.

% begin code -----------------------------------------
<<dedsApoA1,eval=TRUE,echo=TRUE, results=hide>>= 
deds.ApoA1 <- deds.stat.linkC(ApoA1, ApoA1.L, B=400, adj="adjp" ) 
@

<<eval=TRUE,echo=TRUE>>= 
sum(deds.ApoA1$p<=0.01)
sum(deds.ApoA1$p<=0.05)
@

<<showdedsApoA1,eval=TRUE,echo=TRUE>>=
topgenes(deds.ApoA1, number=9)
@

<<pairsApoA1,include=FALSE,fig=TRUE,width=8,height=8>>=
pairs(deds.ApoA1, legend=F)
@
% end code ------------------------------------------------------------

\myincfig{DEDS-pairsApoA1}{\textwidth}{Pairs plots for the ApoA1 data}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Statistical functions}

\subsection{The \Rfunc{comp.t} and other related functions}

The \code{DEDS} package provides the following functions, \Rfunc{comp.FC},
\Rfunc{comp.t}, \Rfunc{comp.SAM}, \Rfunc{comp.F}, \Rfunc{comp.B},
\Rfunc{comp.modt} and \Rfunc{comp.modF} for the computation
of $FC$, $t$-statistics, SAM, F statistics, B statistics, moderated $t$- and
$F$- statistics respectively. \\

There are two steps in applying the above functions to obtain corresponding 
statistics:
\begin{enumerate}
\item Create the statistic function.
\item Apply the function to the microarray expression matrix.
\end{enumerate}

We illustrate the usage with \Rfunc{comp.t} and other functions follow the 
same rules. The function \Rfunc{comp.t} has 
three arguments (see also {\tt ?comp.t}):
\begin{description}
  \item 
    {{\tt L}: }{ A vector of integers corresponding to observation (column)
    class labels. For $k$ classes, the labels must be integers
    between 0 and $k-1$.}
  \item
    {{\tt mu}: }{ A number indicating the true value of the mean (or
    difference in means if two sample statistics are calculated; default set 
    at 0.}
  \item
    {{\tt var.equal}: }{ a logical variable indicating whether to treat the
     two variances as being equal.}
\end{description}
{\tt comp.t} returns a function of one argument with bindings for
{\tt L}, {\tt mu} and {\tt var.equal}. This function accepts a
microarray data matrix as its single argument, when evaluated, computes
$t$ statistic for each row of the matrix.

<<tstat,echo=TRUE,eval=FALSE,results=hide>>=
t <- comp.t(L=affySpikeIn.L)
t.affy <- t(affySpikeIn)
@

\subsection{The \Rfunc{comp.stat} function}
A simple wrapper function \Rfunc{comp.stat} is provided for users interested 
in applying a standard set of statistical measures using default parameters. 
The most important arguments for \Rfunc{comp.stat} are elaborated below:

\begin{description}
\item
  {\code{X}: }{ A  matrix, with rows correspond to genes and columns
          to mRNA samples. }
\item
  {\code{L}: }{ A vector of integers corresponding to observation (column)
          class labels. For $k$ classes, the labels must be integers
          between 0 and $k-1$. }
\item
  {\code{test}: }{ A character string specifying the statistic to be applied.\\
    \code{``t''} -- $t$ statistics;\\
    \code{``fc''} -- $FC$;\\
    \code{``sam''} -- SAM statistics;\\
    \code{``F''} -- F statistics;\\
    \code{``modt''} -- moderated $t$ statistics;\\
    \code{``modF''} -- moderated F statistics;\\
    \code{``B''} -- B statistics;}
\end{description}

To compute $t$ statistics on the \code{affySpikeIn} data, instead of using
\Rfunc{comp.t}, the users can also use \Rfunc{comp.stat} by specifying 
the \code{test} as \code{``t''}. However, \Rfunc{comp.stat} computes $t$ 
statistics assuming unequal variance (if it is a two-sample comparison); if
the user desires to use an equal variance option, the function \Rfunc{comp.t}
has to be applied instead.

<<tstat2,echo=TRUE,eval=FALSE,results=hide>>=
t.affy <- comp.stat(affySpikeIn, affySpikeIn.L, test='t')
@


\begin{thebibliography}{}

\bibitem[Benjamini and Hochberg, 1995]{Benjamini&Hochberg95}
Benjamini, Y. and Hochberg, Y. (1995).
\newblock Controlling the false discovery rate: a practical and powerful
  approach to multiple testing.
\newblock {\em J. R. Statist. Soc. B}, 57:289--300.

\bibitem[Dudoit et~al., 2002a]{Dudoit&Shaffer02}
Dudoit, S., Shaffer, J.~P., and Boldrick, J.~C. (2002a).
\newblock Multiple hypothesis testing in microarray experiments.
\newblock Submitted.

\bibitem[Dudoit et~al., 2002b]{Dudoitetal02}
Dudoit, S., Yang, Y.~H., Callow, M.~J., and Speed, T.~P. (2002b).
\newblock Statistical methods for identifying differentially expressed genes in
  replicated c{DNA} microarray experiments.
\newblock {\em Statistica Sinica}, 12(1):111--139.

\bibitem[Irizarry et~al., 2003]{Irizarryetal03}
Irizarry, R.~A., Bolstad, B.~M., Collin, F., Cope, L., Hobbs, B., and Speed,
  T.~P. (2003).
\newblock Summaries of affymetrix genechip probe level data.
\newblock {\em Nucleic Acids Research}, 31:e15.

\bibitem[L\"{o}nnstedt and Speed, 2001]{Ingrid}
L\"{o}nnstedt, I. and Speed, T.~P. (2001).
\newblock Replicated microarray data.
\newblock {\em Statistica Sinica}, 12(1):31--46.

\bibitem[Smyth et~al., 2003]{limma}
Smyth, G., Ritchie, M., Wettenhall, J., and Thorne, N. (2003).
\newblock Linear models for microarray data.
\newblock {\tt R} package, {\tt http://lib.stat.cmu.edu/R/CRAN/}.

\bibitem[Tusher et~al., 2001]{Tusheretal}
Tusher, V.~G., Tibshirani, R., and Chu, G. (2001).
\newblock Significance analysis of microarrays applied to transcriptional
  responses to ionizing radiation.
\newblock {\em Proc. Natl. Acad. Sci.}, 98:5116--5121.

\bibitem[Yang et~al, 2004]{DEDS}
Yang, Y.~H., Xiao, Y., and Segal, M. (2004).
\newblock Identifying differentially expressed genes from microarray experiments via 
  statistic synthesis.
\newblock {\em Manuscript in preparation.}

\end{thebibliography}


\end{document}