% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
%\VignetteIndexEntry{LMGene User's Guide}
%\VignetteDepends{Biobase, tools, multtest, survival, splines}

\documentclass[11pt]{article}
\usepackage{amsmath,fullpage}
\usepackage{hyperref}


\parindent 0in  % Left justify


\begin{document}

\title{\bf LMGene User's Guide}
\author{Geun-Cheol Lee, John Tillinghast, and David M. Rocke}

\maketitle

\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}

This article introduces usage of the \texttt{LMGene} package. 
\texttt{LMGene} has been developed for analysis of microarray data using 
a linear model and glog data transformation in the R statistical package.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Data preparation}
\texttt{LMGene} takes objects of class \texttt{ExpressionSet}, which is the standard
data structure of the \texttt{Biobase} package. Therefore, the user who already has data that is of class
\texttt{ExpressionSet} can jump to further steps, such as g-log transformation or looking for differentially expressed genes. Otherwise, the user needs to generate new objects
of class \texttt{ExpressionSet}. For more detail, please see the vignette,
`An Introduction to Biobase and ExpressionSets' in the \texttt{Biobase} package.

{\bf Note: ExpressionSet.} In this package, an object of class \texttt{ExpressionSet}
must produce proper data using the commands \texttt{exprs(object)} and \texttt{phenoData(object)}.

{\tt Example.} \texttt{LMGene} includes sample array data which is of class
{\tt ExpressionSet}. Let's take a look this sample data.

\begin{enumerate}

\item First, load the necessary packages in your R session.
<<eval=TRUE, echo=TRUE>>=
library(LMGene)
library(Biobase)
library(tools)
@

\item Load the sample {\tt ExpressionSet} class data in the package {\tt LMGene}.
<<eval=TRUE, echo=TRUE>>=
data(sample.eS)
@

\item View the data structure of the sample data and 
the details of {\tt exprs} and {\tt phenoData} slots in the data.
<<eval=TRUE, echo=TRUE>>=
slotNames(sample.eS)
dim(exprs(sample.eS))
exprs(sample.eS)[1:3,]
phenoData(sample.eS)
slotNames(phenoData(sample.eS))
@

\end{enumerate}

{\tt Data generation.} If you don't have {\tt ExpressionSet} class data, you need to make some.
\texttt{LMGene} provides a function that can generate an object of class {\tt ExpressionSet},
assuming that there are array data of {\tt matrix} class and experimental data of {\tt list} class.

\begin{enumerate}

\item The package includes sample array and experimental/phenotype data, {\tt sample.mat} and {\tt vlist}.
<<eval=TRUE, echo=TRUE>>=
data(sample.mat)
dim(sample.mat)
data(vlist)  
vlist
@

\item Generate {\tt ExpressionSet} class data using {\tt neweS} function.
<<eval=TRUE, echo=TRUE>>=
test.eS<-neweS(sample.mat, vlist)
class(test.eS)
@


\end{enumerate}


\section{G-log transformation}
\begin{enumerate}
\item {\tt Estimating parameters for g-log transformation.} 
In \texttt{LMGene}, the linear model is not intended to be applied to the raw data, but to transformed and normalized data. Many people use a log transform. \texttt{LMGene} uses a log-like transform involving two parameters. We estimate the parameters ${\lambda}$ and ${\alpha}$ of the generalized log
transform ${\log{(y - {\alpha} + {\sqrt{(y - {\alpha})^2 + {\lambda}}})} = \sinh^{-1}(\frac{y-\alpha}{\lambda})}+\log(\lambda)$ using the
function tranest as follows:

<<eval=TRUE, echo=TRUE>>=
tranpar <- tranest(sample.eS)
tranpar
@

The optional parameter {\tt ngenes} controls how many genes are used in
the estimation. The default is all of them (up to 100,000), but this option allows the use of less. A typical call using this parameter
would be

<<eval=TRUE, echo=TRUE>>=
tranpar <- tranest(sample.eS, 100)
tranpar
@

In this case, 100 genes are chosen at random and used to estimate
the transformation parameter. The function returns a list
containing values for lambda and alpha.

   
\item {\tt G-log transformation.} Using the obtained two parameters, the g-log transformed
expression set can be calculated as follows.
<<eval=TRUE, echo=TRUE>>=

trsample.eS <- transeS(sample.eS, tranpar$lambda, tranpar$alpha)
exprs(sample.eS)[1:3,1:8]
exprs(trsample.eS)[1:3,1:8]
@


\item {\tt Tranest options: multiple alpha, lowessnorm, model}

Rather than using a single alpha for all samples, we can estimate a separate alpha for each sample. This allows for differences in chips, in sample concentration, or exposure conditions. 

<<eval=TRUE, echo=TRUE>>=
tranparmult <- tranest(sample.eS, mult=TRUE) 
tranparmult
@


For vector alphas, transeS uses exactly the same syntax:
<<eval=TRUE, echo=TRUE>>=
trsample.eS <- transeS (sample.eS, tranparmult$lambda, tranparmult$alpha)
exprs(trsample.eS)[1:3,1:8]
@

It's also possible to estimate the parameters using the more accurate lowess normalization (as opposed to uniform normalization):

<<eval=TRUE, echo=TRUE>>=
tranparmult <- tranest(sample.eS, ngenes=100, mult=TRUE, lowessnorm=TRUE) 
tranparmult
@


One may also specify a model other than the default no-interaction model. For example, if we think that the interaction of variables in \texttt{vlist} is important, we can add interaction to the model:

<<eval=TRUE, echo=TRUE>>=
tranpar <- tranest(sample.eS, model='patient + dose + patient:dose')
tranpar
@

The model is always specified in the same way as the right-hand side of an \texttt{lm} model. In the example above, we set the parameters to minimize the mean squared error for a regression of transformed gene expression against patient, log dose, and their interaction.

Be very careful of using interactions between factor variables. If you do not have enough replicates, you can easily overfit the data and have no degrees of freedom left for error.

Naturally, it's possible to use \texttt{mult}, \texttt{lowessnorm}, and \texttt{model} all together.


\end{enumerate}   
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Finding differentially expressed genes}

\begin{enumerate}
\item {\tt Transformation and Normalization.} Before finding differentially expressed genes,
the array data needs to be transformed and normalized. 

<<eval=TRUE, echo=TRUE>>=
trsample.eS <- transeS (sample.eS, tranparmult$lambda, tranparmult$alpha)
ntrsample.eS <- lnormeS (trsample.eS) 
@
   
\item {\tt Finding differentially expressed genes} 
The \texttt{LMGene} routine computes significant probes/genes by calculating gene-by-gene p-values for each factor in the model and adjusting for the specified false discovery rate (FDR).  A typical call would be

<<eval=TRUE, echo=TRUE>>=
sigprobes <- LMGene(ntrsample.eS)
@

There is an optional argument, \texttt{level}, which is the FDR (default 5 percent). A call using this optional parameter would look like

<<eval=TRUE, echo=TRUE>>=
sigprobes <- LMGene(ntrsample.eS,level=.01)
@

The result is a list whose components have the names of the effects in the
model. The values are the significant genes for the test of that effect or else the
message "No significant genes".

As with \texttt{tranest}, it's possible to specify a more complex model to \texttt{LMGene}:

<<eval=TRUE, echo=TRUE>>=
sigprobes <- LMGene(ntrsample.eS, model='patient+dose+patient:dose')
sigprobes
@


\end{enumerate}


\bibliographystyle{plain}
\begin{thebibliography}{99}

\bibitem{1}
Benjamini, Y. and Hochberg, Y. (1995) ``Controlling the false discovery rate: a practical and powerful approach to multiple testing,'' \emph{Journal of the Royal Statistical Society, Series B}, {\bf57}, 289--300.

\bibitem{2}  
Durbin, B.P., Hardin, J.S., Hawkins, D.M., and Rocke, D.M. (2002) ``A
variance-stabilizing transformation for gene-expression microarray
data,'' \emph{Bioinformatics}, {\bf18}, S105--S110.

\bibitem{3} 
Durbin, B. and Rocke, D. M. (2003a) ``Estimation of transformation
parameters for microarray data,'' \emph{Bioinformatics}, {\bf19}, 1360--1367.

\bibitem{4} 
Durbin, B. and Rocke, D. M. (2003b) ``Variance-stabilizing transformations for two-color microarrays,''
\emph{Bioinformatics}, {\bf20}, 660--667.

\bibitem{5} 
Geller, S.C., Gregg, J.P., Hagerman, P.J., and Rocke, D.M. (2003)
``Transformation and normalization of oligonucleotide microarray
data,'' \emph{Bioinformatics}, {\bf19}, 1817--1823.

\bibitem{6}
Huber W., Von Heydebreck A., S\"{u}ltmann H., Poustka A. and Vingron M. (2002) ``Variance stabilization applied to microarray data calibration and to the quantification of differential expression,''
\emph{Bioinformatics}, {\bf18}, S96--S104.

\bibitem{7} 
Rocke, David M. (2004) ``Design and Analysis of Experiments with High Throughput Biological Assay Data,'' 
\emph{Seminars in Cell and Developmental Biology },
{\bf15}, 708--713.

\bibitem{8} 
Rocke, D., and Durbin, B. (2001) ``A model for measurement error for
gene expression arrays,'' \emph{Journal of Computational Biology},
{\bf8}, 557--569.

\bibitem{9} 
Rocke, D. and Durbin, B. (2003) ``Approximate
variance-stabilizing transformations for gene-expression microarray
data,'' \emph{Bioinformatics}, {\bf19}, 966--972.

\end{thebibliography}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}