% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{LMGene User's Guide} %\VignetteDepends{Biobase, tools, multtest, survival, splines} \documentclass[11pt]{article} \usepackage{amsmath,fullpage} \usepackage{hyperref} \parindent 0in % Left justify \begin{document} \title{\bf LMGene User's Guide} \author{Geun-Cheol Lee, John Tillinghast, and David M. Rocke} \maketitle \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} This article introduces usage of the \texttt{LMGene} package. \texttt{LMGene} has been developed for analysis of microarray data using a linear model and glog data transformation in the R statistical package. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Data preparation} \texttt{LMGene} takes objects of class \texttt{ExpressionSet}, which is the standard data structure of the \texttt{Biobase} package. Therefore, the user who already has data that is of class \texttt{ExpressionSet} can jump to further steps, such as g-log transformation or looking for differentially expressed genes. Otherwise, the user needs to generate new objects of class \texttt{ExpressionSet}. For more detail, please see the vignette, `An Introduction to Biobase and ExpressionSets' in the \texttt{Biobase} package. {\bf Note: ExpressionSet.} In this package, an object of class \texttt{ExpressionSet} must produce proper data using the commands \texttt{exprs(object)} and \texttt{phenoData(object)}. {\tt Example.} \texttt{LMGene} includes sample array data which is of class {\tt ExpressionSet}. Let's take a look this sample data. \begin{enumerate} \item First, load the necessary packages in your R session. <>= library(LMGene) library(Biobase) library(tools) @ \item Load the sample {\tt ExpressionSet} class data in the package {\tt LMGene}. <>= data(sample.eS) @ \item View the data structure of the sample data and the details of {\tt exprs} and {\tt phenoData} slots in the data. <>= slotNames(sample.eS) dim(exprs(sample.eS)) exprs(sample.eS)[1:3,] phenoData(sample.eS) slotNames(phenoData(sample.eS)) @ \end{enumerate} {\tt Data generation.} If you don't have {\tt ExpressionSet} class data, you need to make some. \texttt{LMGene} provides a function that can generate an object of class {\tt ExpressionSet}, assuming that there are array data of {\tt matrix} class and experimental data of {\tt list} class. \begin{enumerate} \item The package includes sample array and experimental/phenotype data, {\tt sample.mat} and {\tt vlist}. <>= data(sample.mat) dim(sample.mat) data(vlist) vlist @ \item Generate {\tt ExpressionSet} class data using {\tt neweS} function. <>= test.eS<-neweS(sample.mat, vlist) class(test.eS) @ \end{enumerate} \section{G-log transformation} \begin{enumerate} \item {\tt Estimating parameters for g-log transformation.} In \texttt{LMGene}, the linear model is not intended to be applied to the raw data, but to transformed and normalized data. Many people use a log transform. \texttt{LMGene} uses a log-like transform involving two parameters. We estimate the parameters ${\lambda}$ and ${\alpha}$ of the generalized log transform ${\log{(y - {\alpha} + {\sqrt{(y - {\alpha})^2 + {\lambda}}})} = \sinh^{-1}(\frac{y-\alpha}{\lambda})}+\log(\lambda)$ using the function tranest as follows: <>= tranpar <- tranest(sample.eS) tranpar @ The optional parameter {\tt ngenes} controls how many genes are used in the estimation. The default is all of them (up to 100,000), but this option allows the use of less. A typical call using this parameter would be <>= tranpar <- tranest(sample.eS, 100) tranpar @ In this case, 100 genes are chosen at random and used to estimate the transformation parameter. The function returns a list containing values for lambda and alpha. \item {\tt G-log transformation.} Using the obtained two parameters, the g-log transformed expression set can be calculated as follows. <>= trsample.eS <- transeS(sample.eS, tranpar$lambda, tranpar$alpha) exprs(sample.eS)[1:3,1:8] exprs(trsample.eS)[1:3,1:8] @ \item {\tt Tranest options: multiple alpha, lowessnorm, model} Rather than using a single alpha for all samples, we can estimate a separate alpha for each sample. This allows for differences in chips, in sample concentration, or exposure conditions. <>= tranparmult <- tranest(sample.eS, mult=TRUE) tranparmult @ For vector alphas, transeS uses exactly the same syntax: <>= trsample.eS <- transeS (sample.eS, tranparmult$lambda, tranparmult$alpha) exprs(trsample.eS)[1:3,1:8] @ It's also possible to estimate the parameters using the more accurate lowess normalization (as opposed to uniform normalization): <>= tranparmult <- tranest(sample.eS, ngenes=100, mult=TRUE, lowessnorm=TRUE) tranparmult @ One may also specify a model other than the default no-interaction model. For example, if we think that the interaction of variables in \texttt{vlist} is important, we can add interaction to the model: <>= tranpar <- tranest(sample.eS, model='patient + dose + patient:dose') tranpar @ The model is always specified in the same way as the right-hand side of an \texttt{lm} model. In the example above, we set the parameters to minimize the mean squared error for a regression of transformed gene expression against patient, log dose, and their interaction. Be very careful of using interactions between factor variables. If you do not have enough replicates, you can easily overfit the data and have no degrees of freedom left for error. Naturally, it's possible to use \texttt{mult}, \texttt{lowessnorm}, and \texttt{model} all together. \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Finding differentially expressed genes} \begin{enumerate} \item {\tt Transformation and Normalization.} Before finding differentially expressed genes, the array data needs to be transformed and normalized. <>= trsample.eS <- transeS (sample.eS, tranparmult$lambda, tranparmult$alpha) ntrsample.eS <- lnormeS (trsample.eS) @ \item {\tt Finding differentially expressed genes} The \texttt{LMGene} routine computes significant probes/genes by calculating gene-by-gene p-values for each factor in the model and adjusting for the specified false discovery rate (FDR). A typical call would be <>= sigprobes <- LMGene(ntrsample.eS) @ There is an optional argument, \texttt{level}, which is the FDR (default 5 percent). A call using this optional parameter would look like <>= sigprobes <- LMGene(ntrsample.eS,level=.01) @ The result is a list whose components have the names of the effects in the model. The values are the significant genes for the test of that effect or else the message "No significant genes". As with \texttt{tranest}, it's possible to specify a more complex model to \texttt{LMGene}: <>= sigprobes <- LMGene(ntrsample.eS, model='patient+dose+patient:dose') sigprobes @ \end{enumerate} \bibliographystyle{plain} \begin{thebibliography}{99} \bibitem{1} Benjamini, Y. and Hochberg, Y. (1995) ``Controlling the false discovery rate: a practical and powerful approach to multiple testing,'' \emph{Journal of the Royal Statistical Society, Series B}, {\bf57}, 289--300. \bibitem{2} Durbin, B.P., Hardin, J.S., Hawkins, D.M., and Rocke, D.M. (2002) ``A variance-stabilizing transformation for gene-expression microarray data,'' \emph{Bioinformatics}, {\bf18}, S105--S110. \bibitem{3} Durbin, B. and Rocke, D. M. (2003a) ``Estimation of transformation parameters for microarray data,'' \emph{Bioinformatics}, {\bf19}, 1360--1367. \bibitem{4} Durbin, B. and Rocke, D. M. (2003b) ``Variance-stabilizing transformations for two-color microarrays,'' \emph{Bioinformatics}, {\bf20}, 660--667. \bibitem{5} Geller, S.C., Gregg, J.P., Hagerman, P.J., and Rocke, D.M. (2003) ``Transformation and normalization of oligonucleotide microarray data,'' \emph{Bioinformatics}, {\bf19}, 1817--1823. \bibitem{6} Huber W., Von Heydebreck A., S\"{u}ltmann H., Poustka A. and Vingron M. (2002) ``Variance stabilization applied to microarray data calibration and to the quantification of differential expression,'' \emph{Bioinformatics}, {\bf18}, S96--S104. \bibitem{7} Rocke, David M. (2004) ``Design and Analysis of Experiments with High Throughput Biological Assay Data,'' \emph{Seminars in Cell and Developmental Biology }, {\bf15}, 708--713. \bibitem{8} Rocke, D., and Durbin, B. (2001) ``A model for measurement error for gene expression arrays,'' \emph{Journal of Computational Biology}, {\bf8}, 557--569. \bibitem{9} Rocke, D. and Durbin, B. (2003) ``Approximate variance-stabilizing transformations for gene-expression microarray data,'' \emph{Bioinformatics}, {\bf19}, 966--972. \end{thebibliography} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \end{document}