% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \VignetteIndexEntry{maSigPro User's Guide} % \VignetteDepends{maSigPro} % \VignetteKeywords{Expression Analysis, Time course} % \VignettePackage{maSigPro} \documentclass[a4paper,11pt]{article} \usepackage{Sweave} \usepackage{url} \usepackage{amsmath,pstricks,fullpage} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage{a4wide} \usepackage[T1]{fontenc} \parindent 0in \bibliographystyle{abbrvnat} \begin{document} \title{\textbf{maSigPro User's Guide}} \author{Ana Conesa$^1$ and María J. Nueda$^2$} \date{7 October 2009} \maketitle \begin{center} 1. Centro de Investigación Príncipe Felipe, Valencia, Spain. \\ {\tt aconesa@cipf.es}\\ 2. Departamento de Estadística e I.O. Universidad de Alicante, Spain. \\ {\tt mj.nueda@ua.es}\\ \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} {\bf maSigPro} is a R package for the analysis of single and multiseries time course microarray experiments. {\bf maSigPro} follows a two steps regression strategy to find genes with significant temporal expression changes and significant differences between experimental groups. The method defines a general regression model for the data where the experimental groups are identified by dummy variables. The procedure first adjusts this global model by the least-squared technique to identify differentially expressed genes and selects significant genes applying false discovery rate control procedures. Secondly, stepwise regression is applied as a variable selection strategy to study differences between experimental groups and to find statistically significant different profiles. The coefficients obtained in this second regression model will be useful to cluster together significant genes with similar expression patterns and to visualize the results. This document is a example-based guide for the use of \textbf{maSigPro}. We recommend to open a R sesion and go through this tutorial running the code given at the different sections. The guide does not provides a detailed description of the functions of the package or demonstrates the statistical basis of the methodology. The later is described in the work by \citep{Conesa2006}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Getting started} The {\bf maSigPro} package can be obtained from the Bioconductor repository or downloaded from \url{http://www.ua.es/personal/mj.nueda} and \url{http://bioinfo.cipf.es/downloads}. Load \textbf{maSigPro} by typing at the R prompt: <<>>= library(maSigPro) # load maSigPro library @ The on-line help of {\bf maSigPro} can be started by typing at the R promt: \begin{Schunk} \begin{Sinput} >help(package="maSigPro") #for package help >?p.vector #for function help \end{Sinput} \end{Schunk} The analysis approach implemented in \textbf{maSigPro} is executed in 5 major steps which are run by the package core functions {\bf make.design.matrix()}, {\bf p.vector()}, {\bf T.fit()}, {\bf get.siggenes()} and {\bf see.genes()}. Additionally, the package provides the wrapping function {\bf maSigPro()} which executes the entire analysis in one go. In the following section we will explain the usage of each of these funcions using as example a data set from a multiple series time course experiment. At the end of this document we will also explain how to apply {\bf maSigPro} to other experimental designs. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Multiple Series Time Course Experiment} For this section we will use a public data set from a plant abiotic stress study performed at the TIGR Institute by \citep{Rensink2005}. In this study, potato plants were subjected to three different types of abiotic stresses and gene expression was monitored at three time points after the start of the treatments. RNA was also collected from non-stressed plants at the same time points and all samples where hybridased against a common control on a 11K cDNA potato chip. There are three biological replicates for each experimental condition. For speed in this example we will use a random 1000 genes data subset of this study. This data set is part of the data supplied by the \textbf{maSigPro} package. The original data can be found at \url{http://www.tigr.org/tdb/potato/index.shtml}.\\ Before proceeding with this tutorial we will define some of the terms that will be used along this document. We denote experimental groups as the experimental factor for which temporal profiles are defined, like "\emph{Treatment A}", "\emph{Tissue1}", etc; conditions are each experimental group vs. time combination like "\emph{Treatment A at Time 0}" conditions can have or not replicates. Variables are the regression variables defined by the \textbf{maSigPro} approach for the experiment regression model. \textbf{maSigPro} defines dummy variables to model differences between experimental groups. Dummy variables, Time and their interactions are the variables of the regression model. Load the data in your workspace: <<>>= data(data.abiotic) data(edesign.abiotic) @ The {\tt edesign.abiotic} object describes the experimental design of this experiment in {\bf maSigPro} format. <<>>= edesign.abiotic @ Note that arrays are given in rows and experiment descriptors are provided in columns. The first column shows the value that variable Time takes in each array. Replicates column is an index column that indicates the replicated arrays: all arrays belonging to the same experimental condition must be given the same number. The remaining columns are binary columns that give the assignment of arrays to experimental groups. There are as many binary columns as experimental groups and arrays take the value 1 or 0 whether they belong or not to that experimental group.\\ The \texttt{data.abiotic} object is a matrix with normalized gene expression data. Genes must be in rows and arrays in columns. {\bf maSigPro} uses row and colunm names of the data and edesign objects throughout the package. Array names are the labels of the rows of the edesign object and columns of the data object that must be in the same order. GeneIDs are the labels of the rows in the data object. And experiment descriptors are put in the column names of the edesign object. \begin{Schunk} \begin{Sinput} > colnames(data.abiotic) > rownames(edesign.abiotic) > colnames(edesign.abiotic) > rownames(data.abiotic) \end{Sinput} \end{Schunk} %------------------------------------------------------------------------------ \subsection{Defining the regression model} Create a regression matrix for the full regression model: <<>>= design <- make.design.matrix(edesign.abiotic, degree = 2) @ This example has three time points, so we can consider up to a quadratic regression model (\texttt{degree = 2}). Larger number of time points would potentially allow a higher polynomial degree. {\tt design} is a list. Its element {\tt dis} is the actual regression design matrix. {\tt groups.vector} contains the assignment of regression variables to experimental groups. <<>>= design$groups.vector @ %------------------------------------------------------------------------------ \subsection{Finding significant genes} The next step is to compute a regression fit for each gene. This is done by the function {\bf p.vector()}. This function also computes the \emph{p}-value associated to the \emph{F}-Statistic of the model, which is used to select significant genes. By default {\bf maSigPro} corrects this \emph{p}-value for multiple comparisons by applying the linear step-up (B-H) false discovery rate (FDR) procedure \citep{Benjamini1995}. This procedure can be modified by choosing another option of the \textbf{p.adjust} function that is controlled by the function parameter \texttt{MT.adjust} of \textbf{p.vector}. The level of FDR control is given by the function parameter \texttt{Q}. <>= fit <- p.vector(data.abiotic, design, Q = 0.05, MT.adjust = "BH", min.obs = 20) @ {\bf p.vector()} returns a list of values: \begin{Schunk} \begin{Sinput} > fit$i # returns the number of significant genes > fit$alfa # gives p-value at the Q false discovery control level > fit$SELEC # is a matrix with the significant genes and their expression values \end{Sinput} \end{Schunk} %------------------------------------------------------------------------------ \subsection{Finding significant differences} Once significant genes have been found, \textbf{maSigPro} applies a variable selection procedure to find significant variables for each gene. This will ultimatelly be used to find which are the profile differences between experimental groups. This step is done by the \textbf{T.fit()} function. <>= tstep <- T.fit(fit, step.method = "backward", alfa = 0.05) @ \textbf{T.fit()} executes stepwise regression. The step.method can be "backward" or "forward" indicating whether the step procedure starts from the model with all or none variables. Use method "two.ways.backward" and "two.ways.forward" to allow variables to both get in and out. At each regression step the p-value of each variable is computed and variables get in/out the model when this \emph{p}-value is lower or higher than the given cut-off value alfa. \texttt{tstep} is also a list. Its element \texttt{sol} is a matrix of statistical results obtained by the stepwise regression. For each selected gene the following values are given: \begin{description} \item {\tt p-value} of the regression ANOVA \item {\tt R-squared} of the model {\item} {\tt p-value} of the regression coefficients of the selected variables \end{description} %------------------------------------------------------------------------------ \subsection{Obtaining lists of significant genes} The following step is to generate lists of significant genes according to the way we want to see results. This is done by the function {\bf get.siggenes()}. This function has two major arguments, {\tt rsq} and {\tt vars}. \begin{description} \item {\tt rsq} is a cutt-off value for the R-squared of the regression model. {\item} {\tt vars} is used to indicate how to group variables to show results. There are 3 possible values: \end{description} {"{\tt groups}":} This will generate a list of significant genes for each experimental group. The list corresponding to the reference group will contain genes whose expression profile is significantly different from a 0 profile. The lists corresponding to the remaining experimental groups will contain genes whose profiles are different from the reference group. \\ {"{\tt all}":} One unique list of significant genes a any model variable will be produced. \\ {"{\tt each}":} There will be as many lists as variables in the regression model. This can be used to analyze specific differences, for example genes that have linear or saturation kinetics. <>= sigs <- get.siggenes(tstep, rsq = 0.6, vars = "groups") sigs$summary @ The element {\tt summary} is a data frame containing the significant genes for the selected {\tt vars} \\ You can further explore your results by: <<>>= names(sigs$sig.genes) names(sigs$sig.genes$ColdvsControl) @ %------------------------------------------------------------------------------ \subsection{Graphical display} {\bf maSigPro} has a number of functions for the visual exploration of the results.\\ {\bf suma2Venn()} displays the {\tt summary} result as a Venn diagram. For example, the following command will make a Venn diagram of the significant genes for the three stress experimental groups (Figure \ref{fig:venn}). <>= suma2Venn(sigs$summary[, c(2:4)]) @ \begin{figure} \begin{center} \includegraphics[width=4in,height=4in,angle=0]{venn1} \end{center} \caption{Venn Diagram for Cold, Heat and Salt stress significant genes} \protect\label{fig:venn} \end{figure} {\bf PlotGroups()} creates a plot of gene expression profiles by experimental group. For example, STMDE66 is a gene which shows significant profile differences between the control and the cold and salt strees experimental groups, but not significant differences between control and heat experimental groups (Figure \ref{fig:plotSTMDE66}). <>= STMDE66 <- data.abiotic[rownames(data.abiotic)=="STMDE66", ] @ <>= PlotGroups (STMDE66, edesign = edesign.abiotic) @ We can also add to the plot the regression curve computed for this gene. <>= PlotGroups (STMDE66, edesign = edesign.abiotic, show.fit = T, dis = design$dis, groups.vector = design$groups.vector) @ \begin{figure} \begin{center} \begin{tabular}{cc} \includegraphics[width=3in,height=3in,angle=0]{plotGroups1}, & \includegraphics[width=3in,height=3in,angle=0]{plotGroups2} \\ (a) & (b) \end{tabular} \end{center} \caption{PlotGroups of gene STMDE66 without and with display or regression curves } \protect\label{fig:plotSTMDE66} \end{figure} Use {\bf see.genes()} to visualize the result of a group of genes, for example, to visualize the significant genes of the $ControlvsSalt$ comparison. {\bf see.genes()} performs a cluster analysis to group genes by similar profiles. The resulting clusters are then plotted in two fashions: as experiment-wide expression profiles and as by-groups profiles. The first plot (Figure \ref{fig:plotprofiles}) will help to evaluate the consistency of the clusters while the second plot shows clearly the differences between groups (Figure \ref{fig:PLOTGROUPS}). \begin{Schunk} \begin{Sinput} see.genes(sigs$sig.genes$ColdvsControl, main = "ColdvsControl", show.fit = T, dis =design$dis, cluster.method="kmeans" ,cluster.data = 1, k = 9) \end{Sinput} \end{Schunk} \begin{figure} \begin{center} \includegraphics[width=6in,height=6in,angle=90]{PLOTPROFILES} \end{center} \caption{Cluster Analysis ColdvsControl significant genes } \protect\label{fig:plotprofiles} \end{figure} \begin{figure} \begin{center} \includegraphics[width=6in,height=6in,angle=0]{PLOTGROUPS} \end{center} \caption{Expression Profiles ColdvsControl significant genes } \protect\label{fig:PLOTGROUPS} \end{figure} %------------------------------------------------------------------------------ \subsection{The {\bf maSigPro} wrapper function} The five analyse steps decribed in the previous sections can be executed jointly by the function {\bf maSigPro()}. The minimal sintax is: \begin{Schunk} \begin{Sinput} > ts.analysis <- maSigPro (data.abiotic, edesign.abiotic) \end{Sinput} \end{Schunk} Note that the default option directs the graphical output to a pdf. Set graphics off by {\tt pdf = FALSE}. Other arguments are those of the different called functions. For example: \begin{Schunk} \begin{Sinput} > ts.analysis <- maSigPro (data.abiotic, edesign.abiotic, step.method = "two.ways.backward", vars = "all", cluster.method = "hclust") \end{Sinput} \end{Schunk} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Other designs} \subsection{Single Series Time Course} The use of {\bf maSigPro} in Single Series Time Course experiment is straightforward. Make a {\tt edesign} object with just one group column containing all 1s and proceed as described above. Note that when using the {\bf get.siggenes()} funcion the options {"all"} and {"groups"} of the argument {\tt vars} will return the same result. You can use option {"each"} to analyze the type of responses present in the significant genes: significant genes at the "intercept" term will have a significant expression value at the starting time; genes associated to the variable "Time" will have a significant linear component, which can be induction or repression depending on the sign of their coefficient; genes associated to the variable "Time2" will show a change in the linear response that might be indicating transitory or saturation reponses, etc... \\ Here follows an example of a Single Series analysis. \begin{Schunk} \begin{Sinput} ## make a single series edesign > Time <- rep(c(1,5,10,24), each = 3) > Replicates <- rep(c(1:4), each = 3) > Group <- rep(1,12) > ss.edesign <- cbind(Time,Replicates,Group) > rownames(ss.edesign) <- paste("Array", c(1:12), sep = "") ## Create data set > ss.GENE <- function(n, r, var11 = 0.01, var12 = 0.02, var13 = 0.02, var14 = 0.02, a1 = 0, a2 = 0, a3 = 0, a4 = 0) { tc.dat <- NULL for (i in 1:n) { gene <- c(rnorm(r, a1, var11), rnorm(r, a1, var12), rnorm(r, a3, var13), rnorm(r, a4, var14)) tc.dat <- rbind(tc.dat, gene) } tc.dat } > flat <-ss.GENE(n = 85, r = 3) # flat > induc <- ss.GENE(n = 5, r = 3, a1 = 0, a2 = 0.2, a3 = 0.6, a4 = 1) # induction > sat <- ss.GENE(n = 5, r = 3, a1 = 0, a2 = 1, a3 = 1.1, a4 = 1.2) # saturation > ord <- ss.GENE(n = 5, r = 3, a1 = -0.8, a2 = -1, a3 = -1.3, a4 =-0.9) # intercept > ss.DATA <- rbind(flat, induc,sat,ord) > rownames(ss.DATA) <- paste("feature", c(1:100), sep = "") > colnames(ss.DATA) <- paste("Array", c(1:12), sep = "") # run maSigPro > ss.example <- maSigPro(ss.DATA, ss.edesign, vars="each") \end{Sinput} \end{Schunk} %-------------------------------------------------------------------------------------------------------------------------- \subsection{Common Starting Time} The following example illustrates how to build the {\tt edesign} matrix when a common 0 time is applicable to the different experimental groups. \begin{Schunk} \begin{Sinput} > data(edesignCT) \end{Sinput} \end{Schunk} %----------------------------------------------------------------------------------------------------------------------------- \subsection{Using Measured Parameters as Independent Variable} The regression approach followed by {\bf maSigPro} allows that gene expression profile differences can not only be evaluated as a function of Time but another variables, such as a physiological parameter measured at the experimental condition, can be used aswell as the independent variable.\\ The following example illustrates to to build the {\tt edesign} matrix in such case. In this experiment gene expression was analyzed as a response to a temperature shift in a growing {\it E.coli } culture. The culture OD was measured for each sample. \begin{Schunk} \begin{Sinput} > data(edesign.OD) \end{Sinput} \end{Schunk} %----------------------------------------------------------------------------------------------------------------------------- \subsection{Long Time Series} The methodology described in this tutorial works fine with short time series experiments (up to 4 or 5 time points) where biologically meaningful responses can be modelled with polynomes up to a degree of 3. When a large number of time points is evaluated, an alternative to polynomials models with high degree on the time is the piecewise regression or splines regression. In this approach the time points can be split in two or more sets and in each set a polynomial with low degree can be fitted (Figure \ref{fig:splines}). To consider these models new dummy variables must be included in the model. Now the dummies will differenciate between groups defined by a cut time (in the example, d1=1 if t>t*). \begin{figure} \begin{center} \includegraphics[width=3in,height=2in,angle=0]{SPLINES} \end{center} \caption{Dummy definition by splines regression} \protect\label{fig:splines} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \bibliography{maSigPro} \end{document}