% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.

% \VignetteIndexEntry{maSigPro User's Guide}
% \VignetteDepends{maSigPro}
% \VignetteKeywords{Expression Analysis, Time course}
% \VignettePackage{maSigPro}

\documentclass[a4paper,11pt]{article}
\usepackage{Sweave}
\usepackage{url}
\usepackage{amsmath,pstricks,fullpage}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}
\usepackage{a4wide}
\usepackage[T1]{fontenc}
\parindent 0in

\bibliographystyle{abbrvnat}

\begin{document}
\title{\textbf{maSigPro User's Guide}}
\author{Ana Conesa$^1$ and María J. Nueda$^2$}
\date{7 October 2009}
\maketitle

\begin{center}
1. Centro de Investigación Príncipe Felipe, Valencia, Spain. \\
{\tt aconesa@cipf.es}\\
2. Departamento de Estadística e I.O. Universidad de Alicante, Spain. \\
{\tt mj.nueda@ua.es}\\
\end{center}

\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
{\bf maSigPro} is a R package for the analysis of single and multiseries time course microarray
experiments. {\bf maSigPro} follows a two steps regression strategy to find genes with significant
temporal expression changes and significant differences between experimental groups.
The method defines a general regression model for the data where the experimental groups
are identified by dummy variables. The procedure first adjusts this global model by the
least-squared technique to identify differentially expressed genes and selects significant genes
applying false discovery rate control procedures. Secondly, stepwise regression is applied as
a variable selection strategy to study differences between experimental groups and to find
statistically significant different profiles. The coefficients obtained in this second regression
model will be useful to cluster together significant genes with similar expression patterns and
to visualize the results.
This document is a example-based guide for the use of \textbf{maSigPro}. We recommend to open a
R sesion and go through this tutorial running the code given at the different sections. The
guide does not provides a detailed description of the functions of the package or demonstrates
the statistical basis of the methodology. The later is described in the work by \citep{Conesa2006}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Getting started}
The {\bf maSigPro} package can be obtained from the Bioconductor repository or downloaded from
\url{http://www.ua.es/personal/mj.nueda} and \url{http://bioinfo.cipf.es/downloads}. Load \textbf{maSigPro} by typing at
the R prompt:
<<>>=
library(maSigPro) # load maSigPro library
@

The on-line help of {\bf maSigPro} can be started by typing at the R promt:

\begin{Schunk}
\begin{Sinput}
>help(package="maSigPro") #for package help
>?p.vector #for function help
\end{Sinput}
\end{Schunk}

The analysis approach implemented in \textbf{maSigPro} is executed in 5 major steps which are run
by the package core functions {\bf make.design.matrix()}, {\bf p.vector()}, {\bf T.fit()}, {\bf get.siggenes()}
and {\bf see.genes()}. Additionally, the package provides the wrapping function {\bf maSigPro()}
which executes the entire analysis in one go.
In the following section we will explain the usage of each of these funcions using as example a
data set from a multiple series time course experiment. At the end of this document we will
also explain how to apply {\bf maSigPro} to other experimental designs.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Multiple Series Time Course Experiment}
For this section we will use a public data set from a plant abiotic stress study performed at
the TIGR Institute by \citep{Rensink2005}.
In this study, potato plants were subjected to three different types of abiotic stresses and gene
expression was monitored at three time points after the start of the treatments. RNA was also
collected from non-stressed plants at the same time points and all samples where hybridased
against a common control on a 11K cDNA potato chip. There are three biological replicates
for each experimental condition. For speed in this example we will use a random 1000 genes
data subset of this study. This data set is part of the data supplied by the \textbf{maSigPro} package.
The original data can be found at \url{http://www.tigr.org/tdb/potato/index.shtml}.\\

Before proceeding with this tutorial we will define some of the terms that will be used along
this document. We denote experimental groups as the experimental factor for which temporal
profiles are defined, like "\emph{Treatment A}", "\emph{Tissue1}", etc; conditions are each experimental
group vs. time combination like "\emph{Treatment A at Time 0}" conditions can have or not
replicates. Variables are the regression variables defined by the \textbf{maSigPro} approach for the
experiment regression model. \textbf{maSigPro} defines dummy variables to model differences between
experimental groups. Dummy variables, Time and their interactions are the variables
of the regression model. Load the data in your workspace:
<<>>=
data(data.abiotic)
data(edesign.abiotic)
@
The {\tt edesign.abiotic} object describes the experimental design of this experiment in {\bf maSigPro} format.
<<>>=
edesign.abiotic
@
Note that arrays are given in rows and experiment descriptors are provided in columns. The
first column shows the value that variable Time takes in each array. Replicates column is an
index column that indicates the replicated arrays: all arrays belonging to the same experimental
condition must be given the same number. The remaining columns are binary columns
that give the assignment of arrays to experimental groups. There are as many binary columns
as experimental groups and arrays take the value 1 or 0 whether they belong or not to that
experimental group.\\
The \texttt{data.abiotic} object is a matrix with normalized gene expression data. Genes must be in rows
and arrays in columns.
{\bf maSigPro} uses row and colunm names of the data and edesign objects throughout the package.
Array names are the labels of the rows of the edesign object and columns of the data object
that must be in the same order. GeneIDs are the labels of the rows in the data
object. And experiment descriptors are put in the column names of the edesign object.

\begin{Schunk}
\begin{Sinput}
> colnames(data.abiotic)
> rownames(edesign.abiotic)
> colnames(edesign.abiotic)
> rownames(data.abiotic)
\end{Sinput}
\end{Schunk}

%------------------------------------------------------------------------------
\subsection{Defining the regression model}
Create a regression matrix for the full regression model:
<<>>=
design <- make.design.matrix(edesign.abiotic, degree = 2)
@
This example has three time points, so we can consider up to a quadratic regression model
(\texttt{degree = 2}). Larger number of time points would potentially allow a higher polynomial
degree.
{\tt design} is a list. Its element {\tt dis} is the actual regression design matrix. {\tt groups.vector} contains the
assignment of regression variables to experimental groups.
<<>>=
design$groups.vector
@
%------------------------------------------------------------------------------
\subsection{Finding significant genes}
The next step is to compute a regression fit for each gene. This is done by the function
{\bf p.vector()}. This function also computes the \emph{p}-value associated to the \emph{F}-Statistic of the
model, which is used to select significant genes. By default {\bf maSigPro} corrects this \emph{p}-value for multiple
comparisons by applying the linear step-up (B-H) false discovery rate (FDR) procedure \citep{Benjamini1995}. 
This procedure can be modified by choosing another option of the \textbf{p.adjust} function that is controlled by
the function parameter \texttt{MT.adjust} of \textbf{p.vector}.
The level of FDR control is given by the function parameter \texttt{Q}.
<<results=hide>>=
fit <- p.vector(data.abiotic, design, Q = 0.05, MT.adjust = "BH", min.obs = 20)

@ 

{\bf p.vector()} returns a list of values:

\begin{Schunk}
\begin{Sinput}
> fit$i # returns the number of significant genes
> fit$alfa # gives p-value at the Q false discovery control level
> fit$SELEC # is a matrix with the significant genes and their expression values
\end{Sinput}
\end{Schunk}

%------------------------------------------------------------------------------
\subsection{Finding significant differences}
Once significant genes have been found, \textbf{maSigPro} applies a variable selection procedure to
find significant variables for each gene. This will ultimatelly be used to find which are the
profile differences between experimental groups. This step is done by the \textbf{T.fit()} function.
<<results=hide>>=
tstep <- T.fit(fit, step.method = "backward", alfa = 0.05)
@
\textbf{T.fit()} executes stepwise regression. The step.method can be "backward" or "forward"
indicating whether the step procedure starts from the model with all or none variables. Use
method "two.ways.backward" and "two.ways.forward" to allow variables to both get in and
out. At each regression step the p-value of each variable is computed and variables get in/out
the model when this \emph{p}-value is lower or higher than the given cut-off value alfa.
\texttt{tstep} is also a list. Its element \texttt{sol} is a matrix of statistical results obtained by the stepwise
regression. For each selected gene the following values are given:

\begin{description}
\item
{\tt p-value} of the regression ANOVA
\item
{\tt R-squared} of the model
{\item}
{\tt p-value} of the regression coefficients of the selected variables
\end{description}

%------------------------------------------------------------------------------
\subsection{Obtaining lists of significant genes}
The following step is to generate lists of significant genes according to the way we want to see results. 
This is done by the function {\bf get.siggenes()}. 
This function has two major arguments, {\tt rsq} and {\tt vars}.

\begin{description}
\item
{\tt rsq} is a cutt-off value for the R-squared of the regression model.
{\item}
{\tt vars} is used to indicate how to group variables to show results. There are 3 possible values:
\end{description}


{"{\tt groups}":} This will generate a list of significant genes for each experimental group. The list corresponding to the reference group will contain genes whose expression profile is significantly different from a 0 profile. The lists corresponding to the remaining experimental groups will contain genes whose profiles are different from the reference group.
\\
{"{\tt all}":} One unique list of significant genes a any model variable will be produced.
\\
{"{\tt each}":} There will be as many lists as variables in the regression model. This can be used to analyze specific differences, for example genes that have linear or saturation kinetics.


<<results=hide>>=
sigs <- get.siggenes(tstep, rsq = 0.6, vars = "groups")
sigs$summary
@
The element {\tt summary} is a data frame containing the significant genes for the selected {\tt vars}
\\
You can further explore your results by:
<<>>=
names(sigs$sig.genes)
names(sigs$sig.genes$ColdvsControl)
@

%------------------------------------------------------------------------------
\subsection{Graphical display}

{\bf maSigPro}  has a number of functions for the visual exploration of the results.\\
{\bf suma2Venn()} displays the {\tt summary} result as a Venn diagram. For example, the following 
command will make a Venn diagram of the significant genes for the three stress experimental groups 
(Figure \ref{fig:venn}).

<<venn1,fig=TRUE,prefix=FALSE,echo=TRUE,include=FALSE>>=
suma2Venn(sigs$summary[, c(2:4)])

@
\begin{figure}
\begin{center}
\includegraphics[width=4in,height=4in,angle=0]{venn1}
\end{center}
\caption{Venn Diagram for Cold, Heat and Salt stress significant genes}
\protect\label{fig:venn}
\end{figure}

{\bf PlotGroups()} creates a plot of gene expression profiles by experimental group. For example, STMDE66 is a gene which shows 
significant profile differences between the control and the cold and salt strees experimental groups, but not significant differences 
between control and heat experimental groups (Figure \ref{fig:plotSTMDE66}).

<<echo=TRUE,print=FALSE,eval=TRUE, results=hide>>=
STMDE66 <- data.abiotic[rownames(data.abiotic)=="STMDE66", ]
@

<<plotGroups1,fig=TRUE,prefix=FALSE,echo=TRUE,include=FALSE>>=
PlotGroups (STMDE66, edesign = edesign.abiotic)
@

We can also add to the plot the regression curve computed for this gene.
<<plotGroups2,fig=TRUE,prefix=FALSE,echo=TRUE,include=FALSE>>=
PlotGroups (STMDE66, edesign = edesign.abiotic, show.fit = T, dis = design$dis, groups.vector = design$groups.vector)
@

\begin{figure}
\begin{center}
\begin{tabular}{cc}
\includegraphics[width=3in,height=3in,angle=0]{plotGroups1}, &
\includegraphics[width=3in,height=3in,angle=0]{plotGroups2} \\
(a) & (b)
\end{tabular}
\end{center}
\caption{PlotGroups of gene STMDE66 without and with display or regression curves }
\protect\label{fig:plotSTMDE66}
\end{figure}

Use {\bf see.genes()} to visualize the result of a group of genes, for example, to visualize the significant genes of the 
$ControlvsSalt$ comparison. {\bf see.genes()} performs a cluster analysis 
to group genes by similar profiles. The resulting clusters are then plotted in two fashions: as experiment-wide 
expression profiles and as by-groups profiles. The first plot (Figure \ref{fig:plotprofiles}) will help to evaluate the consistency of the 
clusters while the second plot shows clearly the differences between groups (Figure \ref{fig:PLOTGROUPS}).

\begin{Schunk}
\begin{Sinput}
see.genes(sigs$sig.genes$ColdvsControl, main = "ColdvsControl", show.fit = T,
          dis =design$dis, cluster.method="kmeans" ,cluster.data = 1, k = 9)
\end{Sinput}
\end{Schunk}

\begin{figure}
\begin{center}
\includegraphics[width=6in,height=6in,angle=90]{PLOTPROFILES}
\end{center}
\caption{Cluster Analysis ColdvsControl significant genes }
\protect\label{fig:plotprofiles}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=6in,height=6in,angle=0]{PLOTGROUPS}
\end{center}
\caption{Expression Profiles ColdvsControl significant genes }
\protect\label{fig:PLOTGROUPS}
\end{figure}
%------------------------------------------------------------------------------
\subsection{The {\bf maSigPro} wrapper function}
The five analyse steps decribed in the previous sections can be executed jointly by the function {\bf maSigPro()}. The minimal sintax is:
\begin{Schunk}
\begin{Sinput}
> ts.analysis <- maSigPro (data.abiotic, edesign.abiotic)
\end{Sinput}
\end{Schunk}

Note that the default option directs the graphical output to a pdf. Set graphics off by {\tt pdf = FALSE}.
Other arguments are those of the different called functions. For example:

\begin{Schunk}
\begin{Sinput}
> ts.analysis <- maSigPro (data.abiotic, edesign.abiotic,
                           step.method = "two.ways.backward",
                           vars = "all", cluster.method = "hclust")
\end{Sinput}
\end{Schunk}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Other designs}
\subsection{Single Series Time Course}
The use of {\bf maSigPro} in Single Series Time Course experiment is straightforward. Make a {\tt edesign} object with just one group 
column containing all 1s and proceed as described above. Note that when using the {\bf get.siggenes()} funcion the options {"all"} and 
{"groups"} of the argument {\tt vars} will return the same result. You can use option {"each"} to analyze the type of responses present 
in the significant genes: significant genes at the "intercept" term will have a significant expression value at the starting time; 
genes associated to the variable "Time" will have a significant linear component, which can be induction or repression depending on the
 sign of their coefficient; genes associated to the variable "Time2" will show a change in the linear response that might be 
 indicating transitory or saturation reponses, etc...
\\
Here follows an example of a Single Series analysis.

\begin{Schunk}
\begin{Sinput}
## make a single series edesign
> Time <- rep(c(1,5,10,24), each = 3)
> Replicates <- rep(c(1:4), each = 3)
> Group <- rep(1,12)
> ss.edesign <- cbind(Time,Replicates,Group)
> rownames(ss.edesign) <- paste("Array", c(1:12), sep = "")
## Create data set
> ss.GENE <- function(n, r, var11 = 0.01, var12 = 0.02, var13 = 0.02,
           var14 = 0.02, a1 = 0, a2 = 0, a3 = 0, a4 = 0) {	
  tc.dat <- NULL
  for (i in 1:n) {
    gene <- c(rnorm(r, a1, var11), rnorm(r, a1, var12),
              rnorm(r, a3, var13), rnorm(r, a4, var14))
    tc.dat <- rbind(tc.dat, gene)
  }
  tc.dat }
> flat <-ss.GENE(n = 85, r = 3) # flat
> induc <- ss.GENE(n = 5, r = 3, a1 = 0, a2 = 0.2, a3 = 0.6, a4 = 1) # induction
> sat <- ss.GENE(n = 5, r = 3, a1 = 0, a2 = 1, a3 = 1.1, a4 = 1.2) # saturation
> ord <- ss.GENE(n = 5, r = 3, a1 = -0.8, a2 = -1, a3 = -1.3, a4 =-0.9)  # intercept
> ss.DATA <- rbind(flat, induc,sat,ord)
> rownames(ss.DATA) <- paste("feature", c(1:100), sep = "")
> colnames(ss.DATA) <- paste("Array", c(1:12), sep = "")
# run maSigPro
> ss.example <- maSigPro(ss.DATA, ss.edesign, vars="each")
\end{Sinput}
\end{Schunk}

%--------------------------------------------------------------------------------------------------------------------------
\subsection{Common Starting Time}
The following example illustrates how to build the {\tt edesign} matrix when a common 0 time is applicable to the different experimental
 groups.
\begin{Schunk}
\begin{Sinput}
> data(edesignCT)
\end{Sinput}
\end{Schunk}

%-----------------------------------------------------------------------------------------------------------------------------
\subsection{Using Measured Parameters as Independent Variable}
The regression approach followed by {\bf maSigPro} allows that gene expression profile differences can not only be evaluated as a 
function of Time but another variables, such as a physiological parameter measured at the experimental condition, can be used aswell 
as the independent variable.\\
The following example illustrates to to build the {\tt edesign} matrix in such case. In this experiment gene expression was analyzed 
as a response to a temperature shift in a growing {\it E.coli } culture. The culture OD was measured for each sample.

\begin{Schunk}
\begin{Sinput}
> data(edesign.OD)
\end{Sinput}
\end{Schunk}
%-----------------------------------------------------------------------------------------------------------------------------
\subsection{Long Time Series}

The methodology described in this tutorial works fine with short time series experiments (up to 4 or 5 time points) where biologically 
meaningful responses can be modelled with polynomes up to a degree of 3. When a large number of time points is evaluated, an alternative 
to polynomials models with high degree on the time is the piecewise regression or splines regression. In this approach the time points 
can be split in two or more sets and in each set a polynomial with low degree can be fitted (Figure \ref{fig:splines}). To consider these models 
new dummy variables must be included in the model. Now the dummies will differenciate between groups defined by a cut time (in the 
example, d1=1 if t>t*).

\begin{figure}
\begin{center}
\includegraphics[width=3in,height=2in,angle=0]{SPLINES}
\end{center}
\caption{Dummy definition by splines regression}
\protect\label{fig:splines}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliography{maSigPro}
\end{document}