% \VignetteIndexEntry{RankProd Tutorial}
% \VignetteKeywords{Microarray identify genes}
% \VignettePackage{RankProd}

\documentclass[11pt]{article}

% Approximately 1 inch borders all around
\setlength\topmargin{-0.2in} \setlength\evensidemargin{-.0in}
\setlength\oddsidemargin{-.0in} \setlength\textwidth{6.6in}
\setlength\textheight{8.4in}


%\usepackage{fancyheadings}


\usepackage{graphics}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{latexsym}
\usepackage{makeidx}
\usepackage{amsfonts}
\usepackage[dvips]{epsfig}
\usepackage{url}

\let\BLS=\baselinestretch
\makeatletter
\newcommand{\singlespacing}{\let\CS=\@currsize\renewcommand{\baselinestretch}{1}
\small\CS}
\newcommand{\doublespacing}{\let\CS=\@currsize\renewcommand{
\baselinestretch}{1.5}\small\CS}
\newcommand{\normalspacing}{\let\CS=\@currsize\renewcommand{\baselinestretch}
{\BLS}\small\CS}
\makeatother

%\pagestyle{myheadings}

\title{ Bioconductor RankProd Package Vignette }

\author{Fangxin Hong, fhong@salk.edu
\\ Francesco Del Carratore, francescodc87@gmail.com
\\ Ben Wittner, wittner.ben@mgh.harvard.edu
\\ Rainer Breitling, rainer.breitling@manchester.ac.uk
\\ Andris Janckevics, andris.jankevics@gmail.com}


\begin{document}
\SweaveOpts{concordance=TRUE}
%\markright{RankProd Vignette}
\maketitle

%\pagestyle{plain}
\pagenumbering{roman}

\renewcommand{\baselinestretch}{1.2}
\tableofcontents

\pagenumbering{arabic}
\setcounter{page}{1}

\section{Introduction}
\noindent The RankProd package contains all the functions needed to apply the
Rank Product (RP) and the Rank Sum (RS) methods
(Breitling et al., 2004, \textit{FEBS Letters} \textbf{573}:83)
to omics datasets.

\noindent Both methods are a non-parametric statistical tests,
derived from biological reasoning, able to detect variables (e.g. genes or
metabolites) that are consistently upregulated (or downregulated) in a number
of replicated experiments.
In contrast to alternative approaches, both RP and RS are based on relatively
weak assumptions:
\begin{enumerate}
\item{only a minority of all the features measured are upregulated
(or downregulated);}
\item{the measurements are independent between replicate experiments;}
\item{most of the changes are independent with each other;}
\item{measurement variance is about equal for all measurements.}
\end{enumerate}
While the first three are biologically reasonable assumptions, the latter
cannot be considered always true, especially when dealing with metabolomics
datasets.
For this reason data preprocessing through the use of a variance stabilization
method is often required.


\paragraph{Brief description of the RP method.}
Suppose we have a differential expression data for a total of $N$ genes in $K$
replicated experiments. Let $r_{i,j}$ be the position of the $i^{th}$ gene in
the $j^{th}$ replicate experiment in a list ordered according to fold changes
(in a decreasing order if we are interested in upreguleted genes,
or in a increasing order vice versa). Under the null hypothesis
(no differential expressed genes are present in the dataset), the rank of
a gene in the list generated, considering a single replicate, comes from an
uniform distribution (i.e. $P(r_{i} = x) = 1/N$, where $x \in \{1,..,N\}$).
Considering all the $K$ replicates, one should notice that is extremely unlikely
to find the same gene at the top of each list just by chance. In fact the exact
probability of a gene being ranked 1 in each replicate is exactly $1/N^{K}$.
The RP statistic for the $i^{th}$ is defined as the geometric mean of all the
ranks of the gene obtained in each replicate:
\begin{equation} \label{RP}
RP_{i} = \Bigg{(}\prod_{j=1}^{K} {r_{i,j}}\Bigg{)}^{1/K}
\end{equation}
While the RS statistic is defined as the arithmetic mean of all the ranks:
\begin{equation} \label{RS}
RS_{i} = \frac{1}{K}\sum_{j=1}^{K} {r_{i,j}}
\end{equation}
Genes with the smallest $RP$ (or $RS$) values are the most likely to be
upregulated or downregulated (according to the order we chose when ranking the
fold changes).
The RP is equivalent to calculating the geometric mean rank;
replacing the \textit{product} by the \textit{sum}
leads to a statistics (RS). This statistic is slightly less sensitive
to outliers and puts a higher premium on consistency between the ranks in
various lists. This can be useful in some applications as detailed below.


\noindent The package is able to analyse different kinds of data, such as:
Affymetrix Genechip data, spotted cDNA array data (after normalization) and
metabolomics data (after variance stabilization).
Both methods are also able to combine datasets derived from different origins
into one analysis, increasing the power of the identification.
Since the methods use the ranks of the variables in each replicated experiment
(instead of the actual values), it can be flexibly applied to many different
situations, such as identifying genes which are down-regulated under one
condition while being up-regulated under another condition.\\

\noindent This guide gives a tutorial-style introduction to the main features
of RankProd and to the usage of its functions. The presentation
focuses on the analysis of Affymetrix array data, cDNA array data
and metabolomics data obtained from mass spectrometry. \\


\noindent First, it is necessary to load the package.
<<results=hide>>=
library(RankProd)
@

\noindent In the following, we use the \textit{Arabidopsis} dataset that is
contained in this package to illustrate how the RP and RS methods
can be applied.
<<>>=
data(arab)
@

\verb"data(arab)" consists of a 500 $\times$ 10 matrix \verb"arab" containing
the expression levels of 500 genes in 10 samples, a vector \verb"arab.cl"
containing the class labels of the 10 samples, a vector \verb"arab.origin"
containing the origin labels of the 10 samples (data were produced at two
different laboratories), and a vector \verb"arab.gnames"
containing the names(AffyID) of the 500 genes.
The dataset is normalized by \verb"RMA", thus it is in \textit{log2} scale.


\section{Required arguments}
\noindent In order to run a RP analysis, users need to call either
the function \verb"RankProducts" or \verb"RP.advance" (is it also possible to
use the two functions \verb"RP" and \verb"RPadvance" which have been kept in the
package for backward compatibility). \verb"RankProducts" is a simpler version, 
which is specialized in handling data sets from a single origin, while
\verb"RP.advance" is able to analyse data with single or multiple origins,
and also perform some advanced analysis. There are two required arguments for
the function \verb"RanksProducts": \verb"data" and \verb"cl", which are
identical to those required by the function \verb"SAM" contained in the package
\verb"siggenes". The first required argument, \verb"data", is the matrix
(or data frame) containing the gene expression data that should be analysed.
Each of its rows corresponds to a gene, and each column corresponds to a sample,
which would be obtained, for example, by
\begin{verbatim}
> Dilution <- ReadAffy()
> data<-exprs(rma(Dilution))
\end{verbatim}
The second required argument, \verb"cl", is a vector of length
\verb"ncol(data)" containing the class labels of the samples. In a RP
analysis for a datasets containing samples from different origins,
there is one more required argument in the function \verb"RP.advance":
\verb"origin", which is a vector of length \verb"ncol(data)" containing the
origin labels of the samples.\\

\noindent \textbf{One class data.} In the one class case, \verb"cl" is expected
to be a vector of length \verb"n" containing only 1's, where \verb"n" denotes
the number of samples. A label value other than 1 would also be accepted.
In the latter case, this value is automatically set to 1. So for n=5,
the vector \verb"cl" is given by
<<>>=
n <- 5
cl <- rep(1,5)
cl
@
Note: for one class data, we usually refer it as the expression ratio of two
channels. In the outputs from the package, we call the channel used as the
numerator as class 1 and the channel used as denominator as class 2.

\vspace{0.3in}
\noindent \textbf{Two class data.} In this case, the function expects a vector
\verb"cl" consisting only of 0's and 1's, where all the samples with class
label `0' belong to the first group (e.g. the control group) and the samples
with class label `1' belong to the second group (e.g. the treatment group).
For example, the first \verb"n1=5" columns belong to the first group, and the
next \verb"n2=4" columns belong to the second group, the \verb"cl" is given by
<<>>=
n1 <- 5
n2 <- 4
cl <- rep(c(0,1),c(n1,n2))
cl
@
Identically to the behaviour of the \verb"SAM" analysis, the function also
accepts others values. In that case, the smaller value is set
to 0 to be the first class and the larger value to 1 as the second class.\\

\vspace{0.3in}
\noindent \textbf{Single origin:}  If the data were generated under identical
or very similar conditions except the factor of interest (e.g.
control and treatment), it is considered to be data with a single origin.
This is the most common case of array analysis.
In this case, the function \verb"RP.advance" (and RPadvance) expects a vector
\verb"origin" of length \verb"n" with only 1's.
For example, for 9 samples generated at one time in one laboratories,
the first 5 columns in the data are class 1, and the next 4 are class 2,
the \verb"cl" and \verb"origin" are given by
<<>>=
n1 <- 5
n2 <- 4
cl <- rep(c(0,1),c(n1,n2))
cl
origin <- rep(1, n1+n2)
origin
@

If 9 samples are from one class, the \verb"cl" and \verb"origin" vectors are
given by:
<<>>=
n <- 9
cl <- rep(1,n)
cl
origin <- rep(1, n)
origin
@

\vspace{0.3in}
\noindent \textbf{Multiple origins:} Sometimes happens that different
laboratories conduct a very similar experiment to study the effect of the same
treatment (e.g. application of a certain drug). Datasets generated by
different laboratories are considered as data with different origin, as it is
known that the resulting data are not directly comparable. The RP can
combine these datasets together to perform an overall analysis. In this case,
the vector \verb"origin" should consist numbers 1 to $L$, where $L$ is the
number of different origins. For example, if 3 laboratories performed the
same study using respectively 6, 4 and 8 samples,
the \verb"origin" vector is given by
<<>>=
origin <- c(rep(1, 6), rep(2,4), rep(3,8))
origin
@

The function also accepts other values in the origin labels. In that case,
samples with the same origin
label will be treated as having the same origin.\\

\noindent Example: For the dataset \verb"arab" which is included in the
package, 6 samples are from laboratory 1,
and another 4 are from laboratory 2. Both laboratories compare wild type
\textit{Arabidopsis} plants with and without treatment (i.e. brassinosteroid).
<<>>=
colnames(arab)
arab.cl
arab.origin
@

\section{Identification of differentially expressed genes -- Affymetrix array}
\noindent In this section, we show how the RP method can be applied
to the dataset \verb"arab".
One should notice that RP identifies differentially expressed genes in
two separate lists, up- and down-regulated genes separately. 
For each variable, a pfp (percentage of false prediction) is computed
and used to select the differentially expressed variables.
Alternatively, the p-values estimated by the function can be used with the
same purpose after a multiple test correction is performed
(e.g. Benjamini-Hochberg).

\subsection{Data with single origin}
Here, we perform the analysis for the samples from the same origin.
A subset data matrix is extracted by selecting columns whose origin label is 1.
<<>>=
arab.sub <- arab[,which(arab.origin==1)]
arab.cl.sub <- arab.cl[which(arab.origin==1)]
arab.origin.sub <- arab.origin[which(arab.origin==1)]
@
The RP analysis for single-origin data can be performed by either
\verb"RankProducts" or \verb"RP.advance"(and also by the two functions kept for
backward compatibility \verb"RP" and \verb"RPadvance"). Initially, we use the
function \verb"RankProducts" to look for differentially expressed genes between
class 2 (class label=1)and class 1 (class label=0). 
<<>>=
RP.out <- RankProducts(arab.sub,arab.cl.sub, logged=TRUE,
na.rm=FALSE,plot=FALSE,  rand=123)
@
Data in \verb"arab" are already log-transformed, otherwise one
should set \verb"logged=FALSE". The argument \verb"plot=FALSE" will prevent the
graphical display of the estimated pfp \textit{vs.} number of identified genes.
The argument \verb"rand" sets the random seed number to \verb"123" allowing the
function to produce reproducible results.
Since some of the function parameters have a default value, we can use this
function by simply typing:
\begin{verbatim}
> RP.out <- RankProducts(arab.sub,arab.cl.sub,gene.names=arab.gnames,rand=123)
\end{verbatim}
The same results could also be obtained by
\begin{verbatim}
> RP.out <- RP.advance(arab.sub, arab.cl.sub, arab.origin.sub,
+     logged = TRUE, na.rm = FALSE, gene.names = arab.gnames, plot = FALSE,
+     rand = 123)
\end{verbatim}
or
\begin{verbatim}
> RP.out=RP.advance(arab.sub,arab.cl.sub,arab.origin.sub,gene.names=arab.gnames,rand=123)
\end{verbatim}
or
\begin{verbatim}
> RP.out <- RP(arab.sub,arab.cl.sub,gene.names=arab.gnames,rand=123)
\end{verbatim}
or
\begin{verbatim}
> RP.out=RPadvance(arab.sub,arab.cl.sub,arab.origin.sub,gene.names=arab.gnames,
+                  rand=123)
\end{verbatim}

\noindent The function \verb"plotRP" can be used to plot a graphical display of
the estimated pfp \textit{vs.} number of identified genes using the output from
\verb"RankProducts" or \verb"RP.advance" (also for \verb"RP"
and \verb"RPadvance"). If \verb"cutoff" (the maximum accepted pfp) is specified,
identified genes are marked in red (see figure 1). Note that the estimated pfps
are not necessarily smaller than 1, but they will converge to 1.
Two plots will be generated on current graphic display, for identification
of up- and down-regulated genes under class 2, respectively.\\

\begin{center}
<<fig=TRUE,echo=TRUE>>=
plotRP(RP.out, cutoff=0.05)
@
\end{center}

\noindent The function \verb"topGene" generates a table of the
identified genes based on user-specified selection criteria. One of the required
argument is the output object from \verb"RankProducts" or \verb"RP.advance"
(also for \verb"RP" and \verb"RPadvance"). The user also needs to specify
either the \verb"cutoff" (the pfp or p value threshold) or
\verb"num.gene" (the number of top genes identified),
otherwise a error message will be printed and the function will stop.
If \verb"cutoff" is specified, the function also requests user to select either 
\verb"pfp" (percentage of false prediction) or \verb"pval" (p value) which
is used to select genes. \verb"pfp" is the default setting.

\begin{verbatim}
> topGene(RP.out,gene.names=arab.gnames)
Error in topGene(RP.out, gene.names = arab.gnames) : 
No selection criteria is input, please input either cutoff or num.gene
\end{verbatim}
<<>>=
topGene(RP.out,cutoff=0.05,method="pfp",
logged=TRUE,logbase=2,gene.names=arab.gnames)
@

\noindent Here the user can choose variables shown by controlling
$pfp<0.05$. If \verb"gene.names" is provided the output will also show the names
of the selected genes. Since data set \verb"arab" is in log based 2 scale,
we specified \verb"logged=TRUE" and \verb"logbase=2", which are the default
values. 

The output consists of two tables, listing selected up- (Table1: class 1 $<$ class 2)
and down- (Table2: class 1 $>$ class 2) regulated genes.
In the tables, there are 5 columns, the first one \verb"gene.index" contains
the gene indexes; the second \verb"RP/Rsum" contains the computed Rank
Product (or RS) statistics;
the third \verb"FC:(class1/class2)" contains the computed fold change of the
average expression levels under two conditions, which would be converted to the
original scale using input \verb"logbase" (default value is 2) if
\verb"logged=TRUE" is specified;
the fourth \verb"pfp" contains the estimated \verb"pfp" value for each gene
in the list; the last \verb"P.value" contained the estimated
P-values for each gene. If the user wants to use a less stringent criterion,
a cutoff on the p-value ($pvalue<0.05$) can be specified as:
\begin{verbatim}
> topGene(RP.out,cutoff=0.05,method="pval",logged=TRUE,logbase=2,
+         gene.names=arab.gnames)
\end{verbatim}

If the user is interested in the top 50 genes, he/she can type
\begin{verbatim}
> topGene(RP.out,num.gene=50,gene.names=arab.gnames)
\end{verbatim}

\subsection{Data with multiple origins} \label{secaffy:multi}
\noindent In this section, we will illustrate how the RP method
can be applied to datasets containing samples from multiple origins using
the built-in data set \verb"arab". As mentioned before, \verb"arab" consists of
array data measured by two different laboratories.
Both laboratories measured gene expression under the same two conditions.\\
\noindent Given the lack of experimental standards for microarray experiments, 
direct comparison is not feasible. Instead of using actual
expression data, our approach combines the gene
rank from different origins together
(for details refer to Breitling et al. (2004)). 
<<>>=
##identify differentially expressed  genes
RP.adv.out <-  RP.advance(arab,arab.cl,arab.origin,
logged=TRUE,gene.names=arab.gnames,rand=123)
#The last command can also be written using the old syntax
#RP.adv.out <-  RPadvance(arab,arab.cl,arab.origin,
#logged=TRUE,gene.names=arab.gnames,rand=123)
@

\begin{center}
<<fig=TRUE,echo=TRUE>>=
plotRP(RP.adv.out, cutoff=0.05)
@
\end{center}

By combining data from different origins, the power of the statistical test
increases leading to an higher number of selected genes, as shown in the
following table.

\begin{center}
<<>>=
topGene(RP.adv.out,cutoff=0.05,method="pfp",logged=TRUE,logbase=2,
        gene.names=arab.gnames)
@
\end{center}

\section{Identification of differentially expressed genes -- cDNA array}
\noindent When dealing with cDNA array data, the usage of the RP
method has to change since gene expressions of two conditions
are measured from a single spot. Furthermore, the RP implementation
will also change according to the experimental design.
The two most commonly encountered experimental design are:
\begin{itemize}
\item{\textbf{common reference design}, where two RNA samples are compared via
a common reference;}
\item{\textbf{direct two-color design}, where two RNA samples are directly
compared without a common reference.}
\end{itemize}

\subsection{Common Reference Design}
This type of analysis is very similar to the analysis of
Affymetrix Genechips. As an example, we will
have a look at the data \verb"lymphoma" copied from the package \verb"vsn". 
<<>>=
data(lymphoma)
@
\begin{verbatim}
> pData(lymphoma)
      name    sample
1  lc7b047 reference
2  lc7b047    CLL-13
3  lc7b048 reference
4  lc7b048    CLL-13
5  lc7b069 reference
6  lc7b069    CLL-52
7  lc7b070 reference
8  lc7b070    CLL-39
9  lc7b019 reference
10 lc7b019 DLCL-0032
11 lc7b056 reference
12 lc7b056 DLCL-0024
13 lc7b057 reference
14 lc7b057 DLCL-0029
15 lc7b058 reference
16 lc7b058 DLCL-0023
\end{verbatim}

As shown in the table, the 16 columns of the \verb"lymphoma" object contain the red
and green intensities of the 8 slides.
Thus, the Ch1 intensities are in columns 1,3,\ldots,15, while the Ch2
intensities are in columns 2,4,\ldots,16. We can call \verb"vsn" to normalize
all of them at once.

\begin{verbatim}
> library(vsn)
> lym.vsn <- vsn(lymphoma)
> lym.exp <- exprs(lym.vsn)
\end{verbatim}

Next, we can obtain the log-ratios for each slide by subtracting the common
reference intensities from the 8 samples. After a class label vector is
created, the \verb"RankProducts" function can be called to
perform a two-classes analysis.
<<>>=
refrs <- (1:8)*2-1
samps <- (1:8)*2
M <- lym.exp[,samps]-lym.exp[,refrs]
colnames(M)
cl <- c(rep(0,4),rep(1,4))
cl  #"CLL" is class 1, and "DLCL" is class 2
RP.out <- RankProducts(M,cl, logged=TRUE, rand=123)
#The last command can also be written using the old syntax
# RP.out <- RP(M,cl,logged=TRUE,rand=123)
@

<<results=hide>>=
topGene(RP.out,cutoff=0.05,logged=TRUE,logbase=exp(1))
@
Note that \verb"vsn" normalized data is in log base \textit{e}.


\subsection{Direct two-color design}
\noindent In this case, the gene expression ratio of the two dyes (classes) is
measured for each spot.
Here we consider and experiment where two wild type (class1) and two mutant mice
(class2) are compared.
The targets might be:
\begin{center}
\begin{tabular}{lccc} \hline
File name & Cy3 &Cy5 & Ratio=wt/mu \\\hline
File 1 & wt & mu  & Cy3/Cy5 \\
File 2  & mu & wt  & Cy5/Cy3 \\
File 3  & wt &  mu & Cy3/Cy5 \\
File 4  & mu &  wt & Cy5/Cy3 \\\hline
\end{tabular}
\end{center}
The first required argument for the the \verb"RankProducts", \verb"data",
is the matrix (or data frame) containing the gene expression ratios.
The rows correspond to the genes, while each column corresponds to the
ratio of one chip. Since the input \verb"data" is already log-ratios,
the second required argument, \verb"cl", has to be a the vector of
length \verb"ncol(data)" containing only 1's.
Finally, a one-class RP analysis can be performed in order to identify
up- or down-regulated genes.
\begin{verbatim}
> cl=rep(1,4)
> RankProducts(data,cl, logged=TRUE, rand=123)
># or using the old syntax
># RP(data,cl, logged=TRUE, rand=123)
\end{verbatim}

It should be noticed that for the direct two-color design,
the RP will not distinguish the details of different designs
as done by \verb"limma" (for example see the special designs discussed in
section 9 of the \verb"limma" vignette including simple comparison and
dye swaps). Furthermore, the differences among biological or technical
replicates are not an issue in the RP analysis.


\section{Identification of differentially expressed metabolites -
LC/MS based metabolomics experiment}

As mentioned before, our methods (especially the RS) can be successfully used to
analyse metabolomics datasets. Testing biomarker selection methods on real data
is problematic. In fact, we usually do not know the "True" biomarkers a priori.
In order to cope with this problem, a publicly available UPLC-MS spike-in
metabolomics dataset has been used (Franceschi P., et al. (2012)). This dataset
has been obtained from twenty apples, ten of which have been spiked with known
compounds that naturally occur in apples. The raw have been pre-processed
allowing us to work with a data matrix containing the basepeaks intensities of
the identified metabolites. The dataset used in this example is contained this
package, but it can also be found in the \verb"BioMark" package
(Wehrens R., et al. (2011)). The dataset can be loaded as follows:
<<>>=
data(Apples)
@
The list of the features associated to the spiked-in biomarkers is contained in
the \verb"Biom" vector:

<<>>=
Biom
@
While the \verb"apples.cl" vector, contains the class labels of the samples.
As mentioned before, it is necessary to apply variance stabilization and
normalization to the data. This can be easily done with the \verb"vsn"
funciton.

\begin{verbatim}
> library(vsn)
> apples.data.exp<-vsn(apples.data)
> apples.data.vsn<-exprs(apples.data.exp)
\end{verbatim}

In order to put an higher premium on consistency between replicates, the RS
analysis is preferred here. 

<<>>=
RS.apples<-RankProducts(apples.data.vsn, apples.cl,
                        gene.names = rownames(apples.data.vsn), 
                        calculateProduct = FALSE, rand=123)
@

As shown in previous examples, the \verb"topGene" function can be used in order
to show the variables presenting a \verb"pfp" values smaller or equal than
$0.05$.
It is also possible to store the indexes of the selected variables in a vector
called \verb"selected".
<<>>=
topGene(RS.apples,cutoff = 0.05, method = "pfp",
        gene.names = rownames(apples.data.vsn))
selected <- which(RS.apples$pfp[,1]<= 0.05)
@

Comparing the variables selected by out method with the list of the real
biomarkers, it is easy to see that the method is able to identify 4 true
biomarkers out of 5, while finding only one False positive.
<<>>=
selected %in% Biom
@


\section{Advanced usage of the package}\label{sec::adv}
Since the RP method uses ranks instead of actual expression to
identify genes, the method can be generally used in many other cases beside
the simple two-class comparison.
Evaluating the RP is equivalent to calculating the geometric mean
of Rank. Replacing the \textit{product} with the
\textit{sum} (i.e. replacing the geometric mean with the average) leads to the
RS. Which is a statistic that is slightly less sensitive to outliers and
puts a higher premium on consistency between the ranks in various lists.
The RS analysis can be performed both by the \verb"RankProducts" and
\verb"RP.advance" functions (the latter can also cope with the multiple origins
case). In fact, setting \verb"calculateProduct=FALSE" the functions will perform
the RS instead of the RP.
In order to guarantee the backward compatibility with the previous version of
the package, the function \verb"RSadvance" has been kept allowing to perform
the RS analysis with the old syntax.

\subsection{Identify genes with consistent down- or up-regulation upon
drug-treatment}
\noindent The following example has been inspired by a question posted in the
BioC mailing-list.
Suppose that a comparative study (control against treated) has been performed
3 different times with a different dosage of the same drug.
The aim of such study is to investigate genes that are consistently up- or
down- regulated by the drug when compared to controls.
\noindent Regardless the difference in the drug dosage,
one will expect that the genes up-regulated (down-regulated) by the drug
will consistently show an high (low) rank in all studies.
Treating the 3 studies as 3 different origins (as shown in
section ~\ref{secaffy:multi}), the RS method can be successfully
performed.
The identified genes will be good candidates for consistent down- or
up-regulation under various conditions.


\subsection{Simultaneously identify genes up-regulated under one condition
and down-regulated under another condition}
\noindent  

Usually, in a microarray study that considers the responses
in two different conditions, two lists of genes are identified independently:
\begin{itemize}
\item{up-regulated genes under condition 1;}
\item{down-regulated genes under condition 2.}
\end{itemize}
Genes appearing in both lists are considered as the candidates.
The rank-based method can be used to identify the desired list of genes in a
single analysis. This is another advantage of the rank-based methods.\\
\noindent In fact, one can rank genes in ascending order under the first
condition and in descending order under the second one.
The two lists, can be considered together as in a 2-origin study in order to
identify the candidate genes.
Using the data \verb"arab", we now show a practical example.
Suppose that we want to verify the consistency of the datasets generated in two
different laboratories. Specifically, we want to look for genes that have been detected
as up-regulated in class 2 at laboratory 1, but down-regulated in class 2 at
laboratory 2.
This can be achieved switching class labels for laboratory 2.
Thus, for laboratory 2 the hypothetical class1 represents the real class2.
<<>>=
arab.cl2 <- arab.cl
arab.cl2[arab.cl==0 &arab.origin==2] <- 1
arab.cl2[arab.cl==1 &arab.origin==2] <- 0
arab.cl2
@
\noindent If the measurements in the two laboratories are consistent, the genes 
will have very different ranks in the two origins.
The RS analysis is preferred here, in order to emphasise consistency
for the candidate genes. In the following example, we used only the first 500
genes to perform a fast analysis.
<<>>=
Rsum.adv.out <- RP.advance(arab,arab.cl2,arab.origin,calculateProduct=FALSE,
logged=TRUE,gene.names=arab.gnames,rand=123)
# also the old syntax can be used
#Rsum.adv.out <- RSadvance(arab,arab.cl2,arab.origin,
#logged=TRUE,gene.names=arab.gnames,rand=123)
topGene(Rsum.adv.out,cutoff=0.05,gene.names=arab.gnames)
@

\noindent No gene was found to be differentially expressed (FDR=0.05),
indicating a relative good consistency of the experiments conducted by the two
laboratories.
Looking at the top 10 genes in the lists, it easy to realise that they are
indeed very similar.
<<>>=
topGene(Rsum.adv.out,num.gene=10,gene.names=arab.gnames)
@
\begin{center}
<<fig=TRUE,echo=TRUE>>=
plotRP(Rsum.adv.out,cutoff=0.05)
@
\end{center}


\noindent The abnormal patterns shown in figure 3 (compared with figure 1)
reveal a meaningless identification. However, due to its
stability, the RP statistics is still able to identify some genes.
<<>>=
RP.adv.out <- RP.advance(arab,arab.cl2,arab.origin,calculateProduct=TRUE,
logged=TRUE,gene.names=arab.gnames,rand=123)
#  also the old syntax can be used
#RP.adv.out <- RPadvance(arab,arab.cl2,arab.origin,
#logged=TRUE,gene.names=arab.gnames,rand=123)
@

<<>>=
topGene(RP.adv.out,cutoff=0.05,gene.names=arab.gnames)
@
\noindent Nevertheless, the log fold-changes show that these findings are
not significant.
This is also confirmed by the comparison of the ranks under 13
pairings for one gene (first 9 in laboratory 1, next 4 in from laboratory 2).
<<>>=
RP.adv.out$Orirank[[1]][344,]
@

\section{Changes introduced in the last version}
In this section changes introduced in the new version of the package are briefly
summarized.

\subsection{Application to unpaired datasets}
Let T and C stand for two experimental conditions (e.g. treatment versus
control),
while $n_{T}$ and $n_{C}$ are the number of replicates in the two conditions.
In the old package the RP (RS) analysis for the unpaired case
was performed according the ad hoc procedure decribed here:
\begin{enumerate}
\item all the possible $K = n_{T} \times n_{C}$ pair-wise comparisons are
considered and $K$ lists of ratios FC are evaluated;
\item the ratios are ranked within each comparison ($r_{gi}$ is the rank of the
$g$th gene in the $i$th comparison);
\item the RP for each gene is determined as
$RP_{g}=(\prod_{i}{r_{gi}})^{1/K}$;
\item alternatively, the RS is determined as
$RP_{g}=({\sum_{i}{r_{gi}}})/K$.
\end{enumerate}

Apparently, such approach leads to an increase of the False Discovery Rate. 
\noindent In the new package, a new and more principled method has been
developed. This method is described below:

\begin{enumerate}
\item the number of pairs ($npairs$) is defined as the number of the samples
in the smallest class;
\item if not defined by the user, the number of Random Pairings that will
be generated ($n_{rp}$) is set to $npairs \times npairs$
(if this number is not odd $n_{rp}=npairs \times npairs + 1$);
\item Sampling from the original dataset, $n_{rp}$ new datasets of dimension
$(ngenes \times npairs)$ are generated;
\item the RP (RS) is evaluated $n_{rp}$ times considering each
Random Pairing as a paired experiment; 
\item per each gene, the final $RP_{g}$ (or $RS_{g}$) is estimated as the median
of the $n_{rp}$ values evaluated in the step before.
\end{enumerate}

\subsection{Evaluation of the p-values for the RP}
Instead of the permutation approach used in the old version of the package,
the pvalues for the RP are now evaluated through the fast algorithm
described in \textit{Heskes et al. 2014}, which allows a very accurate
approximation of the p-values in a computationally fast manner. This approach
significantly speeds up the RP analysis. When considering a typical paired
dataset ($N = 1000$ and $K = 10$), the computation time is now reduced by a
factor of $\sim 500$, when compared with the analysis performed with the
previous approach (using $10,000$ permutations).

\subsection{Evaluation of the p-values for the RS}
Also in this case, the permutation approach was abbandoned. We have developed
a novel method able to compute the exact p-values for the RS in a fast manner.
This method is straightforward and based on a very simple analogy.
It is easy to understand that, under the null hypothesis,
the probability distribution of the RS, in an experiment with $N$ features and
$K$ replicates, is exactly the same as the probability distribution of the sum
of the outcomes obtained by rolling $K$ dice with $N$ faces
(\url{http://mathworld.wolfram.com/Dice.html}).
The numerical error generated by our fast algorithm increases with the the size
of the dataset.
For this reason we developed a more
accurate implementation of the same algoritm, which is able to cope with
extremely large datasets. Unfortunately, this leads to an increase of the
computational time. When the size of the dataset is such that the use of the
accurate implementation is needed and the time needed to
evaluate the exact p-values becomes unacceptable, the new package computes the
exact p-values for the smallest RS values for \verb"tail.time" minutes.
The rest of the p-values are approximated with the following gaussian:
\begin{equation} \label{RP}
\mathcal{N} (\mu =  \dfrac{K(N+1)}{2} ,\sigma^{2} = \dfrac{K(N^{2}-1)}{12})
\end{equation}
It should be noticed that with such large datasets, this approximation is
extremely accurate. Nevertheless, in this case the fuction shows the
highest p-values exactly computed, so the user can play with the tail.time
parameter if not satisfied. In most of the cases, this approach significantly
speeds up the RS analysis. When considering a typical paired
dataset ($N = 1000$ and $K = 10$), the computation time is now reduced by a
factor of $\sim 1200$, when compared with the analysis performed with the
previous approach (using $10,000$ permutations).


\section*{Reference}

\begin{list}{}{\itemindent=-1.0cm}

\item Breitling, R., Armengaud, P., Amtmann, A., and Herzyk, P.(2004)
Rank Products:
A simple, yet powerful, new method to detect differentially regulated genes in
replicated microarray experiments. {\it FEBS Letter}, 57383-92

\item Nemhauser JL, Mockler TC, Chory J. (2004)  Interdependency of
brassinosteroid and auxin signaling in Arabidopsis. {\it PLoS Biol.}  21460

\item  http://arabidopsis.org/info/expression/ATGenExpress.jsp

\item Heskes, T.,Eisinga, R., and Breitling, R.(2014) 
A fast algorithm for
determining bounds and accurate approximate p-values of the rank product
statistic for replicate experiments. {\it BMC bioinformatics}, 15.1: 367.

\item Franceschi, P., Masuero, D., Vrhovsek, U., Mattivi, F. and Wehrens R.
(2012) A benchmark spike-in data set for biomarker identification in
metabolomics. {\it Journal of chemometrics}, 26(1-2):16-24.

\item Wehrens, R., Franceschi, P., Vrhovsek, U. and Mattivi, F.
Stability-based biomarker selection. {\it Analytica chimica acta}, 705(1):15-23.

\end{list}


\end{document}