%\VignetteIndexEntry{GeneSelector Manual}
%\VignetteKeywords{Expression Analysis}
%\VignetteDepends{GeneSelector}
%\VignettePackage{GeneSelector}



\input{preamble.tex}
\title{\texttt{GeneSelector} package vignette}
\author{Martin Slawski$^{1,2}$ \footnote{\url{ms@cs.uni-sb.de}} $\qquad$
        Anne-Laure Boulesteix$^{1,2,3}$ \footnote{\url{boulesteix@ibe.med.uni-muenchen.de}}}

\begin{document}
\date{\normalsize{$^1$ Sylvia Lawry Centre, Munich, Germany \\
                  $^2$  Institute for Medical Informatics, Biometry and
                  Epidemiology, Ludwig-Maximilians-University Munich,
                  Germany \\
                  $^3$ Department of Statistics, Ludwig-Maximilians-University Munich, Germany}}
\maketitle


%\vspace{11pt}

\begin{abstract}
This is the vignette of the Bioconductor add-on package
\texttt{GeneSelector} which contains methods to assess quantitatively
the variability among multiple gene rankings, obtained by
using altered datasets or several ranking procedures. The resulting multiplicity
problem is addressed by functionality for rank aggregation. 
\end{abstract}


<<preliminaries,echo=FALSE,results=hide>>=
library(GeneSelector)
@

\section{Introduction}

An important aspect of microarray data analysis is
the detection of genes that are differentially expressed, e.g. in different
experimental conditions or in individuals with different phenotypes.
The results of microarray studies are usually the starting point
for further more expensive and time-consuming experiments, which involve only a small number
of candidate genes \citet{aer2006}. The set of candidate genes is typically determined by computing
a two-sided test statistic for each gene $j=1,\ldots,p$, and ordering them decreasingly according to the size of the absolute
of the statistic. This yields an ordered list $\bm{l} = (l_m, \; \; m=1,\ldots,p)$ and ranks $\bm{r} = (r_j, \; \; j=1,\ldots,p)$ defined by
$r_j = m \; \Leftrightarrow \; l_m = j, \; \; j,m=1,\ldots,p$. The genes at the top of the list are
displayed in almost all microarray-related biomedical publications, often considered as an unequivocal and definitive result.
Critical voices have pointed out that this procedure might yield false research findings  \citet{Ioannidis:noisy}, since it
ignores the variability of the obtained ordered lists \citet{Ein2006}, \citet{Qiu}. The \texttt{GeneSelector} package tries to quantify this
variability by mimicking changed data situations via resampling- and related strategies, and then comparing the results  to those
obtained with the reference datasets. For this purpose, \texttt{GeneSelector} assembles several stability measures for rank data.\\ 
A second source of variability, which, to our knowledge, has not been addressed in the literature, is the multiplicity of test statistics (= ranking criteria) proposed for gene expression data with the aim to cope with the 'small $n$, large $p$' situation, with $n$ denoting the number of replicates. \texttt{GeneSelector} implements
a collection of fourteen such ranking criteria, displayed in Table \ref{tab:statistics}, and hence enables the user to explore
this source of variability. Using several ranking criteria instead of only one
may additionally be seen as sensitivity analysis, since most criteria rely on
idealized, hard-to-check assumptions. Each ranking criteria produces its own
result, whereas the user may be confronted with the dilemma of finding exactly
one result, which should unify all results as good as possible, hopefully
giving rise to an improved and more stable ranking and in turn to a set of candidate genes with as least
as possible false positives. In this spirit, our package offers a
\texttt{GeneSelector} function as well as several methods for rank aggregation.

\scriptsize
\begin{table} %\label{methodsoverview}
\begin{tabular}[h!]{|l|l|l|l|} 
\hline 
\\
\textbf{Method} & \textbf{Function name} & \textbf{Package} & \textbf{Reference} \\
\hline
Foldchange & \texttt{RankingFC} &    &  \\
$t$-statistic &  \texttt{RankingTstat}&  &  \\
Welch's $t$ statistic & \texttt{RankingWelchT} &  & \\
Bayesian $t$-statistic (1) & \texttt{RankingBaldiLong} &   &
\citet{BaldiLong} \\
Bayesian $t$-statistic (2) & \texttt{RankingFoxDimmic} &     &
\citet{FoxDimmic} \\
Shrinkage $t$-statistic & \texttt{RankingShrinkageT}      &   & \citet{st} \\
Soft-thresholded $t$-statistic & \texttt{RankingSoftthresholdT} &  & \citet{softt} \\
Parametric empirical Bayes & \texttt{RankingLimma} & \texttt{limma} &
\citet{Smyth} \\
$B$-statistic & \texttt{RankingBstat}      &  \texttt{sma} & \citet{Loennstedt} \\
Nonparametric empirical Bayes  & \texttt{RankingEbam} & \texttt{siggenes} & \citet{Efron1}\\
SAM & \texttt{RankingSam}      &  \texttt{samr} &  \citet{Tusher} \\
Wilcoxon statistic & \texttt{RankingWilcoxon} & & \\
Wilcoxon statistic, empirical Bayes & \texttt{RankingWilcEbam} & \texttt{siggenes} & \citet{Efron2} \\
Permutation test & \texttt{RankingPermutation} & \texttt{multtest}             &  \\
\hline
\end{tabular}
\caption{Overview of the ranking procedures in \texttt{GeneSelector}. If
  the 'package' is not given, then the respecive procedure is \emph{not}
  imported from a foreign package.}

\end{table}\label{tab:statistics}

\normalsize

\section{Illustration}

\subsection{Description of the data set}

We demonstrate the functionalities of \texttt{GeneSelector} in the classical
setting of two independent samples, each of size 10. We simulate a gene expression matrix \texttt{x} containing $2,000$
genes in the following manner.
\begin{itemize}
\item Gene expression intensities are drawn from a multivariate normal distribution
      with zero mean vector and covariance which itself
      has been drawn randomly from an inverse Wishart
      distribution.
\item The first 40 genes are differentially expressed. The differences in the means between the two classes are simulated independently
      according to a normal distribution with variance 0.9.
\end{itemize}

<<preliminaries,echo=FALSE,results=hide>>=
library(GeneSelector)
@

We access the data using the lines:
<<>>=
data(toydata)
y <- as.numeric(toydata[1,])
x <- as.matrix(toydata[-1,])
dim(x)
table(y)
@

Knowing that the first genes are differentially expressed, we make
boxplots of the gene expression intensities of the first four genes:

<<results=hide, fig=TRUE>>=
par(mfrow=c(2,2))
for(i in 1:4) boxplot(x[i,]~y, main=paste("Gene", i))
@



\subsection{Rankings}

We now perform a ranking using the ordinary $t$-statistic.

<<results=hide>>=
ordT <- RankingTstat(x, y, type="unpaired")
@

The resulting objects are all instances of the class \texttt{GeneRanking}.\\
To get basic information, we use the commands:

<<>>=

getSlots("GeneRanking")
str(ordT)

show(ordT)
toplist(ordT)

@

The last command yields the top-ranking genes according to the respective procedure.

\subsection{Altered data sets}

In order to inspect stability of the obtained ranking with respect to changes
in the data, we use resampling techniques implemented in
\texttt{GeneSelector}. The following command produces jackknif-ed data sets,
i.e. datasets resulting from successively removing exactly one sample from the
complete sample:

<<>>=
loo <- GenerateFoldMatrix(y = y, k=1)
show(loo)
@

We plug this into the method \texttt{RepeatRanking}, which determines
the ranking 20 times, i.e. for each removed observation, anew:

<<>>=

loor_ordT <- RepeatRanking(ordT, loo)

@

The object \texttt{loo} may additionally be used in the following manner:

<<>>=

ex1r_ordT <- RepeatRanking(ordT, loo, scheme = "labelexchange")

@

The argument \texttt{scheme = "labelexchange"} specifies that instead of leaving one observation
out, it is assigned the opposite class label.\\
We may also use the bootstrap, e.g.

<<>>=

boot <- GenerateBootMatrix(y = y, maxties=3, minclassize=5, repl=30)
show(boot)
boot_ordT <- RepeatRanking(ordT, boot)

@
$\ldots$ or add a small amount of noise to the observed expression intensities:

<<>>=
noise_ordT <- RepeatRanking(ordT, varlist=list(genewise=TRUE, factor=1/10))
@

To get a toplist that tabulates how top list positions are distributed over
all repeated rankings, we use:
<<>>=
toplist(loor_ordT, show=FALSE)
@

As an exploratory tool to examine the difference in rankings between original
and perturbed data sets, a \texttt{plot} command is available.


\begin{figure}[h!]
\centering
<<fig=TRUE>>=
par(mfrow=c(2,2))
plot(loor_ordT, col="blue", pch=".", cex=2.5, main = "jackknife")
plot(ex1r_ordT, col="blue", pch=".", cex=2.5, main = "label exchange")
plot(boot_ordT, col="blue", pch=".", cex=2.5, main = "bootstrap")
plot(noise_ordT, frac=1/10, col="blue", pch=".", cex=2.5, main = "noise")
@
\caption{Scatterplots of rankings from altered datasets vs. rankings
from the original dataset.}\label{fig:scatterplot}
\end{figure}
From Figure \ref{fig:scatterplot}, it is obvious that
variability increases for a higher list position. Moreover, the figure shows that
variability depends on the method used to generate altered data sets. In
this example, the bootstraped rankings are more scattered around the angle
bisector than the jackknif-ed rankings.

\subsection{Stability measures}

Alternative to visual methods, one can compute of  the stability measures
tabulated in Table \ref{tab:stab}. Let $\bm{\sigma}, \bm{\sigma}'$ be either two rankings $\bm{r}, \bm{r}'$ or
two lists $\bm{l}, \bm{l}'$. A function $s$ is called \emph{pairwise stability measure} if 
\begin{itemize}
\item[(i)] $s(\bm{\sigma}, \bm{\sigma}') = s(\bm{\sigma}', \bm{\sigma})$,
\item[(ii)] $s(\bm{\sigma}, \bm{\sigma}') \leq s(\bm{\sigma}, \bm{\sigma}) =
  s(\bm{\sigma}', \bm{\sigma}') = 1$.
\end{itemize}
In the current version of \texttt{GeneSelector}, there are two groups of pairwise
stability measures: the first group is set-based, counting/summing up overlaps
of lists, while the second one computes distances. Pairwise stability measures
are particularly appropriate in the presence of a reference list/ranking. In
the example given in the previous subsection, the natural choice for the
reference is the ranking obtained with the original dataset. If one wants to
compute a stability indicator for several lists without a reference, e.g. when
comparing the output of different ranking criteria, we introduce the following
notion. Let $\bm{\sigma}_b, \; \; b=1,\ldots,B$ be a sequence of rankings or lists. A function $s$ is
called \emph{multi-input stability measure} if 
\begin{itemize}
\item[(i)] $s(\bm{\sigma}_1, \ldots, \bm{\sigma}_{B}) =
  s(\bm{\sigma}_{\pi_1}, \ldots, \bm{\sigma}_{\pi_B})$ for any permutation
  $\bm{\pi} \; \text{of} \; \{1,\ldots,B \}$,
\item[(ii)] $s(\bm{\sigma}_1, \ldots, \bm{\sigma}_B) \leq s(\bm{\sigma}_1,
  \ldots, \bm{\sigma}_1) = \ldots = s(\bm{\sigma}_B,
  \ldots, \bm{\sigma}_B) = 1$.
 \end{itemize}
As shown in Table \ref{tab:stab}, an additional component of stability
measures is a weighting scheme which penalizes variability at the top of list
more severely than at the bottom, since only the top is of practical
relevance.

\begin{table} %\label{methodsoverview}
\begin{tabular}[h!]{|l|l|l|} 
\hline 
\\
\textbf{Name} & \textbf{Definition} & \textbf{Reference} \\
\hline
& & \\
Intersection count $\dagger$ & $s_{\cap}(\bm{l}, \bm{l}') = s_{\cap}(\bm{l}_{[k]}, \bm{l}'_{[k]}) = \frac{\sum_{1 \leq m,m' \leq k} I(l_m =
  l_{m'}')}{k}, \; \; k=1,\ldots,p.$ & \citet{Qiu} \\
& &  \\
Overlap score $\dagger$ & $s_{O\cap}(\bm{l}, \bm{l}') =  \frac{\sum_{k=1}^p w_k
   s_{\cap} (\bm{l}_{[k]},
  \bm{l}'_{[k]})}{\sum_{k=1}^p w_k}$ & 
\citet{Yan2006} \\
& & \\
$\ell^1$ $\ddagger$ & $s_{\ell^1}(\bm{r}, \bm{r}') = 1 \; - \; \frac{\sum_{j=1}^p w_j |r_j -
  r_j'|}{\sum_{j=1}^p w_{(j)} |j -  (p-j+1)|}$ & \Citet{Jur2008} \\
& & \\
$\ell^2$ $\ddagger$ & $s_{\ell^2}(\bm{r}, \bm{r}') = 1 \; - \; \frac{\sum_{j=1}^p w_j (r_j -
  r_j')^2}{\sum_{j=1}^p w_{(j)} |j -  (p-j+1)|}$ & \citet{Jur2008} \\
& & \\
Spearman's $\rho$ $\ddagger$ & $\frac{\sum_{j=1}^p w_j (r_j - (p+1)/2)(r_j' -
  (p+1)/2)}{\left(\sum_{j=1}^p r_j^2 \right)^{1/2} \left( \sum_{j=1}^p r_j'^2 \right)^{1/2}}$ & \citet{Jur2008} \\
& & \\
Kendall's $\tau$ $\ddagger$ & $s_{\tau}(\bm{r}, \bm{r}') =  1 \; - \; \frac{\sum_{1 \leq j < m \leq p} w_j \;
w_m \; I( [(r_j - r_m)(r'_j - r'_m)] < 0)}{\sum_{1 \leq j < m \leq p} w_j \;
w_m}$ & \citet{Dec2006} \\
& & \\
Union count $\Delta$ & $s_{\cup}(\bm{l}_{1 \; [k]},\ldots,\bm{l}_{B \; [k]}) =   1  - \frac{|U_{[k]}| -
  k}{\min\{ Bk, p\} - k}$ & \citet{Jur2008} \\
& & \\
Union score $\Delta$ & $s_{O\cup}(\bm{l}_1,\ldots,\bm{l}_{B}) = \frac{\sum_{k=1}^p w_k
  s_{\cup}(\bm{l}_{1 \; [k]},\ldots,\bm{l}_{B \; [k]})}{\sum_{k=1}^p w_k}$ &   \\  
\hline
\end{tabular}
\caption{Overview of the stability measures in
  \texttt{GeneSelector}. Notations: $\bm{l}_{[k]} = (l_m, \; 1 \leq m \leq
  k)$ denotes the top-$k$ list of $\bm{l}$; the $w_j$'s are (fixed) weights - 
  the subscript in the brackets indicate ordering, i.e. $w_{(1)} \leq \ldots
  \leq w_{(p)}$; $|U_{[k]}|$
  denotes the size of the union of all top $k$-lists to be compared. Legend:
  $\dagger$ - implemented in \texttt{GetStabilityOverlap}; $\ddagger$ - implemented in
  \texttt{GetStabilityDistance}; $\Delta$ - implemented in \texttt{GetStabilityUnion}.}\label{tab:stab}

\end{table}

As illustration, we apply \texttt{GetStabilityOverlap} to the rankings
obtained after swapping class labels, which seems to  perturb considerably the
original ranking, as indicated by Figure \ref{fig:scatterplot}. Concerning the
sequence of weights, we choose $w_m = 1/m, \; \; m=1,\ldots,p$, which is
realized by using the option \texttt{decay = "linear"}.

<<>>=

stab_ex1r_ordT <- GetStabilityOverlap(ex1r_ordT, scheme = "original",
decay="linear")
show(stab_ex1r_ordT)

@

\texttt{GetStabilityOverlap} computes both normalized intersection counts and
overlap scores when truncating at list position $k, \; \;
k=1,\ldots,p$. Evaluating these scores for position $k = 10$, we use the lines:

<<>>=

summary(stab_ex1r_ordT, measure = "intersection", display = "all", position = 10)
summary(stab_ex1r_ordT, measure = "overlapscore", display = "all", position = 10)

@

The output shows that the overlap between reference- and alternative top-ten lists
ranges from 60 to 90 percent. Overlap score and intersection count disagree
visibly, which is due to the fact that the overlap score is computed with
weights. Though \ref{fig:scatterplot} suggests some discrepancy between reference-
and alternative lists, the output shows that the fraction of accordance is
much larger than the expectation in the no-information case, i.e. in the case
of mutually unrelated lists. To have a look at how the two scores vary with
increasing list position (on average), we invoke the predefined
\texttt{plot(...)} routine:

\begin{figure}[h!]
\centering
<<fig=TRUE>>=
plot(stab_ex1r_ordT, frac = 1, mode = "mean")
@
\caption{Visualization of intersection count and overlap score.}
\label{fig:overlap}
\end{figure}

Next, let us investigate which sample is most influential in the sense that
its removal perturbs the reference ranking most. For this purpose, we apply
\texttt{GetStabilityDistance} with the option \texttt{measure = "spearman"} to
the jackknif-ed rankings.

<<>>=

stab_loo_ordT <- GetStabilityDistance(ex1r_ordT, scheme = "original", measure
= "spearman", decay="linear")
show(stab_loo_ordT)
summary(stab_loo_ordT, display = "all")

@

From the output we conclude that the fifth sample is by far the most
influential one.  

\subsection{Aggregating multiple ranking criteria}

In addition to the ordinary $t$-statistic, we compute five additional rankings
(cf. Table \ref{tab:statistics}):

<<results=hide>>=
BaldiLongT <- RankingBaldiLong(x, y, type="unpaired")
FoxDimmicT <- RankingFoxDimmic(x, y, type="unpaired")
sam <- RankingSam(x, y, type="unpaired")
wilcox <- RankingWilcoxon(x, y, type="unpaired")
wilcoxeb <- RankingWilcEbam(x, y, type="unpaired")
@

Again, we first assess variability visually. The method
\texttt{HeatmapRankings} produces a heatmap from all obtained rankings, 
clustering genes and criteria simultaneously. We restrict to our attention to
the first forty, differentially expressed genes (\texttt{ind = 1:40}).

<<fig = TRUE>>=
Merged <- MergeMethods(list(ordT, BaldiLongT, FoxDimmicT, sam, wilcox, wilcoxeb))
HeatmapRankings(Merged, ind=1:40)

@ 

To cope with the multiplicity problem, we exploit the functionalities for
rank aggregation in the \texttt{GeneSelector} package. A simple approach would
just take the average of all observed ranks, which is, among other things,
implemented in the method \texttt{AggregateSimple}. As a more sophisticated approach,
we use the Markov chain model propagated in \citet{Dec2006} and implemented in
the method \texttt{AggregateMC}. Lastly, we use
the \texttt{GeneSelector} function, which aims at finding genes falling
consistently, i.e. for all ranking criteria, below a pre-specified threshold,
here chosen as 50.

<<>>=

AggMean <- AggregateSimple(Merged, measure = "mean")
AggMC <- AggregateMC(Merged, type = "MCT", maxrank = 100)
GeneSel <- GeneSelector(list(ordT, BaldiLongT, FoxDimmicT, sam, wilcox,
wilcoxeb), threshold="user", maxrank=50)
show(GeneSel)
sel <- sum(slot(GeneSel, "selected"))

cbind(mean = toplist(AggMean, top = sel, show = F), MC = toplist(AggMC, top
= sel, show = F), GeneSelector = toplist(GeneSel, top = sel, show = F)[,1])

@

Here, we have first determined the number of genes passing the \texttt{GeneSelector}
filter. In total, 29 genes manage to fall below rank 50 in all six
rankings. Although the \texttt{GeneSelector} attempts to minimize the number
of false positives, one still ends up with 14 false positive genes among the
29 selected ones. In this regard, the Markov chain approach is superior,
because it selects only 11 false positive ones. Simple averaging seems to
perform slightly worse, putting the false positive gene \texttt{820} at
position 13. In contrast, the first false positive gene of the
\texttt{GeneSelector} occurs at position 16. A nice feature we want to present
at the end is the \texttt{plot} routine for the class
\texttt{GeneSelector}. It allows one to obtain a detailed gene-specific overview:


<<fig=TRUE>>=

plot(GeneSel, which = 1)
    
@

Interestingly, for the first, differentially expressed gene, simple approaches
such as the ordinary $t$- and the Wilcoxon statistic perform well, while the
more sophisticated statistics, which depend on hyperparameters, fail to detect
differential expression.

\bibliographystyle{plainnat}
\bibliography{literatur}

\end{document}