%\VignetteIndexEntry{GeneSelector Manual} %\VignetteKeywords{Expression Analysis} %\VignetteDepends{GeneSelector} %\VignettePackage{GeneSelector} \input{preamble.tex} \title{\texttt{GeneSelector} package vignette} \author{Martin Slawski$^{1,2}$ \footnote{\url{ms@cs.uni-sb.de}} $\qquad$ Anne-Laure Boulesteix$^{1,2,3}$ \footnote{\url{boulesteix@ibe.med.uni-muenchen.de}}} \begin{document} \date{\normalsize{$^1$ Sylvia Lawry Centre, Munich, Germany \\ $^2$ Institute for Medical Informatics, Biometry and Epidemiology, Ludwig-Maximilians-University Munich, Germany \\ $^3$ Department of Statistics, Ludwig-Maximilians-University Munich, Germany}} \maketitle %\vspace{11pt} \begin{abstract} This is the vignette of the Bioconductor add-on package \texttt{GeneSelector} which contains methods to assess quantitatively the variability among multiple gene rankings, obtained by using altered datasets or several ranking procedures. The resulting multiplicity problem is addressed by functionality for rank aggregation. \end{abstract} <>= library(GeneSelector) @ \section{Introduction} An important aspect of microarray data analysis is the detection of genes that are differentially expressed, e.g. in different experimental conditions or in individuals with different phenotypes. The results of microarray studies are usually the starting point for further more expensive and time-consuming experiments, which involve only a small number of candidate genes \citet{aer2006}. The set of candidate genes is typically determined by computing a two-sided test statistic for each gene $j=1,\ldots,p$, and ordering them decreasingly according to the size of the absolute of the statistic. This yields an ordered list $\bm{l} = (l_m, \; \; m=1,\ldots,p)$ and ranks $\bm{r} = (r_j, \; \; j=1,\ldots,p)$ defined by $r_j = m \; \Leftrightarrow \; l_m = j, \; \; j,m=1,\ldots,p$. The genes at the top of the list are displayed in almost all microarray-related biomedical publications, often considered as an unequivocal and definitive result. Critical voices have pointed out that this procedure might yield false research findings \citet{Ioannidis:noisy}, since it ignores the variability of the obtained ordered lists \citet{Ein2006}, \citet{Qiu}. The \texttt{GeneSelector} package tries to quantify this variability by mimicking changed data situations via resampling- and related strategies, and then comparing the results to those obtained with the reference datasets. For this purpose, \texttt{GeneSelector} assembles several stability measures for rank data.\\ A second source of variability, which, to our knowledge, has not been addressed in the literature, is the multiplicity of test statistics (= ranking criteria) proposed for gene expression data with the aim to cope with the 'small $n$, large $p$' situation, with $n$ denoting the number of replicates. \texttt{GeneSelector} implements a collection of fourteen such ranking criteria, displayed in Table \ref{tab:statistics}, and hence enables the user to explore this source of variability. Using several ranking criteria instead of only one may additionally be seen as sensitivity analysis, since most criteria rely on idealized, hard-to-check assumptions. Each ranking criteria produces its own result, whereas the user may be confronted with the dilemma of finding exactly one result, which should unify all results as good as possible, hopefully giving rise to an improved and more stable ranking and in turn to a set of candidate genes with as least as possible false positives. In this spirit, our package offers a \texttt{GeneSelector} function as well as several methods for rank aggregation. \scriptsize \begin{table} %\label{methodsoverview} \begin{tabular}[h!]{|l|l|l|l|} \hline \\ \textbf{Method} & \textbf{Function name} & \textbf{Package} & \textbf{Reference} \\ \hline Foldchange & \texttt{RankingFC} & & \\ $t$-statistic & \texttt{RankingTstat}& & \\ Welch's $t$ statistic & \texttt{RankingWelchT} & & \\ Bayesian $t$-statistic (1) & \texttt{RankingBaldiLong} & & \citet{BaldiLong} \\ Bayesian $t$-statistic (2) & \texttt{RankingFoxDimmic} & & \citet{FoxDimmic} \\ Shrinkage $t$-statistic & \texttt{RankingShrinkageT} & & \citet{st} \\ Soft-thresholded $t$-statistic & \texttt{RankingSoftthresholdT} & & \citet{softt} \\ Parametric empirical Bayes & \texttt{RankingLimma} & \texttt{limma} & \citet{Smyth} \\ $B$-statistic & \texttt{RankingBstat} & \texttt{sma} & \citet{Loennstedt} \\ Nonparametric empirical Bayes & \texttt{RankingEbam} & \texttt{siggenes} & \citet{Efron1}\\ SAM & \texttt{RankingSam} & \texttt{samr} & \citet{Tusher} \\ Wilcoxon statistic & \texttt{RankingWilcoxon} & & \\ Wilcoxon statistic, empirical Bayes & \texttt{RankingWilcEbam} & \texttt{siggenes} & \citet{Efron2} \\ Permutation test & \texttt{RankingPermutation} & \texttt{multtest} & \\ \hline \end{tabular} \caption{Overview of the ranking procedures in \texttt{GeneSelector}. If the 'package' is not given, then the respecive procedure is \emph{not} imported from a foreign package.} \end{table}\label{tab:statistics} \normalsize \section{Illustration} \subsection{Description of the data set} We demonstrate the functionalities of \texttt{GeneSelector} in the classical setting of two independent samples, each of size 10. We simulate a gene expression matrix \texttt{x} containing $2,000$ genes in the following manner. \begin{itemize} \item Gene expression intensities are drawn from a multivariate normal distribution with zero mean vector and covariance which itself has been drawn randomly from an inverse Wishart distribution. \item The first 40 genes are differentially expressed. The differences in the means between the two classes are simulated independently according to a normal distribution with variance 0.9. \end{itemize} <>= library(GeneSelector) @ We access the data using the lines: <<>>= data(toydata) y <- as.numeric(toydata[1,]) x <- as.matrix(toydata[-1,]) dim(x) table(y) @ Knowing that the first genes are differentially expressed, we make boxplots of the gene expression intensities of the first four genes: <>= par(mfrow=c(2,2)) for(i in 1:4) boxplot(x[i,]~y, main=paste("Gene", i)) @ \subsection{Rankings} We now perform a ranking using the ordinary $t$-statistic. <>= ordT <- RankingTstat(x, y, type="unpaired") @ The resulting objects are all instances of the class \texttt{GeneRanking}.\\ To get basic information, we use the commands: <<>>= getSlots("GeneRanking") str(ordT) show(ordT) toplist(ordT) @ The last command yields the top-ranking genes according to the respective procedure. \subsection{Altered data sets} In order to inspect stability of the obtained ranking with respect to changes in the data, we use resampling techniques implemented in \texttt{GeneSelector}. The following command produces jackknif-ed data sets, i.e. datasets resulting from successively removing exactly one sample from the complete sample: <<>>= loo <- GenerateFoldMatrix(y = y, k=1) show(loo) @ We plug this into the method \texttt{RepeatRanking}, which determines the ranking 20 times, i.e. for each removed observation, anew: <<>>= loor_ordT <- RepeatRanking(ordT, loo) @ The object \texttt{loo} may additionally be used in the following manner: <<>>= ex1r_ordT <- RepeatRanking(ordT, loo, scheme = "labelexchange") @ The argument \texttt{scheme = "labelexchange"} specifies that instead of leaving one observation out, it is assigned the opposite class label.\\ We may also use the bootstrap, e.g. <<>>= boot <- GenerateBootMatrix(y = y, maxties=3, minclassize=5, repl=30) show(boot) boot_ordT <- RepeatRanking(ordT, boot) @ $\ldots$ or add a small amount of noise to the observed expression intensities: <<>>= noise_ordT <- RepeatRanking(ordT, varlist=list(genewise=TRUE, factor=1/10)) @ To get a toplist that tabulates how top list positions are distributed over all repeated rankings, we use: <<>>= toplist(loor_ordT, show=FALSE) @ As an exploratory tool to examine the difference in rankings between original and perturbed data sets, a \texttt{plot} command is available. \begin{figure}[h!] \centering <>= par(mfrow=c(2,2)) plot(loor_ordT, col="blue", pch=".", cex=2.5, main = "jackknife") plot(ex1r_ordT, col="blue", pch=".", cex=2.5, main = "label exchange") plot(boot_ordT, col="blue", pch=".", cex=2.5, main = "bootstrap") plot(noise_ordT, frac=1/10, col="blue", pch=".", cex=2.5, main = "noise") @ \caption{Scatterplots of rankings from altered datasets vs. rankings from the original dataset.}\label{fig:scatterplot} \end{figure} From Figure \ref{fig:scatterplot}, it is obvious that variability increases for a higher list position. Moreover, the figure shows that variability depends on the method used to generate altered data sets. In this example, the bootstraped rankings are more scattered around the angle bisector than the jackknif-ed rankings. \subsection{Stability measures} Alternative to visual methods, one can compute of the stability measures tabulated in Table \ref{tab:stab}. Let $\bm{\sigma}, \bm{\sigma}'$ be either two rankings $\bm{r}, \bm{r}'$ or two lists $\bm{l}, \bm{l}'$. A function $s$ is called \emph{pairwise stability measure} if \begin{itemize} \item[(i)] $s(\bm{\sigma}, \bm{\sigma}') = s(\bm{\sigma}', \bm{\sigma})$, \item[(ii)] $s(\bm{\sigma}, \bm{\sigma}') \leq s(\bm{\sigma}, \bm{\sigma}) = s(\bm{\sigma}', \bm{\sigma}') = 1$. \end{itemize} In the current version of \texttt{GeneSelector}, there are two groups of pairwise stability measures: the first group is set-based, counting/summing up overlaps of lists, while the second one computes distances. Pairwise stability measures are particularly appropriate in the presence of a reference list/ranking. In the example given in the previous subsection, the natural choice for the reference is the ranking obtained with the original dataset. If one wants to compute a stability indicator for several lists without a reference, e.g. when comparing the output of different ranking criteria, we introduce the following notion. Let $\bm{\sigma}_b, \; \; b=1,\ldots,B$ be a sequence of rankings or lists. A function $s$ is called \emph{multi-input stability measure} if \begin{itemize} \item[(i)] $s(\bm{\sigma}_1, \ldots, \bm{\sigma}_{B}) = s(\bm{\sigma}_{\pi_1}, \ldots, \bm{\sigma}_{\pi_B})$ for any permutation $\bm{\pi} \; \text{of} \; \{1,\ldots,B \}$, \item[(ii)] $s(\bm{\sigma}_1, \ldots, \bm{\sigma}_B) \leq s(\bm{\sigma}_1, \ldots, \bm{\sigma}_1) = \ldots = s(\bm{\sigma}_B, \ldots, \bm{\sigma}_B) = 1$. \end{itemize} As shown in Table \ref{tab:stab}, an additional component of stability measures is a weighting scheme which penalizes variability at the top of list more severely than at the bottom, since only the top is of practical relevance. \begin{table} %\label{methodsoverview} \begin{tabular}[h!]{|l|l|l|} \hline \\ \textbf{Name} & \textbf{Definition} & \textbf{Reference} \\ \hline & & \\ Intersection count $\dagger$ & $s_{\cap}(\bm{l}, \bm{l}') = s_{\cap}(\bm{l}_{[k]}, \bm{l}'_{[k]}) = \frac{\sum_{1 \leq m,m' \leq k} I(l_m = l_{m'}')}{k}, \; \; k=1,\ldots,p.$ & \citet{Qiu} \\ & & \\ Overlap score $\dagger$ & $s_{O\cap}(\bm{l}, \bm{l}') = \frac{\sum_{k=1}^p w_k s_{\cap} (\bm{l}_{[k]}, \bm{l}'_{[k]})}{\sum_{k=1}^p w_k}$ & \citet{Yan2006} \\ & & \\ $\ell^1$ $\ddagger$ & $s_{\ell^1}(\bm{r}, \bm{r}') = 1 \; - \; \frac{\sum_{j=1}^p w_j |r_j - r_j'|}{\sum_{j=1}^p w_{(j)} |j - (p-j+1)|}$ & \Citet{Jur2008} \\ & & \\ $\ell^2$ $\ddagger$ & $s_{\ell^2}(\bm{r}, \bm{r}') = 1 \; - \; \frac{\sum_{j=1}^p w_j (r_j - r_j')^2}{\sum_{j=1}^p w_{(j)} |j - (p-j+1)|}$ & \citet{Jur2008} \\ & & \\ Spearman's $\rho$ $\ddagger$ & $\frac{\sum_{j=1}^p w_j (r_j - (p+1)/2)(r_j' - (p+1)/2)}{\left(\sum_{j=1}^p r_j^2 \right)^{1/2} \left( \sum_{j=1}^p r_j'^2 \right)^{1/2}}$ & \citet{Jur2008} \\ & & \\ Kendall's $\tau$ $\ddagger$ & $s_{\tau}(\bm{r}, \bm{r}') = 1 \; - \; \frac{\sum_{1 \leq j < m \leq p} w_j \; w_m \; I( [(r_j - r_m)(r'_j - r'_m)] < 0)}{\sum_{1 \leq j < m \leq p} w_j \; w_m}$ & \citet{Dec2006} \\ & & \\ Union count $\Delta$ & $s_{\cup}(\bm{l}_{1 \; [k]},\ldots,\bm{l}_{B \; [k]}) = 1 - \frac{|U_{[k]}| - k}{\min\{ Bk, p\} - k}$ & \citet{Jur2008} \\ & & \\ Union score $\Delta$ & $s_{O\cup}(\bm{l}_1,\ldots,\bm{l}_{B}) = \frac{\sum_{k=1}^p w_k s_{\cup}(\bm{l}_{1 \; [k]},\ldots,\bm{l}_{B \; [k]})}{\sum_{k=1}^p w_k}$ & \\ \hline \end{tabular} \caption{Overview of the stability measures in \texttt{GeneSelector}. Notations: $\bm{l}_{[k]} = (l_m, \; 1 \leq m \leq k)$ denotes the top-$k$ list of $\bm{l}$; the $w_j$'s are (fixed) weights - the subscript in the brackets indicate ordering, i.e. $w_{(1)} \leq \ldots \leq w_{(p)}$; $|U_{[k]}|$ denotes the size of the union of all top $k$-lists to be compared. Legend: $\dagger$ - implemented in \texttt{GetStabilityOverlap}; $\ddagger$ - implemented in \texttt{GetStabilityDistance}; $\Delta$ - implemented in \texttt{GetStabilityUnion}.}\label{tab:stab} \end{table} As illustration, we apply \texttt{GetStabilityOverlap} to the rankings obtained after swapping class labels, which seems to perturb considerably the original ranking, as indicated by Figure \ref{fig:scatterplot}. Concerning the sequence of weights, we choose $w_m = 1/m, \; \; m=1,\ldots,p$, which is realized by using the option \texttt{decay = "linear"}. <<>>= stab_ex1r_ordT <- GetStabilityOverlap(ex1r_ordT, scheme = "original", decay="linear") show(stab_ex1r_ordT) @ \texttt{GetStabilityOverlap} computes both normalized intersection counts and overlap scores when truncating at list position $k, \; \; k=1,\ldots,p$. Evaluating these scores for position $k = 10$, we use the lines: <<>>= summary(stab_ex1r_ordT, measure = "intersection", display = "all", position = 10) summary(stab_ex1r_ordT, measure = "overlapscore", display = "all", position = 10) @ The output shows that the overlap between reference- and alternative top-ten lists ranges from 60 to 90 percent. Overlap score and intersection count disagree visibly, which is due to the fact that the overlap score is computed with weights. Though \ref{fig:scatterplot} suggests some discrepancy between reference- and alternative lists, the output shows that the fraction of accordance is much larger than the expectation in the no-information case, i.e. in the case of mutually unrelated lists. To have a look at how the two scores vary with increasing list position (on average), we invoke the predefined \texttt{plot(...)} routine: \begin{figure}[h!] \centering <>= plot(stab_ex1r_ordT, frac = 1, mode = "mean") @ \caption{Visualization of intersection count and overlap score.} \label{fig:overlap} \end{figure} Next, let us investigate which sample is most influential in the sense that its removal perturbs the reference ranking most. For this purpose, we apply \texttt{GetStabilityDistance} with the option \texttt{measure = "spearman"} to the jackknif-ed rankings. <<>>= stab_loo_ordT <- GetStabilityDistance(ex1r_ordT, scheme = "original", measure = "spearman", decay="linear") show(stab_loo_ordT) summary(stab_loo_ordT, display = "all") @ From the output we conclude that the fifth sample is by far the most influential one. \subsection{Aggregating multiple ranking criteria} In addition to the ordinary $t$-statistic, we compute five additional rankings (cf. Table \ref{tab:statistics}): <>= BaldiLongT <- RankingBaldiLong(x, y, type="unpaired") FoxDimmicT <- RankingFoxDimmic(x, y, type="unpaired") sam <- RankingSam(x, y, type="unpaired") wilcox <- RankingWilcoxon(x, y, type="unpaired") wilcoxeb <- RankingWilcEbam(x, y, type="unpaired") @ Again, we first assess variability visually. The method \texttt{HeatmapRankings} produces a heatmap from all obtained rankings, clustering genes and criteria simultaneously. We restrict to our attention to the first forty, differentially expressed genes (\texttt{ind = 1:40}). <>= Merged <- MergeMethods(list(ordT, BaldiLongT, FoxDimmicT, sam, wilcox, wilcoxeb)) HeatmapRankings(Merged, ind=1:40) @ To cope with the multiplicity problem, we exploit the functionalities for rank aggregation in the \texttt{GeneSelector} package. A simple approach would just take the average of all observed ranks, which is, among other things, implemented in the method \texttt{AggregateSimple}. As a more sophisticated approach, we use the Markov chain model propagated in \citet{Dec2006} and implemented in the method \texttt{AggregateMC}. Lastly, we use the \texttt{GeneSelector} function, which aims at finding genes falling consistently, i.e. for all ranking criteria, below a pre-specified threshold, here chosen as 50. <<>>= AggMean <- AggregateSimple(Merged, measure = "mean") AggMC <- AggregateMC(Merged, type = "MCT", maxrank = 100) GeneSel <- GeneSelector(list(ordT, BaldiLongT, FoxDimmicT, sam, wilcox, wilcoxeb), threshold="user", maxrank=50) show(GeneSel) sel <- sum(slot(GeneSel, "selected")) cbind(mean = toplist(AggMean, top = sel, show = F), MC = toplist(AggMC, top = sel, show = F), GeneSelector = toplist(GeneSel, top = sel, show = F)[,1]) @ Here, we have first determined the number of genes passing the \texttt{GeneSelector} filter. In total, 29 genes manage to fall below rank 50 in all six rankings. Although the \texttt{GeneSelector} attempts to minimize the number of false positives, one still ends up with 14 false positive genes among the 29 selected ones. In this regard, the Markov chain approach is superior, because it selects only 11 false positive ones. Simple averaging seems to perform slightly worse, putting the false positive gene \texttt{820} at position 13. In contrast, the first false positive gene of the \texttt{GeneSelector} occurs at position 16. A nice feature we want to present at the end is the \texttt{plot} routine for the class \texttt{GeneSelector}. It allows one to obtain a detailed gene-specific overview: <>= plot(GeneSel, which = 1) @ Interestingly, for the first, differentially expressed gene, simple approaches such as the ordinary $t$- and the Wilcoxon statistic perform well, while the more sophisticated statistics, which depend on hyperparameters, fail to detect differential expression. \bibliographystyle{plainnat} \bibliography{literatur} \end{document}