%&pdflatex %\VignetteIndexEntry{Using staRank} \documentclass[a4paper]{article} \usepackage[left=1in,right=1in,top=1in,bottom=1in]{geometry} \pagestyle{empty} \parindent0ex \parskip1ex \usepackage[OT1]{fontenc} \usepackage{Sweave} \usepackage{amsmath, amssymb} \begin{document} \SweaveOpts{prefix.string=plot, eps = FALSE, pdf = TRUE} \SweaveOpts{width=6, height=6} \title{Stability ranking with the \texttt{staRank} package} \author{Juliane Siebourg} \maketitle Stability ranking can be used to obtain variable rankings that are highly reproducible. Such rankings are needed for robust variable selection on univariate data. In most biological experiments several replicate measurement for a variable of interest, e.g. a gene, are available. An obvious way to combine and prioritize these values is to take their average or median. Other procedures like a t-test or rank sum test deliver a statistic that also accounts for the variance in the data. Stability ranking provides a way of ranking measured elements based on one of the methods above and combining it with a bootstrapping routine to estimate the stability with which an element occurs within the top k set of the ranking. The final ranking orders the elements accoring to this stability. The theory behind the procedure is described in \cite{Siebourg2012}. Stability selection in general is described in \cite{Meinshausen2010}. This file contains two example stability analysis that show how the \texttt{staRank} package should be used. In a first toy example, stability selection using all available base ranking methods is performed on simulated data. In a second example, an RNAi dataset on Salmonella infection \cite{Misselwitz2011} is analyzed. \section*{Simulation example} First we create an artificial dataset of $p$ genes (rows) with $n$ replicate measurement values (columns). For each gene the effects are drawn from a normal distribution and their replicate variance is drawn from a gamma distribution. <>= set.seed(42) @ <<>>= # sample parameters for data p<-20 # genes n<-4 # replicates trueEffects<-rnorm(p,0,5) # gene effect s<-rgamma(p,shape=2.5,rate=0.5) # sample variance # draw n replicate values of a gene simData <- matrix(0,nr=p,nc=n, dimnames=list(paste("Gene",c(1:p)),paste("Replicate",c(1:n)))) for(i in 1:p){ simData[i,]<-replicate(n,rnorm(1,mean=trueEffects[i],sd=s[i])) } @ Now stability ranking can be performed on the dataset, to find top scoring genes that are highly reproducible. Stability ranking can be applied using different base ranking methods. Implemented in the package are the mean, median and the test statistics of the rank sum test (Mann-Whitney test), t-test and RSA \cite{Koenig2007}. <<>>= # load the stability ranking package library(staRank) # implemented ranking methods method<-c('mean','median','mwtest','ttest','RSA') @ The ranking is performed calling the main function \texttt{stabilityRanking}. Here this is done using the default settings but for different base methods. <<>>= stabilityList<-list() stabilityList[['mean']]<-stabilityRanking(simData,method='mean') stabilityList[['median']]<-stabilityRanking(simData,method='median') stabilityList[['mwtest']]<-stabilityRanking(simData,method='mwtest') stabilityList[['ttest']]<-stabilityRanking(simData,method='ttest') stabilityList[['RSA']]<-stabilityRanking(simData,method='RSA') @ The stability ranking proceedure can also be used with any external ranking method. This means \texttt{stabilityRanking} can be called on rankings directly. All that is needed is a named vector of the original ranking and a matrix containing sample rankings. In the following this is demonstrated by performing a mean ranking on the data and on bootstrap samples of it. <<>>= # 1. create a ranking extRanking <- sort(apply(simData,1,mean)) # 2. create 100 sample rankings sampleRankings<-matrix(NA,nr=nrow(simData),nc=100) for(i in 1:100){ sampleRankings[,i]<- apply(simData,1,function(x){mean(sample(x,length(x),replace=TRUE))}) } rownames(sampleRankings)<-rownames(simData) sampleRankings<-sampleRankings[names(extRanking),] # 3. run stabilityRanking stabilityList[['externalMean']]<- stabilityRanking(extRanking,sampleRankings,method='externalMean') @ In this case the parameter \texttt{method} is only needed for identification. We can now examine the results of the procedure. The function returns an object of type \texttt{RankSummary}. This contains three different rankings of the data. The base, stability and averaged ranking. The averaged ranking is also an aggregated ranking that averages the rank of a variable over all samples. <<>>= #results for the mean ranking baseRank(stabilityList$mean) stabRank(stabilityList$mean) avrgRank(stabilityList$mean) @ This can be summarized in a matrix with \texttt{getRankmatrix(stabilityList\$mean)}. For comparison of the three rankings we also compute their correlation using Spearman's rank correlation coefficient. <<>>= rankCor(stabilityList[['mean']]) @ Another value that is returned is a vector containing the sizes of the stable sets per cutoff. This gives information about how stable the base ranking itself is. <<>>= stableSetSize(stabilityList[['mean']]) @ We can for example compare this property for all the base ranking methods. \SweaveOpts{width=5, height=5} <>= # plot the stable genes per cutoff k* for each method mCol<-rainbow(5) name_out<-c('Mean','Median','Rank sum','t-test','RSA') plot(1,type='n',xlim=c(0,p),ylim=c(0,p),xlab='k*',ylab='stable genes') abline(0,1,col='lightgray') for(i in 5:1){ lines(stableSetSize(stabilityList[[i]]),lwd=2,col=mCol[i]) } legend('topleft',name_out,col=mCol,lwd=2) @ \setkeys{Gin}{width=.4\textwidth} \begin{figure}[!h] \centering <>= <> @ \caption{Stable set sizes for each cutoff using different ranking methods on the toy example.} \label{toyStabilityPlot} \end{figure} The gray diagonal in this plot indicates perfect stability. \section*{Stability ranking of an RNAi dataset} The package contains an RNAi screen on human cells under \textit{Salmonella} infection. Briefly, in HeLa cells all genes were knocked-down individually by RNA interference. Subsequently they were infected with \textit{Salmonella}. The cells were imaged with a microscope and from the images several features were extracted (for details see original paper \cite{Misselwitz2011}). After loading the data we select all genes where the infection on average decreased more than one standard deviation. Then we run stability ranking on this selection, for mean, rank sum test and RSA as base ranking methods. <<>>= #load the data set data(salmonellaInfection) data<-as.matrix(salmonellaInfection) down<-rownames(data[which(apply(data,1,mean)<=-1),]) data<-data[down,] @ <>= salmonellaStability<-list() salmonellaStability[['mean']]<-stabilityRanking(data,method='mean') salmonellaStability[['mwtest']]<-stabilityRanking(data,method='mwtest') salmonellaStability[['RSA']]<-stabilityRanking(data,method='RSA') <>= save(salmonellaStability,file='data/salmonellaStability.RData') @ <>= data(salmonellaStability) @ The correlation shows that the stability ranking leads to a more diverged result. <<>>= rankCor(salmonellaStability$mwtest) @ Plotting the data points and their mean values according to their ranking shows the differences of the methods. For this we first gather all rankings in a matrix. <<>>= # gather all rankings in a matrix rankings<-lapply(salmonellaStability,getRankmatrix) rankings<-Reduce(cbind,lapply(rankings,function(x){x[down,]})) @ Then, for each gene the three data points and and their mean are plotted at their position in the ranking (x-axis). \SweaveOpts{width=9, height=6} <>= # plot data according to rankings par(mfrow=c(1,5),mar=c(4,4,2,2)) range<-range(data) for(m in c(2,3,1,4,8)){ plot(1,type='n',xlim=c(0,length(down)), ylim=range,main='', ylab='infection score', xlab=paste(colnames(rankings)[m],'rank')) orderedData<-data[order(rankings[,m]),] for(j in 1:3){ points(orderedData[,j],pch=20,cex=0.5) } points(apply(orderedData,1,mean),col='red',pch=20,cex=0.5) } @ \setkeys{Gin}{width=\textwidth} \begin{figure}[!h] \centering <>= <> @ \caption{The infection data plotted according to a ranking (x-axis). Black dots are infection values, plotted columnwise per gene, red dots are their mean.} \label{dataplot} \end{figure} One can observe that the overall variance in this dataset is quite high, resulting from the fact, that different siRNAs were used for each gene. This also becomes obvisous when looking at the stable set sizes of the original rankings. \SweaveOpts{width=5, height=5} <>= # plot the stable genes per cutoff k* for each method mCol<-rainbow(3) name_out<-c('Mean','Rank sum','RSA') plot(1,type='n',xlim=c(0,length(down)),ylim=c(0,length(down)), xlab='k*',ylab='stable genes') abline(0,1,col='lightgray') for(i in 1:3){ lines(stableSetSize(salmonellaStability[[i]]),lwd=2,col=mCol[i]) } legend('topleft',name_out,col=mCol,lwd=2) @ \setkeys{Gin}{width=0.5\textwidth} \begin{figure}[!h] \centering <>= <> @ \caption{Stable sets of rankings on the salmonella infection data.} \label{infectionStabilityPlot} \end{figure} Within the top 10\% of the ranking (35 genes) only 5 are stable. Calling \texttt{summary} on a \texttt{RankSummary} object we can quickly access the top 1, 10 and 50\% cutoffs and stable set sizes. <<>>= summary(salmonellaStability$mean) @ \begin{thebibliography}{50} \bibitem{Siebourg2012} Siebourg J., Merdes G., Misselwitz B., Hardt W.-D.and Beerenwinkel N. \textsl{Stability of gene rankings from RNAi screens}. Bioinformatics, 2012, doi: 10.1093/bioinformatics/bts192 \bibitem{Meinshausen2010} Meinshausen N. and B\"uhlman P. \textsl{Stability selection}. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2010 \bibitem{Misselwitz2011} Misselwitz B. et al., \textsl{RNAi screen of Salmonella invasion shows role of COPI in membrane targeting of cholesterol and Cdc42}. Molecular Systems Biology, 2011 \bibitem{Koenig2007} Koenig R. et al., \textsl{A probability-based approach for the analysis of large-scale RNAi screens}. Nature Methods, 2007 \end{thebibliography} \end{document}