% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{dualKS Overview} % \VignetteDepends{dualKS} % \VignetteKeywords{Expression Analysis, Postprocessing} % \VignettePackage{dualKS} \documentclass[11pt]{article} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage{times} \usepackage{comment} \usepackage{amsmath} \parindent 0.5in \setcounter{totalnumber}{5} \setcounter{topnumber}{5} \setcounter{bottomnumber}{5} \renewcommand{\topfraction}{0.9} \renewcommand{\bottomfraction}{0.9} \renewcommand{\textfraction}{0.1} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{\textit{#1}} \bibliographystyle{plainnat} \begin{document} \title{\bf Using dualKS} \author{Eric J. Kort and Yarong Yang} \maketitle \section{Overview} The Kolmogorov Smirnov rank-sum statistic measures how biased the ranks of a subset of items are among the ranks of the entire set. In otherwords, do the items tend to occur early or late in the ordered list, or are they randomly dispersed in the list? This is an intuitive way to think about gene set enrichment. Indeed, it is the basis for Gene Set Enrichment Analysis described by \cite{Subramanian2005}. This package takes this approach to the problem of multi-class classification, wherein samples of interest are assigned to one of 2 or more classes based on their gene expression. For example, to which of the several sub-types of renal cell carcinoma does a new kidney tumor sample belong? This methodology is described in detail in our (forthcoming) paper, \cite{Kort2008b}. This package is called ``dualKS'' because it applies the KS algorithm in two ways. First, we use the KS approach to perform discriminant analysis by applying it gene-wise to a training set of known classification to identify those genes whose expression is most biased (upward, downward, or both) in each class of interest. For example, say we take gene 1 and sort the samples by their expression of this gene. Then we ask to what extent each class is biased in that sorted list. If all samples of, say, class 1 occur first, then this gene receives a high score for that class and will be included in the final signature of ``upregulated genes'' for that class. And so on, for every gene in the dataset. This process is frequently termed ``discriminant analysis'' because it identifies the most discriminating genes. The second manner in which the KS algorithm is applied is for classification. Based on the signatures identified in the first step (or via some other mechanism of the user's choosing), we apply the algorithm sample-wise to ask which of these signatures has the strongest bias in new samples of interest in order to assign these samples to one of the classes. In otherwords, when we sort all the genes expression values for a given sample, how early (or late) in that list are the genes for a given signature found? This second step is essentially the same as the algorithm described by Subramanian et al. with equal weights given to each gene. While we have developed this package with the task of \emph{classification} in mind, it can just as readily be used to identify which biologically relevant gene signatures are \emph{enriched} in samples of interest (indeed, classification is simply enrichment analysis with a class-specific gene signature). The KS approach has several attractive factors. First, the idea of identifying which genes exhibit a class-dependant bias in expression and, conversely, which signatures are most biased in a given sample, is an intuitive way of thinking about discriminant analysis and classification that most biologists can readily grasp. Second, the resulting gene signatures may be quite parsimonious--an attractive feature when one is planning on downstream validation or implementation of a non-array based assay. Third, the algorithm can deal with many classes simultaneously. Finally, the algorithm does not require iteration like random selection algorithms do. \section{An Example} The package includes an illustrative dataset that is a subset of GEO data set GDS2621. We have taken a small subset of the genes in the original dataset to make the analysis run quickly, but you will obtain the same results if you use the entire dataset from the GEO website and the analysis will take only a few minutes longer. This data set contains one color affymetrix gene expression data: <>= options(width=70, scipen=999) @ <<>>= library("dualKS") data("dks") pData(eset) @ As you can see, there are five samples each of normal synovial fluid, synovial fluid from patients with Osteoarthritis, and synovial fluid from patients with Rheumatoid Arthritis. We will now build a classifier that can distinguish these diagnoses. The first step is to rank each gene based on how biased its expression is in each of the classes. You have the option of scoring genes based on how upregulated or downregulated they are, or both. For one color data such as we have in this dataset, genes with low expression exhibit a great deal of noise in the data. Therefore, we will focus only on those genes that are upregulated in each class, as specified by the \texttt{type="up"} parameter. <<>>= tr <- dksTrain(eset, class=1, type="up") @ The \texttt{class = 1} instructs the software to look in column 1 of the phenoData object contained within \texttt{eset} to determine which class each sample belongs to. Alternatively, you can provide a factor specifying the classes directly. Now, we will extract a classifier of 100 genes per class from the training data. We can subsequently choose classifiers of different sizes (and, perhaps, select among these classifiers using ROC analysis) by specifiying a different value for n, without re-running the (more time intensive) \texttt{dksTrain} function. <<>>= cl <- dksSelectGenes(tr, n=100) @ We can then apply this classifier to a test data set. However, in this case we will just run it against the training set to check for internal consistency (if we can't classify the training set from which the classifier is derived, we are in real trouble!) <<>>= pr <- dksClassify(eset, cl) summary(pr, actual=pData(eset)[,1]) show(pr) @ Note the custom summary and show functions. By specifying the ``actual'' classes for the samples when calling summary, the percent correspondence rate is calculated and displayed along with the summary. As you can see, we didn't do too well with classification (our concordance rate is only 60\%), even though we are only trying to classify our training set, which should be self-fulfillingly easy. What is the problem? To gain some insight, let's look at plots of the data. The prediction object from \texttt{dksClassify} has its own \texttt{plot} method. The resulting plot allows for easy visualization of how the samples of a given class compare to the other samples in terms of KS scores for each signature. To visualize this, a separate panel is created for each signature. The samples are sorted in decreasing order according to their score on that signature. Then for each sample, its score for \emph{each} signature is plotted to allow easy comparison between the signature scores for each sample. The signatures are color coded to distinguish one signature score from another. Finally, a color coded bar is plotted below each sample indicating its predicted and (if provided) actual class for ready identification of outliers and misclassified samples. There is a lot of information compactly summarized in these plots, so let's consider the example: <>= plot(pr, actual=pData(eset)[,1]) @ \begin{figure}[ht] \centering <>= plot(pr, actual=pData(eset)[,1]) @ \caption{Plot of samples sorted by their score for the indicated class} \label{fig:ksp} \end{figure} To begin, examine the first panel of figure \ref{fig:ksp} corresponding to the signature for ``normal'' samples. This panel show the samples sorted according to their score for the 'normal' signature. Not surprisingly, the normal samples (red bars) are clustered to the left, with the highest scores. As you move to the right, not only does the red line decrease (because the samples are sorted by this score), but the lines for the other signatures (blue line=rheumatoid signature score, green line=osteoarthritis signature score) begin to increase. For each sample (indicated by the bars along the bottom), the predicted class is the signature for which that sample's score is the maximum. So where the green line is on top, the prediction is ``osteoarthritis'', and where the red line is on top, the prediction is ``normal''. The other two panels work in the same way, but now the samples have been sorted by decreasing rheumatoid signature score or osteoarthritis signature score, respectively. This way of plotting the data allows easy visualization of the degree to which each class is upwardly biased in the list of samples sorted by the corresponding signature. It also allows the natural "knee point" or threshold for any given signature to be identified. This plot demonstrates why our concordance rate is so poor. While the normal gene expression signature is \emph{most} upwardly biased in the normal samples, these genes also have high expression in the other samples (this is not really too surprising--abnormal cells have a lot of the same work to do as normal cells, biologically speaking). So in many cases, the most upwardly biased genes in, say, osteoarthritis are still not as upwardly biased as highly expressed ``normal'' genes. There are three methods to correct this problem. The most simplistic method is to solve the problem by brute force. Rather than classifying samples based on maximum raw KS score (as in figure \ref{fig:ksp}), we will classify based on \emph{relative} KS score. We will accomplish this by rescaling the scores for each signature so that they fall between 0 and 1 (by simply subtracting the minimum score for a given score across all samples and then dividing by the maximum). This is achieved by supplying the \texttt{rescale=TRUE} option to \texttt{dksClassify}: <<>>= pr <- dksClassify(eset, cl, rescale=TRUE) summary(pr, actual=pData(eset)[,1]) @ And we can plot the results as before: <>= plot(pr, actual=pData(eset)[,1]) @ \begin{figure}[ht] \centering <>= plot(pr, actual=pData(eset)[,1]) @ \caption{Classification with rescaling.} \label{fig:rescale} \end{figure} The resulting plot is shown in figure \ref{fig:rescale}. The classification is now inline with our expectation. A draw back of this approach is that the rescaling requires that the dataset being classified contains multiple samples from \emph{each} class. Therefore, a single new sample cannot be classified using rescaling. Conversion to ratio space avoids this problem as discussed in section 4. Finally, you may want to examine a plot of the running sum of the KS scores for an individual sample. You can call the function \texttt{KS} directly to access the running sum for each signature and the plot it. Here is an example of how that might be accomplished for the first sample in our data set using our previously defined classifier object (\texttt{cl}): <>= sc <- KS(exprs(eset)[,1], cl@genes.up) plot(sc$runningSums[,1], type='l', ylab="KS sum", ylim=c(-1200,1200), col="red") par(new=TRUE) plot(sc$runningSums[,2], type='l', ylab="KS sum", ylim=c(-1200,1200), col="green") par(new=TRUE) plot(sc$runningSums[,3], type='l', ylab="KS sum", ylim=c(-1200,1200), col="blue") legend("topright", col=c("red", "green", "blue"), lwd=2, legend=colnames(sc$runningSums)) @ As shown in figure \ref{fig:runsum}, it looks like that first sample belongs to the class ``normal''. Note that the maxium value achieved by each line corresponds to the KS score (not rescaled) on the corresponding signature for sample 1. \begin{figure}[ht] \centering <>= sc <- KS(exprs(eset)[,1], cl@genes.up) plot(sc$runningSums[,1], type='l', ylab="KS sum", ylim=c(-1200,1200), col="red") par(new=TRUE) plot(sc$runningSums[,2], type='l', ylab="KS sum", ylim=c(-1200,1200), col="green") par(new=TRUE) plot(sc$runningSums[,3], type='l', ylab="KS sum", ylim=c(-1200,1200), col="blue") legend("topright", col=c("red", "green", "blue"), lwd=2, legend=colnames(sc$runningSums)) @ \caption{Plot of sample 1's running sum of KS statistic for each signature.} \label{fig:runsum} \end{figure} The other two methods for addressing the problem in our original classification require a bit more effort (mostly on the part of the CPU), but are conceptually more satisfying because they use biologically relevant information encoded in the data. These two methods--weighting by expression level and converting the data to ratios vs. a relevant reference--are covered in the next two sections. \section{Discriminant analysis with weights} By default, signatures are defined by this package based on the ranks of members of each class when sorted on each gene. Those genes for which a given class has the highest rank when sorting samples by those genes will be included in the classifier, with no regard to the absolute expression level of those genes. This is the classic KS statistic. Very discriminant genes identified in this way may or may not be the highest expressed genes. For example, a gene with very low expression but also very low variance may be slightly over-expressed in one subgroup. The result is that signatures identified in this way have arbitrary "baseline" values as we saw in figure \ref{fig:ksp}. Our first solution was to force the range of values for each signature across samples to be between 0 and 1. An alternative to this brute force approach is to weight the genes based on some biologically interesting statistic. By adding the \texttt{weights = TRUE} option to \texttt{dksTrain}, the genes will be weighted according to the $log_{10}$ of their mean relative rank in each class. More specifically the weight for gene $i$ and class $j$ is: \begin{equation*} w_{ij} = -log_{10}\frac{\bar{R_{ij}}}{n} \end{equation*} Where $\bar{R_{ij}}$ is the \emph{average rank} of gene $i$ across all samples of class $j$ and n is the total number of genes. Therefore, genes with highest absolute expression in a given class are more likely to be included in the signature for that class. For example: <>= tr <- dksTrain(exprs(eset), class=pData(eset)[,1], type="up", weights=TRUE) cl <- dksSelectGenes(tr, n=100) pr <- dksClassify(exprs(eset), cl) plot(pr, actual=pData(eset)[,1]) @ The results are plotted in figure \ref{fig:weighted}. As you can see, the range of KS scores across all samples and classes are now much more consistent, and the resulting classification is much better. \begin{figure}[ht] \centering <>= tr <- dksTrain(exprs(eset), class=pData(eset)[,1], type="up", weights=TRUE) cl <- dksSelectGenes(tr, n=100) pr <- dksClassify(exprs(eset), cl) plot(pr, actual=pData(eset)[,1]) @ \caption{Weighted KS analysis.} \label{fig:weighted} \end{figure} Alternatively, you may provide your own weight matrix based on the metric of your choosing as the argument to \texttt{weights}. This matrix must have one column for each possible value of \texttt{class}, and one row for each gene in \texttt{eset}. NAs are handled gracefully by discarding any genes for which any column of the corresponding row of \texttt{weights} is NA. It is important to note that for \texttt{type='down'} or the down component of \texttt{type='both'}, the weight matrix will be inverted as \texttt{1 - weights}. Therefore, if using one of these methods, the weight matrix should have range 0 - 1. Calculating the weight matrix is somewhat time consuming. If many calls to \texttt{dksTrain} are required in the course of some sort of optimization procedure, the user might want to calculate the weight matrix once. This can be done as follows: <>= wt <- dksWeights(eset, class=1) @ Then the resulting weight matrix can be supplied directly to \texttt{dksTrain} to avoid recalculating it on each call: <>= tr <- dksTrain(exprs(eset), class=1, weights=wt) @ This is a short cut to speed up validation. However, if some type of "leave some out" validation is being performed, the user should carefully consider the implications of calculating the weight matrix on all samples and then training/testing on subsets. Nevertheless, this may be useful for initial validation runs. The final analytic alternative we will consider is converting the data to ratios based on some rational reference data. \section{Discriminant analysis with ratios} As mentioned above, one color microarray data does not lend itself well to including highly \emph{down regulated} genes in the classifier, because genes with very low measured expression levels have very poor signal:noise ratios. (This may be mitigated somewhat by enforcing some reasonable threshold below which the data is discarded). Furthermore, as we saw above, basing classifiers on the raw expression level can lead to problems when genes highly expressed in one class are also highly expressed in other classes (albeit slightly less so). Bidirectional discriminant analysis (i.e., including both up and down regulated genes) can be performed on ratio data where the "downregulated" genes are not necessarily of miniscule absolute intensity, but are simply much smaller relative to the reference. Such ratio data may either arise in the course of two color microarray experiments, or may be generated by dividing one color data from samples by the mean expression of some reference samples. By converting to ratios, it is not only easier to include both up and down regulated genes, but we are now examining not raw expression level but \emph{change} in expression level relative to the reference. This is often what is of greatest biological interest. (Note that very small values should still be excluded from the data prior to calculating ratios to avoid artefactually astronomical ratios.) To illustrate, we will transform our data into ratio data by dividing the osteoarthritis and rheumatoid arthritis samples by the mean expression of each gene in the normal samples. First we calculate the mean values for the normal samples: <>= ix.n <- which(pData(eset)[,1] == "normal") data <- exprs(eset) data.m <- apply(data[,ix.n], 1, mean, na.rm=TRUE) @ Now we drop the normals from our data set, and calculate the ratios (expression relative to average normal expression) using the \texttt{sweep} function. <>= data <- data[,-ix.n] data.r <- sweep(data, 1, data.m, "/") @ Finally, we convert to log2 space and perform the discriminant analysis and classification of our test cases. <>= data.r <- log(data.r, 2) tr <- dksTrain(data.r, class=pData(eset)[-ix.n,1], type="both") cl <- dksSelectGenes(tr, n=100) pr <- dksClassify(data.r, cl) plot(pr, actual=pData(eset)[-ix.n,1]) @ \begin{figure}[ht] \centering <>= ix.n <- which(pData(eset)[,1] == "normal") data <- exprs(eset) data.m <- apply(data[,ix.n], 1, mean, na.rm=TRUE) data <- data[,-ix.n] data.r <- sweep(data, 1, data.m, "/") data.r <- log(data.r, 2) tr <- dksTrain(data.r, class=pData(eset)[-ix.n,1], type="both") cl <- dksSelectGenes(tr, n=100) pr <- dksClassify(data.r, cl) plot(pr, actual=pData(eset)[-ix.n,1]) @ \caption{Bidirectional analysis of ratio data.} \label{fig:kspr} \end{figure} And here is the summary information for this classifier: <<>>= summary(pr, actual=pData(eset)[-ix.n,1]) show(pr) @ \section{Significance testing} Once you have identified which signature has the maximum score in a given sample, you will likely want to determine if that score is significantly elevated. The distribution of KS scores generated by this package tend to follow a gamma distribution, but the parameters of the distribution vary depending on the size of the signature and the total number of genes. Therefore, we take a bootstrapping approach to generate an estimated distribution and then identify the gamma distribution that best fits the estimate. To perform the bootstrap, the sample classes are randomly permuted and then signatures are generated for these bootstrap classes. We take this approach because it preserves the relationships between genes (as opposed to generating gene signatures by randomly selecting genes). Each time this is done, we get $k * c$ bootstrap samples were $k$ is the number of samples and $c$ is the number of classes. The entire process is repeated until the requested number of samples is generated. The function returns a function that is \texttt{1-pgamma(x, ...)} where \texttt{...} is the optimized gamma distribution parameters identified by the bootstrap and fitting procedures. Then you can simply call this function with one or more KS scores to calculate the p-values. You must provide an \texttt{ExpressionSet} (for use in bootstrapping), the class specification for the samples in that set (which will then be permuted), and the number of of genes in each signature (n). <>= pvalue.f <- pv.f @ <<>>= pvalue.f <- dksPerm(eset, 1, type="both", samples=500) @ That's not nearly enough samples to obtain a reasonable estimate of the gamma distribution (1,000-10,000 is more like it), but it will suffice for this demonstration. (Generating several thousand samples takes a few minutes.) Now let's calculate the estimated p-values for our predicted classes from before: <<>>= pvalue.f(pr@predictedScore) @ It appears that most of the scores leading to the predicted classes are unlikely to be the result of random variation. (In reality, small bootstrap samples lead to consistent underestimation of the p-values in this context. A run with much larger \texttt{sample} will produce smaller p-values.) When many classes are examined, appropriate controls for multiple comparisons should be considered. Note that for the resulting probabilty density function to meaingfully describe the probabilty of obtaining observed scores, the parameters provided to \texttt{dksPerm} must match those used when classifying with \texttt{dksClassify}. Different setting will produce different distributions of scores, so you must generate a probability density function for each set of parameters and each training dataset (unless those datasets can be assumed to come from the same underling distribution). \section{Classification using your own gene signatures.} If you have predefined signatures (established empirically or by some other methodology), you can still calculate their enrichment in test samples using \texttt{dksClassify}. That function requires an object of type \texttt{DKSClassifier}. The package provides a utility function to create this object from your list of gene ids. Rather then create a separate list of genes for each class, you simply provide a single list of gene ids and a factor indicating which class each gene belongs to. Note, however, that you must provide a separate list for upregulated and downregulated genes---although you may provide only one or the other if you wish. As an example, we will create some arbitrary signatures and perform classification with them. In the example below, \texttt{sig.up} would be the meaingful sets of (upregulated) gene ids you have pre-identified, and \texttt{cls} would be the class to which each gene in your signature belongs. <>= cls <- factor(sample(pData(eset)[,1], 300, replace=TRUE)) sig.up <- sample(rownames(exprs(eset), 300)) classifier <- dksCustomClass(upgenes=sig.up, upclass=cls) pr.cust <- dksClassify(eset, classifier) @ If you were to plot this prediction object with \texttt{actual=pData(eset[,1]))} you would note that the classification is very poor---which is reassuring since this is a random classifier. \section{Accessing slots for downstream analysis} Since the classes defined in this package are not complex, we have not bothered (yet) to write accessors. The reader is referred to the docs for further details. Suffice it to say here that the most useful slots for the user are likely to be those of \texttt{DKSPredicted}. For example, we can construct a useful table for downstream analysis: <<>>= results <- data.frame(pr@predictedClass, pr@scoreMatrix) results @ \bibliography{dualks} \end{document}