%\VignetteIndexEntry{Checking gene expression signatures against random and known signatures with SigCheck} %\VignettePackage{SigCheck} %\VignetteEngine{utils::Sweave} \documentclass{article} <>= BiocStyle::latex() @ \newcommand{\reff}[1]{Figure \ref{fig:#1}} \begin{document} \SweaveOpts{concordance=TRUE} \newcommand{\exitem}[3]{\item \texttt{\textbackslash#1\{#2\}} #3 \csname#1\endcsname{#2}.} \title{Checking gene expression signatures against random and known signatures with \Biocpkg{SigCheck}} \author{Justin Norden and Rory Stark} \date{Edited: October 3, 2014; Compiled: \today} \maketitle \tableofcontents \section{Introduction} A common task in the analysis of genomic data is the derivation of gene expression signatures that distinguish between phenotypes (disease outcomes, molecular subtypes, etc.). However, in their paper "Most random gene expression signatures are significantly associated with breast cancer outcome" \cite{Venet2011}, Venet, Dumont, and Detour point out that while a gene signature may distinguish between two classes of phenotype, their ultimate uniqueness and utility may be limited. They show that while a specialized feature selection process may appear to determine a unique set of predictor genes, the resultant signature may not perform better than one made up of random genes, or genes selected at random from all differentially expressed genes. This suggests that the genes in the derived signature may not be particularly informative as to underlying biological mechanisms. They further show that gene sets that comprise published signatures for a wide variety of phenotypic classes may perform just as well at predicting arbitrary phenotypes; famously, they show that a gene signature that distinguishes postprandial laughter performs as well at predicting the outcome of breast cancers as well as a widely-cited signature \cite{vantveer}. The \Biocpkg{SigCheck} package was developed in order to make it easy to check a gene signature against random and known signatures, and assess the unique ability of that signature. It additionally provides the ability to check a signature's performance against permuted data as a reality check that it is detecting a genuine signal in the original data. This vignette shows the process of performing the checks. \section{Example: NKI Breast Cancer Data and the van't Veer Signature} In order to use \Biocpkg{SigCheck}, you must provide a) some data (an expression matrix), b) some metadata (feature names and class identifiers), and c) a gene signature (a subset of the features). For this vignette, we will use the NKI Breast Cancer dataset \Biocexptpkg{breastCancerNKI} and the associated van't Veer signature that predicts the likelihood that a patient will develop a distant metastases \cite{vantveer}. The dataset can be loaded as follows: <>= library(breastCancerNKI) data(nki) nki @ As can be seen, the \Rcode{nki} data is encapsulated in an \Robject{ExpressionSet} object. At its core, it contains an expression matrix consisting of 24,481 features (microarray probes mapped to genes) and 337 samples (derived from tumor tissue taken from breast cancer patients). Included is phenotypical (clinical) metadata regarding the patients, including age of the patient, tumor grade, expression status of the ER, PR, and HER2 biomarkers, presence of a BRCA mutation, etc: <>= varLabels(nki) @ The signatures derived by \cite{vantveer} and \cite{vandevijver} are used to predict the Distant Metastasis-Free Survival (DMFS) time. This is presented as a continuous measure indicating the time until the occurrence of a distant metastasis: <>= nki$t.dmfs @ DMFS is also presented as a binary class, with a cutoff used to distinguish between patients that had a recurrence event and those that did not: <>= nki$e.dmfs @ While the derived signatures were used as part of a survival analysis using the continuous measure, \Biocpkg{SigCheck} uses a pure classification model and hence relies on the binary measure instead. A number of patient samples do not have DMFS data available, so the first step is to exclude these patients from our analysis, which brings the number of samples down to 319: <>= dim(nki) nki <- nki[,!is.na(nki$e.dmfs)] dim(nki) @ The next step is to provide a gene signature to check. A core function of \Biocpkg{SigCheck} is to compare the performance of a gene signature with the performance of known gene signatures against the same data set. To accomplish this, it includes several sets of known signatures. One of these included signatures is the van't Veer signature, which we will use for this example. To load the known gene signatures that are included with \Biocpkg{SigCheck}: <>= library(SigCheck) data(knownSignatures) names(knownSignatures) @ There are five sets of gene signatures, including a set of cancer signatures. The van't Veer signature is one of the signatures in the known cancer gene signature set: <>= names(knownSignatures$cancer) vantveer <- knownSignatures$cancer$VANTVEER vantveer @ The signature is provided in the form of symbolic gene names. These will need to be matched up with the feature names in the \Robject{ExpressionSet}. While the default annotation is a probe identifier, the \Rcode{nki} dataset provides a number of alternative annotations: <>= fvarLabels(nki) @ We'll be using the \Rcode{"HUGO.gene-symbol"} to match the gene names in the van't Veer signature. \subsection{Running with a validation set} Optimally, when deriving and testing a gene signature, there are distinct sets of training samples and validation samples, with the validation samples kept separate from the process used to derive the gene signature. For this example, we are going to assume that the final 45 samples comprise the validation set. With this assumption, all required elements have been identified for an analysis. \Biocpkg{SigCheck} provides a high-level function, \Rfunction{sigCheck}, that will run all the checks and plot the results. This function accepts the expression and metadata in the form of an \Robject{ExpressionSet}; an indication of which metadata field should be used to define the two classes and which should be used as the annotation for the features; a gene signature to check; and an indication of which samples should be treated as the validation set. There are additional optional parameters that default to sensible values. For example, % rather than using the default % set of \Sexpr{length(knownSignatures$cancer)} cancer signatures from % \cite{Venet2011}, a larger set of \Sexpr{length(knownSignatures$c6.oncogenic)} % from \Rcode{MSigDB} (\cite{MSigDB}) is specified. \Rfunction{sigCheck} also accepts the number of iterations to run for the random and permuted checks; this is set to a low number (10) by default to ensure a quick run, but in general should be set higher for more reliable p-values. <>= nkiResults <- sigCheck(nki, classes="e.dmfs", annotation="HUGO.gene.symbol", signature=vantveer, validationSamples=275:319, knownSignatures="cancer",nIterations=1000) @ Rather than run the 1,000 iterations in the vignette, a results list is included for a run of \Rfunction{sigCheck} with \Rcode{nIterations=1000}: <>= data(nkiResults) @ \Rfunction{sigCheck} generates four plots, each showing how the baseline and mode performance compares to the distribution of results for a specific check. The results are best interpreted by looking more closely at what happens when \Rfunction{sigCheck} is invoked. \subsection{\Rfunction{sigCheckClassifier}} The first step is to assess the baseline performance of the identified signature. This is accomplished by a call to \Rfunction{sigCheckClassifier}, which uses the features identified in the signature on the training samples (all the samples not designated as validation samples) to train a classifier based on the phenotypic categories. In this case, 274 samples are used to train the classifier, utilizing the default machine learning method (a support vector machine). Next, the classifier is used to predict the phenotype of the validation samples (45 samples in this case). The results can be seen as follows: <>= nkiResults$checkClassifier @ The baseline performance of the signature is the proportion of validation samples that are correctly classified; in this case, that value is \Sexpr{as.numeric(format(nkiResults$checkClassifier$sigPerformance,digits=3))} (\Sexpr{nkiResults$checkClassifier$confusion[1,1]+nkiResults$checkClassifier$confusion[2,2]} out of 45). The confusion matrix breaks this down in more detail, showing the numbers of true and false positives as well as true and false negatives. \Rfunction{sigCheckClassifier} also computes the performance of a theoretical "mode" classifier. This is the performance that would result if the classifier always made the same prediction, no matter what the data. The prediction is equal to the phenotype category that appears most often in the training set. In the current example, more samples in the training set come from patients who do not experience a metastatic recurrence: <>= sum(nki$e.dmfs[1:274]==0) sum(nki$e.dmfs[1:274]==1) @ so the mode classifier would always predict 0. In the validation set, an even higher proportion of the samples come from patients who do not experience a relapse: <>= sum(nki$e.dmfs[274:319]==0) sum(nki$e.dmfs[274:319]==1) @ yielding a mode performance of \Sexpr{format(nkiResults$checkClassifier$modePerformance,digits=3)}. In each of the results plots, the signature classifier performance is indicated by a solid red vertical line, while the mode classifier performance is indicated by a dotted red vertical line. A p-value is also provided, based on the null hypothesis that the signature classifier is no better at predicting phenotype than the classifiers being checked, as follows: \subsection{\Rfunction{sigCheckRandom}} <>= sigCheckPlot(nkiResults[[2]]) @ \incfig{SigCheck-sigCheckNKIRandom}{.66\textwidth}{Results of \Rfunction{sigCheck} for NKI Breast Cancer dataset checking v'ant Veer signature against 1,000 random gene signatures.} {(Generated using \Rcode{data(nkiResults)})} The first check is to compare the performance of the signature against signatures randomly sampled from the dataset. This gives insight into the success of the feature selection process by which the set of features was honed. The number of features in each random signature is equal to the number of features in the identified signature. Figure~\ref{SigCheck-sigCheckNKIRandom} shows the performance of the v'ant Veer signature on the NKI breast cancer dataset compared to 1,000 randomly selected signatures. The mean performance of the random signatures is equal to that of the mode signature. While the identified signature performance is better than this, some \Sexpr{sum(nkiResults$checkRandom$performance >= nkiResults$checkClassifier$sigPerformance)} of the random signatures perform as well \emph{or better} than the identified signature. The resulting p-value indicates that the identified signature does not perform significantly better than random signatures. \warning{this should not be taken to suggest that selecting a random signature with better performance on the validation set is a good idea, as it will almost certainly represent an \emph{over-fitting} to the validation samples. Indeed, performance on validation samples should \emph{never} be used in selecting features for a signature.} \subsection{\Rfunction{sigCheckKnown}} <>= sigCheckPlot(nkiResults[[3]]) @ \incfig{SigCheck-sigCheckNKIKnown}{.66\textwidth}{Results of \Rfunction{sigCheck} for NKI Breast Cancer dataset checking v'ant Veer signature against \Sexpr{length(knownSignatures$cancer)} known cancer gene signatures.}{(Generated using \Rcode{data(nkiResults)})} The next check is to compare the performance of the signature against known gene signatures that have been used to predict other phenotypes. Figure~\ref{SigCheck-sigCheckNKIKnown} shows how the performance of the v'ant Veer signature compares to other known cancer-related signatures at classifying the validation samples. While performance overall of the known signatures is skewed higher than for random signatures, the mode performance is equal to the mean performance of the known signatures, and the signature being checked is not among the very best performers. Indeed, some \Sexpr{format(sum(nkiResults$checkKnown$performance >= nkiResults$checkClassifier$sigPerformance)/length(nkiResults$checkKnown$performance)*100,digits=2)}\% of the known signatures perform as well \emph{or better} than the identified signature. The resulting p-value indicates that the specified signature does not perform significantly better than the set of known cancer signatures as a whole. The previously identified signatures that outperform the derived signature can be identified as follows: <>= knownHigh <- nkiResults$checkKnown$performanceKnown > nkiResults$checkClassifier$sigPerformance nkiResults$checkKnown$performanceKnown[knownHigh] @ \subsection{\Rfunction{sigCheckPermuted}} <>= sigCheckPlot(nkiResults[[4]]) @ \incfig{SigCheck-sigCheckNKIPermuteFeature}{.66\textwidth}{Results of \Rfunction{sigCheck} for NKI Breast Cancer dataset checking v'ant Veer signature using permuted expression data}{(Generated using \Rcode{data(nkiResults)})} <>= sigCheckPlot(nkiResults[[5]]) @ \incfig{SigCheck-sigCheckNKIPermuteCategory}{.66\textwidth}{Results of \Rfunction{sigCheck} for NKI Breast Cancer dataset checking v'ant Veer signature using permuted category assignments}{(Generated using \Rcode{data(nkiResults)})} The final two plots show the results of running \Rfunction{sigCheckPermuted} with two different parameter settings. These represent a sort of reality check by seeing how the gene signature performs on randomly permuted data. In the first check, the expression values for each feature (specific rows in the expression matrix) are permuted, so the expression values are maintained but assigned randomly to samples. Each feature is permuted independently. Figure~\ref{SigCheck-sigCheckNKIPermuteFeature} shows the results. For most of the datasets permuted in this manner, the performance of the signature is equal to that of the mode classifier, as the best strategy in the face of such noisy data is to always predict the most frequent phenotype. In a small number of cases the performance of the signature is as good or even better on permuted data than on the real data, but this is expected by chance. Only \Sexpr{sum(nkiResults$checkPermutedFeatures$performance >= nkiResults$checkClassifier$sigPerformance)} of the permuted datasets enable performance as good or better than using the original data. The empirical p-value indicates that the specified gene signature performs significantly better on the real data than on the permuted data. For the second check, the phenotype category assignments are permuted across the samples. Figure~\ref{SigCheck-sigCheckNKIPermuteCategory} shows the results. Similar to datasets permuted by features, the performance of the signature equals that of the mode classifier. Likewise,the signature is as good or even better on permuted data than on the real data in some cases, as expected by chance. Only a few ( \Sexpr{sum(nkiResults$checkPermutedCategories$performance >= nkiResults$checkClassifier$sigPerformance)} ) enable performance as good or better than using the original data, with a resulting p-value indicating that the specified signature performs significantly better using the correct outcomes rather than randomly assigned ones. \subsection{Running without a separate validation set using leave-one-out cross-validation on the training set} In the sample so far, we have assumed the presence of a validation set separate from the training samples used to derive the gene signature. While this represents the optimal methodology, when a signature is being developed, there may not yet be a validation set. In this case, a computational strategy known as leave-one-out cross-validation (LOO-XV) will automatically be employed by \Biocpkg{SigCheck}. Instead of using all the training samples to construct a single classifier, a separate classifier is trained for each sample, using all the other samples. In the current example, if no validation samples are specified, 319 classifiers will be trained, each being used to predict the phenotype category of a single sample using the remaining 318 samples. When checking against other (random or known) signatures, or permuted datasets, each signature or dataset would require 319 classifiers to be trained. This can be computationally intensive (see note below on use of parallel computation in this case). For example, the performance of the signature over all of the samples can be computed using LOO-XV: <>= results <- sigCheckClassifier(nki, classes="e.dmfs", signature=vantveer, annotation="HUGO.gene.symbol") results @ Note that the mode performance is worse over the entire training set as the numbers of patients in each category are more blanaced. \section{Technical notes} \subsection{Use of \Biocpkg{BiocParallel} and \Rpackage{parallel}} This note shows how to control the parallel processing in \Biocpkg{SigCheck}. There are two different aspects of \Biocpkg{SigCheck} that are able to exploit parallel processing. The primary one is when multiple signatures or datasets are being evaluated independently. This include the \Rcode{nIterations} random signatures in \Rfunction{sigCheckRandom}, the database of known signatures in \Rfunction{sigCheckKnown}, and the \Rcode{nIterations} permuted datasets in \Rfunction{sigCheckPermuted}. In this case, the \Biocpkg{BiocParallel} package is used to carry out these comparisons in parallel. By default, \Biocpkg{BiocParallel} uses \Rpackage{parallel} to run in multicore mode, but it can also be configured to schedule a compute task across multiple computers. In the default multicore mode, it will use all of the cores on your system, which can result in a heavy load (especially if there is inadequate memory). You can manually set the number of cores to use as follows: <>= CoresToUse <- 6 library(BiocParallel) mcp <- MulticoreParam(workers=CoresToUse) register(mcp, default=TRUE) @ which limits the number of cores in use at any one time to six. If you want to use only one core (serial mode), you can set \Rcode{CoresToUse <- 1}, or \Rcode{register(SerialParam())}. The other aspect of processing that can use multiple processor cores is when performing leave-one-out cross-validation (LOO-XV). In this case, the underlying \Biocpkg{MLInterfaces} package takes care of the parallelization using the \Rpackage{parallel} package. You can set the number of cores that will be used for this as follows: <>= options(mc.cores=CoresToUse) @ Note that in the LOO-XV case, as every random or known signature, or permuted dataset, requires parallel evaluation of cross-validated classifiers, the parallelization at the level of \Rcode{nIterations} is disabled automatically. \section{Acknowledgements} We would like to acknowledge everyone in the Bioinformatics Core at Cancer Research UK's Cambridge Institute at the University of Cambridge, as well as members of the Ponder group (particularly Kerstin Meyer), for their support and contributions. \section{Session Info} <>= toLatex(sessionInfo()) @ \bibliography{SigCheck} \end{document}