%\VignetteIndexEntry{geneClassifiers introduction} \documentclass{article} \usepackage[utf8]{inputenc} \usepackage[letterpaper, portrait, margin=0.8in]{geometry} \usepackage{listings} \usepackage[colorlinks=true, urlcolor=blue, linkcolor=blue, citecolor=blue]{hyperref} \usepackage{color} \definecolor{dkgreen}{rgb}{0,0.6,0} \definecolor{gray}{rgb}{0.5,0.5,0.5} \definecolor{mauve}{rgb}{0.58,0,0.82} \usepackage{verbatim} \usepackage{graphicx} \usepackage{caption} \usepackage{amsmath} \lstset{frame=tb, language=R, aboveskip=3mm, belowskip=3mm, showstringspaces=false, columns=flexible, basicstyle={\small\ttfamily}, numbers=none, numberstyle=\tiny\color{gray}, keywordstyle=\color{blue}, commentstyle=\color{dkgreen}, stringstyle=\color{mauve}, breaklines=true, breakatwhitespace=true, tabsize=3 } \usepackage{hyperref} \begin{document} \title{Package: geneClassifiers (Version 1.0.0)} \author{R.Kuiper} \date{October 27, 2016} \maketitle <>= library(geneClassifiers) @ Combining gene expression profiling data with survival data has led to the development of robust outcome predictors (gene classifiers).This package provides a method for running gene classifiers generating patient specific predictive outcomes. This package is intended to support and enable research. The workflow is illustrated in Figure \ref{Fig:workflow}. The raw gene expression data obtained by microarray experiments is normalized using existing techniques (independent of this package). The choice of normalization method is dictated by the classifier. Some classifiers were developed using MAS5.0. In that case, the data to be classified should be normalized using MAS5.0. Normalization is followed by preprocessing (this package) and generating scores/classifications (this package). This package is suitable only for datasets of at least 20 patients. \begin{figure}[h] \centering \includegraphics[trim=0cm 0cm 0cm 0cm, clip=true, scale=0.5]{Figures/WorkFlow1.pdf} \caption{The workflow of the geneClassifiers package. The raw gene expression data is normalized. This normalized data is used as input in the geneClassifiers package. The processes with their relevant functions are shown. \label{Fig:workflow}} \end{figure} \section{Classifiers} \label{sec:Classifiers} The currently implemented list of classifiers can be obtained with the command: <<>>= showClassifierList() @ To find more information on a specific classifier (e.g. EMC92),the classifier parameters can be obtained by <<>>= EMC92Classifier<-getClassifier("EMC92") EMC92Classifier HM19Classifier<-getClassifier("HM19") HM19Classifier @ This is an object of class 'ClassifierParameters' which stores classifier related information, such as probe-sets used and their weights, means, standard deviations and covariance structure as observed in the classifiers' training data, and the description of the procedure on how to preprocess new data prior to application of the classifier. Further information can be obtained from this object e.g. obtaining the weights used in a classifier: <<>>= getWeights(EMC92Classifier)[1:10] @ or the decision boundaries used to decide which class a sample score belongs to <<>>= getDecisionBoundaries(HM19Classifier) @ or the 'eventChain' which gives information on preprocessing: <<>>= getEventChain(EMC92Classifier) @ \section{Data to be classified} \label{sec:data} The input data for the 'geneClassifiers' package is a Bioconductor ExpressionSet which has been prenormalized using existing methods such as MAS5.0 or GCRMA. For more information on these methods see the Bioconductor 'affy' package. The 'geneClassifiers' package contains an example dataset of MAS5.0 normalized (target value = 500) gene expression data of 25 multiple myeloma patients from the HOVON65/GMMG-HD4 trial (Pieter Sonneveld et al., J Clin Oncol, 2012) <<>>= library(Biobase) data(exampleMAS5) class(exampleMAS5) #an object of class ExpressionSet dim(exampleMAS5) preproc(experimentData(exampleMAS5)) @ To import this data set into the 'geneClassifiers' package, the setNormalization function is used: <<>>= fixedData <- setNormalizationMethod( exampleMAS5, method="MAS5.0", targetValue = 500 ) fixedData @ Nb the targetValue = 500 is only required in the example (see below). To get reliable results in the classification, the function depends on unmanipulated output from the normalization methods, i.e. read in CEL files into affy functions, obtain ExpressionSet, and use this set (without modification) for obtaining classifier scores. The function can detect deviations such as subsets of data sets or log transformed data, but detection of deviations is not guaranteed. When providing an ExpressionSet with all probe-sets still included, the 'targetValue=500' argument is not necessary because the function is able to extract the value from the data. In the example the number of probe-sets was reduced due to space considerations, the MAS5.0 target value cannot be obtained from the data so that the argument has to be provided. See '?setNormalizationMethod' for more details. \section{Performing classifications} To perform the classification using a classifier described in section \ref{sec:Classifiers} on the data described in section \ref{sec:data}, the 'runClassifier' function is called using both arguments: <<>>= resultsEMC92 <- runClassifier( "EMC92" , fixedData ) resultsUAMS70 <- runClassifier( "UAMS70", fixedData ) resultsEMC92 resultsUAMS70 @ The scores and classifications can be extracted using the 'getScores' and 'getClassifications' function <<>>= data.frame( "score_EMC92" = getScores( resultsEMC92 ), "class_EMC92" = getClassifications( resultsEMC92 ), "score_UAMS70" = getScores( resultsUAMS70 ), "class_UAMS70" = getClassifications( resultsUAMS70 ) ) @ \section{Caution: non standard situations} The geneClassifiers package performs a batch correction by applying a linear transformation of the probe-set means and standard deviations to the values observed in the classifiers' training set. In order to accurately do this the data must contain a sufficient number of samples (n>=20) to estimate the means and standard deviations. If less samples are available, the 'runClassifier' function will give a warning and suggest to consider setting 'do.batchcorrection = FALSE'. Please note this will most likely result in invalid classifications (or certainly different classifications). Besides the requirements of a matching normalization method between data and classifier and sufficient samples, the assumption is that the probe-sets needed for classification are present in the data. If this is not true, simply ignoring the missing probe-set may heavily bias the results. Therefore, when detecting missing probe-sets, the 'run-classifier' function will give an error message and suggest to consider using the argument 'allow.reweighted = TRUE'. This will reweight the weightings for the probe-sets which are present, based on the covariance structure of the classifiers' trainings data. See the vignette 'MissingCovariates' for more information. Please note this is not how the classifiers are intended and consequentially will result in different classifications. <>= resultsEMC92.reWeighted <- runClassifier( "EMC92" , fixedData[1:70,] , allow.reweighted=TRUE ) resultsEMC92.reWeighted plot( x = getScores(resultsEMC92), y = getScores(resultsEMC92.reWeighted), xlab = "complete", ylab = "reweighted", main = "EMC92 scores", pch = 21, bg ='black' ) lines(c(-10,10),c(-10,10),col=2,lty=2) abline( v = getDecisionBoundaries( getClassifier( resultsEMC92 )), h = getDecisionBoundaries( getClassifier( resultsEMC92.reWeighted)), col='red' ) @ <<>>= sessionInfo() @ \end{document}