%\VignetteIndexEntry{LPEadj test for microarray data with small number of replicates} %\VignetteKeywords{Local pooled error, replicates} %\VignetteDepends{LPEadj} %\VignettePackage{LPEadj} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{url} \usepackage{natbib} \usepackage{isorot} \usepackage{epsfig} \usepackage{fullpage} % standard 1 inch margins \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\email}[1]{\texttt{#1}} %%% Hyperlinks for ``PDF Latex'' : \ifx\pdfoutput\undefined%%--- usual ``latex'' : %% Stuff w/out hyperref \else%%---------------------- `` pdflatex '' : -- still gives funny errors \RequirePackage{hyperref} %% The following is R's share/texmf/hyperref.cfg : %% Stuff __with__ hyperref : \hypersetup{% %default: hyperindex,% colorlinks,% %default: pagebackref,% linktocpage,% %%plainpages=false,% linkcolor=Green,% citecolor=Blue,% urlcolor=Red,% pdfstartview=Fit,% pdfview={XYZ null null null}% } \RequirePackage{color} \definecolor{Blue}{rgb}{0,0,0.8} \definecolor{Green}{rgb}{0.1,0.75,0.1} \definecolor{Red}{rgb}{0.7,0,0} %% ESS JCGS v2 : %%\hypersetup{backref,colorlinks=true,pagebackref=true,hyperindex=true} %%\hypersetup{backref,colorlinks=false,pagebackref=true,hyperindex=true} \fi \usepackage{Sweave} \begin{document} \title{A correction for the LPE statistical test} \author{ Carl Murie \email{}, \\ Robert Nadon \email{} } \maketitle \tableofcontents \section{Introduction} LPEadj is a correction for the LPE statistical test \cite{Jain:2003} which consists of two additions to the LPE method. The LPE package documentation is still correct with the exception of the two additions listed below and should be consulted for more information on the LPE method. \\ \\ The correction is in two parts. See \cite{Mur:2008} for more information on the correction. \begin{enumerate} \item The LPEadj method discontinues the LPE practice of setting all variances below the maximum variance in the ordered distribution of variances to the maximum variance. In certain cases this practice can set many variances to the maximum and lower the performance of this algorithm. If the assumption that there are only a few low variances to be adjusted is correct then it may be safe to use this procedure. This option is controlled by the doMax parameter (default is FALSE). \item The LPEadj method replaces the $\pi/2$ adjustment of the variance with an empirically estimated adjustment based on the sample size of the data. The empirical adjustment values have been estimated for replicate sizes up to 10. This option is controlled by the doAdj parameter (default is TRUE). \end{enumerate} The top level function LPEadj executes the first and second step by default. This is the recommended manner in which LPEadj should be run. One can use steps 1 and 2 independently if desired. \\ \section{LPE variance adjustment} The LPE method pools variance estimates of genes with similar intensities in order to gain an improved error estimate and increased degrees of freedom. A calibration curve of variance versus mean intensity is generated for each group and the gene specific median intensity is used to obtain the gene's variance estimate from the calibration curve. It has been shown that the sampling variability of the median is proportionally higher (by $\pi/2$) than that of the mean \cite{Moo:1974}. Accordingly, a multiplicative adjustment of $\pi/2$ is applied to the variance estimate obtained from the calibration curve for the purpose of statistical testing. \\ \\ The LPE z-statistic is as follows: \begin{equation} z = \frac{Med_1 - Med_2}{\sigma_{pool}} \end{equation} \noindent where \begin{equation} \sigma^2_{pool} = \frac{\pi}{2}(\frac{\sigma^2_1(Med_1)}{n_1} + \frac{\sigma^2_2(Med_2)}{n_2}) \end{equation} $\sigma^2_i(Med_i)$ are the variances derived from the calibration curve using the median of the gene intensites for a particular group. $n$ is the number of replicates in the groups (assuming equal sample size). The associated probability of the z-statistic under the null hypothesis is calculated by reference to the standard normal distribution. \par \begin{figure}[tbp] \includegraphics[]{fig//MeanMedianVar3.jpg} \caption{The ratio of the variance of sampling medians over sampling means across a range of sample sizes. The Sampling variance of mean line is the variance of taking the mean of a random sample from a standard normal distribution. This line corresponds to the Central Limit Theorem, which states that the sample variance of a distribution of means is $\sigma^2/N$. The Sampling variance of median line is the variance of taking the median from the same random samples. The Sampling variance of median/Sampling variance of mean line is the variance of the median divided by the variance of the mean at each sample size. The sampling was repeated 1000 times for each sample size (ranging from 3 to 1000).} \label{fig:Ratio} \end{figure} The Mood \cite{Moo:1974} proof shows that with normal data the ratio of the squared standard error of the median relative to that of the mean is asymptotically $\pi/2$. Figure \ref{fig:Ratio} shows that the ratio converges to $\pi/2$ when the sample size is large, around 100, but is less than $\pi/2$ when the sample sizes are small, from three to ten. The ratio of variances at small sample sizes also oscillates lower to higher depending on whether the sample size is even or odd. This fluctutation is due to the difference in obtaining the median with even and odd sample sizes. The middle value of the ordered distribution is used as the median with odd sample sizes while the mean of the two middle values of the ordered distribution is used with even sample sizes. There is higher variability when taking the middle value of a distribution (with odd number of samples) than taking the average of the two middle values (with even number of samples). \section{Modification of LPE Method} The use of an empirically estimated variance ratio adjustment, $c_i$, based on sample size can correct the bias caused by the $\pi/2$ adjustment. The $\pi/2$ term in Equation 1 is replaced by the empirically generated ratio of the variance of sampling a median over the variance of sampling a mean. Equation 1 then becomes: \begin{equation} \sigma^2_{pool} = c_1\frac{\sigma^2_1(Med_1)}{n} + c_2\frac{\sigma^2_2(Med_2)}{n} \end{equation} The parameters, $c_1$ and $c_2$, are the ratio of variances of sampling the median and mean based on the number of replicates for each group. $c_1$ and $c_2$ are the adjust1 and adjust2 variables in the calculateLpeAdj function. \begin{figure}[tbp] \centering \includegraphics[]{fig//FPRPvalHist10Reps4.jpg} \caption{(a) False positive rates for the LPE and adjusted LPE methods using simulated data with no differentially expressed genes evaluated at p $\le$ .05 threshold. The LPE showed variable and low false positive rates. In contrast, the adjusted LPE showed appropriate false positive rate for all sample sizes. (b) The adjusted LPE, but not the LPE, shows the theoretically expected uniform p-value distribution. Each data set had 10000 genes with each gene's replicate intensity drawn from a $N(\mu, 0.1)$ distribution. $\mu$ was drawn from a $N(7,1)$ distribution. } \label{fig:FPR} \end{figure} Figure \ref{fig:FPR} shows that the LPE test has a lower than expected false positive rate (FPR) which fluctuates between even and odd sample sizes (average FPR with odd and even samples sizes is 0.030 and 0.022 respectively) in a similar manner as the ratio of variances in Figure \ref{fig:Ratio}. The LPE method also shows a non-uniform p-value distribution with fewer than expected small p-values. The $\pi/2$ adjustment increases the variance by an overly large proportion and causes the LPE test statistics to be smaller than they should be and skews the p-value distribution leftward. In contrast, the adjusted LPE test produced theoretically expected values. \par \begin{figure}[tbp] \centering \includegraphics[]{fig//HGU95GreyScale1.jpg} \caption{P-value histograms and boxplots of FPR, TPR, and pAUC from the LPE and adjusted LPE methods applied to the HGU95 latin square data set. The data were normalized using six different normalization methods (labeled by row). \vspace{-5mm}} \label{fig:Power} \end{figure} Figure \ref{fig:Power} summarizes the results of the LPE and adjusted LPE methods applied to the HGU95 Affymetrix spike-in data set (www.affymetrx.com). The HGU95 data is based on a 14 x 14 Latin Square design of ``spiked-in'' transcripts (14 concentrations per microarray chip x 14 groups) with three replicates for each group. The concentrations for the ``spiked-in'' transcripts were doubled for each consecutive group (0 and 0.25 to 1024 pM inclusive). To assess the performance of the statistical tests we used the FPR, the true positive rate (TPR, which is the proportion of transcipts correctly identified as being differentially expressed), and the partial area under the curve (pAUC, which measures the area under a Receiver Operator Characteristic curve (ROC) below a false positive cutoff of 0.05). The pAUC has a value between 0 (worst performance) and 1 (perfect performance). \par \section{Example} The easiest way to apply the LPEadj statistical test is to use the LPEadj function. Setting doMax to false (the default value) stops the LPE method from setting the variances of low intensity genes to the maximum variance. Setting doAdj to true (the default value) makes LPE use a variance adjustment based on the number of replicates of each group rather than $\pi/2$. The code will print the values used for the variance adjustment. The variance adjustment values for replicate sizes up to 10 are precalculated within the lpeAdj function. For example if there are two groups with three replicates each the following line will be printed. \\ \\ <>= # Loading the library and null dataset (two groups with three # replicates each). library(LPEadj) dat <- matrix(rnorm(6000), ncol=6) # Applying LPE lpe.result <- lpeAdj(dat, labels=c(0,0,0,1,1,1), doMax=FALSE, doAdj=TRUE) @ If you want more control over low level variables you can call calculateLpeAdj directly. Note that in this case the results of this call are the same as the previous call to lpeAdj because the variance adjustment values are identical (the variance adjustment values defined in ADJ.VALUES is the same as the variance adjustment values defined in lpeAdj). <>= # Loading the library and null dataset (two groups with three # replicates each) library(LPEadj) dat <- matrix(rnorm(6000), ncol=6) ADJ.VALUES <- c(1, 1, 1.34585905516761 ,1.19363228146169 ,1.436849413109 ,1.289652132873 ,1.47658053092781 ,1.34382984852146 ,1.49972130857404, 1.3835405678718) # calculate base line error distributions var1 <- adjBaseOlig.error(dat[,1:3], setMax1=FALSE, q=.05) var2 <- adjBaseOlig.error(dat[,4:6], setMax1=FALSE, q=.05) # The correct variance adjustments can be fetched using the replicate # number for each group as in index for the ADJ.VALUES vector. # eg: ADJ.VALUES[n] if there are n replicates in a group results <- calculateLpeAdj(dat[,1:3],dat[,4:6],var1,var2, probe.set.name=c(1:1000), adjust1=ADJ.VALUES[3], adjust2=ADJ.VALUES[3]) @ \begin{thebibliography}{10} \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi \expandafter\ifx\csname url\endcsname\relax \def\url#1{{\tt #1}}\fi \bibitem{Jain:2003} Jain et.\ al. \newblock {Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays}, \newblock {\em Bioinformatics}, 2003, Vol 19, No. 15, pp: 1945-1951. \bibitem{Moo:1974} Mood et.\ al. \newblock {Introduction to the theory of statistics} \newblock{Mcgraw-Hill, New York, 3rd Ed, 1974} \bibitem{Mur:2008} Murie et.\ al. \newblock {A correction for estimating error when using the Local Pooled Error statistical test}, \newblock {\em Bioinformatics}, in press. \end{thebibliography} \end{document}