%\VignetteIndexEntry{Simulation Study for VSN Add-On Normalization and Subsample Size}
%\VignetteKeywords{Preprocessing, VSN, Affymetrix}
%\VignetteDepends{affy, vsn, affyPara}
%\VignettePackage{affyPara}

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}

\documentclass[12pt]{article}

\usepackage[%
baseurl={http://www.bioconductor.org},%
pdftitle={Simulation Study for VSN Add-On Normalization and Subsample Size},%
pdfauthor={Markus Schmidberger},%
pdfsubject={affyPara, vsn},%
pdfkeywords={Bioconductor},%
plainpages,pdftex]{hyperref}

\usepackage{subfigure}

\SweaveOpts{eps=FALSE}

\author{Markus Schmidberger
	\footnote{Division of Biometrics and Bioinformatics, IBE, University of Munich, 81377 Munich, Germany}
	\footnote{Email: \texttt{schmidb@ibe.med.uni-muenchen.de}}
	\and Wolgang Hartmann \and Ulrich Mansmann$^*$
}

\title{Simulation Study for \Rpackage{vsn} Add-On Normalization and Subsample Size}

\begin{document}

\maketitle \tableofcontents \newpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
Variance Stabilization Normalization (VSN) is a model-based method to preprocess microarray 
intensity data \cite{Huber2002, Huber2007}. The method uses a robust variant of the 
maximum-likelihood estimator for the stochastic model of microarray data described in the 
references. The model incorporates data calibration (normalization), a model for the dependence 
of the variance on the mean intensity, and a variance stabilizing data transformation. 
The method is available in the Bioconductor package \Rpackage{vsn} and is implemented for 
normalising microarray intensities, both between colors within array, and between arrays.

This simulation study analyzes the influence of different numbers of subsamples and of 
add-on normalization to the normalized intensities. Biased preprocessing results will be 
demonstrated with two public available data sets.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Data}
For this simulation study two public available data sets from the ArrayExpress database 
(\url{http://www.ebi.ac.uk/microarray-as/ae}) were used.
\begin{description}
	\item[E-GEOD-11121:] Transcription profiling of human breast cancer cohort reveals 
	the humoral immune system has a key prognostic impact in node-negative breast cancer.\\ 
	Data: 200 HG-U133A microarray chips.\\
	Citation: The humoral immune system has a key prognostic impact in node-negative 
	breast cancer. Marcus Schmidt, Daniel B\"ohm, Christian von T\"orne, Eric Steiner, 
	Alexander Puhl, Henryk Pilch, Hans-Anton Lehr, Jan G Hengstler, Heinz K\"olbl, 
	Mathias Gehrmann. Cancer Res 68(13):5405-13 (2008)
	\item[E-GEOD-12093:] Transcription profiling of human breast cancer samples that were 
	treated with tamoxifen identifier a 76-gene signature defines high-risk patients 
	that benefit from adjuvant tamoxifen therapy.\\
	Data: 136 HG-U133A microarray chips.\\
	Citation: The 76-gene signature defines high-risk patients that benefit from 
	adjuvant tamoxifen therapy. Sieuwerts Zhang, Casey McGreevy, Paradiso Cufer, 
	Span Harbeck, Crowe Hicks, Budd Tubbs, Sweep Lyons, Schittulli Schmitt, Talantov 
	Golouh, Foekens Wang. Breast Cancer Res Treat -- (2008)
\end{description}

To get the raw data (CEL files) of the experiments the \Rpackage{ArrayExpress} package 
can be used or the files can be downloaded directly from the ArrayExpress web interface.
<<eval=FALSE>>=
library(ArrayExpress)
getAE(c('E-GEOD-11121','E-GEOD-12093'), extract=FALSE)
@

For quality control the \Rpackage{arrayQualityMetrics} package was used. This package uses 
six different methods for quality control and creates a qualtiy report with a summary table 
into a HTML output file .
<<eval=FALSE>>=
library(arrayQualityMetrics)
celDir <- 'PATH'
celf <- list.celfiles(celDir, full.names=T)
ab <- read.affybatch(celf)
arrayQualityMetrics(ab, outdir='QA')
@
Arrays with more than two potential problems (three stars or more in the summary table from 
the \Rpackage{arrayQualityMetrics} report) were excluded from the study. In the dataset 'E-GEOD-11121' 
nine arrays and in 'E-GEOD-12093' eight arrays were removed (see Table~\ref{removed_arrays}).
\begin{table}[htp]
	\begin{small}
	\centering
	\begin{tabular}{ll}
	E-GEOD-11121-raw-cel-1679201568.cel & E-GEOD-12093-raw-cel-1681274912.cel\\
	E-GEOD-11121-raw-cel-1679200783.cel & E-GEOD-12093-raw-cel-1681275553.cel\\
	E-GEOD-11121-raw-cel-1679200049.cel & E-GEOD-12093-raw-cel-1681276139.cel\\
	E-GEOD-11121-raw-cel-1679199817.cel & E-GEOD-12093-raw-cel-1681276420.cel\\
	E-GEOD-11121-raw-cel-1679199644.cel & E-GEOD-12093-raw-cel-1681276775.cel\\
	E-GEOD-11121-raw-cel-1679198997.cel & E-GEOD-12093-raw-cel-1681276985.cel\\
	E-GEOD-11121-raw-cel-1679198126.cel & E-GEOD-12093-raw-cel-1681277333.cel\\
	E-GEOD-11121-raw-cel-1679197959.cel & E-GEOD-12093-raw-cel-1681277423.cel\\
	E-GEOD-11121-raw-cel-1679197894.cel & \\
	\end{tabular}
	\caption{List of removed arrays due to quality problems identified with the 
	\Rpackage{arrayQualityMetrics} package.}
	\label{removed_arrays}
	\end{small}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Number of Subsamples}
As default for \Robject{AffyBatch} objects the \Rfunction{vsn2()} function uses 30.000 
subsamples (rows=probes) for the calculation of the model parameters using the maximum-likelihood 
estimator. However on current chip types there are more than 500.000 probes and therefore in default less 
than 5\% will be used for the estimation of the model. Long computation times and the main memory 
requirements are the main reasons for the sample size reduction.
<<echo=T, eval=F>>=
library(vsn)
fit <- vsn2(ab, subsample=100000)
x <- predict(fit, newdata=ab)
@

Figure~\ref{subsamples} plots the values of the offset and slope parameters for 25 and 50 arrays and 
different numbers of subsamples. 
\begin{figure}
\centering
\subfigure[E-GEOD-11121]{
		\includegraphics[height=6.5cm]{fig/E-GEOD-11121/vsnSubsamples_25.pdf}
	}
	\subfigure[E-GEOD-12093]{
		\includegraphics[height=6.5cm]{fig/E-GEOD-12093/vsnSubsamples_50.pdf}
	}
	\caption{Values of the vsn coefficients for different numbers of subsamples.}
	\label{subsamples}	
\end{figure}
The dashed red line is the default subsample with 30.000 rows and the dotted 
pink line indicates the parameters using all probes. Depending on the number of 
subsamples the parameters are changing very strong. The default subsample is a good 
approximation for several other subsamples, but there are strong outliers too. This 
indicates that in a bad case the model parameters will be estimated on an unsuitable 
subsample.

Figure~\ref{subsamplesBoxes} shows the boxplots of the intensities for every array. Blue boxes 
are calculated with 30.000 subsamples (default) and red boxes with 420.000 subsamples. 
\begin{figure}
	\centering
	\subfigure[E-GEOD-11121]{
		\includegraphics[height=6.5cm]{fig/E-GEOD-11121/vsnSubsamplesBoxplot_25.pdf}
	}
	\subfigure[E-GEOD-12093]{
		\includegraphics[height=6.5cm]{fig/E-GEOD-12093/vsnSubsamplesBoxplot_50.pdf}
	}
	\caption{Boxplots of intensities after vsn normalization with two different numbers of subsamples.}
	\label{subsamplesBoxes}	
\end{figure}
The boxplots reflect a good variance stabilization for all arrays. But due to the differences 
in the estimated parameters (see Figure~\ref{subsamples}) there are strong differences in the 
location of the means and in the outliers.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Add-On Normalization}
Add-On normalization allows to normalize a new microarray with respect to a normalization 
performed for a set off arrays. The \Rpackage{vsn} package offers this technique which is needed to 
apply gene expression profiles for classification or prognosis to a new patient. 

In the simulation study 20 arrays were normalized with \Rfunction{vsn2()}. After this two times 
20 arrays were added using the vsn add-on normalization. Additionally the same 60 arrays were 
normalized with vsn together (using the same subsample) and than compared to the results of the 
add-on normalization. 

<<echo=T, eval=F>>=
set.seed(1234)
fit_ref <- vsn2(ab[,1:20])
fit_add1 <- vsn2(ab[,21:40], reference=fit_ref)
fit_add2 <- vsn2(ab[,41:60], reference=fit_ref)
set.seed(1234)
fit_comp <- vsn2(ab)
@

\begin{figure}
	\centering
	\subfigure[E-GEOD-11121]{
		\includegraphics[height=6.5cm]{fig/E-GEOD-11121/vsnAddon_20-40-60.pdf}
	}
	\subfigure[E-GEOD-12093]{
		\includegraphics[height=6.5cm]{fig/E-GEOD-12093/vsnAddon_20-40-60.pdf}
	}
	\caption{Values of the vsn coefficients compared for complete vsn normalization and add-on normalization 
	with 20 reference arrays and two times 20 arrays add-on.}
	\label{addon}	
\end{figure}
Figure~\ref{addon} shows the influence of the add-on normalization to the model parameters. 
Blue lines are the parameters for the complete vsn normalization and red lines for the reference 
and add-on normalization. For both data sets there are strong differences between the 
parameters, but they have the same structure.

Figure~\ref{addonBoxes} plots the boxplots of the intensity values. 
\begin{figure}
	\centering
	\subfigure[E-GEOD-11121]{
		\includegraphics[height=6.5cm]{fig/E-GEOD-11121/vsnAddonBoxplots_20-40-60.pdf}
	}
	\subfigure[E-GEOD-12093]{
		\includegraphics[height=6.5cm]{fig/E-GEOD-12093/vsnAddonBoxplots_20-40-60.pdf}
	}
	\caption{Boxplots of intensities for complete vsn normalization 
		and vsn Add-On normalization.}
	\label{addonBoxes}	
\end{figure}
The boxplots reflect a good variance stabilization for all arrays. But due to the differences 
in the estimated parameters (see Figure~\ref{addon}) there are strong differences in the 
location of the means and in the outliers.

There is no influence from the number of add-on batches, because for every add-on array the 
same reference data (parameters) will be used. The more add-on data used the bigger the 
difference between add-on (+ reference) and complete vsn noramlization gets.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusion}
Same results can be found using more arrays and other data sets. The results are not shown due to 
overflow in the graphics using more than 150 arrays.

Using more than 30.000 subsamples increases the computation time but in theorey increases the accuracy 
of the statistical model too. A misuse of the add-on technique to normalize many patients to a base set 
of a few patients results in a biased preprocessing result. The number of add-on arrays should be not 
bigger than the number of reference arrays.  

Therefore the power of parallel computing or the \Rpackage{affyPara} \cite{Schmidberger2008} 
is required to use vsn normalization with more than 30.000 subsamples and to 
normalize all microarray data together.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{plain}
\bibliography{references}

\end{document}