%\VignetteIndexEntry{Parallelized affy functions for preprocessing}
%\VignetteKeywords{Preprocessing, Affymetrix}
%\VignetteDepends{affy, snow, affydata}
%\VignettePackage{affyPara}

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}

\documentclass[12pt]{article}

\usepackage[%
baseurl={http://www.bioconductor.org},%
pdftitle={Parallelized affy functions for preprocessing},%
pdfauthor={Markus Schmidberger},%
pdfsubject={affyPara},%
pdfkeywords={Bioconductor},%
plainpages,pdftex]{hyperref}

 %nice urls
\usepackage{url}
\usepackage[square,sort,comma,numbers]{natbib}

%\SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5}

\author{Markus Schmidberger
	\footnote{Division of Biometrics and Bioinformatics, IBE, University of Munich, 81377 Munich, Germany}
	\footnote{Package maintainer, Email: \texttt{schmidb@ibe.med.uni-muenchen.de}}
	\and Ulrich Mansmann$^*$
}

\title{Description of the \Rpackage{affyPara} package: Parallelized preprocessing methods for Affymetrix Oligonucleotide Arrays}

\begin{document}
\SweaveOpts{concordance=TRUE}

\maketitle \tableofcontents \newpage


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Abstract}
The \Rpackage{affyPara} package is part of the Bioconductor\footnote{\url{http://www.bioconductor.org/}} 
\cite{Gentleman2004} project. The package extends the \Rpackage{affy} package. The \Rpackage{affy} package 
is meant to be an extensible, interactive environment for data analysis and exploration of Affymetrix 
oligonucleotide array probe level data. For more details see the \Rpackage{affy} vignettes or 
\citet{Irizarry2002}.

The \Rpackage{affyPara} package contains parallelized preprocessing methods for high-density 
oligonucleotide microarray data. Partition of data could be done on arrays and therefore 
parallelization of algorithms gets intuitive possible. The partition of data and distribution 
to several nodes solves the main memory problems -- caused by the \Robject{AffyBatch} object -- 
and accelerates the methods \cite{Schmidberger2008}.

<<echo=F, eval=T, results=hide>>=
library(affyPara)
@

This document was created using R version \Sexpr{paste(R.version$major,".",R.version$minor,sep="")}
and versions \Sexpr{package.version("affy")}, \Sexpr{package.version("snow")} and 
\Sexpr{package.version("vsn")} of the packages \Rpackage{affy}, \Rpackage{snow} and 
\Rpackage{vsn} respectively.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Changes to previous Versions} 
For major changes see the NEWS file in the source code of the package or use the function 
\Rfunction{readNEWS()}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
The functions in the \Rpackage{affyPara} package have the same functionality and a very similar 
user-interface as the functions in the \Rpackage{affy} package. For a detailed function and method 
description see the \Rpackage{affy} vignettes and help files. The \Rpackage{affyPara} package 
contains parallelized preprocessing methods for high-density oligonucleotide microarray data.

The package is designed for large numbers of microarray data and solves the main memory problems 
caused by the \Robject{AffyBatch} object at only one workstation or processor. Partition of data 
could be done on arrays and therefore parallelization of algorithms gets intuitive possible. It is 
very difficult to define a concrete limit for a large number of data, because this strongly depends 
on the computer system (architecture, main memory, operating system). In general a computer cluster 
and the \Rpackage{affyPara} package  should be used when working 
with more than 150 microarrays. The partition of data and distribution to several nodes solves the main memory
problems (at one workstation) and accelerates the methods. Parallelization of existing preprocessing methods
produces, in view of machine accuracy, the same results as serialized methods. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Requirements}
The \Rpackage{affyPara} package requires the \Rpackage{affy}, \Rpackage{snow} and \Rpackage{vsn} package. 
From the \Rpackage{affy} and \Rpackage{vsn} packages several subfunctions for preprocessing will be used. 
The \Rpackage{snow} package \cite{Rossini2003} will be used as interface to a communication mechanism for 
parallel computing. In the \Rpackage{snow} package four low level interfaces have been implemented, one 
using PVM via the \Rpackage{rpvm} package by Li and Rossini, one using MPI via the \Rpackage{Rmpi} \cite{Rmpi}
package by Hao Yu, one using NetWorkSpaces via the \Rpackage{NWS} package by Revolution Computing and 
one using raw sockets that may be useful if PVM and MPI are not available. For a comparison and review of 
existing parallel computing techniques with the R language see \cite{Schmidberger2009}.

For more details concerning the \Rpackage{snow} package, see the help files or the webpage 
\url{http://www.cs.uiowa.edu/~luke/R/cluster/cluster.html}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Loading the package}
First of all you have to load the package and the depending packages.
<<echo=T, eval=T>>=
library(affyPara)
@
For demonstration we use the small \Robject{AffyBatch} object '\Robject{Dilution}' from 
the \Rpackage{affydata} package:
<<echo=T, eval=T>>=
library(affydata)
data(Dilution)
Dilution
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Starting and stopping cluster}
After loading the libraries the computer cluster has to be initialized. Starting a workstation cluster 
is the only step in using a computer cluster that depends explicitly on the underlying communication 
mechanism. A cluster is started by 
calling the \Rfunction{makeCluster()} function, but the details of the call depend on the type of cluster. 
PVM, MPI or NWS cluster may also need some preliminary preparations to start the systems. For 
some examples see the webpage \url{http://www.cs.uiowa.edu/~luke/R/cluster/cluster.html}.

To start a cluster you should use
<<echo=T, eval=T>>=
cl <- makeCluster(4, type='SOCK')
@
with the first parameter '4' for the number of spawend slaves and a parameter (type) for the used 
communication mechanism. In this example we use raw socket connections. As default in the 
\Rpackage{snow} a cluster object 'cl' will be created.

To stop a cluster you should use
<<stopCluster, echo=T, eval=T>>=
stopCluster(cl)
@
<<makeCluster, echo=F, eval=T>>=
cl <- makeCluster(2, type='SOCK')
@

The \Rpackage{affyPara} package masks (overwrites) the functions \Rfunction{makeCluster()} and 
\Rfunction{stopCluster()} from the \Rpackage{snow} package. The new functions save the cluster 
object (cl) in the namespace environment. Therefore you do not have to deal with the 'cl' object 
in the function calls of the \Rpackage{affyPara} functions. But it is still possible, using the 
parameter \Robject{cluster=cl}.

Socket clusters should stop automatically, when the process -- that created them -- terminates. 
However, it is still a good idea to call \Rfunction{stopCluster()}.

For more details see the \Rpackage{snow} package, the R package for your communication 
mechanism (\Rpackage{Rmpi}, \Rpackage{Rpvm}, \Rpackage{NWS}), and the implementation of your 
communication mechanism. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Inputdata: CEL Files or AffyBatch}
Before running any kind of proprocessing, the probe level data (CEL files) have to be handled. As suggested in
the \Rpackage{affy} package an object of class \Robject{AffyBatch} can be created:
\begin{itemize}
\item Create a directory.
\item Move all the relevant CEL files to that directory.
\item Make sure your working directory contains the CEL files (\Rfunction{getwd()}, \Rfunction{setwd()}). 
\item Then read in the data:
<<ReadAffy, echo=T, eval=F>>=
AffyBatch <- ReadAffy()
@
\end{itemize}
This \Robject{AffyBatch} object can be used to do preprocessing (with functions from the \Rpackage{affyPara} and
\Rpackage{affy} package) on the data. Depending on the size of the dataset and on the memory available at the
computer system, you might experience errors like 'Cannot allocate vector ...'. 

The idea of the \Rpackage{affyPara} package is, that all probe level data will never be needed at one place (computer)
at the same time. Therefore it is much more efficient and memory friendly to distribute the CEL files to the local
disc of the slave computers or to a shared memory system (e.g. samba device). To build only small \Rfunction{AffyBatch} 
objects at the slaves, do preprocessing at the slaves and rebuild the results (\Rfunction{AffyBatch} or 
\Rfunction{ExpressionSet} object) at the master node. This process is implemented in the functions from the 
\Rpackage{affyPara} package.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Function Description}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Background Correction}
Background correction (BGC) methods are used to adjust intensities observed by means of 
image analysis to give an accurate measurement of specific hybridization. Therefore BGC 
is essential, since part of the measured probe intensities are due to non-specific 
hybridization and the noise in the optical detection system.

In the \Rpackage{affyPara} package the same BGC methods as in the \Rpackage{affy} package 
are available. To list the background correction methods -- built into the package -- 
the function \Rfunction{bgcorrect.method()} can be used:
<<BGmethods>>=
bgcorrect.methods()
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Use Background Correction Para}
The function \Rfunction{bgCorrectPara()} needs an input data object (Dilution)
and the background correction method (method="rma") as input parameters.
<<affyBatchBGC, echo=T, eval=F>>=
affyBatchBGC <- bgCorrectPara(Dilution,
				method="rma", verbose=TRUE)
@

If you do not want to use an \Robject{AffyBatch} object as input data, you can directly 
give the CEL files and a vector of the CEL files location respectively to the function 
\Rfunction{bgCorrectPara()}:
<<affyBatchBGCfiles, echo=T, eval=F>>=
files <- list.celfiles(full.names=TRUE)
affyBatchGBC <- bgCorrectPara(files, 
				method="rma", cluster=cl)
@
For this method all CEL files have to be available at every node. This could be achieved for 
example using a a shared memory system. If you want to distribute the CEL files to the slaves, 
see Chapter~\ref{chap:distri}. 

Additionally this example demonstrates how to use an extra cluster object (\Robject{cluster=cl}).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Normalization}
Normalization methods make measurements from different arrays comparable. Multi-chip methods have proved to perform
very well. We parallelized the following methods 
\begin{description}
	\item[\Rfunction{contrast}] -> \Rfunction{normalizeAffyBatchConstantPara}
	\item[\Rfunction{invariantset}] -> \Rfunction{normalizeAffyBatchInvariantsetPara}
	\item[\Rfunction{loess}] -> \Rfunction{normalizeAffyBatchLoessPara}
	\item[\Rfunction{quantile}] -> \Rfunction{normalizeAffyBatchQuantilesPara}
	\item[\Rfunction{vsn2}] -> \Rfunction{vsnPara} 
 \end{description}
available from the \Rpackage{affy} package in the function \Rfunction{normalize()} and 
the \Rpackage{vsn} package.

The parallelized normalization functions need an input data object and the coresponding normalization
parameters as input paramters.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Use Quantile Normalization Para}
The function \Rfunction{normalizeAffyBatchQuantilesPara()} needs an 
input data object (Dilution)
and quantil normalization parameters as input parameters (type = "pmonly").
<<affyBatchNORM, echo=T, eval=F>>=
affyBatchNORM <- normalizeAffyBatchQuantilesPara(
		Dilution, type = "pmonly", verbose=TRUE)
@

If you do not want to use an \Rfunction{AffyBatch} object as input data, you can directly give the CEL files and a vector of the
CEL files location respectively to the function \Rfunction{normalizeAffyBatchQuantilesPara()}:
<<affyBatchGBCfiles, echo=T, eval=F>>=
files <- list.celfiles(full.names=TRUE)
affyBatchNORM <- normalizeAffyBatchQuantilesPara(
		files, type = "pmonly")
@
For this method all CEL files have to be available from a shared memory system. If you want to distribute the CEL files
to the slaves, see Chapter~\ref{chap:distri}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Summarization}
Summarization is the final step in preprocessing raw data. It combines the multiple probe intensities for each
probeset to produce expression values. These values will be stored in the class called \Rfunction{ExpressionSet}.
Compared to the \Rfunction{AffyBatch} class, the \Rfunction{ExpressionSet} requires much less main memory, because
there are no more multiple data. Therefore the complete preprocessing functions in the 
\Rpackage{affyPara} package are very
effizient, because no complete \Rfunction{AffyBatch} object has to be build, see Chapter~\ref{chap:preproPara}.

The parallelized summarization functions need an input data object and the coresponding summarization
parameters as input paramters. To see the summarization methods and PM correct methods that 
are built into the package the function \Rfunction{express.summary.stat.methods()} and 
\Rfunction{pmcorrect.methods()} can be used:
<<>>=
express.summary.stat.methods()
pmcorrect.methods()
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Use Summarization Para}
The function \Rfunction{computeExprSetPara()} needs an input data object (Dilution)
and the summarization parameters as input parameters (pmcorrect.method = "pmonly", summary.method = "avgdiff").
<<computeExprSetPara, echo=T, eval=F>>=
esset <- computeExprSetPara(
	Dilution,
	pmcorrect.method = "pmonly",
	summary.method = "avgdiff")
@

If you do not want to use an \Rfunction{AffyBatch} object as input data, you can directly give the CEL files and a vector of the
CEL files location respectively to the function \Rfunction{computeExprSetPara()}:
<<computeExprSetParafiles, echo=T, eval=F>>=
files <- list.celfiles(full.names=TRUE)
esset <- normalizeAffyBatchQuantilesPara(
	files, 
	pmcorrect.method = "pmonly",
	summary.method = "avgdiff")
@
For this method all CEL files have to be available from a shared memory system. If you want to distribute the CEL files
to the slaves, see Chapter~\ref{chap:distri}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Complete Preprocessing}\label{chap:preproPara}
By combining the background correction, normalization and summarization methods to one single method for
preprocessing an efficient method can be obtained. For parallelization, the combination has the big advantage of
reducing the exchange of data between master and slaves. Moreover, at no point 
a complete \Rfunction{AffyBatch} object needs to
be built, and the time-consuming rebuilding of the AffyBatch objects is no longer necessary.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Use Preprocessing Para}
It is important to note that not every preprocessing method can be combined together. For more details see the vignettes in
the \Rpackage{affy} package. 

The function \Rfunction{preproPara()} needs an input data object (Dilution)
and the parameters for BGC, normalization and summarization as input parameters.
<<preproPara, echo=T, eval=F>>=
esset <- preproPara(
	Dilution, 
	bgcorrect = TRUE, bgcorrect.method = "rma",
	normalize = TRUE, normalize.method = "quantil",
	pmcorrect.method = "pmonly",
	summary.method = "avgdiff")
@
The function works very similar to the \Rfunction{expresso()} function from the 
\Rpackage{affy} package. It is not very reasonable to have an \Robject{AffyBatch} 
object as input data object for this function. Because therefore you have to 
create a complete \Rfunction{AffyBatch} object. (very memory intensive). It is much better 
to use a vector of CEL files as input data object. And at no point a complete 
\Robject{AffyBatch} object needs to be built:
<<preproParafiles, echo=T, eval=F>>=
files <- list.celfiles(full.names=TRUE)
esset <- preproPara(
	files, 
	bgcorrect = TRUE, bgcorrect.method = "rma",
	normalize = TRUE, normalize.method = "quantil",
	pmcorrect.method = "pmonly",
	summary.method = "avgdiff")
@
For this method all CEL files have to be available from a shared memory system. If you want to distribute the CEL files
to the slaves, see Chapter~\ref{chap:distri}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Use RMA Para}
RMA is a famous \cite{Irizarry2003a} complete preprocessing method. This function 
converts an \Rfunction{AffyBatch} object into an \Rfunction{ExpressionSet} object 
using the robust multi-array average (RMA) expression measure. There exists a function
\Rfunction{justRMA()} in the \Rpackage{affy} package, which reads CEL files and 
computes an expression measure without using an \Rfunction{AffyBatch} object.  

The parallelized version of \Rfunction{rma()} is called \Rfunction{rmaPara()} and is a 
'simple' wrapper function for the function \Rfunction{preproPara()}.
<<rmaPara, echo=T, eval=F>>=
esset <- rmaPara(Dilution)
@
It is not very reasonable to have an \Robject{AffyBatch} object as input data object for this function.
Because therefore you have to create a complete \Robject{AffyBatch} object (very memory intensive).

It is much better to use a vector of CEL files as input data object. And at no point a 
complete \Robject{AffyBatch} object needs to be built:
<<rmaParaFiles, echo=T, eval=F>>=
files <- list.celfiles(full.names=TRUE)
esset <- rmaPara(files)
@
For this method all CEL files have to be available from a shared memory system. If you want to distribute the CEL files
to the slaves, see Chapter~\ref{chap:distri}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Use vsnRMA Para}
An other famous preprocessing method is \Rfunction{vsnrma()}. This uses varianze stabilization 
normalization for normalization and rma for summarization. The parallelized version is 
called \Rfunction{vsnrmaPara()}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Quality Control and Assessment}
Quality control and assessment gets very difficult for a huge number of microarrays data. It takes 
a lot of computation time and it requires a lot of main memory. In addition most methods for quality 
control are graphical tools and using more than 200 arrays no more information can be readout of 
the figures.

Therefore some optimized functions for quality control were implemented in parallel:
<<qc, echo=T, eval=F, eps=FALSE, fig=TRUE, hide=TRUE>>=
boxplotPara(Dilution)
MAplotPara(Dilution)
@
These functions create output tables with the quality assessment for all arrays and create 
optimized graphics (\Robject{plot=TRUE}) for huge numbers of arrays.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Distributing Data}\label{chap:distri}
At a workstation cluster the CEL files could be available by a shared memory system. At a workstation
cluster, this is often done by a samba device. But this could be the bottle neck for communication traffic. 
For distributed memory systems, the function \Rfunction{distributeFiles()} for (hierarchically) distributing 
files from the master to a special directory (e.g. '/tmp/') at all slaves was designed. R or the faster network 
protocols SCP or RCP can be used for the process of distributing.

<<distributeFiles, echo=T, eval=F>>=
path <- "tmp/CELfiles" # path at local computer system (master)
files <- list.files(path,full.names=TRUE)
distList <- distributeFiles(CELfiles, protocol="RCP")
eset <- rmaPara(distList$CELfiles)
@
With the paramteter \Rfunction{hierarchicallyDist} hierarchically distribution could be used. If 
\Rfunction{hierarchicallyDist = TRUE} data will be hierarchically distributed to all slaves. If 
\Rfunction{hierarchicallyDist = FALSE} at every slave only a part of data is available. This function 
and the corresponding input data object (\Robject{distList\$CELfiles}) could be used 
for every parallelized preprocessing method in the \Rpackage{affyPara} package.

There is also a function to remove distributed files:
<<removeDistributedFiles, echo=T, eval=F>>=
removeDistributedFiles("/usr1/tmp/CELfiles")
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Results and Discuccion}
This article proposes the new package called \Rpackage{affyPara} for parallelized preprocessing of high-density oligonucleotide
microarrays. Parallelization of existing preprocessing methods produces, in view of machine accuracy, the
same results as serialized methods. The partition of data and distribution to several nodes solves the main memory
problems and accelerates the methods.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Test for Accuracy}\label{chap:compare}
In view of machine accuracy, the parallelized functions produce same results as serialized methods. To compare results
from different functions you can use the functions \Rfunction{identical()} or \Rfunction{all.equal()} from the \Rpackage{base} package.
<<identical_bgc, echo=T, eval=F>>=
affybatch1 <- bg.correct(Dilution,
		method="rma") 
 affybatch2 <- bgCorrectPara(Dilution,
		method="rma")
identical(exprs(affybatch1),exprs(affybatch2))
all.equal(exprs(affybatch1),exprs(affybatch2))
@
Attention: If you directly compare the \Robject{AffyBatch} or \Robject{ExpressionSet} objects there are some warnings or not similar results. This 
is being caused by different values of the 'Title' and 'notes' slots in experimentData. Using the 
function \Rfunction{exprs()} to get the expression data equal results -- in view of machine 
accuracy -- can be proven.

Attention for loess normalization: In loess normalization a random sub sample will be created. For generating the same results the random generator
has to be reset for every run: 
<<identical_loess, echo=T, eval=F>>=
set.seed(1234)
affybatch1 <- normalize.AffyBatch.loess(Dilution)
set.seed(1234)
affybatch2 <- normalizeAffyBatchLoessPara(Dilution, verbose=TRUE) 
identical(exprs(affybatch1),exprs(affybatch2))
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Speedup}
In order to illustrate by how much the parallel algorithms are faster than the corresponding sequential algorithms, Figure \ref{fig:speedup} shows the speedup for
the parallelized preprocessing methods for 50, 100 and 200 CEL files. An average speedup of up to 
the factor 10 for 200 arrays or more may be achieved.
\begin{figure}[ht]
	\centering
	\includegraphics[width=12.5cm]{fig/affyPara_speedup.pdf}
	\caption{Speedup for the parallelized preprocessing methods for 200, 100 and 50 microarrays.}
	\label{fig:speedup}
\end{figure}

The computation time for parallel algorithms is compared to the original serial code. It is well known that parts of the original code 
are not very well implemented. Therefore an increased speedup (super-linear) could be achieved for low numbers of processors, the outliers are mostly generated by unbalanced data 
distribution. For example 200 microarrays can not be equally distributed to 23 nodes, there are some computers who have to calculate with one more array. 
Furthermore foreign network traffic in the workstation cluster at the IBE is a reason for outliers. After a special number of processors (depending on number of arrays and method) the plots for 
all parallelized function get a flat. This means, by using more processors no more speedup could be achieved. Therefore for example for 200 microrrays circa 10 nodes will be enough.

The cluster at the Department for Medical Information, Biometrics and Epidemiology (IBE, University
of Munich) consists of 32 personal computers with 8 GB main memory and two dual core Intel Xeon DP 5150 processors.
Using this cluster, about 16.000 (32 nodes $\cdot$ approximately 500 CEL files) microarrays of the type HGU-133A can be
preprocessed using the function \Rfunction{preproPara()}. By expanding the cluster, the number of microarrays can be increased to
any given number.

The \Rpackage{affyPara} package is tested in different hardware environments:
\begin{description}
	\item[Computer Cluster:] IBE, Linux-Cluster(LRZ, Munich, Germany), HLRB2(LRZ, Munich, Germany), Hoppy (FHCRC, Seattle, WA, USA)
	\item[Multicore:] IBE, HLRB2(LRZ, Munich, Germany), lamprey (FHCRC, Seattle, WA, USA)
\end{description}
Thanks a lot to the institutes for providing access to their computer ressources. 

<<echo=T, eval=T>>=
stopCluster()
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\nocite{Team2007}
\bibliographystyle{plain}
\bibliography{references}

\end{document}