%\VignetteIndexEntry{Main vignette (complete version): End-to-end analysis of cell-based screens}
%\VignetteKeywords{Cell based assays}
%\VignettePackage{cellHTS2}

\documentclass[11pt]{article}
\usepackage{amsmath}
\usepackage{color}
\definecolor{darkblue}{rgb}{0.0,0.0,0.75}
\usepackage[%
baseurl={http://www.bioconductor.org},%
pdftitle={End-to-end analysis of cell-based screens},%
pdfauthor={Wolfgang Huber},%
pdfsubject={cellHTS2},%
pdfkeywords={Bioconductor},%
pagebackref,bookmarks,colorlinks,linkcolor=darkblue,citecolor=darkblue,%
pagecolor=darkblue,raiselinks,plainpages,pdftex]{hyperref}

\SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5} 

\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rfunction}[1]{\textit{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}

\newcommand{\myincfig}[3]{%
  \begin{figure}[tp]
    \begin{center}
      \includegraphics[width=#2]{#1}
      \caption{\label{#1}#3}
    \end{center}
  \end{figure}
}

\begin{document}

%------------------------------------------------------------
\title{End-to-end analysis of cell-based screens: from raw intensity readings
  to the annotated hit list}
%------------------------------------------------------------
\author{Michael Boutros, L\'igia Br\'as, Florian Hahne and Wolfgang Huber}
\maketitle
\tableofcontents

<<Ropts, echo=FALSE, results=hide>>=
options(width=70)
@ 

\section{Introduction}\label{sec:Intro}
The package \Rpackage{cellHTS2} is a revised and improved version of 
the \Rpackage{cellHTS} package~\footnote{To convert a S3 class 
\Rclass{cellHTS} object that was made using the \Rpackage{cellHTS} package 
into an S4 class \Rclass{cellHTS} object suitable for \Rpackage{cellHTS2}, 
please see the function \Rfunction{convertOldCellHTS}.}.

This report describes the structure of the \Rclass{cellHTS} class,
while explaining all the steps necessary to run a complete analysis of
a cell-based high-throughput screen (HTS), from raw intensity readings
to an annotated hit list.

This text has been produced as a reproducible
document~\cite{Gentleman2004RepRes}. It contains the actual computer
instructions for the methods it describes, and these in turn produce
all results, including the figures and tables that are shown here. The
computer instructions are given in the language R, thus, in order to
reproduce the computations shown here, you will need an installation
of R (version 2.3 or greater) together with a recent version of the
package \Rpackage{cellHTS2} and of some other add-on packages. Within
R, the following commands can be used to install \Rpackage{cellHTS2}
along with all dependent packages.

To reproduce the computations shown here, you do not need to type them
or copy-paste them from the PDF file; rather, you can take the file
\textit{cellhts2Complete.Rnw} in the \textit{scripts} directory of the
package, open it in a text editor, run it using the R command
\Rfunction{Sweave}, and modify it to your needs.

First, we load the package.
%
<<setup1, results=hide>>=
library("cellHTS2")
@ 
%
<<setup2, echo=FALSE, results=hide>>=
## working path:
workPath <- getwd()

## check if bib file exists
if (!("cellhts.bib" %in% dir()) )
  system(sprintf("cp %s/cellhts.bib .", system.file("doc", package="cellHTS2")))

## for debugging:
options(error=recover)
## for software development, when we do not want to install 
## the package after each minor change:
##   for(f in dir("~/huber/projects/Rpacks/cellHTS2/R", full.names=TRUE, pattern=".R$"))source(f)
@ 
%
%------------------------------------------------------------
\section{Reading the intensity data}
\label{sec:read}
%------------------------------------------------------------
We consider a cell-based screen that was conducted in microtiter plate
format, where a library of double-stranded RNAs was used to target the
corresponding genes in cultured \textit{Drosophila} $Kc_{167}$
cells~\cite{Boutros2004}. Each of the wells in the plates contains
either a gene-specific probe, a control, or it can be empty.  The
experiments were done in duplicate, and the viability of the cells
after treatment was recorded by a plate reader measuring luciferase
activity.  The promoter has been chosen such that the luciferase
activity is indicative of ATP levels. Although this set of example
data corresponds to a single-channel screening assay, the
\Rpackage{cellHTS2} package can also deal with cases where there are
readings from more channels, corresponding to different reporters.

Usually, the measurements from each replicate and each channel come in
individual result files. The set of available result files and the
information about them (which plate, which replicate, which channel)
is contained in a spreadsheet, which we call the \emph{plate list
  file}. This file should contain at least the following columns:
\emph{Filename}, \emph{Plate} and \emph{Replicate}.  The last two
columns should be integer numbers, with values ranging from 1 to the
maximum number of plates or replicates, respectively. An optional
\emph{Batch} column can be used to provide batch information about the
experiment, i.e., changes in reagents, or days for a multi-day
experiment. See Section~ref{sec:multPlateConfs} for more details. The
first few lines of an example plate list file are shown in
Table~\ref{tab:platelist}, while Table~\ref{tab:signalintensity} shows
the first few lines from one of the plate result files listed in the
plate list file.

\input{cellhts2-platelist}
\input{cellhts2-signalintensity}

The first step of the analysis is to read the plate list file, to read
all the intensity files, and to assemble the data into a single R
object that is suitable for subsequent analyses.  The main component
of that object are arrays with the intensity readings of all plates,
channels, and replicates. We demonstrate the R instructions for this
step. First we define the path where the input files can be found.
%
<<dataPath>>=
experimentName <- "KcViab"
dataPath <- system.file(experimentName, package="cellHTS2") 
@ 
%
In this example, the input files are in the
\Robject{\Sexpr{experimentName}} directory of the \Rpackage{cellHTS2}
package. To read your own data, modify \Robject{dataPath} to point to
the directory where they reside. We show the names of 12 files from
our example directory:
%
<<dirDataPath>>=
dataPath
rev(dir(dataPath))[1:12]
@ 
%
and read the data into the object \Robject{x}
%
<<readPlateList, results=hide>>=
x <- readPlateList("Platelist.txt", 
                   name=experimentName, 
                   path=dataPath)
@ 
%
<<showX>>=
x
@ 
%
The plate format used in the screen (96-well or 384-well plate design)
is automatically determined from the raw intensity files when calling
the \Rfunction{readPlateList} function.

%% Create the example table:
%% it would have been nice to use the "xtable" package but it insists on 
%%   adding row numbers, which we don't like.
<<plateFileTable, results=hide, echo=FALSE>>=
cellHTS2:::tableOutput(file.path(dataPath, "Platelist.txt"), "plate list")
cellHTS2:::tableOutput(file.path(dataPath, names(intensityFiles(x))[1]), 
                       "signal intensity", header=FALSE)
@ 

%------------------------------------------------------------
\subsection{Importing intensity data files with other formats}
\label{sec:read2}
%------------------------------------------------------------

The function \Rfunction{readPlateList} has the argument
\Robject{importFun} that can be used to provide a different import
function to read plate result files with a format different from that
shown in Table~\ref{tab:signalintensity}.  For example, to import
plate data files from an EnVision plate reader, set
\Robject{importFun=getEnVisionRawData} or
\Robject{importFun=getEnvisionCrosstalkCorrectedData} when calling
\Rfunction{readPlateList}. Please see the help page of function
\Rfunction{getEnVisionRawData} for an example.  Another import
function (``importData.R'') is given together with the example data
set for a enhancer-supressor screen in the directory called
\emph{TwoWayAssay} of this package.\\

While for the above data sets, the measurements from each replicate
and channel come in separate result files, this is not the case when
measurement files are provided in the HTanalyst format. In this case,
each output file contains meta-experimental data together with
intensity readings (in a matrix-like layout) of a set of plates made
for the same replicate or screen. Thus, there is no need to have a
plate list file, and instead of using \Rfunction{readPlateList} we
should call the function \Rfunction{readHTAnalystData}. Please see the
help page for this function (\Robject{? readHTAnalystData}), where we
illustrate how it can be applied to import HTAnalyst data files.


%------------------------------------------------------------
\section{The \Rclass{cellHTS} class and reports}
%------------------------------------------------------------
The basic data structure of the package is the class \Rclass{cellHTS},
which is a container for cell-based high-throughput RNA interference
assays (data and experimental meta-data) performed in multi-plate
format.  This class extends the class \Rclass{NChannelSet} of the
\Rpackage{Biobase} package~\cite{BioC2004}.

The data can be thought of as being organised in a two- or 
three-dimensional array as follows:
\begin{enumerate}
  
\item The first dimension corresponds to reagents (e.g. siRNAs,
  chemical compounds) that were used in the assays. For example, if
  the screen used 100 plates of 384 wells (24 columns, 16 rows), then
  the first dimension has size 38,400, and the \Rclass{cellHTS} object
  keeps track of plate ID, row, and column associated with each
  element.  For historic reasons, and because we are using
  infrastructure that was developed for microarray experiments, the
  following terms are used synonymously for the elements of the first
  dimension: \emph{reagents}, \emph{features}, \emph{probes},
  \emph{genes}.
  
\item The second dimension corresponds to assays, including replicates
  and different experimental conditions (cell type, treatment, genetic
  background).  A potentially confusing terminology is that the data
  structure that annotates the second dimension is called
  \Robject{phenoData} (see below). This is because we are using
  infrastructure, namely the \Rclass{NChannelSet} class from the
  \Rpackage{Biobase} package, that uses this unfortunate term for this
  purpose.  Sometimes, the elements of this second dimension are also
  called \emph{samples}.
  
\item The (optional) third dimension corresponds to different
  channels (e.g. different luminescence reporters)
  
\end{enumerate}

\noindent The main structures contained in a \Rclass{cellHTS} object
are:

\begin{description}

\item[assayData] an object of class \Rclass{AssayData}, usually an
  environment containing a set of matrices of identical size.  Each
  matrix represents a single channel.  In each matrix, the rows
  correspond to features or reporters (e.\,g.\ siRNAs, dsRNAs) and the
  colums to samples (different conditions and/or replicates).

\item[phenoData] a dataframe (more precisely, an object of class
  \Rclass{AnnotatedDataFrame}) containing information about the
  screens, such as the replicate number and the type of biological
  assay.  It must have the following columns in its data component:
  
  \begin{description}
  
  \item[replicate] a vector of integers giving the replicate number
  
  \item[assay] a character vector giving the name of the biological
    assay or condition
  
\end{description}

Both of these columns have the same length as the number of samples in
the \Robject{assayData} slot.  The choice of name \emph{phenoData} for
this structure is unfortunate, it has nothing to do with phenotypes.

\item[featureData] a dataframe (more precisely, an object of class
  \Rclass{AnnotatedDataFrame}) that contains information about the
  reagents used in the experiment. There are three mandatory columns,
  in addition there can be an arbitrary number of additional columns,
  for example a target gene identifier. The mandatory columns are:
  
  \begin{description}
  
  \item[plate] integers specifying the plate number (e.\,g.\ $1, 2,
    \ldots$)
  
  \item[well] alphanumeric character strings giving the well ID within
    the plate (e.g. A01, B01, ..., P24).
    
  \item[controlStatus] a factor specifying the annotation for each well
    with possible levels: \emph{empty}, \emph{other}, \emph{neg},
    \emph{sample}, \emph{pos}. Other levels besides \emph{pos} and
    \emph{neg} may be employed for the positive and negative controls.
    
  \end{description}

\item[plateData] a list of dataframes, where each list item contains
  plate-specific information. The structure of the individual
  dataframes is fixed: rows are supposed to be plates and columns are
  supposed to be samples (i.e., the number of columns has to be the
  same as the number of columns in the \Robject{assayData}
  matrices. By default, the only available list item is
  \Robject{Batch} containing the information about experimental
  batches in the experiment. Unless explicitely specified in the plate
  list file, these will all be set to 1.

\item[experimentData] an object of class \Rclass{MIAME} containing
  descriptions of the experiment.

\end{description}

\noindent The \Rclass{cellHTS} class also includes additional slots
that are used to store the input files used to assemble the
\Rclass{cellHTS} object.  These are:

\begin{description}

\item[plateList] a dataframe containing names and metadata about input
  measurement data files, plus a column named \emph{status}, of type
  \Rclass{character}, whose elements are the string "OK" if the data
  import appeared to have gone well, and the respective error or
  warning message otherwise.

\item[intensityFiles] a list whose components are copies of the
  imported input data files.  Its length corresponds to the number of
  rows of \Robject{plateList}.

\item[plateConf] a data frame containing what was read from the
  configuration file for the experiment (except the first two header
  rows).

\item[screenLog] a data frame containing what was read from the screen
  log file for the experiment (in case there was one).

\item[screenDesc] a character containing what was read from the
  description file of the experiment.
\end{description}


Other slots are \Robject{rowcol.effects}, \Robject{overall.effects}
and \Robject{annotation}.  For a detailed description of this class,
please type \Robject{class ? cellHTS}.

In Section~\ref{sec:read}, we created the object \Robject{x}, which is
an instance of the \Rclass{cellHTS} class.  The measurements intensities
are stored in the slot \Robject{assayData} of \Robject{x}.  The slot
called \Robject{state} helps to keep track of the preprocessing state of our
\Rclass{cellHTS} object.  This slot can be accessed through the
\Rfunction{state}, as shown below:
%
<<see object state>>=
state(x)
@ 
%
It contains a logical vector of length \Sexpr{length(state(x))}
representing the processing status of the object. It should have the
names "configured", "normalized", "scored" and "annotated".  We can
thus see that \Robject{x} has not been configured, annotated,
normalized or scored yet.

These 4 main stages are explained along this vignette and can be
briefly described as follows:

\begin{description}
\item[Configuration] involves annotating the experiment with
  information about the screen (e.\,g.\ title, when and how it was
  performed, which organism, which library, type of assay, etc.),
  annotating the measured values with information on controls and
  flagging invalid measurements.  This step is covered in
  Section~\ref{sec:screenconf} and is prerequisite for preprocessing
  the experimental screening data (i.\,e.\ for normalization and
  scoring).

\item[Annotation] involves annotating the features (reagents) with
  information about, for example, their target genes.  This step,
  detailed in Section~\ref{sec:annotation}, is not essential for
  preprocessing, but this information can be used for the HTML quality
  reports generated by \Rpackage{cellHTS2} and in further analyses
  (see the complete report, as explained in Section~\ref{sec:Intro}).

\item[Normalization] involves removing systematic variations, making
  measurements comparable across plates and within plates, enhancing
  the biological signal and eventually transforming the data to a
  scale suitable for subsequent analyses.

\item[Replicates scoring and summarization] involves standardizing the
  values for each replicate and then summarizing replicate values in
  order to obtain a single-value per probe (for example, a robust
  $z$-score).

\end{description}

The steps of data normalization, replicate scoring and summarization
constitute the preprocessing work-flow of the screening data.  They
alter the contents of \Robject{assayData} slot and even the number of
channels of the \Rclass{cellHTS} object.  The complete analysis
project is contained in a set of \Rclass{cellHTS} objects that reflect
different preprocessing stages and that can be shared with others and
stored for subsequent computational analyses.

The \Rpackage{cellHTS2} package offers export functions for generating
human-readable reports, which consist of linked HTML pages with tables
and plots. The final scored hit list is written as a tab-delimited
format suitable for reading by spreadsheet programs.

Returning to our example data set, we create a report using the
function \Rfunction{writeReport}. Since until now we only have
unnormalized experimental data, the \Rclass{cellHTS} object should be
given to the function as the \Robject{raw} argument:
 
%
<<writeReport1Show, eval=FALSE>>=
out <- writeReport(raw=x)
@ 
<<writeReport1Do, echo=FALSE, results=hide>>=
out <- writeReport(raw=x, force=TRUE, outdir=tempdir())
@ 
%
This will write the report into the directory given by the argument
\Robject{outdir}. The option \Robject{force=TRUE} tells the function
to delete and overwrite a possibly already existing directory of the
same name.  This option needs to be used with caution and is only
accepted if \Robject{outdir} is explicitely specified.  If the
argument \Robject{outdir} is not specified, the default is a
subdirectory of the current working directory with name given by
\Robject{name(x)}

In order to keep a reference to the R commands used to create the HTML
output, one can provide the path to an ASCII script file using the
\Robject{mainScriptFile} argument. This comes back to the notion of
reproducible research mentioned before, and we strongly suggest to
make use of this feature. In fact, the function will issue a warning
if no script file is supplied. This said, we will not make use of the
\Robject{mainScriptFile} arguments, since we don't really have
scripting files at hand; all commands are contained in this document.

It can take a while to run this function, since it creates a large
number of graphics files. After this function has finished, the index
page of the report will be in the file indicated by the variable
\Robject{out},
%
<<printout>>=
out
@ 
%
and you can view it by directing a web browser to that file.
%
<<browseReport1>>=
if (interactive()) browseURL(out)
@ 

%------------------------------------------------------------
\section{Screen configuration: annotating the plate results}
\label{sec:screenconf}
%------------------------------------------------------------
\input{cellhts2-plateconfiguration} 
\input{cellhts2-screenlog} 

The next step of the analysis involves reading and processing three
input files specific of the screening experiment:
\begin{itemize}
\item \emph{Screen description file} contains a general description of
  the screen, its goal, the conditions under which it was performed,
  references, and any other information that is pertinent to the
  biological interpretation of the experiments.  In
  \Rpackage{cellHTS2} package we provide a function that creates a
  template description file whose entries can be edited and completed
  by the user. Type \Robject{? templateDescriptionFile}.  This file
  contains the entries compliant with the \Rclass{MIAME} class and
  also additional fields specific for the \Rclass{cellHTS} class.

\item \emph{Plate configuration file} is used to annotate the measured
  data with information on controls.  The content of this file for the
  example data set analysed here is shown in
  Table~\ref{tab:plateconfiguration} and the expected format for this
  file is explained in Section~\ref{sec:plateconf}.

\item \emph{Screen log file} (optional) is used to flag individual
  measurements as invalid.  The first 5 lines of this file are shown
  in Table~\ref{tab:screenlog}, and the layout for the
  \emph{screen log file} is detailed in Section~\ref{sec:screenlog}.\\

\end{itemize}


To apply the information contained in these three files in our
\Rclass{cellHTS} object, we call:
<<annotatePlateRes>>=
x <- configure(x,
               descripFile="Description.txt", 
               confFile="Plateconf.txt", 
               logFile="Screenlog.txt", 
               path=dataPath)
@ 
%
Note that the function \Rfunction{configure}\footnote{More precisely,
  \Rfunction{configure} is a method for the S4 class
  \Rclass{cellHTS}.}  takes \Robject{x}, the result from
Section~\ref{sec:read}, as an argument, and we then overwrite
\Robject{x} with the result of this function. If no screen log file is
available for the experiment, the argument \Robject{logFile} of the
function \Rfunction{configure} should be omitted.
%%
%% Create the example table for plateConf and screenLog
<<plateConfscreenLogTable, results=hide, echo=FALSE>>=
cellHTS2:::tableOutputWithHeaderRows(file.path(dataPath, "Plateconf.txt"), 
                                     "plate configuration", selRows=NULL)
cellHTS2:::tableOutput(file.path(dataPath, "Screenlog.txt"), 
                       "screen log", selRows=1:3)
@ 


%%--------------------------------------------------
\subsection{Format of the plate configuration file}
\label{sec:plateconf}
%%--------------------------------------------------
The software expects this to be a rectangular table in a tabulator
delimited text file, with mandatory columns \emph{Plate}, \emph{Well},
\emph{Content}, plus two additional header lines that give the total
number of wells and plates (see Table~\ref{tab:plateconfiguration} for
an example).  The content of this file (except the two header lines)
are stored in slot \Robject{plateConf} of \Robject{x}.\newline

As the name suggests, the \emph{Content} column provides the content
of each well in the plate (here referred to as the \textit{well
  annotation}). Mainly, this annotation falls into four categories:
empty wells, wells targeting genes of interest, control wells, and
wells containing other things that do not fit in the previous
categories. The first two types of wells should be indicated in the
\textit{Content} column of the plate configuration file by
\textit{empty} and \textit{sample}, respectively, while the last type
of wells should be indicated by \textit{other}. The designation for
the control wells in the \textit{Content} column is more flexible. By
default, the software expects them to be indicated by \textit{pos}
(for positive controls), or \textit{neg} (for negative
controls). However, other names are allowed, given that they are
specified by the user whenever necessary (for example, when calling
the \Rfunction{writeReport} function). This versatility for the
control wells' annotation is justified by the fact that, sometimes,
multiple positive and/or negative controls can be employed in a given
screen, making it useful to give different names to the distinct
controls in the \textit{Content} column. Moreover, this versatility
might be required in multi-channel screens for which we frequently
have reporter-specific controls.

The \emph{Well} column contains the name of each well of the plate in
alphanumeric format (in this case, \Robject{A01} to \Robject{P24}),
while column \emph{Plate} gives the plate number (1, 2, ...).  These
two columns are also allowed to contain regular expressions.  In the
plate configuration file, each well and plate should be covered by a
rule, and in case of multiple definitions only the last one is
considered.  For example, in the file shown in
Table~\ref{tab:plateconfiguration}, the rule specified by the first
line after the column header indicates that all of the wells in each
of the 57 assay plate contain ``sample''.  However, a following rule
indicate that the content of wells \Robject{A01}, \Robject{A02} and
\Robject{B01} and \Robject{B02} differ from ``sample'', containing
other material (in this case, ``other'' and controls).

Note that the well annotations mentioned above are used by the
software in the normalization, quality control, and gene selection
calculations. Data from wells that are annotated as \textit{empty} are
ignored, i.\,e.\ they are set to \Robject{NA}.

Here we look at the frequency of
each well annotation in the example data:
%
<<>>=
table(wellAnno(x))
@ 
%

We can also use the function \Rfunction{configurationAsScreenPlot} to
see the plate configuration of all of the plates as a screen plot:
%
<<configurationplot, echo=FALSE, results=hide>>=
png("cellhts2Complete-configurationplot.png", width=324, height=324)
configurationAsScreenPlot(x)
dev.off()
@
<<configurationplotShow, eval=FALSE>>=
configurationAsScreenPlot(x)
@ 

The result is shown in Figure~\ref{fig:configurationplot}. This can be
useful to verify the correctness of the plate configuration when a
complex set of rules with regular expressions are used.
\begin{figure}[tp]
\begin{center}
\includegraphics[width=324pt,height=324pt]{cellhts2Complete-configurationplot.png}
\caption{\label{fig:configurationplot}Screen plot of the plate 
configuration.}
\end{center}
\end{figure}

A special case of well annotation is when different types of positive
controls are used for the screening, that is \emph{enhancer} and
\emph{suppressor} compounds.  The vignette \textit{Analysis of screens
  with enhancer and suppressor controls} accompanying this package
explains how such controls can be handled during the screen analysis
using \Rpackage{cellHTS2} package.


%--------------------------------------------------
\subsubsection{Multiple plate configurations}
\label{sec:multPlateConfs}
%%--------------------------------------------------
Although it is good practice to use the same plate configuration for
the whole experiment, sometimes this does not work out, and there are
different parts of the experiment with different plate configurations.
The use of regular expressions in columns \emph{Plate} and \emph{Well}
of the plate configuration file allow therefore to specify different
configurations within and between assay plates.  The two header rows
of this file ascertain that all of the plates and wells are covered in
the plate configuration file.

%% Note about 'batches'
Note that in contrast to the \Rpackage{cellHTS} package, in
\Rpackage{cellHTS2} package the concept of \emph{batch} is separated
from the concept of having multiple plate configurations. So, for
example, different replicate of the same plate can be set as to belong
to different batches (even though they are required to have the same
plate configuration!) - since readouts were performed on different
days, or due to the use of different lots of reagents, etc.

Batch information (expressed in terms of integer values giving the
batch number: 1, 2, ...) can go into a particular slot called
\Robject{plateData}. This is expected to be a dataframe) of integer
values giving the batch number (1, 2, ...) for each plate and
sample. Its first dimension corresponds to the number of individual
plates, and its second dimension correspond to the number of columns
of each matrix stored in \Robject{assayData} slot (the samples).
Batch information can be filled in by the user in case s/he wants to
take into account this information in the analysis (for example, see
\Rfunction{normalizePlates} function, which allows to adjust the data
variance on a per-batch basis). It can either be provided as an
optional column in the plate list file, or using the \Rfunction{batch}
accessor method.


%%--------------------------------------------------
\subsection{Format of the screen log file}
\label{sec:screenlog}
%%--------------------------------------------------
The screen log file is a tabulator delimited file with mandatory
columns \emph{Plate}, \emph{Well}, \emph{Flag}. 
If there are multiple samples (replicates or conditions), a column 
called \emph{Sample} should also be present; a column named \emph{Channel} 
must also be provided when there are multiple channels.
In addition, it can 
contain arbitrary optional columns. 
Each row corresponds to one flagged measurement, 
identified by the plate number (and possible sample number and channel number) and 
the well identifier (alphanumeric identifier). 
The type of flag is specified in the column \emph{Flag}. Most commonly,
this will have the value ``NA'', indicating that the measurement
should be discarded and regarded as missing.\\

For those users that have been using \Rpackage{cellHTS} package and need to migrate 
their projects to \Rpackage{cellHTS2} package, 
we explain how this can be smoothly performed in Appendix~\ref{sec:migratingProjects}.



%------------------------------------------------------------
\section{Normalization, scoring and summarization of replicates}
\label{sec:norm}
%------------------------------------------------------------ 

The data normalization, replicates scoring and summarization functions
available in \Rpackage{cellHTS2} package work on the data stored in
the \Robject{assayData} slot of the \Rclass{cellHTS} object.
They create a copy of their input \Rclass{cellHTS} object,
with the data replaced by the transformed values. For a
list of the available functions, type \Robject{? cellHTS2}.

The preprocessing work-flow of a typical RNAi screen using
\Rpackage{cellHTS2} package is the following:

\begin{description}
\item[(a)] Per-plate normalization to remove plate and/or edge effects.
This can be done using the function \Rfunction{normalizePlates}.
\item[(b)] Scoring of measurements (for example, compute,
for each replicate, $z$-scores).
This can be done using the function \Rfunction{scoreReplicates}.
\item[(c)] Summarization of replicates (for example, take the median value).
This can be done using the \Rfunction{summarizeReplicates}.
\end{description}

Note that this work-flow is suitable for single-channel data. For
dual-channel data, further steps are required, as explained in the
vignette \textit{Analysis of multi-channel cell-based screens},
accompanying this package.

The function \Rfunction{normalizePlates} can be called to perform per-plate 
data transformation, normalization and variance adjustment. The normalization 
is performed in a plate-by-plate fashion and has three components: 
\begin{description}
\item[Data transformation] logarithmic transformation 
(optional, this can be advisable if the data are on multiplicative scale)
\item[Per-plate normalization] location adjustment for possible 
plate effects and/or possible spatial effects, using a choice of methods
that you need to adapt to your data
\item[Variance adjustment] plate-specific scale (variance) adjustment
(optional) 
\end{description}

For more details about this function and available normalization
methods, please type \Robject{? normalizePlates}.  To provide the
means to perform the above steps, \Rfunction{normalizePlates} has the
arguments \Robject{scale}, \Robject{log}, \Robject{method} and
\Robject{varianceAdjust}.

The argument \Robject{scale} allows to define the scale of the data
(``additive'' or ``multiplicative''), which will then control the
subsequent data transformation and normalization steps.  Namely, when
data are in multiplicative scale, the function allows to perform
$\log_2$ data transformation. For that we need to set the function's
argument \Robject{log} to \Robject{TRUE}. Log transformation will then
change the scale of the data to be ``additive''.

Per-plate median normalization is one of the methods available in
\Rfunction{normalizePlates} and can be chosen by setting the argument
\Robject{method="median"}.  In this case, plate effects are corrected
by dividing (if the current scale of the data is multiplicative) each
measurement by the median value across wells annotated as sample, for
each plate and replicate. If data are in additive scale, the
per-plate median values are subtracted from each plate measurement
instead.  All of the available normalization methods are described in
Appendix~\ref{sec:normalizationMethods}.

The variance of normalized intensities can be adjusted according to
argument \Robject{varianceAdjust}, as follows:

\begin{itemize}

\item \Robject{varianceAdjust="byPlate"}: per plate normalized intensities are 
divided by the per-plate median absolute deviations (MAD) in "sample" wells. 
This is done separately for each replicate and channel;

\item \Robject{varianceAdjust="byBatch"}: using the content of slot
\Robject{batch}, plates are split according to assay batches and the
individual normalized intensities in each group of plates (batch) are
divided by the per-"batch of plates" MAD values (calculated based on
"sample" wells).  This is done separately for each replicate and
channel;

\item \Robject{varianceAdjust="byExperiment"}: each normalized
measurement is divided by the overall MAD of normalized values in
wells containing "sample".  This is done separately for each replicate
and channel.

\end{itemize}

If \Robject{varianceAdjust="none"}, no variance adjustment is
performed (default).

As explained above, the parameter \Robject{method} of
\Rfunction{normalizePlates} allows to choose between different types
of per-plate normalization methods.  Returning to our example data
set, we choose to apply \emph{plate median scaling}:

\begin{eqnarray} 
x'_{ki} &=& \frac{x_{ki}}{M_i}\quad\quad\forall k,i \label{eq:normalizePlateMedian}\\
M_{i}&=&\mathop{\operatorname{median}}_{m\in\,\mbox{\scriptsize samples}} x_{mi} \label{eq:plateMedian}
\end{eqnarray}

where $x_{ki}$ is the raw intensity for the $k$-th well in the $i$-th
replicate file, and $x'_{ki}$ is the corresponding normalized intensity. 
The median is calculated across the wells annotated as \textit{sample} in 
the $i$-th result file. 
This is achieved by calling:
%
<<normalizePlateMedian>>=
xn <- normalizePlates(x, 
                      scale="multiplicative", 
                      log=FALSE, 
                      method="median", 
                      varianceAdjust="none")
@ 
after which we obtain a \Rclass{cellHTS} object with the normalized 
intensities stored in the slot
\Robject{assayData}. 
In the previous call to \Rfunction{normalizePlates} function, 
we have chosen not to adjust the data variance (default behaviour of 
\Rfunction{normalizePlates}).
For example, we can use function \Rfunction{compare2cellHTS} provided with the package to check whether these 
two \Rclass{cellHTS} objects, \Robject{x} and \Robject{xn} belong to the same experiment:

<<compare cellHTs objects>>=
compare2cellHTS(x, xn)
@ 

After normalizing the data, we standardize the values for each 
replicate experiment using Equation~\eqref{eq:defz}. In this equation, 
$\hat{\mu}$ and $\hat{\sigma}$ are estimators of location and
scale of the distribution of $x'_{ki}$ taken across all plates and 
wells of a given replicate experiment. This function uses robust estimators,
namely, median and median absolute deviation (MAD). 
Moreover, it only considers the wells 
containing ``sample'' for estimating $\hat{\mu}$ and $\hat{\sigma}$.
%As the values $x'_{ki}$ were obtained using plate median
%normalization~\eqref{eq:normalizePlateMedian}, it holds that
%$\hat{\mu}=1$.  
The symbol $\pm$ indicates that we allow for either
plus or minus sign in Equation~\eqref{eq:defz}; the minus sign can be
useful in the application to an inhibitor assay, where an effect
results in a decrease of the signal and we may want to see this
represented by a large score.
This is done by calling the \Rfunction{scoreReplicates} function, 
where arguments \Robject{sign} and \Robject{method} define 
the sign and the scoring method to apply (robust $z$-scores, in this case), 
respectively:
%
<<score replicates>>=
xsc <- scoreReplicates(xn, sign="-", method="zscore") 
@ 
%
After data standardization, we summarize the replicates, 
calculating a single score for each gene. 
This is performed using the \Rfunction{summarizeReplicates} function, 
where we use its argument \Robject{summary} to select the summary to apply.
One option is \emph{rms}, which corresponds to take the 
root mean square of the replicates values, and is shown in 
Equation~\eqref{eq:scoresSummary}. The chosen summary is 
taken over all the $n_{\mbox{\scriptsize rep}_k}$ replicates of probe $k$. 


\begin{eqnarray}
z_{ki} &=& \pm \frac{x'_{ki}-\hat{\mu}}{\hat{\sigma}} \label{eq:defz} \\
z_{k}  &=& \sqrt{\frac{1}{n_{\mbox{\scriptsize rep}_k}} \sum_{r=1}^{n_{\mbox{\scriptsize rep}_k}} z_{kr}^2} \label{eq:scoresSummary}.
\end{eqnarray}

Depending on the intended stringency of the analysis, other plausible
choices of summary function between replicates are the minimum, the
maximum, the mean or the median. In the first case, the analysis would
be particularly conservative: all replicate values have to be high in
order for $z_{k}$ to be high.  For the cases where both sides of the
distribution of $z$-score values are of interest, alternative summary
options for the replicates are to select the value closest to zero
(conservative approach) by setting \Robject{summary="closestToZero"}
or the value furthest from zero ( \Robject{summary="furthestFromZero"}
).  In order to compare our results with those obtained in the paper
of Boutros \textit{et al.}~\cite{Boutros2004}, we choose to consider
the mean as a summary:
%
<<summarize replicates>>=
xsc <- summarizeReplicates(xsc, summary="mean") 
@ 
%
after which we obtain a \Rclass{cellHTS} object with the 
resulting single $z$-score value per probe
stored in \Robject{assayData} slot.

Boxplots of the $z$-scores for the different types of probes are shown
in Figure~\ref{cellhts2Complete-boxplotzscore}.
%
<<boxplotzscore, fig=TRUE, include=FALSE, width=4.5, height=5.5>>=
scores <- Data(xsc)
ylim <- quantile(scores, c(0.001, 0.999), na.rm=TRUE)
boxplot(scores ~ wellAnno(x), col="lightblue", outline=FALSE, 
        ylim=ylim)
@ 
%
\myincfig{cellhts2Complete-boxplotzscore}{0.5\textwidth}{Boxplots of $z$-scores 
for the different types of probes.}
%


In the \Rpackage{cellHTS2} package, we provide a further
transformation of the $z$-score values to obtain the so-called
\emph{calls}. This involves applying a sigmoidal transformation to the
$z$-score values, with parameters $z_0$ and $\lambda$ ($>0$), given
by:

\begin{equation}
y_k = \frac{1}{1+e^{-\lambda \,\left(z_k-z_0\right)}} \label{eq:sigtransf}
\end{equation}

This transformation maps the $z$-score values to the interval
$\left[0,\,1\right]$ and is intended to expand the scale of $z$-scores
with intermediate values and shrink the ones showing extreme values,
therefore making the difference between intermediate phenotypes
larger.  The parameter $z_0$ defines the centre of the sigmoidal
transformation, while $\lambda$ controls the smoothness of the
transformation.

This transformation can be done by calling the function
\Rfunction{scores2calls}, as shown in Figure~\ref{cellhts2Complete-callvalues.png}.
%
<<callvalues, echo=FALSE, results=hide, width=180, height=180>>=
y <- scores2calls(xsc, z0=1.5, lambda=2)
png("cellhts2Complete-callvalues.png")
plot(Data(xsc), Data(y), col="blue", pch=".",
     xlab="z-scores", ylab="calls", 
     main=expression(1/(1+e^{-lambda *(z-z[0])})))
dev.off()
@
<<callvaluesShow, eval=FALSE>>=
y <- scores2calls(xsc, z0=1.5, lambda=2)
plot(Data(xsc), Data(y), col="blue", pch=".",
     xlab="z-scores", ylab="calls", 
     main=expression(1/(1+e^{-lambda *(z-z[0])})))
@
\myincfig{cellhts2Complete-callvalues.png}{180px}{A
  sigmoidal transformation that can be used for obtaining call
  values.}
%
However, for the purpose of the present analysis, we will consider the
$z$-score values instead of the call values.

%------------------------------------------------------------
\section{Probe annotation}
\label{sec:annotation}
%------------------------------------------------------------
\input{cellhts2-geneID} 

Up to now, the assayed genes have been identified solely by the
identifiers of the plate and the well that contains the probe for
them. The \emph{annotation file} contains additional annotation, such
as the probe sequence, references to the probe sequence in public
databases, the gene name, gene ontology annotation, and so forth.
Mandatory columns of the annotation file are \textit{Plate},
\textit{Well}, and \textit{GeneID}, and it has one row for each
well. The content of the \textit{GeneID} column will be species- or
project-specific. The first 5 lines of the example file are shown in
Table~\ref{tab:geneID}, where we have associated each probe with
CG-identifiers for the genes of \textit{Drosophila melanogaster}.
%
<<geneIDs>>=
xsc <- annotate(xsc, geneIDFile="GeneIDs_Dm_HFA_1.1.txt",
                path=dataPath)
@ 
%% Create the example table:
<<geneIDsTable, results=hide, echo=FALSE>>=
cellHTS2:::tableOutput(file.path(dataPath, "GeneIDs_Dm_HFA_1.1.txt"), 
                       "gene ID", selRows = 3:6)
@ 
%
An optional column named \textit{GeneSymbol} can be included in the 
\emph{annotation file}, and its content will be displayed by the tooltips 
added to the plate plots and screen-wide 
plot in the HTML quality report (see Section~\ref{sec:report}).


%---------------------------------------------------------------
\subsection{Adding additional annotation from public databases}
\label{sec:additAnno}
%---------------------------------------------------------------
For the analysis of the RNAi screening results, we usually want to
consider gene annotation information such as Gene Ontology,
chromosomal location, gene function summaries, homology.  The package
\Rpackage{biomaRt} can be used to obtain such annotation from public
databases~\cite{biomaRt2005}. However, there are also numerous
alternative methods to annotate a list of gene identifiers with public
annotation -- pick your favourite one.

This section demonstrates how to do it with the package
\Rpackage{biomaRt}. It is optional, you can move on to 
Section \ref{sec:report} if you do not have the \Rpackage{biomaRt}
package or do not want to use it. If you do skip this section, then
for the purpose of this vignette, please load a cached version of the
gene annotation:

<<bdgpbiomart1, eval=TRUE>>=
data("bdgpbiomart")
fData(xsc) <- bdgpbiomart
fvarMetadata(xsc)[names(bdgpbiomart), "labelDescription"] <- 
  sapply(names(bdgpbiomart), 
    function(i) sub("_", " ", i) 
)
@ 


%--------------------------- 
\subsubsection{Installation}
\label{sec:install}
%--------------------------- 
The installation of the \Rpackage{biomaRt} package can be a little bit
tricky, since it relies on the two packages \Rpackage{RCurl} and
\Rpackage{XML}, which in turn rely on the presence of the system
libraries \textit{libcurl} and \textit{libxml2} on your computer. If
you are installing the precompiled R packages (for example, this is
what most people do on Windows), then you need to make sure that the
system libraries on your computer are version-compatible with those on the
computer where the R packages were compiled, and that they are
found. If you are installing the R packages from source, then you need
to make sure that the library header files are available and that the
headers as well as the actual library is found by the compiler and
linker. Please refer to the \textit{Writing R Extensions} manual and
to the FAQ lists on www.r-project.org.

%------------------------------------------------------------------
\subsubsection{Using biomaRt to annotate the target genes online}
%------------------------------------------------------------------
In the remainder of this section, we
will demonstrate how to obtain the dataframe \Robject{bdgpbiomart} by
querying the online webservice \textit{BioMart} and through it the
Ensembl genome annotation database~\cite{Ensembl2006}.
%
<<get path for Rnw files, results=hide, eval=FALSE>>=
rnwPath <- system.file("doc/Rnw", package="cellHTS2")
setwd(rnwPath)
system(sprintf("cp biomart.tex %s", workPath))
setwd(workPath)
@ 
%
<<runBiomart, echo=FALSE, eval=FALSE>>=
cat("Weaving the biomart vignette part...")
setwd(rnwPath)
Sweave("biomart.Rnw")
setwd(workPath)
stop()
@ 
%
\input{Rnw/biomart.tex}
%\input{biomart.tex}
%
%------------------------------------------------------------
\section{Report}
\label{sec:report}
%------------------------------------------------------------
We have now completed the analysis tasks: the dataset has been read, 
configured, normalized, scored, and annotated:
%
<<printxagain>>=
xsc
@
%
We can now save the scored data set to a file. 
%
<<savex>>=
save(xsc, file=paste(experimentName, ".rda", sep=""))
@ 
% 
The data set can be loaded again for subsequent analysis, or passed on
to others. To produce a comprehensive report, we can call the function
\Rfunction{writeReport} again, this time specifying the three
\Rclass{cellHTS} objects as separate function arguments: ``raw'',
``normalized'' and ``scored''. We also alter some of the default
output settings using the \Rfunction{setSettings} functions (more details 
are given in section \ref{settings})

<<writeReport2, results=hide, eval=FALSE>>=
setSettings(list(plateList=list(reproducibility=list(include=TRUE, map=TRUE), 
                                intensities=list(include=TRUE, map=TRUE)),
                 screenSummary=list(scores=list(range=c(-4, 8), map=TRUE))))
out <- writeReport(raw=x, normalized=xn, scored=xsc, 
                   force=TRUE)
@ 
%
and use a web browser to view the resulting report.
%
<<browseReport2, eval=FALSE>>=
if (interactive()) browseURL(out)
@ 
% 
The report contains a quality report for each plate, and also for the
whole screening assays.  The per-plate HTML reports display the
scatterplot between duplicated plate measurements, the histogram of
the normalized signal intensities for each replicate, and plate plots
representing, in a false color scale, the normalized values of each
replicate, and the standard deviation between replicate measurements
at each plate position.  It also reportes different measures of
agreement between replicate measurements, such as the repeatability
standard deviation between replicate plates and the Spearman
correlation coefficient between duplicates. The per-plate reports also
show the dynamic range, calculated as the ratio between the geometric
means of the positive and negative controls. These measures can also
be obtained independently from \Rfunction{writeReport} function, by
using the functions \Rfunction{getMeasureRepAgreement} and
\Rfunction{getDynamicRange} provided in \Rpackage{cellHTS2} package.
If different positive controls were specified at the configuration
step and when calling \Rfunction{writeReport}, the dynamic range is
calculated separately for the distinct positive controls, since
different positive controls might have different potencies.

The experiment-wide HTML report presents, for each replicate, the
boxplots with raw and normalized intensities for the different plates,
and two plots for the controls: one showing the signal from positive
and negative controls at each plate, and another plot displaying the
distribution of the signal from positive and negative controls,
obtained from kernel density estimates.  The latter plot further gives
the $Z'$-factor determined for each experiment (replicate) using the
negative controls and each different type of positive
controls~\cite{Oldenburg1999}, as a measure to quantify the distance
between their distributions.  This measure can also be obtained by
calling the function \Rfunction{getZfactor}.

The experiment-wide report also shows a screen-wide plot with the
$z$-scores in every well position of each plate. If the argument
\Robject{map} of \Rfunction{writeReport} function is set to
\Robject{TRUE}, this plot and the plate plots of the per-plate reports
contain tooltips (information pop-up boxes) dispaying the annotation
information at each position within the plates.  If the scored
\Rclass{cellHTS} object provided for \Rfunction{writeReport} is not
annotated with gene identifiers, the annotation information shown by
the tooltips is simply the well identifiers.  For an annotated
\Rclass{cellHTS} object, if an optional column called
\textit{GeneSymbol} was included in the \emph{annotation file} (see
Section~\ref{sec:annotation}), and therefore is present in
\Robject{featureData} slot of the annotated object, then its content
is used for the tooltips.  Otherwise, the content of column ``GeneID''
of the \Robject{featureData} slot (which can be accessed via
\Robject{geneAnno}) is considered.

The screen-wide image plot can also be produced separately using the
function \Rfunction{imageScreen} given in the \Rpackage{cellHTS2}
package. This might be useful if we want to select the best display
for our data, namely, the aspect ratio for the plot and/or the range
of $z$-score values to be mapped into the color scale. These can be
passed to the function's arguments \Robject{ar} and \Robject{zrange},
respectively. For example,

<<imageScreen, eval=FALSE, results=hide>>=
imageScreen(xsc, ar=1, zrange=c(-3,4))
@ 

It should be noted that the per-plate and per-experiment quality
reports are constructed based on the normalized \Rclass{cellHTS}
object, in case it is provided to \Rfunction{writeReport}
function. Otherwise, it uses the data of the raw \Rclass{cellHTS}
object.  The quality report produced by \Rfunction{writeReport}
function has also a link to a file called \emph{topTable.txt} that
contains the list of scored probes ordered by decreasing $z$-score
values, when the final scored \Rclass{cellHTS} object is
provided. This file has one row for each well and plate, and for the
present example data set, it has the following columns:
\begin{itemize}
\item \verb=plate= plate identifier for each feature;
\item \verb=position= gives the position of the well in the plate
  (runs from 1 to the total number of wells in the plate);
\item \verb=well= gives the alphanumeric well identifier for each
  feature;
\item \verb=score= corresponds to the summarized score calculated for
  each probe (data stored in the scored and summarized object
  \Robject{xsc});
\item \verb=wellAnno= corresponds to the well annotation (as given by
  the plate configuration file);
\item \verb=finalWellAnno= gives the final well annotation for the
  scored values.  It combines the information given in the plate
  configuration file with the values in \Robject{assayData} slot of
  the scored \Rclass{cellHTS} object, in order to have into account
  the wells that have been flagged either by the screen log file, or
  manually by the user during the analysis.  These flagged wells
  appear with the annotation \emph{flagged}.

\item \verb=raw_r1_ch1= and \verb=raw_r2_ch1= contain the raw
  intensities for replicate 1 and replicate 2, respectively (data
  stored in the unnormalized \Rclass{cellHTS} object \Robject{x}).
  'ch' refers to channel;

\item \verb=median_ch1= corresponds to the median of raw measurements
  across replicates;

\item \verb=diff_ch1= gives the difference between replicated raw
  measurements (only given if the number of replicates is equal to
  two);

\item \verb=average_ch1= corresponds to the average between replicate
  raw intensities (only given if the number of replicates is higher
  than two);

\item \verb=raw/PlateMedian_r1_ch1= and \verb=raw/PlateMedian_r2_ch1=
  give the ratio between each raw measurement and the median intensity
  in each plate for replicate 1 and replicate 2, respectively.  The
  plate median is determined for the raw intensities, using
  exclusively the wells annotated as ``sample''.

\item \verb=normalized_r1_ch1= and \verb=normalized_r2_ch1= give the
  normalized intensities for replicate 1 and replicate 2,
  respectively. This corresponds to the data stored in the normalized
  \Rclass{cellHTS} object \Robject{xn}.

\end{itemize} 

Additionally, if any of the \Rclass{cellHTS} objects provided in the
argument \Robject{cellHTSlist} to \Rfunction{writeReport} has been
annotated (as in the present case), it also contains the data given in
the content of \Robject{featureData} slot of the annotated object.
The above file with the list of scored probes can also be obtained
without the need to run \Rfunction{writeReport} by using the function
\Rfunction{getTopTable} provided in the package.

%------------------------------------------------------------
\subsection{Controlling settings}\label{settings}
%------------------------------------------------------------
The \Rfunction{writeReport} function is highly customizable in terms
of the resulting HTML output. For most of the graphics that get
generated the color scheme, the size, the font size and many other
features can be controlled individually. This control is available
either through session-wide settings using the \Rfunction{setSettings}
function, or for each call of the the \Rfunction{writeReport} function
through the optional \Robject{settings} argument. Most of the plots
can even be completely supressed by switching the respective
\Robject{include} setting to \Robject{FALSE}. Please see
\Rfunction{?settings} for more details.


%------------------------------------------------------------
\subsection{Exporting data to a tab-delimited file}
%------------------------------------------------------------

The \Rpackage{cellHTS2} package contains a function called
\Rfunction{writeTab} to save the data stored in \Robject{assayData}
slot of a \Rclass{cellHTS} object to a tab-delimited file.  The rows
of the file are sorted by plate and well, and there is one row for
each plate and well.  When the \Rclass{cellHTS} object is annotated,
the probe information (i.e. the probe identifiers stored in column
``GeneID'' of the \Robject{featureData} slot) is also added.
%
<<exportData, results=hide, eval=FALSE>>=
writeTab(xsc, file="Scores.txt")
@ 
%
Since you might be interestered in saving other values to a tab
delimited file, below we demonstrate how you can create a matrix with
the ratio between each raw measurement and the plate median, together
with the gene and well annotation, and export it to a tab-delimited
file using the function \Rfunction{write.tabdel}~\footnote{This
  function is a wrapper of the function \Rfunction{write.table},
  whereby you just need to specify the name of the data object and the
  file} also provided in the \Rpackage{cellHTS2} package.
%
<<exportOtherData, eval=FALSE>>=
# determine the ratio between each well and the plate median
y <- array(as.numeric(NA), dim=dim(Data(x)))
nrWell <- prod(pdim(x))
nrPlate <- max(plate(x))
for(p in 1:nrPlate) 
{
    j <- (1:nrWell)+nrWell*(p-1)
    samples <- wellAnno(x)[j]=="sample"
    y[j, , ] <- apply(Data(x)[j, , , drop=FALSE], 2:3, 
                      function(w) w/median(w[samples], 
                                           na.rm=TRUE)) 
}

y <- signif(y, 3)
out <- y[,,1]
out <- cbind(fData(xsc), out)
names(out) = c(names(fData(xsc)), 
sprintf("Well/Median_r%d_ch%d", rep(1:dim(y)[2], dim(y)[3]), 
rep(1:dim(y)[3], each=dim(y)[2])))
write.tabdel(out, file="WellMedianRatio.txt")
@ 
 
%
At this point we are finished with the basic analysis of the
screen. As one example for how one could continue to further mine the
screen results for biologically relevant patterns, we demonstrate an
application of category analysis.



%------------------------------------------------------------
\section{Category analysis}
%------------------------------------------------------------
We would like to see whether there are Gene Ontology
categories~\cite{GO} overrepresented among the probes with a high
score. For this we use the category analysis from Robert Gentleman's
\Rpackage{Category} package~\cite{GentlemanCategories}. Similar
analyses could be done for other categorizations, for example
chromosome location, pathway membership, or categorical phenotypes
from other studies.

%
<<category, results=hide>>=
library("Category")
@ 
%
Now we can create the category matrix. Conceptually, this a matrix
with one column for each probe and one row for each category.  The
matrix element \Robject{[i,j]} is \Robject{1} if probe \Robject{j}
belongs to the \Robject{j}-th category, and \Robject{0} if not.
%In practice, this matrix would be rather large (and perhaps 
%exhaust the memory of your computer), hence we use a type of sparse 
%matrix representation, implemented by the \Rclass{graph} object 
%\Robject{categs}. This bipartite graph's nodes correspond to the rows 
%and columns of the matrix and there is an edge if the matrix element 
%is non-zero.

% <<find obsolete GO ids, echo=FALSE, eval=FALSE>>=
% scores <- as.vector(Data(xsc))
% names(scores) <- geneAnno(xsc)
% sel <- !is.na(scores) & (!is.na(fData(xsc)$go))
% goids = strsplit(fData(xsc)$go[sel], ", ")
% genes = rep(geneAnno(xsc)[sel], listLen(goids))
% ux = unique(unlist(goids, use.names=FALSE))
% require("GO")
% s1 = ux %in% ls(GOMFANCESTOR)
% s2 = ux %in% ls(GOBPANCESTOR)
% s3 = ux %in% ls(GOCCANCESTOR)
% s4 = is.na(ux)
% obsolete = sort(ux[which(!(s1|s2|s3|s4))])
% @ 

<<obsolete GO ids>>=
obsolete <- c("GO:0005489", "GO:0015997", "GO:0045034", "GO:0005660",
              "GO:0006118", "GO:0006512", "GO:0045045", "GO:0006125", 
              "GO:0043072", "GO:0006100", "GO:0048740")
@ 

Some distractions are the GO terms 
\Robject{\Sexpr{paste(obsolete, collapse=", ")}}, 
which are annotated to some of the genes, but are obsolete.
%
<<cat1>>=
scores <- as.vector(Data(xsc))
names(scores) <- geneAnno(xsc)
sel <- !is.na(scores) & 
       (!is.na(fData(xsc)$go_biological_process_id))
goids <- strsplit(fData(xsc)$go_biological_process_id[sel], 
                  ", ")
goids <- lapply(goids, function(x) x[!(x %in% obsolete)])
genes <- rep(geneAnno(xsc)[sel], listLen(goids))
categs <- cateGOry(genes, unlist(goids, use.names=FALSE))
@ 
% 
We will select only those categories that contain at least 3 and no
more than 1000 genes.
<<cat2>>=
nrMem <- rowSums(categs) # number of genes per category
remGO <- which(nrMem < 3 | nrMem > 1000)
categs <- categs[-remGO,,drop=FALSE]
##  see if there are genes that don't belong to any category 
## after applying the filter
nrMem <- rowSums(t(categs))
rem <- which(nrMem==0)
if(length(rem)!=0) categs <- categs[,-rem, drop=FALSE]
@ 
%
As the statistic for the category analysis we use the
$z$-score. First, we need to select the subset of genes that actually
have GO annotation:
%
<<cat3>>=
stats <- scores[ sel & (names(scores) %in% colnames(categs)) ]
@

There are some replicated probes in \Robject{stats}. We will handle
this by taking the maximum value between replicate probes
(non-conservative approach):
<<handle replicates>>=
## handle duplicated genes in stats:
isDup <- duplicated(names(stats))
table(isDup)
dupNames <- names(stats)[isDup]
sp <- stats[names(stats) %in% dupNames]
sp <- split(sp, names(sp))
table(sapply(sp, length)) 
aux <- stats[!isDup]
aux[names(sp)] <- sapply(sp, max)
stats <- aux
rm(aux)
@ 
%
Before calling the category summary functions, we need to order our
statistic vector according to the names of the columns of the category
matrix.
%
<<arrange probes>>=
m <- match(colnames(categs), names(stats))
stats <- stats[m]
stopifnot(colnames(categs)==names(stats))
@
% 
Finally, we are ready to call the category summary functions:
%
<<cat6>>=
acMean <- applyByCategory(stats, categs)
acTtest <- applyByCategory(stats, categs, 
                           FUN=function(v) 
                           t.test(v, stats)$p.value)
acNum <- applyByCategory(stats, categs, FUN=length)
isEnriched <- (acTtest<=1e-3) & (acMean>0.5)
@ 
%
A volcano plot of the $-\log_{10}$ of the $p$-value \Robject{acTtest}
versus the per category mean $z$-score \Robject{acMean} is shown in
Figure~\ref{cellhts2Complete-volcano.png}. For a given category, the
$p$-value is calculated from the $t$-test against the null hypothesis
that there is no difference between the mean $z$-score of all probes
and the mean $z$-score of the probes in that category.  To select the
enriched categories (\Robject{isEnriched}), we considered a
significance level of $0.1\%$ for the $t$-test, and a per category
mean $z$-score greater than $0.5$. This led to the
\Sexpr{sum(isEnriched)} categories marked in red in
Figure~\ref{cellhts2Complete-volcano.png} are listed in
Table~\ref{tab:enrichedGoCateg}.

 
%
\input{cellhts2-enrichedGoCateg} 
%
\myincfig{cellhts2Complete-volcano.png}{180px}{Volcano plot of the
  $t$-test $p$-values and the mean $z$-values of the category analysis
  for Gene Ontology categories. The top categories are shown in red.}
%
<<volcano, echo=FALSE, results=hide>>=
png("cellhts2Complete-volcano.png", width=180, height=180)
par(mai=c(0.9,0.9,0.1,0.1))
px <- cbind(acMean, -log10(acTtest))
plot(px, main='', xlab=expression(z[mean]), 
     ylab=expression(-log[10]~p), pch=".", col="black")
points(px[isEnriched, ], pch=16, col="red", cex=0.7)
dev.off()
stopifnot(identical(names(acMean), names(acTtest)),  
          identical(names(acMean), names(acNum))) 
@ 
%
<<enrichedGoCateg, echo=FALSE, results=hide>>=
enrichedGOCateg <- names(which(isEnriched))
require("GO.db")
res <- data.frame(
   "$n$" = acNum[isEnriched], 
    "$z_{\\mbox{\\scriptsize mean}}$" = signif(acMean[isEnriched],2), 
    "$p$" = signif(acTtest[isEnriched],2),
    "GOID" = I(enrichedGOCateg),
    "Ontology" = I(sapply(enrichedGOCateg, function(x) Ontology(get(x, GOTERM)))),
    "description" = I(sapply(enrichedGOCateg, function(x) Term(get(x, GOTERM)))),
    check.names=FALSE)

mt <- match(res$Ontology, c("CC", "BP", "MF"))
stopifnot(!any(is.na(mt)))
res <- res[order(mt, res$"$p$"), ]

cellHTS2:::dataframeOutput(res, header=TRUE, 
  caption=sprintf("Top %d Gene Ontology categories with respect to $z$-score.", nrow(res)),
  label="enrichedGoCateg", gotable=TRUE)
@ 
%

We have recently added an experimental feature to the
\Rpackage{cellHTS2} package in order to integrate such category
analyses into the HTML report framework. It leverages functionality
from the \Rpackage{GSEABase} package, which was specifically designed
to deal with gene sets. The feature is exemplyfied using the KEGG
pathway information available though the \Rpackage{KEGG.db} package.

First we have to create an object of class \Rclass{GeneSetCollection}
which represents the mapping of genes to KEGG pathways. The geneIDs in
the gene set collection are supposed to map to the geneIDs of the
assay scores. From the \Rpackage{KEGG.db} package we get a list of
pathway mapping to FlyBase identifiers.
<<geneset>>=
library(KEGG.db)
library(GSEABase)
kegg <- as.list(KEGGPATHID2EXTID)
kegg <- kegg[grep("dme", names(kegg))]
gsc <- 
    GeneSetCollection(mapply(function(keggId, geneId)
                             GeneSet(unique(geneId), 
                                     geneIdType=EntrezIdentifier(),
                                     collectionType=KEGGCollection(keggId),
                                     setName=keggId), 
                             names(kegg), kegg))
@ 
%
In a subsequent step we want to filter out all the assay scores
without proper annotation (e.g., control wells) and also restrict our
collection of gene sets to those containing at least 5 genes. The gene
annotations of our assay are the old CG identifiers, but we have
retrieved a mapping to proper FlyBase identifiers in the previous
\Rpackage{biomaRt} step.
<<filterGeneset>>=
scores <- as.vector(Data(xsc))
names(scores) <- fData(xsc)$flybase_gene_id
sel <- !is.na(scores) & !is.na(names(scores))
scores <- scores[sel]
                                
## Only consider set with more than 5 genes
gsc <- gsc[sapply(gsc, length) > 5]
gsNames <- mget(gsub("dme", "", names(gsc)), KEGGPATHID2NAME)
ann <- data.frame(Name=I(unlist(gsNames)))
rownames(ann) <- names(gsc)
@ 
%
Finally, we need to bundle up the assay scores and the gene set
collection and define one or several summary statistics to show in the
HTML report. Each of these statistics is supposed to be created by a
separate function, which takes one mandatory and one optional
argument. The mandatory argument is simply a vector of assay scores
for the respective gene set. The optional second argument is a numeric
vector of all available scores. This setup allows to compute simple
statistics like means or medians, but also more complicated test
statistics, for instance a $t$-test. The container for this
information is an object of class \Rclass{gseaModule} and there is a
convenient constructor which does most of the job.

In the following example, we define four test statistics (mean,
$p$-value of a $t$-test, its $t$-statistic and the number of genes in
the gene set) and pass them to the constructor as a list. We also add
a data.frame with further gene set annotation (in this case more human
readable names for the pathways) and the vector of assays scores. Note
that if no explicit vector is given, the scores of the
\Rclass{cellHTS} object are used. In our case their identifiers don't
match the gene sets FlyBase identifiers, and we also wanted to do a
couple of extra filtering steps.
<<gseamodule>>=
gmod <- gseaModule(gsc, list("Mean"=function(x,...) mean(x),
                             "P-value"=function(x,y) t.test(x, y)$p.value,
                             "T-statistic"=function(x,y) t.test(x, y)$statistic,
                             "NrGenes"=function(x,...) length(x)),
                   annotation=ann, scores=scores)
@ 
%
We can now re-write the \Rclass{cellHTS} report using the familiar
Rfunction{writeReport} function with the addition \Robject{gseaModule}
argument.
<<rewriteReport, eval=FALSE>>=
## Now load the cellHTS objects and rewrite the report
out <- writeReport(raw=x, normalized=xn, scored=xsc, 
                   force=TRUE,
                   outdir=tempdir(), gseaModule=gmod)
if (interactive()) browseURL(out)
@ 


%------------------------------------------------------------
\section{Comparison with the results previously reported}
%------------------------------------------------------------

In this section we compare the current results obtained using
\Rpackage{cellHTS} package, with the ones previously reported in
Boutros \textit{et al.}~\cite{Boutros2004}. The file
``Analysis2003.txt'' in the same directory as the input data files,
i.\,e.\, in \Robject{\Sexpr{experimentName}} directory of the
\Rpackage{cellHTS} package.  First, We will load this file:
%
<<load file with previous analysis>>=
data2003 <- read.table(file.path(dataPath, "Analysis2003.txt"), 
                       header=TRUE, as.is=TRUE, sep="\t")
@ 
% 
The file contains the columns 
\Robject{\Sexpr{paste(names(data2003),collapse=", ")}}.  
The scored values in the \Robject{Scores}
column will be compared with the ones obtained in our analysis. For
that, I will start by adding to \Robject{data2003}, a column with the
corresponding $z$-score values calculated using the \Rpackage{cellHTS}
package.
% 
<<add the current scored values>>=
i <- data2003$Position + 384*(data2003$Plate-1)
data2003$ourScore <- as.vector(Data(xsc))[i]
@ 
%
%
\myincfig{cellhts2Complete-scoresComparison.png}{324px}{Scored values
  obtained in the paper of Boutros \textit{et al.} against the scored
  values calculated herein. Each panel corresponds to one 384-well
  plate.  Axis labels are not pretty - they overlap with neighboring
  panels due to space constraints.}
%
<<scoresComparison,echo=FALSE,results=hide>>=
png("cellhts2Complete-scoresComparison.png", width=324, height=324)
par(mfrow=c(7,9), mai=c(0,0,0,0))
for(i in 1:max(data2003$Plate)) 
{
   sel <- (data2003$Plate==i)
   plot(data2003$ourScore[sel], data2003$Score[sel], pch=19, cex=0.6)
}
dev.off()
@ 
%
Figure~\ref{cellhts2Complete-scoresComparison.png} shows the
scatterplot between Boutros \textit{et al.}'s scores and our scores in
each of the 384-well plates.  The results between the two analyses are
very similar, except for two minor details: use of robust estimators
of location and spread (median and MAD instead of mean and standard
deviation), and estimation of MAD over the whole experiment instead of
plate-by-plate.  In fact,
Figure~\ref{cellhts2Complete-scoresComparison.png} evidenciates how
the scored values exactly agree up to an offset (mean versus median)
and scale (standard deviation versus MAD).
%

%------------------------------------------------------------
\section{Appendix: How to convert \Rpackage{cellHTS} to \Rpackage{cellHTS2} configuration files}
\label{sec:migratingProjects}
%------------------------------------------------------------

We advise the users of \Rpackage{cellHTS} package to start using the
improved package \Rpackage{cellHTS2} described herein, since the
latter provides better functionality for working with multi-channel
screens and multiple screens.

To facilitate this transition and help users to migrate their
\Rpackage{cellHTS}-specific projects to \Rpackage{cellHTS2}, we
provide in this package a function that converts the old S3
\Robject{cellHTS} object associated with \Rpackage{cellHTS} package
into one or several S4 \Robject{cellHTS} defined with the
\Rpackage{cellHTS2} package. This function is called
\Rfunction{convertOldCellHTS}.

However, you might want to migrate an existing project from start,
i.\,e.\,, redo all the steps starting by reading the intensity files
and configuring the screening data. In this case, you need to update
the screen log file (if available), the screen description file and
the screen configuration file~\footnote{The expected format of the
  other input files, namely the raw intensity data files
  (Section~\ref{sec:read})
  and the annotation file (Section~\ref{sec:annotation}) remains unchanged between the two packages.}.\\

% Screen description file
Regarding the screen description file, as mentioned in
Section~\ref{sec:screenconf}, we provide a function that creates a
template screen description file that can be edited and modified by
the user. Below we examplify how such file can be created:
%
<<example for description file>>=
out <- templateDescriptionFile("template-Description.txt",
                               force=TRUE)
out
readLines(out)
@
%
\input{cellhts2-cellHTSpackage-specificplateconfiguration} 
\input{cellhts2-cellHTSpackage-specificscreenlog} 

% Create the example table for the older formats of plate configuration and screen log files
<<old plateConfscreenLogTable, results=hide, echo=FALSE>>=
cellHTS2:::tableOutput(file.path(dataPath, "old-Plateconf.txt"), 
                       "cellHTS package-specific plate configuration", selRows=1:28)
cellHTS2:::tableOutput(file.path(dataPath, "old-Screenlog.txt"), 
                       "cellHTS package-specific screen log", selRows=1:3)
@ 
%

% Screen log file
The format of the screen log file compatible with \Rpackage{cellHTS2}
package is shown in Table~\ref{tab:screenlog}.  Compared to the
previous format required for \Rpackage{cellHTS} package
(Table~\ref{tab:cellHTSpackage-specificscreenlog}), we note that
column \textit{Filename} was
replaced by two columns named \textit{Plate} and \textit{Sample}. \\


% Screen configuration file
In \Rpackage{cellHTS} package, the concept of \textit{batch} is
intrinsically related with the plate configuration, since a change in
plate configuration along the experiment had to be handled by setting
each distinct plate configuration as corresponding to a different
batch.  Therefore, in \Rpackage{cellHTS} package, the plate
configuration file had three mandatory columns named \textit{Batch},
\textit{Well}, \textit{Content}, where the \textit{Batch} column
allowed for different plate configurations. The first 28 lines of such
file for the example RNAi screen considered in this report is shown in
Table~\ref{tab:cellHTSpackage-specificplateconfiguration}.  Thus, in
the old format required for the plate configuration file, we had to
have a number of rows equal to the product between the total number of
batches and the total number of wells per plate.

In contrast to \Rpackage{cellHTS} package, in \Rpackage{cellHTS2}
package the concept of \emph{batch} and multiple plate configurations
were made independent (see Section~\ref{sec:multPlateConfs}).  For
\Rpackage{cellHTS2} package, Table~\ref{tab:plateconfiguration} shows
the required file format, which was discussed in more detail in
Section~\ref{sec:plateconf}.

Due to the separation between batch and multiple plate configuration,
the column \textit{Batch} was removed, and replaced by the column
\emph{Plate}. The other two mandatory columns \emph{Well} and
\emph{Content} were kept.  Additionally, we now require that this file
contains two extra header lines giving the total number of wells and
plates (Table~\ref{tab:plateconfiguration}).  There is also an
improvement in the file format related with the fact that \emph{Plate}
and \emph{Well} columns now allow the use of regular expressions (see
Section~\ref{sec:plateconf} for more specific information), which
allows to cover the plate configuration used in the screen with just a
few lines. Besides, it allows to specify different configurations
within and between assay plates.



%------------------------------------------------------------
\section{Appendix: Normalization methods implemented in
  \Rpackage{cellHTS2} package}
\label{sec:normalizationMethods}
%------------------------------------------------------------


There are two main normalization methods available with
\Rpackage{cellHTS2} package: methods based on the use of reference
controls, and distribution-based methods.  These methods can be
applied using the function \Rfunction{normalizePlates}.


%------------------------------------------------------------
\subsection{Controls-based normalization}
%------------------------------------------------------------

% POC
% NPI
% negatives

%------------------------------------------------------------
\subsubsection{Percent of control}
%------------------------------------------------------------

Percent of control (POC) is a preprocessing method that tries to
correct for plate-to-plate variability by normalizing each $k$th
compound raw measurements in the $i$th result file, $x_{ki}$, relative
to the average of within-plate controls.  In an antagonist (or
inhibition) type assay, it is defined as:

\begin{equation}
x_{ki}^{\text{POC}} = \frac{x_{ki}}{\mu_i^{\text{pos}}} \times 100
\end{equation}
\noindent
where $\mu_i^{\text{pos}}$ is the average of the measurements on the
positive controls in the $i$th result file (i.\,e.\,, for a given
plate and replicate).

In \Rpackage{cellHTS2} package, this method can be applied by setting
the argument \Robject{method="POC"} when calling
\Rfunction{normalizePlates} function.

We also provide in the package a normalization method for
\Rfunction{normalizePlates} (\Robject{method="negatives"}) that
consists of scaling the plate measurements by the per-plate median of
the intensities on the negative controls~\footnote{If the scale of the
  data is defined as being additive (i.\,e.\,, argument
  \Robject{scale="additive"}, or arguments
  \Robject{scale="multiplicate"} and \Robject{log=TRUE}), measurements
  are subtracted by the median of per-plate negative controls
  instead.}.


%------------------------------------------------------------
\subsubsection{Normalized percent inhibition}
%------------------------------------------------------------

If \Rfunction{normalizePlates} is called with \Robject{method="NPI"},
the method known as normalized percent inhibition (NPI) is applied in
a per-plate basis to correct for plate effects.  For an antagonist
assay, this method divides the difference between each measurement in
a given result file $i$ ($x_{ki}$) and the average of the positive
controls on that plate ($\mu_i^{\text{pos}}$) by the difference
between the averages of the measurements on the positive
($\mu_i^{\text{pos}}$) and the negative controls
($\mu_i^{\text{neg}}$):

\begin{equation}
x_{ki}^{\text{NPI}} = \frac{\mu_i^{\text{pos}}-x_{ki}}{\mu_i^{\text{pos}}-\mu_i^{\text{neg}}}
\end{equation}

%------------------------------------------------------------
\subsection{Non-controls-based normalization}
% or distribution-based normalization
%------------------------------------------------------------

% Z score
% plate-median normalization 
% B score
% spatial normalization ("loess" or "locfit")

There are several normalization method implemented in
\Rpackage{cellHTS2} package that make use of the overall distribution
of values, instead of relying exclusively on controls.  These are
described in the following sections.

%------------------------------------------------------------
\subsubsection{Z score method}
%------------------------------------------------------------

Z score is a simple and widely known normalizing method that is
performed in a per-plate basis as follows:

\begin{equation}
x_{ki}^{\text{Z}} = \frac{x_{ki} - \mu_i}{\sigma_i},
\end{equation}
\noindent
where $\mu_i$ and $\sigma_i$ are the mean and standard deviation,
respectively, of all measurements within the $i$th result file
(replicated plate).  In the Z score method, measurements are re-scaled
relative to within-plate variation by subtracting the average of the
plate values and dividing this difference by the standard deviation
estimated from all measurements of the plate.

In \Rpackage{cellHTS2}, we consider a robust version of this method,
where the mean and standard deviation are replaced by the median and
the MAD, respectively, calculated at the sample wells. This robust Z
score method is performed by calling \Rfunction{normalizePlates} as
follows:
%
<<Z score method, eval=FALSE>>=
xZ <- normalizePlates(x, scale="additive", log=FALSE,
                      method="median",
                      varianceAdjust="byPlate") 
@ 

%
%------------------------------------------------------------
\subsubsection{Plate median normalization}
%------------------------------------------------------------

Plate median normalization involves calculating the relative signal of
each well compared to the median of the sample wells in the plate, as
shown in Equation (\ref{eq:normalizePlateMedian}) and Equation
(\ref{eq:plateMedian}).  The median is calculated among the $m$ wells
containing \textit{sample} (i.\,e.\,, for wells that contain genes of
interest) in result file $i$.  Plate median normalization can be
chosen by setting \Robject{method="median"} in
\Rfunction{normalizePlates}. When applied to data on additive scale,
the plate median normalization involves subtracting the plate
measuments by the per-plate median instead.

We also have two variants of the plate median scaling which consist of
using as the per-plate scaling factor $M_i$ the per-plate average
intensity on sample wells (\Robject{method="mean"}) or the midpoint of
the shorth of the per-plate distribution of values on sample wells
(\Robject{method="shorth"}).\\

The next methods are intended to explicitly correct for \emph{spatial
  effects} within plates, i.\,e.\,, the presence of intensity
gradients within the plates.  Such signal gradients can be caused by
differences in temperature, incubation time or concentration, etc., in
different wells across a plate.  Typically, these gradients produce
repeatitive patterns, which make it possible to distinguish them from
real actives that should be more or less randomly dispersed for
quasi-randomized collections of compounds.


%------------------------------------------------------------
\subsubsection{B score method}
%------------------------------------------------------------

In the B score method, row and column biases within each plate are
explicitly corrected for by fitting a two-way median polish to the raw
data in a per-plate fashion~\cite{Brideau2003}:

\begin{eqnarray}
e_{rci} = x_{rci} - \hat{x}_{rci} = x_{rci} - \left( \hat{\mu}_i + \hat{R}_{ri} + \hat{C}_{ci}\right) \\ \label{eq:BscoreResiduals}
x^{\text{B}}_{rci} = \frac{e_{rci}}{MAD_i}	\label{eq:BscoreVarAdj}
\end{eqnarray} 
\noindent
Here, $x_{rci}$ is the measurement value in the $r$th row and $c$th
column of the plate corresponding to the $i$th result file,
$\hat{x}_{rci}$ is the corresponding fitted value defined as the sum
between the estimated average of the replicate plate ($\hat{\mu}_i$),
the estimated systematic offset for row $r$ ($\hat{R}_{ri}$) and the
systematic offset for column $c$ ($\hat{C}_{ci}$) in that replicated
plate.  In a second step, each of the obtained residual values
$e_{rci}$'s of the $i$th result file are divided by their median
absolute deviation ($MAD_i$) giving the final B score value --
Equation~(\ref{eq:BscoreVarAdj}).


We implemented a method similar to the B score method described in
Malo \textit{et al.}~\cite{Malo2006} and Brideau \textit{et
  al.}~\cite{Brideau2003} using the Tukey's median polish
procedure~\cite{Tukey1977} (function \textit{medpolish} of the package
\textit{stats}) which fits an additive model to the data according to
Equation~(\ref{eq:BscoreResiduals}).  The Tukey's median polish
algorithm works by alternately removing the row and column medians and
continues until the proportional reduction in the sum of absolute
residuals is less than $\epsilon$ or until the maximum number of
iterations has been reached.  The B score method can be applied by
defining argument \Robject{method="Bscore"} in
\Rfunction{normalizePlates}. Alternatively, the method can be applied
by calling a separate function
called \Rfunction{Bscore} provided in the \Rpackage{cellHTS2} package.\\


In \textit{cellHTS2} package, we provide two additional spatial
normalization methods that fit a polynomial surface to the intensities
within each assay plate using local regression and that can be
performed via \Rfunction{normalizePlates} or
\Rfunction{spatialNormalization} functions, although we advise to
apply these methods using the former function. The fit can be
performed either using the \textit{loess} procedure or the
\textit{locfit.robust} function of package \textit{locfit}.  In
\Rfunction{normalizePlates}, if \Robject{method="locfit"}, spatial
effects are removed by fitting a bivariate local regression to each
plate and replicate, while if \Robject{method="loess"}, a loess curve
is fitted instead.





%------------------------------------------------------------
\section{Appendix: Data transformation}
%------------------------------------------------------------
\myincfig{cellhts2Complete-transfplots.png}{324px}{Comparison between
  untransformed (left) and logarithmically (base 2) transformed
  (right), normalized data.  Upper: histogram of intensity values of
  replicate 1.  Middle: scatterplots of standard deviation versus mean
  of the two replicates.  Bottom: Normal quantile-quantile plots.}

An obvious question is whether to do the statistical analyses on the
original intensity scale or on a transformed scale such as the
logarithmic one.  Many statistical analysis methods, as well as
visualizations work better if (to sufficient approximation)
\begin{itemize}
\item replicate values are normally distributed,
\item the data are evenly distributed along their dynamic range, 
\item the variance is homogeneous along the dynamic range~\cite{Huber2002ismb}.
\end{itemize}

Figure~\ref{cellhts2Complete-transfplots.png} compares these
properties for untransformed and log-transformed normalized data,
showing that the difference is small.  Intuitively, this can be
explained by the fact that for small $x$,
\[
\log(1+x)\approx x
\]
and that indeed the range of the untransformed data is mostly not far
from 1.  Hence, for the data examined here, the choice between
original scale and logarithmic scale is one of taste, rather
than necessity.
%
<<transfplots, echo=FALSE, results=hide, eval=TRUE>>=
library(vsn)
myPlots=function(z, main, plotCol) 
{
  z <- as.data.frame(z)
  colnames(z) <- paste0("Sample", seq_len(ncol(z)))
  gh <- ggplot2::ggplot(z, ggplot2::aes(x=Sample1))+
    ggplot2::geom_histogram(fill="darkblue")+
    ggplot2::ggtitle(main)
  gm <- meanSdPlot(as.matrix(z), plot=FALSE)$gg+
    ggplot2::ylim(c(0, quantile(abs(z[,2]-z[,1]), 0.95, na.rm=TRUE)))+
    ggplot2::theme(legend.key.size=unit(0.02, "npc"), legend.position="top")
  gq <- ggplot2::qplot(sample=z$Sample1)
  print(gh, vp=viewport(layout.pos.row=1, layout.pos.col=plotCol))
  print(gm, vp=viewport(layout.pos.row=2, layout.pos.col=plotCol))
  print(gq, vp=viewport(layout.pos.row=3, layout.pos.col=plotCol))
}
png("cellhts2Complete-transfplots.png", width=400, height=600)
grid.newpage()
pushViewport(viewport(layout=grid.layout(3, 2)))
dv <- Data(xn)[,,1]
myPlots(dv, main="untransformed", plotCol=1)
xlog <- normalizePlates(x, scale="multiplicative", log=TRUE, 
                        method="median", varianceAdjust="byExperiment")
dvlog <- Data(xlog)[,,1]
myPlots(dvlog, main="log2", plotCol=2)
dev.off()
@
%

%---------------------------------------------------------
\section{Session info}
%---------------------------------------------------------
\begin{table*}[h]%[tbp]
\begin{minipage}{\textwidth}
<<sessionInfo, results=tex, print=TRUE>>=
toLatex(sessionInfo())
@ 
\end{minipage}
\caption{\label{tab:sessioninfo}%
The output of \Rfunction{sessionInfo} on the build system 
after running this vignette.}
\end{table*}


%------------------------------------------------------------
%Bibliography
%------------------------------------------------------------
\bibliography{cellhts}
\bibliographystyle{plain}

\end{document}