%\VignetteIndexEntry{An introduction to OperaMate}
%\VignetteKeywords{Preprocessing, CellBasedAssays}
%\VignettePackage{OperaMate} 

\documentclass[10pt, oneside]{article}

<<style-Sweave, eval = TRUE, echo = FALSE, results = tex>>=
if (requireNamespace("BiocStyle", quietly = TRUE)) {
    BiocStyle::latex()
}
@
\usepackage{hyperref}

\newcommand{\thetitle}{OperaMate: data importing, processing and
  analysis for Opera High Content Screening System}
\title{\textsf{\textbf{\thetitle}}} \author{Chenglin
  Liu\\[1em]Shanghai Jiaotong University,\\ Shanghai,
  China\\ \texttt{cliu@sjtu.edu.cn} \and Yixue Li\\[1em]Shanghai
  Jiaotong Univeristy,\\ Shanghai, China\\ \texttt{yxli@sibs.ac.cn}}

\begin{document}

\maketitle

\tableofcontents

\section{Introduction}
The \Rpackage{OperaMate} is intended to analyze protein intensity data derived 
from the images from PerkinElmer's Opera High Content Screening System 
(\url{http://www.perkinelmer.com/pages/020/cellularimaging/products/opera.xhtml})
and intepreted by Columbus Image Data Storage and Analysis System
(\url{http://www.perkinelmer.com/pages/020/cellularimaging/products/columbus.xhtml}).
PerkinElmer's Opera High Content Screening System is the 
state-of-the-art confocal microplate imaging solution for high throughput
screening. The system works with complex disease models and offers various 
solutions like protein expression, RNAi screening. 
Its compatible tool Columbus image analysis system
exports intensity data by analyzing the Opera images, 
but lacks further processing and analysis to
uncover the biological meaning underlying the image data. Hence, we develop an R \cite{Rpackage} package 
\Robject{OperaMate} especially to process the intensity data exported by Columbus system,
which can fulfills the procedure of data importing, processing and analysis in
an easy and efficient way.

\section{Getting Started}
\Rpackage{OperaMate} integrates the entire process of data importing,
processing and analysis into one function, which is easy to operate
for the users who are new to the R language. However, it is also very
convinient to customize each step according to the pipeline built by
this function, checking the immediate results of each step, and
re-performing a specific step with different parameters to achieve the
desire of the users.

\subsection{Configration}\label{section:config}
In order to process and analyze data by one function, the configuration file
should be supplied. The templete can be obtained together with the package named param.txt.
Users are required to fill the blanks after colons. Leave them as blank if NULL is expected.
An example can be obtained by :
<<demo_config, echo = FALSE, eval = TRUE, include = FALSE>>=
file <- file.path(system.file("Test", package = "OperaMate"), "demoData", "demoParam.txt")
## file.show(file)
@ 
Required column names of \Robject{genemap}:
\begin{itemize}
\item Barcode:  character, the barcode of the plates.
\item Well: character, the well ID.
\item GeneSymbol: character, the annotated gene names of the well.
\end{itemize}
An example can be obtained by :
<<demo_genemap, echo = FALSE, eval = TRUE, include = FALSE>>=
file <- file.path(system.file("Test", package = "OperaMate"), "demoData", "genemap.csv")
genemap <- read.csv(file)
head(genemap, n = 5)
@ 

\subsection{Running OperaMate}
<<run_pipeline, echo = TRUE, eval = TRUE, include = FALSE>>=
library(OperaMate) 
configFile <- file.path(system.file("Test", package = "OperaMate"), "demoData", "demoParam.txt")
operaReport <- operaMate(configFile, gDevice = "png")
names(operaReport)
@
See the mannual for more details about this function.

\subsection{Description of output}
All the reports and figures of the demo data can be find at
<<out_dir, echo = TRUE, eval = FALSE, include = FALSE>>=
tempdir()
@ 

\section{Class declaration}
\subsection{The expData class}\label{section:expData}
Each \Robject{expData} object stores data of one imaging analysis
report of the Opera system generatd by Columbus$^{TM}$ Image Data
Storage and Analysis System. The report includes different types of
data of one plate, distinguished by the paramters. The format of the
report is set in the analysis system. \Rpackage{OperaMate} supports
two most popular formats by now: the matrix and the table. The class
requires the following information:

\begin{enumerate}
\item \Robject{name}, name of the plate
\item \Robject{path}, path of the importing file
\item \Robject{rep.id}, replicate ID
\item \Robject{exp.id}, barcode of the experiment
\item \Robject{format}, report format
  \begin{enumerate}
  \item \Robject{Matrix}: matrix format in the Columbus analysis
    system
  \item \Robject{Tab}: table format in the Columbus analysis system
  \end{enumerate}
\end{enumerate}

An example of creating a new expData object is as follows:
<<create_one_plate, eval = TRUE>>= 
onePlate <- expData(name = "DSIMGA02-s1",
                     path = file.path( 
                       system.file("Test",package = "OperaMate"),
                       "Matrix", "130504-s1-02.txt" ),
                     rep.id = "s1", exp.id = "DSIMGA02", 
                     format = "Matrix")
@
%def

However, it is highly recommended to import all files using the
function \Robject{loadAll}. Details will be referred in Section
\ref{section:import}.

\subsection{The cellData class}
An object of \Robject{expData} class stores all data corresponding to
one parameter of the reports. It organizes the data as a matrix with
rows are well IDs and columns the plate IDs. This class requires to 
provide the name of the object, which must be one parameter of the report.
Additional information includes:
\begin{enumerate}
\item \Robject{posctrwell}, the well IDs of the positive controls, e.g. B05
\item \Robject{negctrwell}, the well IDs of the negative controls, e.g. B05
\item \Robject{expwell}, the well IDs of the samples, e.g. C12
\item \Robject{norm.method}, one method from "MP", "PMed", "Z", "Ctr", "None"
\item \Robject{QC.threshold}, the thresholds for quality control
\end{enumerate}

The \Robject{expData} objects provide different places for different
levels of the data: the raw data in the \Robject{origin.data} slot;
the normalized data in the \Robject{norm.data} slot and data after
quality control in the \Robject{qc.data} slot. In addition, the
general quality of each plate is stored in the \Robject{plate.quality}
slot.

An example of creating a new cellData object is as follows:
<<create_one_cell, eval=TRUE>>= 
oneCell <- cellData(name = "Average Intensity of Nuclei",
                    positive.ctr = c("H02", "J02", "L02"),
                    negative.ctr = c("C23", "E23", "G23"))
oneCell
@
%def
\section{Data importing and processing}
\subsection{Raw data importing and re-organization}\label{section:import}
\Rpackage{OperaMate} imports all reports of the Columbus analysis
system to \Robject{expData} objects using the function
\Robject{loadAll}, and then re-organizes them to several
\Robject{cellData} objects corresponding to different parameters of
the reports. The function \Robject{loadAll} requires all reports are
of the same data format and are placed in the same location. The structure of
the file name can be specified by an example, e.g. 
egFilename = list(eg.filename = "0205-s2-01.txt",
rep.id = "s2", exp.id = "01", sep = "-", barcode = "DSIMGA01").
If the file formats can meet these requirements, assign
\Robject{cellformat} as ''Matrix'' or ''Tab'', and pass the location
of the files to \Robject{datapath}. Otherwise, you need to specify the information of
each file by a data.frame variable. An example can be referred by \Rcode{data(platemap)}.

Then, you can import the data and construct \Robject{cellData} objects as follows:
<<import_data, eval = TRUE>>= 
datapath <- file.path(system.file("Test", package = "OperaMate"), "Matrix")
lstPlates <- loadAll(cellformat = "Matrix", datapath = datapath,
                     egFilename <- list(eg.filename = "Tab.130504-s1-01.txt",
                                        rep.id = "s1", exp.id = "01", sep = "-",
                                        barcode = "DSIMGA01")
                     )
oneCell <- cellData(name = "Average Intensity of Nuclei")
oneCell <- cellLoad(oneCell, lstPlates, neglect.well = c("*02", "*23"),
                    positive.ctr = c("H02", "J02", "L02"),
                    negative.ctr = c("C23", "E23", "G23"))
str(oneCell["origin.data"])
@

If the reports are of other formats, you can redefine the function
\Robject{parseTemplete}. See the manual for more information.

\subsection{Data normalization}
Data normalization is to reducing the systematical technical variance
among the raw data. Technical sources of variation are unavoidable
during experiments, and different amounts of variance occur in
different rounds of experiments, which are referred as ''batch
effect''. As to the plate-based assays, different techincal variations
are added to different plates, and different locations of well are
suffered with different amounts of noise, especially the edge
wells. The latter phenomenon is called ''edge effects''.

Good normalization methods help to attenuate the batch effects. The
\Rpackage{OperaMate} package provides several normalization methods
including ''MP'', ''PMed'', ''Z"'' ''Ctr''. Ctr method
divides data by the mean of the plate controls. This approach is often
favored by biologists. However, as to the large sample screening, it
is usually not as accurate as methods which take all samples into
consideration. All of the other methods consider all samples, and use
the majority of samples as a negative reference. The ''PMed'' method
divides data by the median of their corresponding plates. ''MP''
employs the Tukey's median polish procedure, and divides data by the
median of their corresponding plates and wells recursively. "Z" is the
robust z-score method which firstly substracts data by the median of
their plates, and then divides them by the median absolute deviation
of the plates. The default normalization method is ''MP''. However,
if no normalization methods are expected, you can assign the
\Robject{norm.method} as ''None''.

An example is as follows: 
<<norm_cell, eval = TRUE>>= 
oneCell <- cellNorm(oneCell, norm.method = "MP") 
str(oneCell["norm.data"])
@ 
%def

Contrasting colors are very useful to visualize the batch effects. The
\Rpackage{OperaMate} package provides two ways, hierarchical clustering (heatmap),
the boxplot of each plate data. Heatmap method performs hieratical
clustering to data matrix, and a large region of distinguishing color
indicates the corresponding data with different technical
variations. Boxplots visualize the distribution of the data in each plate.
Examples are shown as shown in Figure \ref{fig:normplot}.

<<batch_methods, eval = TRUE>>=
cellViz(oneCell, data.type = c("raw", "norm"), plateID = 1:6, outpath = tempdir())
@

\begin{figure}[tp]
  \begin{center}
    \includegraphics{normplot.png}
    \caption{\label{fig:normplot}% Batch effect visualization. 
      (a-1) Boxplot of raw data.; (a-2) boxplot of norm data; (b-1) heatmap of raw data;
      (b-2) heatmap of norm data.}
  \end{center}
\end{figure}

In addition, you can check a specific plate.

<<visualize_onePlate, eval = TRUE>>=
cellViz(oneCell, data.type = c("raw", "norm"), plateID = 1, outpath = tempdir())
@

\subsection{Quality control }\label{section:qc}
Before using the data for biological analysis, you should verify that
your data passes quality control checks. \Robject{OperaMate} provides
several ways to do quality control.  ''PlateCorrelation'' checks if the
duplicated plates have similar distributions. Pearson correlations are
performed between every pair of duplicated plates, and the low correlation
means bad replicates (default: 0.8). ''Zfactor'' checks the z factors of 
the wells of each plate. The data from the bad plates/wells are replaced by
the mean of their duplicates if replace.badPlateData is TRUE. 
The qualities of the plates are stored
in the \Robject{plate.quality} slot.  Next, it performs the quality
control well by well. It detects abnormal standardard deviations (sds) of replicated 
wells considering the mean values. Boxplot shows the abnormal sds as the outliers 
(black points outside the box), as shown in Figure \ref{fig:qcplot}. 

The cell numbers are loaded by the following function:
<<load_cellnum, eval = TRUE>>=
cell.cellNum <- cellData(name = "Cells.Analyzed")
cell.cellNum <- cellLoad(cell.cellNum, lstPlates, neglect.well = c("*02", "*23"),
                    positive.ctr = c("H02", "J02", "L02"),
                    negative.ctr = c("C23", "E23", "G23"))
oneCell <- cellNumLoad(oneCell, cell.cellNum)
@ 

Quality control is performed by the following function:
<<qc_cell, eval = TRUE>>=
oneCell <- cellQC(oneCell, qcType = c("plateCorrelation", "wellSd", "cellNumber"),
                  qc.threshold = c(correlation = 0.7), outpath = tempdir())
str(oneCell["qc.data"])
head(oneCell["plate.quality"])
@

\begin{figure}[tp]
  \begin{center}
    \includegraphics{qcplot.png}
    \caption{\label{fig:qcplot}
      (a) Pearson correlation of replicated plates; (b) standard deviations of
      replicated wells binned by mean values.}
  \end{center}
\end{figure}

\section{Hit identification and biological analysis}\label{section:sigDA}
The ''hits'' are the samples which are meaningfully different from the
negative controls. The compulsory method is the t-test between the
samples and the controls. The hits are the samples those significantly
differ from the negative controls based on t-test ( default: p-value
less than 0.05). This method is combined with the following methods to
reduce a high false positives:
\begin{enumerate}
\item \Robject{ksd}, $mean \pm k\:standard\,deviation$. The hits are
  the samples those surpass k (default: 3) standard deviation relative
  to the mean. This approach is often used with z-score normalization.
\item \Robject{kmsd}, $median \pm k\:median\,absolute\,deviation$. The
  hits are the samples those surpass k (default: 3) median absolute
  deviation relative to median. It is more robust than \Robject{ksd}
  as to a nonnormal distribution.
\item \Robject{stable}, stable distribution method. 
  You can check the fitness to the stable distribution by
  the QQ plot.
\end{enumerate}

An example is as follows: 
<<sig_detection, eval = TRUE>>= 
oneCell <- cellSig(oneCell, method = "stable", th = c(0.05, 0.05),
                   outpath = tempdir())
names(oneCell["Sig"])
@
The fitness of stable distribution to the data can be visualized, as shown 
in Figure \ref{fig:sig}(a).

In addition, the signicance of the hits can be exihibited by a volvano
plot,as shown in Figure \ref{fig:sig}(b).  
<<sig_vcplot, eval = TRUE>>=
labels <- c("Axin1") 
names(labels) <- c("DSIMGA04:C07")
cellSigPlot(oneCell, highlight.label = labels, outpath = tempdir())
@

\begin{figure}[tp]
  \begin{center}
    \includegraphics{sigplot.png}
    \caption{\label{fig:sig}
      (a) QQ plot between data and stable distribution; (b) volcano plot between the log2
      intensity and the log10 p-value in the multiple t-tests. The red
      points are hits statistically and quantitatively significantly
      different from the negative controls. (c) The counts of the
      detected hits. (d) The functions of the detected hits.}
  \end{center}
\end{figure}

Moreover, the potential biological meanings can be interpreted with
the help of gProfileR package. You are required to provide the well-gene
specification table. The description of \Robject{genemap}
is referred in Section \ref{section:config}. The p-values barplot can be
visualized, as shown in Figure \ref{fig:sig}(d)
<<sig_analysis, eval = TRUE>>= 
genemap <- read.csv(file.path(system.file("Test", package = "OperaMate"),
                              "demoData", "genemap.csv"), stringsAsFactors = FALSE)
chart <- cellSigAnalysis(oneCell, genemap, organism = "mmusculus")
head(chart, n = 5)
cellSigAnalysisPlot(chart, prefix = oneCell@name, outpath = tempdir())
@

\section{Summarization}
At last, all processed data and the significant hits of all data type
are summarized to a single report, which is easy to check using
Microsoft Office Excel or other softwares. The numbers of hits are
visualized by a histogram, as shown in Figure \ref{fig:sig}(c).

<<report, eval = TRUE>>= 
report <- generateReport(list(oneCell), genemap, verbose = FALSE,
                         plot = FALSE)
head(report, n = 5)
@

\section{Acknowledgement}
We thank Li Mao and his advisor Lin Li for providing the original
screening data generated by the Columbus system. The example data in
the package are synthesis data generated based on their providing
data.

\section{Session info}
<<session_info, echo = FALSE>>= 
sessionInfo() 
@

\bibliography{OperaMate-vignette}

\end{document}