% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Basic Functions for Flow Cytometry Data}
%\VignetteDepends{flowCore, flowViz}
%\VignetteKeywords{}
%\VignettePackage{flowCore}
\documentclass[11pt]{article}
\usepackage{times}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}
\usepackage{times}
\usepackage{comment}
\usepackage{graphicx}
\usepackage{subfigure}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\setkeys{Gin}{width=0.5\textwidth}
\setkeys{Gin}{height=0.5\textwidth}

\title{flowCore: data structures package for flow cytometry data}
\author{N. Le Meur F. Hahne B. Ellis P. Haaland}

\begin{document}
%\SweaveOpts{concordance=TRUE}



<<echo=FALSE, include=FALSE>>=
library(knitr)
opts_chunk$set(
concordance=TRUE, warning = FALSE
)
@

\maketitle

\begin{abstract}
\noindent \textbf{Background} The recent application of modern
automation technologies to staining and collecting flow cytometry
(FCM) samples has led to many new challenges in data management and
analysis. We limit our attention here to the associated problems in
the analysis of the massive amounts of FCM data now being
collected. From our viewpoint, see two related but substantially
different problems arising. On the one hand, there is the problem of
adapting existing software to apply standard methods to the increased
volume of data. The second problem, which we intend to address here,
is the absence of any research platform which bioinformaticians,
computer scientists, and statisticians can use to develop novel
methods that address both the volume and multidimensionality of the
mounting tide of data. In our opinion, such a platform should be Open
Source, be focused on visualization, support rapid prototyping, have a
large existing base of users, and have demonstrated suitability for
development of new methods. We believe that the Open Source
statistical software R in conjunction with the Bioconductor Project
fills all of these requirements. Consequently we have developed a
Bioconductor package that we call \Rpackage{flowCore}. The
\Rpackage{flowCore} package is not intended to be a complete analysis
package for FCM data. Rather, we see it as providing a clear object
model and a collection of standard tools that enable R as an
informatics research platform for flow cytometry. One of the important
issues that we have addressed in the \Rpackage{flowCore} package is
that of using a standardized representation that will insure
compatibility with existing technologies for data analysis and will
support collaboration and interoperability of new methods as they are
developed. In order to do this, we have followed the current
standardized descriptions of FCM data analysis as being developed
under NIH Grant xxxx [n]. We believe that researchers will find
flowCore to be a solid foundation for future development of new
methods to attack the many interesting open research questions in FCM
data analysis.\\

\noindent \textbf{Methods} We propose a variety of data
structures. We have implemented the classes and methods in the
Bioconductor package flowCore.  We illustrate their use with X case
studies.\\

\noindent \textbf{Results} We hope that those proposed data structures
will be the base for the development of many tools for the analysis of
high throughput flow cytometry.\\

\noindent \textbf{keywords} Flow cytometry, high throughput,
software, standard
\end{abstract}

<<loadPackage, echo=FALSE,results='hide'>>=
library(flowCore)
@


\section{Introduction}
Traditionally, flow cytometry has been a tube-based technique limited
to small-scale laboratory and clinical studies.  High throughput
methods for flow cytometry have recently been developed for drug
discovery and advanced research methods \citep{Gasparetto2004}. As an
example, the flow cytometry high content screening (FC-HCS) can
process up to a thousand samples daily at a single workstation, and
the results have been equivalent or superior to traditional manual
multi-parameter staining and analysis techniques.

The amount of information generated by high throughput technologies
such as FC-HCS need to be transformed into executive summaries (which
are brief enough) for creative studies by a human researcher
\citep{Brazma2001}.  Standardization is critical when developing new
high throughput technologies and their associated information services
\citep{Brazma2001, Chicurel2002, Boguski2003}. Standardization efforts
have been made in clinical cell analysis by flow cytometry
\citep{Keeney2004}, however data interpretation has not been
standardized for even low throughput FCM.  It is one of the most
difficult and time consuming aspects of the entire analytical process
as well as a primary source of variation in clinical tests, and
investigators have traditionally relied on intuition rather than
standardized statistical inference
\citep{Bagwell2004,Braylan2004,Parks1997,Suni2003}. In the development
of standards in high throughput FCM, little progress has been made in
terms of Open Source software. In this article we propose R data
structures to handle flow cytometry data through the main steps of
preprocessing: compensation, transformation, filtering.

The aim is to merge both \Rpackage{prada} and \Rpackage{rflowcyt} 
\citep{LeMeur2006} into one core package which is compliant with
the data exchange standards that are currently developed in the 
community \citep{Spidlen2006}.

Visualization as well as quality control will then be part of the
utility packages that depend on the data structures defined in the
flowCore package.

\section{Representing Flow Cytometry Data}

\Rpackage{flowCore}'s primary task is the representation and basic
manipulation of flow cytometry (or similar) data. This is accomplished
through a data model very similar to that adopted by other
Bioconductor packages using the \Robject{expressionSet} and
\Robject{AnnotatedDataFrame} structures familiar to most Bioconductor
users.

\subsection{The \Rclass{flowFrame} Class}

The basic unit of manipulation in \Rpackage{flowCore} is the
\Rclass{flowFrame}, which corresponds roughly to a single
``FCS'' file exported from the flow cytometer's acquisition
software. At the moment we support FCS file versions 2.0
through 3.1, and we expect to support FCS4/ACS1 as soon as
the specification has been ratified.

\subsubsection{Data elements}

The primary elements of the \Rclass{flowFrame} are the
\Robject{exprs} and \Robject{parameters} slots, which
contain the event-level information and column metadata
respectively. The event information, stored as a single
matrix, is accessed and manipulated via the
\Rfunction{exprs()} and \Rfunction{exprs<-} methods,
allowing \Rclass{flowFrame}s to be stitched together if
necessary (for example, if the same tube has been collected
in two acquisition files for memory reasons).

The \Robject{parameters} slot is an
\Rclass{AnnotatedDataFrame} that contains information
derived from an FCS file's ``\$Pn'' keywords, which
describe the detector and stain information. The entire list
is available via the \Rfunction{parameter()} method, but
more commonly this information is accessed through the
\Rfunction{names}, \Rfunction{featureNames} and
\Rfunction{colnames} methods. The \Rfunction{names} function
returns a concatenated version of \Rfunction{colnames} and
\Rfunction{featureNames} using a format similar to the one
employed by most flow cytometry analysis software. The
\Rfunction{colnames} method returns the detector names,
often named for the fluorochrome detected, while the
\Rfunction{featureNames} method returns the description
field of the parameters, which will typically be an
identifier for the antibody.

The \Rfunction{keyword} method allows access to the raw FCS
keywords, which are a mix of standard entries such as
``SAMPLE ID,'' vendor specific keywords and user-defined
keywords that add more information about an experiment. In
the case of plate-based experiments, there are also one or
more keywords that identify the specific well on the plate.

Most vendor software also includes some sort of unique
identifier for the file itself. The specialized method
\Rfunction{identifier} attempts to locate an appropriate
globally unique identifier that can be used to uniquely
identify a frame. Failing that, this method will return the
original file name offering some assurance that this frame
is at least unique to a particular session.

\subsection{The \Rclass{flowSet} Class}

Most experiments consist of several \Rclass{flowFrame} objects, which
are organized using a \Rclass{flowSet} object. This class provides
a mechanism for efficiently hosting the \Rclass{flowFrame}s with
minimal copying. This reduces memory requirements and ensures
that experimental metadata stay properly attached to their
corresponding \Rclass{flowFrame}s.

\section{Reading and Manipulating \Rpackage{flowCore} Data Classes}

\subsection{Reading an FCS file into a \Rclass{flowFrame}}

FCS files are read into the R environment via the
\Rfunction{read.FCS} function using the standard connection
interface---allowing for the possibility of accessing FCS
files hosted on a remote resource as well as those that have
been compressed or even retrieved as a blob from a database
interface. FCS files (version 2.0, 3.0, and 3.1) and LMD (List
Mode Data) extensions are currently supported.

There are also several immediate processing options
available in this function, the most important of which is
the \Rfunction{transformation} parameter, which can
``linearize'' (the default), ``linearize-with-PnG-scaling'', 
or ``scale'' our data. To see how this works, 
first we will examine an FCS file
without any transformation at all:

<<ReadFiles, echo=TRUE, results='markup'>>=
file.name <- system.file("extdata","0877408774.B08", 
                         package="flowCore")
x <- read.FCS(file.name, transformation=FALSE)
summary(x)
@

As we can see, in this case the values from each parameter
seem to run from $0$ to $1023$ ($2^{10}-1$).  However,
inspection of the documentation of the ``exponentiation'' 
keyword (\$PnE) reveals that some of the parameters (3 and 4) 
have been stored in a format specifying that the channel values
should be scaled as $y=f_2\times 10^{f_1\cdot x/R}$ where
$x$ is the original channel value, $y$ is the scaled value,
$f_1$ and $f_2$ are given respectively by the first and second
element of the value of the \$PnE key and $R$ is the channel
range given by the value of the \$PnR key. The special \$PnE value
of ``$0,0$'' corresponds to a scale that is already linear. 

<<SearchKeywords, echo=TRUE, results='markup'>>=
keyword(x,c("$P1E", "$P2E", "$P3E", "$P4E"))
@

The default ``linearize'' transformation option will convert these to,
effectively, have a \$PnE value of ``$0,0$'':

<<PrintSummary, echo=TRUE, results='markup'>>=
summary(read.FCS(file.name))
@

The ``linearize-with-PnG-scaling'' option will perform the previous
transformation and it will also apply a ``division by gain''
to parameters stored on linear scale with specified gain.
The gain is specified in the \$PnG keyword. This option has been introduced
as part of Gating-ML 2.0 compliance. 

Finally, the ``scale'' option will both linearize values as
well as ensure that output values are contained in $[0,1]$,
which is the proposed method of data storage for the
ACS1.0/FCS4.0 specification:

<<PrintSummary2, echo=TRUE,results='markup'>>=
summary(read.FCS(file.name,transformation="scale")) 
@

Another parameter of interest is the \Rfunction{alter.names}
parameter, which will convert the parameter names into more
``R friendly'' equivalents, usually by replacing ``-'' with
``.'':

<<ReadFiles2, echo=TRUE,results='markup'>>=
read.FCS(file.name,alter.names=TRUE) 
@

When only a particular subset of parameters is desired, the
\Rfunction{column.pattern} parameter allows for the specification of a
regular expression and only parameters that match the regular
expression will be included in the frame. For example, to include on
the Height parameters:

<<ReadFiles3, echo=TRUE, results='markup'>>=
x <- read.FCS(file.name, column.pattern="-H") 
x 
@

\textit{Note that \Rfunction{column.pattern} is applied after
  \Rfunction{alter.names} if it is used.}

Finally, only a sample of lines can be read in case you need a quick
overview of a large series of files.
<<ReadFiles4, echo=TRUE, results='markup'>>=
lines <- sample(100:500, 50)
y <- read.FCS(file.name, which.lines = lines) 
y 
@

\subsection{Creating a \Rclass{flowSet}}

To facilitate the creation of \Rclass{flowSet} objects from a variety
of sources, we provide a means to coerce \Rclass{list} and
\Rclass{environment} objects to a \Rclass{flowSet} object using the
usual coercion mechanisms. For example, if we have a directory
containing FCS files, we can read in a list of those files and create a
\Rclass{flowSet} out of them:

<<Frames1, echo=TRUE, results='markup'>>=
fcs.dir <- system.file("extdata",
                       "compdata",
                       "data",
                       package="flowCore")
frames <- lapply(dir(fcs.dir, full.names=TRUE), read.FCS)
fs <- as(frames, "flowSet")
fs
sampleNames(fs)
@

Note that the original list is unnamed and that the resulting sample
names are not particularly meaningful. If the list is named, the list
constructed is much more meaningful. One such approach is to employ
the \Rfunction{keyword} method for \Rclass{flowFrame} objects to
extract the ``SAMPLE ID'' keyword from each frame:

<<Frames2, echo=TRUE,results='markup'>>=
names(frames) <- sapply(frames, keyword, "SAMPLE ID")
fs <- as(frames, "flowSet")
fs
sampleNames(fs)
@

\subsubsection{Working with experimental metadata}

Like most Bioconductor organizational classes, the \Rclass{flowSet}
has an associated \Rclass{AnnotatedDataFrame} that provides metadata
not contained within the \Rclass{flowFrame} objects themselves. This
data frame is accessed and modified via the usual
\Rfunction{phenoData} and \Rfunction{phenoData<-} methods. You can
also generally treat the phenotypic data as a normal data frame to add
new descriptive columns. For example, we might want to track the
original filename of the frames from above in the phenotypic data for
easier access:

<<metaData, echo=TRUE, results='markup'>>=
phenoData(fs)$Filename <- fsApply(fs,keyword, "$FIL")
pData(phenoData(fs))
@

Note that we have used the \Rclass{flowSet}-specific iterator,
\Rfunction{fsApply}, which acts much like \Rfunction{sapply} or
\Rfunction{lapply} and will be discussed shortly. 
Additionally, we should also note that the \Rfunction{phenoData} 
data frame \textbf{must} have row names that correspond to the 
original names used to create the \Rclass{flowSet}.

\subsubsection{Reading multiple FCS files with \Rfunction{read.FlowSet}}
The process of reading multiple FCS files in to a \Rclass{flowSet} 
is simplified by using the \Rfunction{read.flowSet} function. In its simplest
incarnation, this function takes a \Rfunction{path}, which defaults to
the current working directory, and an optional \Rfunction{pattern}
argument that allows only a subset of files contained within the
working directory to be selected. For example, to read a
\Rclass{flowSet} of the files read in as \Robject{frames} above:

<<ReadFlowSet, echo=TRUE,results='markup'>>=
fs <- read.flowSet(path = fcs.dir)
@

\Rfunction{read.flowSet} will pass on additional arguments meant for
the underlying \Rfunction{read.FCS} function, such as
\Rfunction{alter.names} and \Rfunction{column.pattern}, but also
supports several other interesting arguments for conducting initial
processing:
\begin{description}
\item[files]{ An alternative to the \Rfunction{pattern} argument, you
    may also supply a vector of filenames to read.}
\item[name.keyword]{ Like the example in the previous section, you may
    specify a particular keyword to use in place of the filename when
    creating the \Rclass{flowSet}.}
\item[phenoData]{ If this is an \Rclass{AnnotatedDataFrame}, then this
    will be used in place of the data frame that is ordinarily
    created. Additionally, the row names of this object will be taken
    to be the filenames of the FCS files in the directory specified by
    \Robject{path}.  }
\end{description}


\subsubsection{Manipulating a \Rclass{flowSet}}

You can extract a \Rclass{flowFrame} from a \Rclass{flowSet} object in
the usual way using the \Rfunction{[[} or \Rfunction{\$} extraction
  operators. On the other hand using the \Rfunction{[} extraction
  operator returns a new \Rclass{flowSet} by \textbf{copying} the
  environment. However, simply assigning the \Rclass{flowFrame} to a
  new variable will \textbf{not} copy the contained frames.

  The primary iterator method for a \Rclass{flowSet} is the
  \Rfunction{fsApply} method, which works more-or-less like
  \Rfunction{sapply} or \Rfunction{lapply} with two extra options. The
  first argument, \Robject{simplify}, which defaults to
  \Robject{TRUE}, instructs \Rfunction{fsApply} to attempt to simplify
  its results much in the same way as \Rfunction{sapply}. The primary
  difference is that if all of the return values of the iterator are
  \Rclass{flowFrame} objects, \Rfunction{fsApply} will create a new
  \Rclass{flowSet} object to hold them. The second argument,
  \Robject{use.exprs}, which defaults to \Robject{FALSE} instructs
  \Rfunction{fsApply} to pass the expression matrix of each frame
  rather than the \Rclass{flowFrame} object itself. This allows
  functions to operate directly on the intensity information without
  first having to extract it.

  As an aid to this sort of operation we also introduce the
  \Rfunction{each\_row} and \Rfunction{each\_col} convenience
  functions that take the place of \Rfunction{apply} in the
  \Rfunction{fsApply} call. For example, if we wanted the median value
  of each parameter of each \Rclass{flowFrame} we might write:

<<fsApply1, echo=TRUE, results='markup'>>=
fsApply(fs, each_col, median)
@
%
which is equivalent to the less readable

<<fsApply2, echo=TRUE,results='markup'>>=
fsApply(fs,function(x) apply(x, 2, median), use.exprs=TRUE)
@
%
In this case, the \Robject{use.exprs} argument is not required in the
first case because \Rfunction{each\_col} and \Rfunction{each\_row} are
methods and have been defined to work on \Rclass{flowFrame} objects by
first extracting the intensity data.


\section{Visualizing Flow Cytometry Data}

Much of the more sophisticated visualization of \Rclass{flowFrame} and
\Rclass{flowSet} objects, including an interface to the
\Rpackage{ggplot2} graphics system is implemented by the
\Rpackage{ggcyto} package, also included as part of
Bioconductor. Here, we will only introduce the \Rfunction{autoplot} function
for the purpose of demonstrating \Rpackage{flowCore} operations. 
See the vignettes of \Rpackage{ggcyto} for more examples of how to visualize flow data.

\subsection{Visualizing a single \Rclass{flowFrame}}

It is frequently helpful to view the density of events in a \Rclass{flowFrame}
across one channel or two.

To create a bivariate density plot, simply provide the appropriate channels:
<<Plot, echo=TRUE, results='hide'>>=
library(ggcyto)
autoplot(x, "FL1-H", "FL2-H")
@

Similarly, to get a univariate densityplot:
<<Plot2, echo=TRUE, results='hide'>>=
autoplot(x, "FL1-H")
@

\subsection{Visualizing a \Rclass{flowSet}}
The syntax is basically the same for \Rclass{flowSet} objects, with the output
now being a grid of plots corresponding to each \Rclass{flowFrame} (which may
or may not be useful depending on context). The grid layout can be adjusted
with additional arguments to \Rfunction{autoplot}

<<Plot3, echo=TRUE, results='hide'>>=
fs <- read.flowSet(path = system.file("extdata", 
                                      package = "flowCore"), 
                   pattern = "\\.")
autoplot(fs, "FL1-H", "FL2-H")
@

\section{Compensation}

Before proceeding with further analysis of the data in a \Rclass{flowFrame}
or \Rclass{flowSet}, it is important to properly compensate the data for
spectral overlap between fluorescence channels. 

\subsection{Extracting and applying a pre-calculated spillover matrix}

If the original FCS file contains a pre-calculated spillover matrix 
as the value of the \$SPILLOVER or \$SPILL keywords, this can be 
accessed from the \Rclass{flowFrame}'s \Robject{description} slot 
using the \Rfunction{spillover} method:
<<Comp, echo=TRUE, results='markup'>>=
fcsfiles <- list.files(pattern = "CytoTrol", 
                       system.file("extdata", 
                                   package = "flowWorkspaceData"),
                       full = TRUE)
fs <- read.flowSet(fcsfiles)
x <- fs[[1]]
comp_list <- spillover(x)
comp_list
comp  <- comp_list[[1]]
@

Given a \Rclass{flowFrame}, \Rfunction{spillover} will check a few valid
spillover matrix keywords. This will result in a list, usually only one
of which will be valid (in this case, the first one, corresponding to
the \$SPILL keyword).

This spillover matrix can then be applied to the channels of
the \Rclass{flowFrame} using the \Rfunction{compensate} function.

<<Comp2, echo=TRUE, results='markup'>>=
x_comp <- compensate(x, comp)
@

The \Rfunction{compensate} function can also take a \Rclass{flowSet}
as its first argument and a list of named spillover matrices to compensate
the corresponding \Rclass{flowFrame}s. Here the \Rcode{simplify = FALSE}
option is necessary so that \Rfunction{fsApply} returns a list of named matrices.

<<Comp3, echo=TRUE, results='markup'>>=
comp <- fsApply(fs, function(x) spillover(x)[[1]], simplify=FALSE)
fs_comp <- compensate(fs, comp)
@

While transformation will be discussed in the next section, it is helpful to
visualize the effect of compensation on transformed parameters. Quick inspection
of the compensation matrix above reveals large overlap between the 450nm and 545nm
channels, so we would expect to see a significant difference in each of those channels
after compensation. This is in fact the case:
<<CompViz, echo=TRUE, results='markup'>>=
library(gridExtra)
transList <- estimateLogicle(x, c("V450-A","V545-A"))
p1 <- autoplot(transform(x, transList), "V450-A", "V545-A") +
      ggtitle("Before")
p2 <- autoplot(transform(x_comp, transList), "V450-A", "V545-A") +
      ggtitle("Before")
grid.arrange(as.ggplot(p1), as.ggplot(p2), ncol = 2)
@

Often the spillover matrix is not provided in the FCS file for each analyzed sample
and instead it is necessary to work with a set of FCS files representing the controls
from which the spillover matrix is to be calculated, as follows. 

\subsection{Computing a spillover matrix from a set of compensation controls}

The spillover matrix can also be calculated using a set of FCS files
that contain data for an unstained sample and singly-stained samples for
each of the measurement channels. If these files are not already
incorporated as \Rclass{flowFrame}s in a \Rclass{flowSet} object, this
must be done first. Here we also are assigning the appropriate channel name
for each control file before coercion of the frames to a \Rclass{flowSet}.
This spillover calculation method is now part of the \Rpackage{flowStats}
package.

<<Comp4, echo=TRUE, results='markup'>>=
library(flowStats)
fcs.dir <- system.file("extdata", "compdata", "data",
                       package="flowCore")
frames <- lapply(dir(fcs.dir, full.names=TRUE), read.FCS)
names(frames) <- c("UNSTAINED", "FL1-H", "FL2-H", "FL4-H", "FL3-H")
frames <- as(frames, "flowSet")
comp <- spillover(frames, unstained="UNSTAINED", patt = "-H",
                  fsc = "FSC-H", ssc = "SSC-H", 
                  stain_match = "ordered")
comp
@

Note that when using the \Rcode{stain\_match = "ordered"} option of 
\Rfunction{spillover}, we are specifying that the method should assume
the singly-stained \Rclass{flowFrame}s in the \Rclass{flowSet} are 
in the order of the channel columns. In this case this behavior is 
incorrect as can be seen by quick inspection of the resulting matrix. 
To override this behavior and use the names we have provided, use the
\Rcode{stain\_match = "regexpr"} option instead. This will use regular 
expression matching of the channel names to the \Rclass{flowFrame} names 
to achieve the proper ordering.

<<Comp5, echo=TRUE, results='markup'>>=
sampleNames(frames)
comp <- spillover(frames, unstained="UNSTAINED", patt = "-H",
                  fsc = "FSC-H", ssc = "SSC-H", 
                  stain_match = "regexpr")
comp
@

As an alternative way to match channels to their corresponding compensation
controls, we provide the \Rfunction{spillover\_match} method. This allows 
for the matching of control files to specific channels via a simple csv file, 
resulting in a \Rclass{flowSet} ready for \Rfunction{spillover}. 
Lastly, it is acceptable to omit 
the \Rclass{flowSet} argument \Rcode{x} to both \Rcode{spillover} and
\Rcode{spillover\_match} if you provide the \Rcode{path} argument to point to 
a directory containing the files to be used in the compensation \Rclass{flowSet}:

<<Comp6, echo=TRUE, results='markup'>>=
comp_match <- system.file("extdata", "compdata", "comp_match",
                          package="flowCore")
# The matchfile has a simple format
writeLines(readLines(comp_match))
control_path <- system.file("extdata", "compdata", "data",
                            package="flowCore")
# Using path rather than pre-constructed flowSet
matched_fs <- spillover_match(path=control_path, 
                              fsc = "FSC-H", ssc = "SSC-H", 
                              matchfile = comp_match)
comp <- spillover(matched_fs, fsc = "FSC-H", ssc = "SSC-H", 
                  prematched = TRUE)
@

% Note for the \Rfunction{spillover\_match} function
% to work correctly, the sample names of each of the \Rclass{flowFrame}s
% should be their corresponding file names. Recall that this is the default
% behavior of \Rfunction{read.flowSet}.

This spillover matrix can then be used to compensate a sample
\Rclass{flowFrame} or an entire \Rclass{flowSet}.

% PICKUP FOLLOWING LEAD OF http://127.0.0.1:23643/library/flowWorkspace/doc/flowWorkspace-Introduction.html

\section{Transformation}

\Rpackage{flowCore} features two methods of transforming parameters
within a \Rclass{flowFrame} or \Rclass{flowSet}: inline and out-of-line. 
The inline method, discussed in the next section, has been developed 
primarily to support filtering features and is strictly more limited 
than the out-of-line transformation method, which uses R's
\Rfunction{transform} function to accomplish the filtering. Like the
normal \Rfunction{transform} function, the \Rclass{flowFrame} is
considered to be a data frame with columns named for parameters of the
FCS file. For example, if we wished to plot our first
\Rclass{flowFrame}'s first two fluorescence parameters on the log
scale we might write:

<<Transfo1, echo=TRUE, results='hide'>>=
fs <- read.flowSet(path=system.file("extdata", "compdata", "data",
                   package="flowCore"), name.keyword="SAMPLE ID")
autoplot(transform(fs[[1]],
                   `FL1-H`=log(`FL1-H`),
                   `FL2-H`=log(`FL2-H`)
                   ),
         "FL1-H","FL2-H")
@

Like the usual \Rfunction{transform} function, we can also create new
parameters based on the old parameters, without destroying the old

<<Transfo2, results='hide'>>=
autoplot(transform(fs[[1]],
                   log.FL1.H=log(`FL1-H`),
                   log.FL2.H=log(`FL2-H`)
                   ),
         "log.FL1.H", "log.FL2.H")
@



\subsection{Standard Transforms}

Though any function can be used as a transform in both the out-of-line
and inline transformation techniques, \Rpackage{flowCore} provides a
number of parameterized transform generators that correspond to the
transforms commonly found in flow cytometry and defined in the
Transformation Markup Language (Transformation-ML, see \url{http://www.ficcs.org/} and \cite{Spidlen2006} for more details). Briefly, the
predefined transforms are:

\begin{description}
\item[truncateTransform]{ $y = \left\{\begin{array}{ll}
a & x < a \\
x & x \geq a \\
\end{array}\right.$ }
\item[scaleTransform]{$ f(x) = \frac{x-a}{b-a}$}
\item[linearTransform]{ $ f(x) = a+bx$ }
\item[quadraticTransform]{ $ f(x) = ax^2 + bx + c$ }
\item[lnTransform]{$ f(x) = \log\left(x\right)\frac{r}{d} $}
\item[logTransform]{$ f(x) = \log_b\left(x\right)\frac{r}{d} $}
\item[biexponentialTransform]{ $ f^{-1}(x) = ae^{bx}-ce^{dx}+f$}
\item[logicleTransform]{ A special form of the biexponential transform
    with parameters selected by the data.}
\item[arcsinhTransform]{ $ f(x) = asinh\left(a+bx\right)+c$}
\end{description}

To use a standard transform, first we create a transform function via
the constructors supplied by \Rpackage{flowCore}:

<<Transfo3, echo=TRUE>>=
aTrans <- truncateTransform("truncate at 1", a=1)
aTrans
@
which we can then use on the parameter of interest in the usual way, in this
case for the full \Rclass{flowSet} using the same syntax
<<Transfo4, echo=TRUE,results='markup'>>=
transform(fs,`FL1-H`=aTrans(`FL1-H`))
@

However this form of transform call is not intended to be used in the programmatic
context because a locally defined transform function (e.g. 'aTrans')
may not always be visible to the non-standard evaluation environment. .e.g
<<Transfo4.1, echo=TRUE,results='markup'>>=
f1 <- function(fs,...){
  transform(fs, ...)[,'FL1-H']
}

f2 <- function(fs){
  aTrans <- truncateTransform("truncate at 1", a=1)
  f1(fs, `FL1-H` = aTrans(`FL1-H`))
}
res <- try(f2(fs), silent = TRUE)
res
@

So this form of usage of 'transform' method is only useful for the interactive
exploratory.
we highly recommend the usage of \Robject{transformList} instead for more
robust and reproducible code.

<<Transfo4.2, echo=TRUE,results='markup'>>=
myTrans <- transformList('FL1-H', aTrans)
transform(fs, myTrans)
@

\section{Gating}

The most common task in the analysis of flow cytometry data is some
form of filtering operation, also known as gating, either to obtain
summary statistics about the number of events that meet certain
criteria or to perform further analysis on a subset of the data.  Most
filtering operations are a composition of one or more common filtering
operations. The definition of gates in \Rpackage{flowCore} follows the
Gating Markup Language Candidate Recommendation \cite{Spidlen2008},
thus any \Rpackage{flowCore} gating strategy can be reproduced by any
other software that also adheres to the standard and \textit{vice
  versa}.

\subsection{Standard gates and filters}
Like transformations, \Rpackage{flowCore} includes a number of
built-in common flow cytometry gates. The simplest of these gates are
the geometric gates, which correspond to those typically found in
interactive flow cytometry software:
\begin{description}
\item[rectangleGate]{Describes a cubic shape in one or more
    dimensions--a rectangle in one dimension is simply an interval
    gate.}
\item[polygonGate]{Describes an arbitrary two dimensional polygonal
    gate.}
\item[polytopeGate]{Describes a region that is the convex hull of the
    given points. This gate can exist in dimensions higher than $2$,
    unlike the \Robject{polygonGate}.}
\item[ellipsoidGate]{Describes an ellipsoidal region in two or more
    dimensions}
\end{description}

These gates are all described in more or less the same manner (see man
pages for more details):

<<rectGate, echo=TRUE, results='markup'>>=
rectGate <- rectangleGate(filterId="Fluorescence Region", 
                          "FL1-H"=c(0, 12), "FL2-H"=c(0, 12))
@

In addition, we introduce the notion of data-driven gates, or filters,
not usually found in flow cytometry software. In these approaches, the
necessary parameters are computed based on the properties of the
underlying data, for instance by modeling data distribution or by
density estimation:

\begin{description}
\item[norm2Filter]{ A robust method for finding a region that most
    resembles a bivariate Normal distribution. Note that this
    method now resides in the \Rpackage{flowStats} package}
\item[kmeansFilter]{ Identifies populations based on a one dimensional
    k-means clustering operation. Allows the specification of
    \textbf{multiple} populations. }
\end{description}

\subsection{Count Statistics}

When we have constructed a filter, we can apply it in two basic
ways. The first is to collect simple summary statistics on the number
and proportion of events considered to be contained within the gate or
filter. This is done using the \Rfunction{filter} method. The first
step is to apply our filter to some data

<<echo=TRUE,results='markup'>>=
result = filter(fs[[1]],rectGate)
result
@

As we can see, we have returned a \Rclass{filterResult} object, which
is in turn a filter allowing for reuse in, for example, subsetting
operations. To obtain count and proportion statistics, we take the
summary of this \Rclass{filterResult}, which returns a list of summary
values:

<<Summary3, echo=TRUE, results='markup'>>=
summary(result)
summary(result)$n
summary(result)$true
summary(result)$p
@

A filter which contains multiple populations, such as the
\Rclass{kmeansFilter}, can return a list of summary lists:

<<SummarFilter, echo=TRUE, results='markup'>>=
summary(filter(fs[[1]], 
               kmeansFilter("FSC-H"=c("Low", "Medium", "High"),
                                     filterId="myKMeans")))
@

A filter may also be applied to an entire \Rclass{flowSet}, in which
case it returns a list of \Rclass{filterResult} objects:

<<echo=TRUE,results='markup'>>=
filter(fs,rectGate)
@

\subsection{Subsetting}

To subset or split a \Rclass{flowFrame} or \Rclass{flowSet}, we use
the \Rfunction{Subset} and \Rfunction{split} methods respectively. The
first, \Rfunction{Subset}, behaves similarly to the standard R
\Rfunction{subset} function, which unfortunately could not be used. For
example, the morphology parameters, Forward Scatter and Side Scatter, contain a
large more-or-less ellipse shaped population:

<<Norm2Filter1, echo=TRUE, results='markup'>>=
autoplot(fs[[1]], "FSC-H", "SSC-H")
@

If we wished to deal
only with that population, we might use \Rfunction{Subset} along with
a \Rclass{norm2Filter} object as follows:

<<Norm2Filter2, echo=TRUE, results='markup'>>=
library(flowStats)
morphGate <- norm2Filter("FSC-H", "SSC-H", filterId="MorphologyGate", 
                         scale=2)
smaller <- Subset(fs, morphGate)
fs[[1]]
smaller[[1]]
@
 
Notice how the \Robject{smaller} \Rclass{flowFrame} objects contain
fewer events. Now imagine we wanted to use a \Rclass{kmeansFilter} as
before to split our first fluorescence parameter into three
populations. To do this we employ the \Rfunction{split} function:

<<Split, echo=TRUE, results='markup'>>=
split(smaller[[1]], kmeansFilter("FSC-H"=c("Low","Medium","High"),
                                 filterId="myKMeans"))
@

or for an entire \Rclass{flowSet}

<<Split2, echo=TRUE, results='markup'>>=
split(smaller, kmeansFilter("FSC-H"=c("Low", "Medium", "High"),
                            filterId="myKMeans"))
@

\subsection{Combining Filters}

Of course, most filtering operations consist of more than one gate. To
combine gates and filters we use the standard R Boolean operators: \&,
| and ! to construct an intersection, union and complement
respectively:

<<CombineFilter, echo=TRUE, results='markup'>>=

rectGate & morphGate
rectGate | morphGate
!morphGate

@

We also introduce the notion of the subset operation, denoted by
either \%subset\% or \%\&\%. This combination of two gates first
performs a subsetting operation on the input \Rclass{flowFrame} using
the right-hand filter and then applies the left-hand filter. For
example,

<<Summary5, echo=TRUE, results='markup'>>=
summary(filter(smaller[[1]],rectGate %&% morphGate))
@

first calculates a subset based on the \Robject{morphGate} filter and
then applies the \Robject{rectGate}.

\subsection{Transformation Filters}

Finally, it is sometimes desirable to construct a filter with respect
to transformed parameters. To allow for this in our filtering
constructs we introduce a special form of the \Rfunction{transform}
method along with another filter combination operator \%on\%, which
can be applied to both filters and \Rclass{flowFrame} or
\Rclass{flowSet} objects. To specify our transform filter we must
first construct a transform list using a simplified version of the
\Rfunction{transform} function:

<<Transfo5, echo=TRUE,results='markup'>>=
tFilter <- transform("FL1-H"=log,"FL2-H"=log)
tFilter
@

Note that this version of the transform filter does not take
parameters on the right-hand side--the functions can only take a
single vector that is specified by the parameter on the left-hand
side. In this case those parameters are ``FL1-H'' and ``FL2-H.'' The
function also does not take a specific \Rclass{flowFrame} or
\Rclass{flowSet} allowing us to use this with any appropriate data. We
can then construct a filter with respect to the transform as follows:

<<TectGate3, echo=TRUE, results='markup'>>=
rect2 <- rectangleGate(filterId="Another Rect", "FL1-H"=c(1,2), 
"FL2-H"=c(2,3)) %on% tFilter
rect2
@


Additionally, we can use this construct directly on a
\Rclass{flowFrame} or \Rclass{flowSet} by moving the transform to the
left-hand side and placing the data on the right-hand side:

<<Plot6,echo=TRUE, results='hide'>>=
autoplot(tFilter %on% smaller[[1]], "FL1-H","FL2-H")
@

which has the same effect as the $\log$ transform used earlier.

\section{Extensions from \Rpackage{flowWorkspace}:
\Rclass{GatingHierarchy} and \Rclass{GatingSet}}
The now-defunct \Rclass{filterSet} class was very limited in its use for complex
analysis work flows. It was result-centric and it was hard to access
intermediate results. \Rpackage{flowWorkspace} and the \Rpackage{openCyto} framework (http://opencyto.org) 
offer much more versatile tools for such tasks through the \Rclass{GatingSet} class. 
The general idea is to let the software handle the organization of intermediate results 
and operations and to provide a unified API to access and summarize these operations. 
 

\subsection{Abstraction of GatingSet}
There are two classes in \Rpackage{flowWorkspace} that are used to
abstract work flows: \Rclass{GatingSet} objects are the basic container
holding all the necessary bits and pieces and they are the main
structure for user interaction. It is the container storing multiple 
\Rclass{GatingHierarchy} objects which are associated with individual samples. 
One can think of \Rclass{GatingSet} as corresponding to \Rclass(flowSet) and 
\Rclass{GatingHierarchy} as corresponding to \Rclass(flowFrame).

It is important to know that \Rclass{GatingSet} objects use an external pointer to store
the gating tree and thus most of its accessors have a reference
semantic instead of the pass-by-value semantic that is usually found
in the R language. The main consequence on the user-level is the fact that direct
assignments to a \Rclass{GatingSet} object are usually not necessary;
i.e., functions that operate on the \Rclass{GatingSet} have the
potential side-effect of modifying the object.
  
\subsection{Creating \Rclass{GatingSet} objects}
Before creating a \Rclass{GatingSet}, we need to have flow data loaded into R as a 
\Rclass{flowSet} or \Rclass{ncdfFlowSet} (the disk-based version of \Rclass{flowSet}, 
to handle large data sets that are too big for memory; see \Rpackage{ncdfFlow} package).
<<loadData>>=
library(flowWorkspace)
fcsfiles <- list.files(pattern = "CytoTrol",
                       system.file("extdata", 
                                   package = "flowWorkspaceData"),
                       full = TRUE)
fs <- read.flowSet(fcsfiles)

@
Then a \Rclass{GatingSet} can be created using the constructor \Rfunction{GatingSet}. 

<<createGatingSet>>=
gs <- GatingSet(fs)
gs
@

Normally, we want to compensate the data first by using a user-supplied compensation matrix:
<<getComp, echo=FALSE>>=
gs_manual <- load_gs(list.files(pattern = "gs_manual",
                                system.file("extdata", package = "flowWorkspaceData"),
                                full = TRUE))
comp <- gh_get_compensations(gs_manual[[1]])
@
<<Compensate>>=
comp
gs <- compensate(gs, comp)
fs_comp <- getData(gs)
@
We can query the available nodes in the \Rclass{GatingSet} using the \Rfunction{gs\_get\_pop\_paths} method:
<<nodes>>=
gs_get_pop_paths(gs)
@ 
It shows the only node ``root'' which corresponds to the raw flow data just added.

\subsection{transform the data}
Transformation can be either done on a \Rclass{flowSet} before using it 
to construct a \Rclass{GatingSet} or a \Rclass{transformerList} can be
directly added to a \Rclass{GatingSet}:

<<addTrans>>=
biexpTrans <- flowJo_biexp_trans(channelRange=4096, maxValue=262144
                          , pos=4.5,neg=0, widthBasis=-10)
chnls <- parameters(comp)
tf <- transformerList(chnls, biexpTrans)

# or use estimateLogicle directly on GatingHierarchy 
# object to generate transformerList automatically:
# tf <- estimateLogicle(gs[[1]], chnls)

gs <- transform(gs, tf)
@
<<plotTrans,echo=TRUE, results='hide'>>=
p1 <- autoplot(fs_comp[[1]], "B710-A") + ggtitle("raw")
p2 <- autoplot(gs_cyto_data(gs)[[1]], "B710-A") + 
          ggtitle("trans") + 
          ggcyto_par_set(limits = "instrument")
grid.arrange(as.ggplot(p1), as.ggplot(p2), ncol = 2)
@ 

Note that we did assign the return value of \Rfunction{transform} back to \Robject{gs}. This is because the 
flow data is stored as an R object and thus transforming the data still follows the pass-by-value semantics.

\subsection{Add the gates}
Some basic flowCore \Rclass{filter}s can be added to a \Rclass{GatingSet}:

<<addGate-nonDebris>>=
rg1 <- rectangleGate("FSC-A"=c(50000, Inf), filterId="NonDebris")
gs_pop_add(gs, rg1, parent = "root")
gs_get_pop_paths(gs)
# gate the data
recompute(gs)
@
As we see, here we don't need to assign \Rclass{GatingSet} back because all the modifications are 
made in place to the \Rclass{external pointer} rather than the R object itself.
And now there is one new population node under the ``root'' node called ``NonDebris''.
The node is named after the \Robject{filterId} of the gate if not explictly supplied.
After the gates are added, the actual gating process is done by explictly calling the \Rfunction{recompute} method.
Note that the numeric value it returns is the internal ID for the new population just added, 
which can normally be ignored since the gating path is the recommended 
way to refer to population nodes later.

To view the gate we just added,
<<plotGate,echo=TRUE, results='hide'>>=
autoplot(gs, "NonDebris")
@

Since They are 1-D gates, we can also display them in density plots
<<plotGate-density,echo=TRUE, results='hide'>>=
ggcyto(gs, aes(x = `FSC-A`)) + geom_density() + geom_gate("NonDebris")
@

To get population statistics for the given populatuion
<<getStats1>>=
gh_pop_get_stats(gs[[1]], "NonDebris")#counts
gh_pop_get_stats(gs[[1]], "NonDebris", type = "percent")#proportion
@

Now we add two more gates:
<<addGate-singlets>>=
# add the second gate
mat <- matrix(c(54272,59392,259071.99382782
                ,255999.994277954,62464,43008,70656
                ,234495.997428894,169983.997344971,34816)
              , nrow = 5)
colnames(mat) <-c("FSC-A", "FSC-H")
mat
pg <- polygonGate(mat)
gs_pop_add(gs, pg, parent = "NonDebris", name = "singlets")

# add the third gate
rg2 <- rectangleGate("V450-A"=c(2000, Inf))
gs_pop_add(gs, rg2, parent = "singlets", name = "CD3")
gs_get_pop_paths(gs)

@

We see two more nodes are added to 'GatingSet' and the population names are 
explicitly specified during the addition this time. 

A \Rclass{quadrantGate}, which results in four sub-populations, is also supported.
<<addQuadGate>>=
qg <- quadGate("B710-A" = 2000, "R780-A" = 3000)
gs_pop_add(gs, qg, parent="CD3", names = c("CD8", "DPT", "CD4", "DNT"))
gs_pop_get_children(gs[[1]], "CD3")
# gate the data from "singlets"
recompute(gs, "singlets")
@ 
Here we see four child nodes are added to the ``CD3'' parent node. Four quadrants are named explicitly through the 
\Robject{names} argument by clock-wise order (start from top-left quadrant).
Recomputing only needs to be done once from the first ungated node, which will automatically compute all of its descendants.

To plot the underlying tree:
<<plotgs, eval=FALSE>>=
plot(gs)
@ 
<<plotwfdo, echo=FALSE, results='hide'>>=
if(suppressWarnings(require(Rgraphviz))){
    plot(gs)
}else{
    plot(1,1, type="n", axes=FALSE, ann=FALSE)
    text(1,1,"Need to install Rgraphviz")
}
@ 

To plot all gates for one sample:
<<plotGateAll, results='hide'>>=
autoplot(gs[[1]])
@

To retreive the underlying flow data for a gated \Rclass{population}:
<<getData>>=
fs_nonDebris <- getData(gs, "NonDebris")
fs_nonDebris 
nrow(fs_nonDebris[[1]])
nrow(fs[[1]])
@

To get all the population statistics:
<<getStats2>>=
gs_pop_get_count_fast(gs)
@

\subsection{Removing nodes from a \Rclass{GatingSet} object}
There are dependencies between \Rclass{nodes} in the hierarchical structure of a \Rclass{GatingSet} object. 
Thus, removing a particular node means also removing all of its associated child \Rclass{nodes}. 

<<Rm>>=
Rm('CD3', gs)
gs_get_pop_paths(gs)
Rm('NonDebris', gs)
gs_get_pop_paths(gs)
@ 

Now for the larger data set, it would be either inaccurate to apply the same hard-coded 
gate to all samples or impractical to manually set the gate coordinates for each individual sample. 
\Rpackage{openCyto}(\citep{Finak2014}) provides some data-driven gating functions to automatically generate these gates. 

For example, \Rfunction{mindensity} can be used for estimating a ``nonDebris'' gate for each sample.
<<openCyto-nonDebris>>=
if(require(openCyto)){
thisData <- getData(gs)
nonDebris_gate <- fsApply(thisData,
                          function(fr)
                            openCyto:::.mindensity(fr, channels = "FSC-A"))
gs_pop_add(gs, nonDebris_gate, parent = "root", name = "nonDebris")
recompute(gs)
}
@

\Rfunction{singletGate} can be used for estimating ``singlets" 
<<openCyto-singletGate>>=
if(require(openCyto)){
thisData <- getData(gs, "nonDebris") #get parent data
singlet_gate <- fsApply(thisData,
                        function(fr)
                          openCyto:::.singletGate(fr, channels =c("FSC-A", "FSC-H")))
gs_pop_add(gs, singlet_gate, parent = "nonDebris", name = "singlets")
recompute(gs)
}
@

and then \Rfunction{mindensity} can be used again for the "CD3" gate
<<openCyto-CD3>>=
if(require(openCyto)){
thisData <- getData(gs, "singlets") #get parent data
CD3_gate <- fsApply(thisData,
                    function(fr)
                      openCyto:::.mindensity(fr, channels ="V450-A"))
gs_pop_add(gs, CD3_gate, parent = "singlets", name = "CD3")
recompute(gs)
}
@

and then we can use the more adanced version of \Rfunction{quadrantGate}: 
\Rfunction{quadGate.seq} for gating "CD4" and "CD8" sequentially:
<<openCyto-Tsub>>=
if(require(openCyto)){
thisData <- getData(gs, "CD3") #get parent data
Tsub_gate <- fsApply(thisData,
                     function(fr)
                      openCyto::quadGate.seq(fr,
                            channels = c("B710-A", "R780-A"),
                            gFunc = 'mindensity')
                    )
gs_pop_add(gs, Tsub_gate, parent = "CD3", names = c("CD8", "DPT", "CD4", "DNT"))
recompute(gs)
}
@

and then plot the gates
<<plotALL-openCyto, results='hide'>>=
autoplot(gs[[1]])
@

Note that in order to get parent gated data by \Rfunction{getData}, we have to call \Rfunction{recompute} after adding each gate. 
The result is very similar to the manual gates but the gating process is more data-driven and more consistent across samples.

To further automate the process, a gating pipeline can be established through \Rpackage{OpenCyto}(\citep{Finak2014}) that defines the hierarchical gating template in a text-based csv file. (See more details from http://openCyto.org)

\clearpage
\bibliographystyle{plainnat} 
\bibliography{../inst/doc/cytoref}
\end{document}