% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{Basic Functions for Flow Cytometry Data} %\VignetteDepends{flowCore, flowViz} %\VignetteKeywords{} %\VignettePackage{flowCore} \documentclass[11pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{times} \usepackage{comment} \usepackage{graphicx} \usepackage{subfigure} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \title{flowCore: data structures package for flow cytometry data} \author{N. Le Meur F. Hahne B. Ellis P. Haaland} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \begin{abstract} \noindent \textbf{Background} The recent application of modern automation technologies to staining and collecting flow cytometry (FCM) samples has led to many new challenges in data management and analysis. We limit our attention here to the associated problems in the analysis of the massive amounts of FCM data now being collected. From our viewpoint, see two related but substantially different problems arising. On the one hand, there is the problem of adapting existing software to apply standard methods to the increased volume of data. The second problem, which we intend to address here, is the absence of any research platform which bioinformaticians, computer scientists, and statisticians can use to develop novel methods that address both the volume and multidimensionality of the mounting tide of data. In our opinion, such a platform should be Open Source, be focused on visualization, support rapid prototyping, have a large existing base of users, and have demonstrated suitability for development of new methods. We believe that the Open Source statistical software R in conjunction with the Bioconductor Project fills all of these requirements. Consequently we have developed a Bioconductor package that we call \Rpackage{flowCore}. The \Rpackage{flowCore} package is not intended to be a complete analysis package for FCM data, rather, we see it as providing a clear object model and a collection of standard tools that enable R as an informatics research platform for flow cytometry. One of the important issues that we have addressed in the \Rpackage{flowCore} package is that of using a standardized representation that will insure compatibility with existing technologies for data analysis and will support collaboration and interoperability of new methods as they are developed. In order to do this, we have followed the current standardized descriptions of FCM data analysis as being developed under NIH Grant xxxx [n]. We believe that researchers will find flowCore to be a solid foundation for future development of new methods to attack the many interesting open research questions in FCM data analysis.\\ \noindent \textbf{Methods} We propose a variety different data structures. We have implemented the classes and methods in the Bioconductor package flowCore. We illustrate their use with X case studies.\\ \noindent \textbf{Results} We hope that those proposed data structures will be the base for the development of many tools for the analysis of high throughput flow cytometry.\\ \noindent \textbf{keywords} Flow cytometry, high throughput, software, standard \end{abstract} <>= library(flowCore) @ \section{Introduction} Traditionally, flow cytometry has been a tube-based technique limited to small-scale laboratory and clinical studies. High throughput methods for flow cytometry have recently been developed for drug discovery and advanced research methods \citep{Gasparetto2004}. As an example, the flow cytometry high content screening (FC-HCS) can process up to a thousand samples daily at a single workstation, and the results have been equivalent or superior to traditional manual multi-parameter staining and analysis techniques. The amount of information generated by high throughput technologies such as FC-HCS need to be transformed into executive summaries (which are brief enough) for creative studies by a human researcher \citep{Brazma2001}. Standardization is critical when developing new high throughput technologies and their associated information services \citep{Brazma2001, Chicurel2002, Boguski2003}. Standardization efforts have been made in clinical cell analysis by flow cytometry \citep{Keeney2004}, however data interpretation has not been standardized for even low throughput FCM. It is one of the most difficult and time consuming aspects of the entire analytical process as well as a primary source of variation in clinical tests, and investigators have traditionally relied on intuition rather than standardized statistical inference \citep{Bagwell2004,Braylan2004,Parks1997,Suni2003}. In the development of standards in high throughput FCM, few progress has been done in term of Open Source software. In this article we propose R data structures to handle flow cytometry data through the main steps of preprocessing: compensation, transformation, filtering. The aim is to merge both prada and rflowcyt \citep{LeMeur2006} into one core package which is compliant with the data exchange standards that are currently developed in the community \citep{Spidlen2006}. Visualization as well as quality control will than be part of the utility packages that depend on the data structures defined in the flowCore package. \section{Representing Flow Cytometry Data} \Rpackage{flowCore}'s primary task is the representation and basic manipulation of flow cytometry (or similar) data. This is accomplished through a data model very similar to that adopted by other Bioconductor packages using the \Robject{expressionSet} and \Robject{AnnotatedDataFrame} structures familiar to most Bioconductor users. \subsection{The \Rclass{flowFrame} Class} The basic unit of manipulation in \Rpackage{flowCore} is the \Rclass{flowFrame}, which corresponds roughly with a single ``FCS'' file exported from the flow cytometer's acquisition software. At the moment we support FCS file versions 2.0 through 3.0, and we expect to support FCS4/ACS1 as soon as the specification has been ratified. \subsubsection{Data elements} The primary elements of the \Robject{flowFrame} are the \Robject{exprs} and \Robject{parameters} slots, which contain the event-level information and column metadata respectively. The event information, stored as a single matrix, is accessed and manipulated via the \Rfunction{exprs()} and \Rfunction{exprs<-} methods, allowing \Robject{flowFrame}s to be stitched together if necessary (for example, if the same tube has been collected in two acquisition files for memory reasons). The \Robject{parameters} slot is an \Rclass{AnnotatedDataFrame} that contains information derived from an FCS file's ``\$P'' keywords, which describe the detector and stain information. The entire list is available via the \Rfunction{parameter()} method, but more commonly this information is accessed through the \Rfunction{names}, \Rfunction{featureNames} and \Rfunction{colnames} methods. The \Rfunction{names} function returns a concatenated version of \Rfunction{names} and \Rfunction{featureNames} using a format similar to the one employed by most flow cytometry analysis software. The \Rfunction{colnames} method returns the detector names, often named for the fluorochrome detected, while the \Rfunction{featureNames} methods returns the description field of the parameters, which will typically be an identifier for the antibody. The \Rfunction{keyword} method allows access to the raw FCS keywords, which are a mix of standard entries such as ``SAMPLE ID,'' vendor specific keywords and user-defined keywords that add more information about an experiment. In the case of plate-based experiments, there are also one or more keywords that identify the specific well on the plate. Most vendor software also include some sort of unique identifier for the file itself. The specialized methods \Rfunction{identifier} attempts to locate an appropriate globally unique identifier that can be used to uniquely identify a frame. Failing that, this method will return the original file name offering some assurance that this frame is at least unique to a particular session. \subsubsection{Reading a \Robject{flowFrame}} FCS files are read into the R environment via the \Rfunction{read.FCS} function using the standard connection interface---allowing for the possibility of accessing FCS files hosted on a remote resource as well as those that have been compressed or even retrieved as a blob from a database interface. FCS files (version 2.0 and 3.0) and LMD (List Mode Data) extensions are currently supported. There are also several immediate processing options available in this function, the most important of which is the \Rfunction{transformation} parameter, which can either ``linearize'' (the default) or ``linearize-with-PnG-scaling'' or ``scale'' our data. To see how this works, first we will examine an FCS file without any transformation at all: <>= file.name <- system.file("extdata","0877408774.B08", package="flowCore") x <- read.FCS(file.name, transformation=FALSE) summary(x) @ As we can see, in this case the values from each parameter seem to run from $0$ to $1023$ ($2^{10}-1$). However, inspection of the ``exponentiation'' keyword (\$PE) reveals that some of the parameters (3 and 4) have been stored in the format of the form $a\times{10^{x/R}}$ where $a$ is given by the first element of the string. <>= keyword(x,c("$P1E", "$P2E", "$P3E", "$P4E")) @ The default ``linearize'' transformation option will convert these to, effectively, have a ``\$PE'' of ``0,0'': <>= summary(read.FCS(file.name)) @ The ``linearize-with-PnG-scaling'' option will perform the previous transformation and it will also apply a ``division by gain'' to pamameters stored on linear scale with specified gain. The gain is specified in the \$PnG keywords. This option has been introduced as part of Gating-ML 2.0 compliance. Finally, the ``scale'' option will both linearize values as well as ensure that output values are contained in $[0,1]$, which is the proposed method of data storage for the ACS1.0/FCS4.0 specification: <>= summary(read.FCS(file.name,transformation="scale")) @ Another parameter of interest is the \Rfunction{alter.names} parameter, which will convert the parameter names into more ``R friendly'' equivalents, usually by replacing ``-'' with ``.'': <>= read.FCS(file.name,alter.names=TRUE) @ When only a particular subset of parameters is desired the \Rfunction{column.pattern} parameter allows for the specification of a regular expression and only parameters that match the regular expression will be included in the frame. For example, to include on the Height parameters: <>= x <- read.FCS(file.name, column.pattern="-H") x @ \textit{Note that \Rfunction{column.pattern} is applied after \Rfunction{alter.names} if it is used.} Finally, only a sample of lines can be read in case you need a quick overview of a large series of files. <>= lines <- sample(100:500, 50) y <- read.FCS(file.name, which.lines = lines) y @ \subsubsection{Visualizing a \Rclass{flowFrame}} Much of the more sophisticated visualization of \Rclass{flowFrame} and \Rclass{flowSet} objects, including an interface to the \Rpackage{lattice} graphics system is implemented by the \Rpackage{flowViz} package, also included as part of Bioconductor. Here, we will only introduce the standard \Rfunction{plot} function. The basic plot provides a simple pairs plot of all parameters: <>= library(flowViz) plot(x) @ To control the parameters being plotted we can supply a \Robject{y} value in the form of a character vector. If we choose exactly two parameters this will create a bivariate density plot. <>= plot(x,c("FL1-H", "FL2-H")) @ However, if we only supply a single parameter we instead get a univariate histogram, which also accepts the usual histogram arguments: <>= plot(x, "FL1-H", breaks=256) @ \subsection{The \Rclass{flowSet} Class} Most experiments consist of several \Rclass{flowFrame} objects, which are organized using a \Rclass{flowSet} object. This class provides a mechanism for efficiently hosting the \Rclass{flowFrame} objects with minimal copying, reducing memory requirements, as well as ensuring that experimental metadata stays properly to the appropriate \Rclass{flowFrame} objects. \subsubsection{Creating a \Rclass{flowSet}} To facilitate the creation of \Rclass{flowSet} objects from a variety of sources, we provide a means to coerce \Rclass{list} and \Rclass{environment} objects to a \Rclass{flowSet} object using the usual coercion mechanisms. For example, if we have a directory containing FCS files we can read in a list of those files and create a \Rclass{flowSet} out of them: <>= frames <- lapply(dir(system.file("extdata", "compdata", "data", package="flowCore"), full.names=TRUE), read.FCS) as(frames, "flowSet") @ Note that the original list is unnamed and that the resulting sample names are not particularly meaningful. If the list is named, the list constructed is much more meaningful. One such approach is to employ the \Rfunction{keyword} method for \Rclass{flowFrame} objects to extract the ``SAMPLE ID'' keyword from each frame: <>= names(frames) <- sapply(frames, keyword, "SAMPLE ID") fs <- as(frames, "flowSet") fs @ \subsubsection{Working with experimental metadata} Like most Bioconductor organizational classes, the \Rclass{flowSet} has an associated \Rclass{AnnotatedDataFrame} that provides metadata not contained within the \Rclass{flowFrame} objects themselves. This data frame is accessed and modified via the usual \Rfunction{phenoData} and \Rfunction{phenoData<-} methods. You can also generally treat the phenotypic data as a normal data frame to add new descriptive columns. For example, we might want to track the original filename of the frames from above in the phenotypic data for easier access: <>= phenoData(fs)$Filename <- fsApply(fs,keyword, "$FIL") pData(phenoData(fs)) @ Note that we have used the \Rclass{flowSet}-specific iterator, \Rfunction{fsApply}, which acts much like \Rfunction{sapply} or \Rfunction{lapply}. Additionally, we should also note that the \Rfunction{phenoData} data frame \textbf{must} have row names that correspond to the original names used to create the \Rclass{flowSet}. \subsubsection{Bringing it all together: \Rfunction{read.flowSet}} Much of the functionality described above has been packaged into the \Rfunction{read.flowSet} convenience function. In it's simplest incarnation, this function takes a \Rfunction{path}, that defaults to the current working directory, and an optional \Rfunction{pattern} argument that allows only a subset of files contained within the working directory to be selected. For example, to read a \Rclass{flowSet} of the files read in by \Robject{frame} above: <>= read.flowSet(path = system.file("extdata", "compdata", "data", package="flowCore")) @ \Rfunction{read.flowSet} will pass on additional arguments meant for the underlying \Rfunction{read.FCS} function, such as \Rfunction{alter.names} and \Rfunction{column.pattern}, but also supports several other interesting arguments for conducting initial processing: \begin{description} \item[files]{ An alternative to the \Rfunction{pattern} argument, you may also supply a vector of filenames to read.} \item[name.keyword]{ Like the example in the previous section, you may specify a particular keyword to use in place of the filename when creating the \Rclass{flowSet}.} \item[phenoData]{ If this is an \Rclass{AnnotatedDataFrame}, then this will be used in place of the data frame that is ordinarily created. Additionally, the row names of this object will be taken to be the filenames of the FCS files in the directory specified by \Robject{path}. This argument may also be a named list made up of a combination of \Robject{character} and \Robject{function} objects that specify a keyword to extract from the FCS file or a function to apply to each frame that will return a result. } \end{description} To recreate the \Rclass{flowSet} that we created by hand from the last section we can use \Rfunction{read.flowSet}s advanced functionality: <>= fs <- read.flowSet(path=system.file("extdata", "compdata", "data", package="flowCore"), name.keyword="SAMPLE ID", phenoData=list(name="SAMPLE ID", Filename="$FIL")) fs pData(phenoData(fs)) @ \subsubsection{Manipulating a \Rclass{flowSet}} You can extract a \Rclass{flowFrame} from a \Rclass{flowSet} object in the usual way using the \Rfunction{[[} or \Rfunction{\$} extraction operators. On the other hand using the \Rfunction{[} extraction operator returns a new \Rclass{flowSet} by \textbf{copying} the environment. However, simply assigning the \Rclass{flowFrame} to a new variable will \textbf{not} copy the contained frames. The primary iterator method for a \Rclass{flowSet} is the \Rfunction{fsApply} method, which works more-or-less like \Rfunction{sapply} or \Rfunction{lapply} with two extra options. The first argument, \Robject{simplify}, which defaults to \Robject{TRUE}, instructs \Rfunction{fsApply} to attempt to simplify it's results much in the same way as \Rfunction{sapply}. The primary difference is that if all of the return values of the iterator are \Rclass{flowFrame} objects, \Rfunction{fsApply} will create a new \Rclass{flowSet} object to hold them. The second argument, \Robject{use.exprs}, which defaults to \Robject{FALSE} instructs \Rfunction{fsApply} to pass the expression matrix of each frame rather than the \Rclass{flowFrame} object itself. This allows functions to operate directly on the intensity information without first having to extract it. As an aid to this sort of operation we also introduce the \Rfunction{each\_row} and \Rfunction{each\_col} convenience functions that take the place of \Rfunction{apply} in the \Rfunction{fsApply} call. For example, if we wanted the median value of each parameter of each \Rclass{flowFrame} we might write: <>= fsApply(fs, each_col, median) @ % which is equivalent to the less readable <>= fsApply(fs,function(x) apply(x, 2, median), use.exprs=TRUE) @ % In this case, the \Robject{use.exprs} argument is not required in the first case because \Rfunction{each\_col} and \Rfunction{each\_row} are methods and have been defined to work on \Rclass{flowFrame} objects by first extracting the intensity data. \section{Transformation} \Rpackage{flowCore} features two methods of transforming parameters within a \Rclass{flowFrame}: inline and out-of-line. The inline method, discussed in the next section has been developed primarily to support filtering features and is strictly more limited than the out-of-line transformation method, which uses R's \Rfunction{transform} function to accomplish the filtering. Like the normal \Rfunction{transform} function, the \Rclass{flowFrame}is considered to be a data frame with columns named for parameters of the FCS file. For example, if we wished to plot our first \Rclass{flowFrame}'s first two fluorescence parameters on the log scale we might write: <>= plot(transform(fs[[1]], `FL1-H`=log(`FL1-H`), `FL2-H`=log(`FL2-H`)), c("FL1-H","FL2-H")) @ Like the usual \Rfunction{transform} function, we can also create new parameters based on the old parameters, without destroying the old <>= plot(transform(fs[[1]], log.FL1.H=log(`FL1-H`), log.FL2.H=log(`FL2-H`)), c("log.FL1.H", "log.FL2.H")) @ \subsection{Standard Transforms} Though any function can be used as a transform in both the out-of-line and inline transformation techniques, \Rpackage{flowCore} provides a number of parameterized transform generators that correspond to the transforms commonly found in flow cytometry and defined in the Transformation Markup Language (Transformation-ML, see \url{http://www.ficcs.org/} and \cite{Spidlen2006} for more details). Briefly, the predefined transforms are: \begin{description} \item[truncateTransform]{ $y = \left\{\begin{array}{ll} a & x < a \\ x & x \geq a \\ \end{array}\right.$ } \item[scaleTransform]{$ f(x) = \frac{x-a}{b-a}$} \item[linearTransform]{ $ f(x) = a+bx$ } \item[quadraticTransform]{ $ f(x) = ax^2 + bx + c$ } \item[lnTransform]{$ f(x) = \log\left(x\right)\frac{r}{d} $} \item[logTransform]{$ f(x) = \log_b\left(x\right)\frac{r}{d} $} \item[biexponentialTransform]{ $ f^{-1}(x) = ae^{bx}-ce^{dx}+f$} \item[logicleTransform]{ A special form of the biexponential transform with parameters selected by the data.} \item[arcsinhTransform]{ $ f(x) = asinh\left(a+bx\right)+c$} \end{description} To use a standard transform, first we create a transform function via the constructors supplied by \Rpackage{flowCore}: <>= aTrans <- truncateTransform("truncate at 1", a=1) aTrans @ which we can then use on the parameter of interest in the usual way <>= transform(fs,`FL1-H`=aTrans(`FL1-H`)) @ However this form of transform call is not intended to used in the programmatic context because locally defined transform function (e.g. 'aTrans') may not be always visible to the non-standard evaluation environment. .e.g <>= f1 <- function(fs,...){ transform(fs, ...)[,'FL1-H'] } f2 <- function(fs){ aTrans <- truncateTransform("truncate at 1", a=1) f1(fs, `FL1-H` = aTrans(`FL1-H`)) } res <- try(f2(fs), silent = TRUE) res @ So this form of usage of 'transform' method is only useful for the interactive exploratary. we highly recommend the usage of \Robject{transformList} instead for the more robust and reproducible code. <>= myTrans <- transformList('FL1-H', aTrans) transform(fs, myTrans) @ \section{Gating} The most common task in the analysis of flow cytometry data is some form of filtering operation, also known as gating, either to obtain summary statistics about the number of events that meet a certain criteria or to perform further analysis on a subset of the data. Most filtering operations are a composition of one or more common filtering operations. The definition of gates in \Rpackage{flowCore} follows the Gating Markup Language Candidate Recommendation \cite{Spidlen2008}, thus any \Rpackage{flowCore} gating strategy can be reproduced by any other software that also adheres to the standard and \textit{vice versa}. \subsection{Standard gates and filters} Like transformations, \Rpackage{flowCore} includes a number of built-in common flow cytometry gates. The simplest of these gates are the geometric gates, which correspond to those typically found in interactive flow cytometry software: \begin{description} \item[rectangleGate]{Describes a cubic shape in one or more dimensions--a rectangle in one dimension is simply an interval gate.} \item[polygonGate]{Describes an arbitrary two dimensional polygonal gate.} \item[polytopeGate]{Describes a region that is the convex hull of the given points. This gate can exist in dimensions higher than $2$, unlike the \Robject{polygonGate}.} \item[ellipsoidGate]{Describes an ellipsoidal region in two or more dimensions} \end{description} These gates are all described in more or less the same manner (see man pages for more details): <>= rectGate <- rectangleGate(filterId="Fluorescence Region", "FL1-H"=c(0, 12), "FL2-H"=c(0, 12)) @ In addition, we introduce the notion of data-driven gates, or filters, not usually found in flow cytometry software. In these approaches, the necessary parameters are computed based on the properties of the underlying data, for instance by modelling data distribution or by density estimation : \begin{description} \item[norm2Filter]{ A robust method for finding a region that most resembles a bivariate Normal distribution. } \item[kmeansFilter]{ Identifies populations based on a one dimensional k-means clustering operation. Allows the specification of \textbf{multiple} populations. } \end{description} \subsection{Count Statistics} When we have constructed a filter, we can apply it in two basic ways. The first is to collect simple summary statistics on the number and proportion of events considered to be contained within the gate or filter. This is done using the \Rfunction{filter} method. The first step is to apply our filter to some data <>= result = filter(fs[[1]],rectGate) result @ As we can see, we have returned a \Rclass{filterResult} object, which is in turn a filter allowing for reuse in, for example, subsetting operations. To obtain count and proportion statistics, we take the summary of this \Rclass{filterResult}, which returns a list of summary values: <>= summary(result) summary(result)$n summary(result)$true summary(result)$p @ A filter which contains multiple populations, such as the \Rclass{kmeansFilter}, can return a list of summary lists: <>= summary(filter(fs[[1]], kmeansFilter("FSC-H"=c("Low", "Medium", "High"), filterId="myKMeans"))) @ A filter may also be applied to an entire \Rclass{flowSet}, in which case it returns a list of \Rclass{filterResult} objects: <>= filter(fs,rectGate) @ \subsection{Subsetting} To subset or split a \Rclass{flowFrame} or \Rclass{flowSet}, we use the \Rfunction{Subset} and \Rfunction{split} methods respectively. The first, \Rfunction{Subset}, behaves similarly to the standard R \Rfunction{subset} function, which unfortunately could not used. For example, recall from our initial plots of this data that the morphology parameters, Forward Scatter and Side Scatter contain a large more-or-less ellipse shaped population. If we wished to deal only with that population, we might use \Rfunction{Subset} along with a \Rclass{norm2Filter} object as follows: <>= morphGate <- norm2Filter("FSC-H", "SSC-H", filterId="MorphologyGate", scale=2) smaller <- Subset(fs, morphGate) fs[[1]] smaller[[1]] @ Notice how the \Robject{smaller} \Rclass{flowFrame} objects contain fewer events. Now imagine we wanted to use a \Rclass{kmeansFilter} as before to split our first fluorescence parameter into three populations. To do this we employ the \Rfunction{split} function: <>= split(smaller[[1]], kmeansFilter("FSC-H"=c("Low","Medium","High"), filterId="myKMeans")) @ or for an entire \Rclass{flowSet} <>= split(smaller, kmeansFilter("FSC-H"=c("Low", "Medium", "High"), filterId="myKMeans")) @ \subsection{Combining Filters} Of course, most filtering operations consist of more than one gate. To combine gates and filters we use the standard R Boolean operators: \&, | and ! to construct an intersection, union and complement respectively: <>= rectGate & morphGate rectGate | morphGate !morphGate @ we also introduce the notion of the subset operation, denoted by either \%subset\% or \%\&\%. This combination of two gates first performs a subsetting operation on the input \Rclass{flowFrame} using the right-hand filter and then applies the left-hand filter. For example, <>= summary(filter(smaller[[1]],rectGate %&% morphGate)) @ first calculates a subset based on the \Robject{morphGate} filter and then applies the \Robject{rectGate}. \subsection{Transformation Filters} Finally, it is sometimes desirable to construct a filter with respect to transformed parameters. To allow for this in our filtering constructs we introduce a special form of the \Rfunction{transform} method along with another filter combination operator \%on\%, which can be applied to both filters and \Rclass{flowFrame} or \Rclass{flowSet} objects. To specify our transform filter we must first construct a transform list using a simplified version of the \Rfunction{transform} function: <>= tFilter <- transform("FL1-H"=log,"FL2-H"=log) tFilter @ Note that this version of the transform filter does not take parameters on the right-hand side--the functions can only take a single vector that is specified by the parameter on the left-hand side. In this case those parameters are ``FL1-H'' and ``FL2-H.'' The function also does not take a specific \Rclass{flowFrame} or \Rclass{flowSet} allowing us to use this with any appropriate data. We can then construct a filter with respect to the transform as follows: <>= rect2 <- rectangleGate(filterId="Another Rect", "FL1-H"=c(1,2), "FL2-H"=c(2,3)) %on% tFilter rect2 @ Additionally, we can use this construct directly on a \Rclass{flowFrame} or \Rclass{flowSet} by moving the transform to the left-hand side and placing the data on the right-hand side: <>= plot(tFilter %on% smaller[[1]],c("FL1-H","FL2-H")) @ which has the same effect as the $\log$ transform used earlier. \section{GatingSet} \Rclass{filterSets} are very limited in their use for complex analysis work flows. They are result-centric and it is hard to access intermediate results. \Rpackage{flowWorkspace} and \Rpackage{openCyto} framework (http://opencyto.org) offers much more versatile tools for such tasks though the \Rclass{GatingSet} class (the old \Rclass{workFlow} is now deprecated). The general idea is to let the software handle the organization of intermediate results, and operations and to provide a unified API to access and summarize these operations. \subsection{Abstraction of GatingSet} There are two classes in \Rpackage{flowWorkspace} that are used to abstract work flows: \Rclass{GatingSet} objects are the basic container holding all the necessary bits and pieces and they are the main structure for user interaction. It is the container storing multiple \Rclass{GatingHierarchy} objects which are associate with individual samples. One can think of \Rclass{GatingSet} corresponds to \Rclass(flowSet) and \Rclass{GatingHierarchy} corresponds to \Rclass(flowFrame). It is important to know that \Rclass{GatingSet} use 'exteranl pointer' to store the 'gating tree' and thus most of its accessors have a reference semantic instead of the pass-by-value semantic that is usually found in the R language. The main consequence on the user-level is the fact that direct assignments to a \Rclass{GatingSet} object are usually not necessary; i.e., functions that operate on the \Rclass{GatingSet} have the potential side-effect of modifying the object. \subsection{Creating \Rclass{GatingSet} objects} Before creating a 'GatingSet', we need to have flow data loaded into R as a \Rclass{flowSet} or \Rclass{ncdfFlowSet}(the disk-based 'flowSet', to handle large data set that are too big for memory, see \Rpackage(ncdfFlow) package). <>= library(flowWorkspace) fcsfiles <- list.files(pattern = "CytoTrol" , system.file("extdata", package = "flowWorkspaceData") , full = TRUE) fs <- read.flowSet(fcsfiles) @ Normally, we want to compensate the data firstly by using a user supplied compensation matrix: <>= gs_manual <- load_gs(list.files(pattern = "gs_manual" , system.file("extdata", package = "flowWorkspaceData") , full = TRUE)) comp <- getCompensationMatrices(gs_manual[[1]]) @ <>= comp fs_comp <- compensate(fs, comp) @ Here is the effect of compensation: <>= transList <- estimateLogicle(fs[[1]], c("V545-A","V450-A")) grid.arrange( xyplot(`V545-A`~`V450-A`, transform(fs[[1]], transList), smooth = FALSE, xbin = 32, main = "Before") , xyplot(`V545-A`~`V450-A`, transform(fs_comp[[1]], transList) , smooth = FALSE, xbin = 32, main = "After") , ncol = 2) @ Then \Rclass{GatingSet} can be created using the constructor \Rfunction{GatingSet}. <>= gs <- GatingSet(fs_comp) gs @ % somehow adding trans to gs is not stable on Mac OS X Snow Leopard (10.6.8) + llvm-g++-4.2 % even though it does work on Mac OS X Mavericks (10.9.5) + clang++ and all linux and windows % to pass bioc build we have to skip this and do the transformation outside of gs <>= biexpTrans <- flowJoTrans(channelRange=4096, maxValue=262144 , pos=4.5,neg=0, widthBasis=-10) tf <- transformList(parameters(comp), biexpTrans) fs_trans <- transform(fs_comp, tf) gs <- GatingSet(fs_trans) gs @ We can query the available nodes in the \Rclass{GatingSet} using the \Rfunction{getNodes} method: <>= getNodes(gs) @ It shows the only node 'root' which corresponds to the raw 'flow data' just added. \subsection{transform the data} Transformation can be either done on 'flowSet' before constructing 'GatingSet' or a \Rclass{transformList} can be directly added to a \Rclass{GatingSet}: <>= biexpTrans <- flowJoTrans(channelRange=4096, maxValue=262144 , pos=4.5,neg=0, widthBasis=-10) tf <- transformList(parameters(comp), biexpTrans) gs <- transform(gs, tf) @ <>= grid.arrange( densityplot(~`B710-A`, fs_comp[[1]], main = "raw") ,densityplot(~`B710-A`, flowData(gs)[[1]], main = "trans") , ncol = 2 ) @ Note that we did assign the return value of \Rfunction{transform} back to \Robject{gs}. This is because 'flow data' is stored as R object and thus transforming the data still follows the pass-by-value semantics. Also 'GatingSet' only supports 'flowJoTrans' at the time of writing (which works in most of scenarios though), in order to use other type of transformation, 'flowSet/ncdfFlowSet` can be transformed first and then construct the `GatingSet' from the transformed data: <>= tf <- transformList(parameters(comp), myFun)#myFun is any customized flowCore compatible transformation function fs_trans <- transform(fs_comp, tf) gs <- GatingSet(fs_trans) @ Since it is done outside of 'GatingSet', the transformation functions won't be stored in 'GatingSet' as the first method does. \subsection{Add the gates} Some basic flowCore \Rclass{filter} can be added to a \Rclass{GatingSet}: <>= rg1 <- rectangleGate("FSC-A"=c(50000, Inf), filterId="NonDebris") add(gs, rg1, parent = "root") getNodes(gs) # gate the data recompute(gs) @ As we see, here we don't need to assign \Rclass{GatingSet} back because all the modifications are made in place to the \Rclass{external pointer} rather than the R object itself. And now there is one new population node under the `root' node called 'NonDebris'. The node is named after the 'filterId' of the gate if not explictly supplied. After the gates are added, the actual gating process is done by explictly calling \Rfunction{recompute} method. Note that the numeric value it returns is the internal ID for the new population just added, which can be normally ignored since the gating path instead of numeric id is recommended way to refer to population nodes later. To view the gate we just added, <>= plotGate(gs, "NonDebris") @ Since They are '1d` gates, we can also dispaly it in `densityplot` <>= plotGate(gs, "NonDebris", type = "densityplot") @ To get population statistics for the given populatuion <>= getTotal(gs[[1]], "NonDebris")#counts getProp(gs[[1]], "NonDebris")#proportion @ Now we add two more gates: <>= # add the second gate mat <- matrix(c(54272,59392,259071.99382782 ,255999.994277954,62464,43008,70656 ,234495.997428894,169983.997344971,34816) , nrow = 5) colnames(mat) <-c("FSC-A", "FSC-H") mat pg <- polygonGate(mat) add(gs, pg, parent = "NonDebris", name = "singlets") # add the third gate rg2 <- rectangleGate("V450-A"=c(2000, Inf)) add(gs, rg2, parent = "singlets", name = "CD3") getNodes(gs) @ We see two more nodes are added to 'GatingSet' and the population names are explicitly specified during the adding this time. \Rclass{quadrantGate} that results in four sub-populationsis also supported. <>= qg <- quadGate("B710-A" = 2000, "R780-A" = 3000) add(gs, qg, parent="CD3", names = c("CD8", "DPT", "CD4", "DNT")) getChildren(gs[[1]], "CD3") # gate the data from "singlets" recompute(gs, "singlets") @ Here we see four children nodes are added to 'CD3' parent node. Four quadrants are named explicitly through 'names' argument by clock-wise order (start from top-left quadrant). 'recomputing' only needs to be done once from the first ungated node, which will automatically compute all its descendants. To plot the underlying tree <>= plot(gs) @ <>= if(suppressWarnings(require(Rgraphviz))){ plot(gs) }else{ plot(1,1, type="n", axes=FALSE, ann=FALSE) text(1,1,"Need to install Rgraphviz") } @ To plot all gates for one sample <>= plotGate(gs[[1]]) @ To retreive the underlying flow data for a gated \Rclass{population} <>= fs_nonDebris <- getData(gs, "NonDebris") fs_nonDebris nrow(fs_nonDebris[[1]]) nrow(fs[[1]]) @ To get all the population statistics <>= getPopStats(gs) @ \subsection{Removing nodes from \Rclass{GatingSet} object} There are dependencies between Rclass{nodes} in the hierarchical structure of the \Rclass{GatingSet} object. Thus, removing a particular node means also removing all of its associated child \Rclass{nodes}. <>= Rm('CD3', gs) getNodes(gs) Rm('NonDebris', gs) getNodes(gs) @ Now for the larger data set, it would be either inaccurate to apply the same hard-coded gate to all samples or impractical to manually set the gate coordinates for each individual sample. \Rpackage{openCyto}(\citep{Finak2014}) provides some data-driven gating functions to automatically generate these gates. For example, \Rfunction{mindensity} can be used for estimating 'nonDebris` gate for each sample. <>= library(openCyto) thisData <- getData(gs) nonDebris_gate <- fsApply(thisData, function(fr) openCyto:::.mindensity(fr, channels = "FSC-A")) add(gs, nonDebris_gate, parent = "root", name = "nonDebris") recompute(gs) @ \Rfunction{singeltGate} can be used for estimating 'singlets` <>= thisData <- getData(gs, "nonDebris") #get parent data singlet_gate <- fsApply(thisData, function(fr) openCyto:::.singletGate(fr, channels =c("FSC-A", "FSC-H"))) add(gs, singlet_gate, parent = "nonDebris", name = "singlets") recompute(gs) @ and then use \Rfunction{mindensity} again for "CD3" gate <>= thisData <- getData(gs, "singlets") #get parent data CD3_gate <- fsApply(thisData, function(fr) openCyto:::.mindensity(fr, channels ="V450-A")) add(gs, CD3_gate, parent = "singlets", name = "CD3") recompute(gs) @ and then use more adanced version of 'quadGate': \Rfunction{quadGate.seq} for gating "CD4" and "CD8" sequentially: <>= thisData <- getData(gs, "CD3") #get parent data Tsub_gate <- fsApply(thisData, function(fr) openCyto::quadGate.seq(fr , channels = c("B710-A", "R780-A") , gFunc = 'mindensity' ) ) add(gs, Tsub_gate, parent = "CD3", names = c("CD8", "DPT", "CD4", "DNT")) recompute(gs) @ and then plot the gates <>= plotGate(gs[[1]]) @ Note that in order to get parent gated data by 'getData' we have to 'recompute' after adding each gate. And the result is very similar to the manual gates but the gating process is more data-driven and more consistent across samples. The further automated the process, a gating pipeline can be established through \Rpackage{OpenCyto}(\citep{Finak2014}) that defines the hierarchical gating template in a text-based csv file. (See more details from http://openCyto.org) \clearpage \bibliographystyle{plainnat} \bibliography{cytoref} \end{document}