%\VignetteIndexEntry{flowClust package} %\VignetteKeywords{Preprocessing,statistics} \documentclass{article} \usepackage{cite, hyperref} \title{Robust Model-based Clustering of Flow Cytometry Data\\ The flowClust package} \author{Raphael Gottardo, Kenneth Lo} \begin{document} \setkeys{Gin}{width=1.0\textwidth, height=1.1\textwidth} \maketitle \begin{center} {\tt raph@stat.ubc.ca, c.lo@stat.ubc.ca} \end{center} \textnormal{\normalfont} \tableofcontents \newpage \section{Licensing} Under the Artistic License, you are free to use and redistribute this software. However, we ask you to cite the following paper if you use this software for publication. \begin{itemize} \item[] Lo, K., Brinkman, R.~R., and Gottardo, R. (2008) Automated Gating of Flow Cytometry Data via Robust Model-based Clustering. Cytometry A 73:321--332. \end{itemize} % If you use the GvHD dataset included in this package, you should cite the following paper: % \begin{itemize} % \item[] Brinkman, R.~R., Gasparetto, M., Lee, S.~J.~J., et~al. (2007) High-Content Flow Cytometry and Temporal Data Analysis for Defining a Cellular Signature of Graft-Versus-Host Disease. Biol. Blood Marrow Transplant. 13(6):691--700. % \end{itemize} \section{Overview} We apply a robust model-based clustering approach proposed by Lo et~al. (2008) to identify cell populations in flow cytometry data. The proposed approach is based on multivariate $t$ mixture models with the Box-Cox transformation. This approach generalizes Gaussian mixture models by modeling outliers using $t$ distributions and allowing for clusters taking non-ellipsoidal shapes upon proper data transformation. Parameter estimation is carried out using an Expectation-Maximization (EM) algorithm which simultaneously handles outlier identification and transformation selection. Please refer to Lo et~al. (2008) for more details. Whereas this research primarily arose to automate the gating process in flow cytometry, the proposed clustering approach is a general methodology which -- and also this package -- can be applied to cluster analyses in many other fields. This \texttt{flowClust} package consists of a core function to implement the aforementioned clustering methodology. Its source code is built in C for optimal utilization of system resources. Several plotting functions or methods are available to users for visualizing various features of the clustering results including the cluster assignment, outliers, and the size and shape of the clusters formed. The fitted mixture model may be projected onto any one/two dimensions and displayed by means of a contour or image plot. Currently, the \texttt{flowClust} package provides two options for estimating the number of mixture components when it is unknown, namely, the Bayesian Information Criterion (BIC) and the Integrated Completed Likelihood (ICL). \texttt{flowClust} is built in a way such that it is highly integrated with \texttt{flowCore}, the core package for flow cytometry that provides data structures and basic functionalities to process cytometry data. Please read Section \ref{flowCore} below for details about the actual implementations. \section{Installation} To build the \texttt{flowClust} package from source, make sure that the following is present in your system: \begin{itemize} \item GNU Scientific Library (GSL) \item Basic Linear Algebra Subprograms (BLAS) \item a C compiler \end{itemize} GSL can be downloaded at \url{http://www.gnu.org/software/gsl/}. In addition, the package uses BLAS to perform basic vector and matrix operations. Please go to \url{http://www.netlib.org/blas/faq.html#5} for a list of optimized BLAS libraries for a variety of computer architectures. For instance, Mac users may use the built-in vecLib framework, while users of Intel machines may use the Math Kernel Library (MKL). A C compiler is needed to build the package as the core of the \texttt{flowClust} function is coded in C. For the package to be installed properly you might have to type the following command before installation:\\[6pt] \texttt{export LD\_LIBRARY\_PATH='/path/to/GSL/:/path/to/BLAS/':\$LD\_LIBRARY\_PATH}\\[6pt] which will tell {\bf R} where your GSL and BLAS libraries (see below for more details about BLAS libraries) are. Note that this might have already been configured on your system, so you might not have to do so. In case you need to do it, you might consider copying and pasting the line in your \texttt{.bashrc} so that you do not have to do it every time. Now you are ready to install the package: \\[6pt] \texttt{R CMD INSTALL flowClust\_x.y.z.tar.gz}\\[6pt] The package will look for a BLAS library on your system, and by default it will choose gslcblas, which is not optimized for your system. To use an optimized BLAS library, you can use the \texttt{-{}-with-blas} argument which will be passed to the \texttt{configure.ac} file. For example, on a Mac with vecLib pre-installed the package may be installed via: \\[6pt] \texttt{R CMD INSTALL flowClust\_x.y.z.tar.gz -{}-configure-args="-{}-with-blas='-framework vecLib'"}\\[6pt] On a 64-bit Intel machine which has MKL as the optimized BLAS library, the command may look like: \\[6pt] \texttt{R CMD INSTALL flowClust\_x.y.z.tar.gz -{}-configure-args="-{}-with-blas='-L/usr/local/mkl/lib/em64t/ -lmkl -lguide -lpthread'"}\\[6pt] where \texttt{/usr/local/mkl/lib/em64t/} is the path to MKL. If you prefer to install a prebuilt binary, you need GSL for successful installation. \section{Example: clustering of the Rituximab dataset} \subsection{The core function} \label{core} To demonstrate the functionality we use a flow cytometry dataset from a drug-screening project to identify agents that would enhance the anti-lymphoma activity of Rituximab, a therapeutic monoclonal antibody. The dataset is an object of class \texttt{flowFrame}; it consists of eight variables, among them only the two scattering variables (\texttt{FSC.H}, \texttt{SSC.H}) and two channels for the fluorochrome measurements (\texttt{FL1.H}, \texttt{FL3.H}) are of interest in this experiment. Note that, apart from a typical matrix object, the \texttt{flowClust} package may directly take a \texttt{flowFrame} object, a standard {\bf R} implementation of an ``FCS'' file which may be returned from the \texttt{read.FCS} function in the \texttt{flowCore} package, as data input. The following code performs an analysis with one cluster using the two scattering variables: <>= library(flowClust) @ <>= data(rituximab) summary(rituximab) res1 <- flowClust(rituximab, varNames=c("FSC.H", "SSC.H"), K=1, B=100) @ \texttt{B} is the maximum number of EM iterations; for demonstration purpose here we set a small value for \texttt{B}. The main purpose of performing a cluster analysis with one cluster here is to identify outliers, which will be removed from the subsequent analysis. Next, we would like to proceed with an analysis using the two channels for the fluorochrome measurements on the cells selected at the first stage. The following code performs the analysis with the number of clusters being fixed from one to six respectively: <>= rituximab2 <- rituximab[rituximab %in% res1,] res2 <- flowClust(rituximab2, varNames=c("FL1.H", "FL3.H"), K=1:6, B=100) @ We select the best model based on the Bayesian Information Criterion (BIC). The BIC value of a fitted model can be retrieved through the \texttt{criterion} method. By inspection, the BIC values stay relatively flat beyond three clusters. We therefore choose the model with three clusters and print a summary of the corresponding clustering result: <>= criterion(res2, "BIC") summary(res2[[3]]) @ The summary states that the rule used to identify outliers is ``\texttt{90\% quantile}", which means that a point outside the 90\% quantile region of the cluster to which it is assigned will be called an outlier. To specify a different rule, we make use of the \texttt{ruleOutliers} replacement method. The example below applies the more conservative ``\texttt{95\% quantile}" rule to identify outliers: <>= ruleOutliers(res2[[3]]) <- list(level=0.95) summary(res2[[3]]) @ We can also combine the rule set by the \texttt{z.cutoff} argument to identify outliers. Suppose we would like to assign an observation to a cluster only if the associated posterior probability is greater than 0.5. We can add this rule with the following command: <>= ruleOutliers(res2[[3]]) <- list(z.cutoff=0.6) summary(res2[[3]]) @ This time more points are called outliers. Note that such a change of the rule will not incur a change of the model-fitting process. The information about which points are called outliers is conveyed through the ``\texttt{flagOutliers}" slot, a logical vector in which the positions of \texttt{TRUE} correspond to points being called outliers. By default, when 10 or more points accumulate on the upper/lower boundary of any dimension, the \texttt{flowClust} function will filter those points. To change the threshold count from default (i.e., 10), users may specify \texttt{max.count} and \texttt{min.count} when running \texttt{flowClust}. To suppress filtering at the upper and/or the lower boundaries, set \texttt{max.count} and/or \texttt{min.count} as -1. We can also use the \texttt{max} and \texttt{min} arguments to control filtering of points, but from a different perspective. For instance, if we are only interested in cells which have a \texttt{FL1.H} measurement within (0, 400) and \texttt{FL3.H} within (0, 800), we may use the following code to perform the cluster analysis: <>= flowClust(rituximab2, varNames=c("FL1.H", "FL3.H"), K=2, B=100, min=c(0,0), max=c(400,800)) @ \subsection{Visualization of clustering results} Information like the cluster assignment, cluster shape and outliers may be visualized by calling the \texttt{plot} method to make a scatterplot: \newpage <>= plot(res2[[3]], data=rituximab2, level=0.8, z.cutoff=0) @ \noindent The \texttt{level} and/or \texttt{z.cutoff} arguments are needed when we want to apply a rule different from that stored in the \texttt{ruleOutliers} slot of the \texttt{flowClust} object to identify outliers. To look for densely populated regions, a contour/image plot can be made: \newpage <>= res2.den <- density(res2[[3]], data=rituximab2) plot(res2.den) @ \newpage <>= plot(res2.den, type="image") @ When we want to examine how the fitted model and/or the data is distributed along one chosen dimension, we can use the \texttt{hist} method: \newpage <>= hist(res2[[3]], data=rituximab2, subset="FL1.H") @ \noindent The \texttt{subset} argument may also take a numeric value: <>= hist(res2[[3]], data=rituximab2, subset=1) @ Since \texttt{FL1.H} is the first element of \texttt{res2[[3]]@varNames}, this line produces exactly the same histogram as the one generated by the line taking \texttt{subset="FL1.H"}. Likewise, the \texttt{subset} argument of both \texttt{plot} methods accepts either a numeric or a character vector to specify which two variables are to be shown on the plot. \subsection{Integration with flowCore} \label{flowCore} As mentioned in the Overview, effort has been made to make this package highly integrate with the \texttt{flowCore} package. Users will find that most methods defined in \texttt{flowCore} also work well in the context of \texttt{flowClust}. The very first step of integration is to replace the \texttt{flowClust} function with the \texttt{tmixFilter}-\texttt{filter} method pair. The aim is to wrap the clustering methodology in a filtering operation like those found in \texttt{flowCore}. The \texttt{tmixFilter} function creates a \texttt{filter} object to store all settings required for filtering (i.e., clustering). The object created is then passed to the actual filtering operation implemented by the \texttt{filter} method. The use of the \texttt{tmixFilter}-\texttt{filter} method pair separates the task of specifying the settings from the actual filtering operation, which facilitates a usual scenario in which filtering with the same settings is performed upon a set of data files. As an example, the filtering operation that resembles the second-stage clustering using \texttt{FL1.H} and \texttt{FL3.H} with 3 clusters (see Section \ref{core}) is implemented by the following code: <>= s2filter <- tmixFilter("s2filter", c("FL1.H", "FL3.H"), K=3, B=100) res2f <- filter(rituximab2, s2filter) @ The object \texttt{res2f} is of class \texttt{tmixFilterResult}, which extends the \texttt{multipleFilterResult} class created by \texttt{flowCore}. Users may apply various subsetting operations defined for the \texttt{multipleFilterResult} class in a similar manner on a \texttt{tmixFilterResult} object: <>= Subset(rituximab2, res2f) split(rituximab2, res2f, population=list(sc1=1:2, sc2=3)) @ The \texttt{Subset} method above returns a \texttt{flowFrame} object consisting of observations within the data-driven filter constructed. The \texttt{split} method splits the data into 2 populations upon the removal of outliers: the first population is formed by observations assigned to clusters 1 and 2 constructed by the filter, and the other population consists of observations assigned to cluster 3. The two populations are returned as two \texttt{flowFrame} objects separately, which are stored inside a list and labelled with \texttt{sc1} and \texttt{sc2} respectively. The \texttt{\%in\%} operator is also defined for a \texttt{tmixFilterResult} object, which will return a logical vector in which a \texttt{TRUE} value means that the corresponding observation is accepted by the filter. In fact, the implementation of the \texttt{Subset} method needs to call \texttt{\%in\%}. The object returned by \texttt{tmixFilter} is of class \texttt{tmixFilter}, which extends the \texttt{filter} class in \texttt{flowCore}. Various operators, namely, \texttt{\&}, \texttt{|}, \texttt{!} and \texttt{\%subset\%}, which have been constructed for \texttt{filter} objects in \texttt{flowCore}, also produce expected outcomes when applied to a \texttt{tmixFilter} object. For example, to perform clustering on a subset evaluated by a rectangle gate, we may apply the following code: <>= rectGate <- rectangleGate(filterId="rectRegion", "FL1.H"=c(0, 400), "FL3.H"=c(0, 800)) MBCfilter <- tmixFilter("MBCfilter", c("FL1.H", "FL3.H"), K=2, B=100) filter(rituximab2, MBCfilter %subset% rectGate) @ \end{document}