%\VignetteIndexEntry{Importing flowJo Workspaces into R} %\VignetteKeywords{flow cytometry, single cell assay, import} %\VignettePackage{flowWorkspace} \documentclass[11pt]{article} \usepackage{Sweave} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \title{flowWorkspace: A Package for Importing flowJo Workspaces into R} \author{Greg Finak } \begin{document} \maketitle \section{Purpose} The purpose of this package is to provide functionality to import relatively simple \textit{flowJo} workspaces into R. By this we mean, accessing the samples, groups, transformations, compensation matrices, gates, and population statistics in the \textit{flowJo} workspace, and replicating these using (primarily) \textit{flowCore} functionality. \section{Why Another flowJo Workspace Import Package?} There was a need to import \textit{flowJo} workspaces into R for comparative gating. The \textit{flowFlowJo} package did not meet our needs. Many groups have legacy data with associated flowJo XML workspace files in version 2.0 format that they would like to access using BioConductor's tools. Hopefully this package will fill that need. \section{Support} This package supports importing of \textbf{Version 2.0 XML workspaces only}. We cannot import \textbf{.jo} files directly. You will have to save them in XML workspace format, and ensure that that format is \textit{workspace version 2.0}. The package has been tested and works with files generated using flowJo version 9.1 on Mac OS X. XML generated by older versions of \textit{flowJo} on windows should work as well. We do not yet support \textit{flowJo}'s \textbf{Chimera} XML schema, though that support will be provided in the future. The package supports import of only a subset of the features present in a flowJo workspace. The package allows importing of sample and group names, gating hierarchy, compensation matrices, data transformation functions, a subset of gates, and population counts. BooleanGates are now supported by flowWorkspace. \section{Data Structures} The following section walks through opening and importing a flowJo workspace. \subsection{Loading the library} Simply call: <>= library(flowWorkspace) @ The library depends on numerous other packages, including \textit{graph, XML, Rgraphviz, flowCore, flowViz, RBGL}. \subsection{Opening a Workspace} We represent flowJo workspaces using \texttt{flowJoWorkspace} objects. We only need to know the path to, and filename of the flowJo workspace. <>= d<-system.file("extdata",package="flowWorkspaceData"); wsfile<-list.files(d,pattern="A2004Analysis.xml",full=T) @ In order to open this workspace we call: <>= ws<-openWorkspace(wsfile) summary(ws) @ We see that this a version 2.0 workspace file. It's location and filename are printed. Additionally, you are notified that the workspace file is open. This refers to the fact that the XML document is internally represented using 'C' data structures from the \Rpackage{XML} package. After importing the file, the workspace must be explicitly closed using \Rfunction{closeWorkspace()} in order to free up that memory. \subsection{Parsing the Workspace} With the workspace file open, we have not yet imported the XML document. The next step parses the XML workspace and creates R data structures to represent some of the information therein. Specifically, by calling \Rfunction{parseWorkspace()} the user will be presented with a list of \textit{groups} in the workspace file and need to choose one group to import. Why only one? Because of the way flowJo handles data transformation and compensation. Each group of samples is associated with a compensation matrix and specific data transformation. These are applied to all samples in the group. When a particular group of samples is imported, the package generates a \Rclass{GatingHierarchy} for each sample, describing the set of gates applied to the data (note: polygons, rectangles, quadrants, and ovals and boolean gates are supported). The set of GatingHierarchies for the group of samples is stored in a \Rclass{GatingSet} object. Calling \Rfunction{parseWorkspace()} is quite verbose, informing the user as each gate is created. The parsing can also be done non--interactively by specifying which group to import directly in the function call (either an index or a group name). An additional optional argument \texttt{execute=T/F} specifies whether you want to load, compensate, transform the data and compute statistics immediately after parsing the XML tree. <>= G<-parseWorkspace(ws,name=1,path=ws@path,isNcdf=FALSE,cleanup=FALSE,keep.indices=TRUE); #import the first group #Lots of output here suppressed for the vignette. @ When isNcdf flag is set TRUE,the data is stored in ncdf format on disk. <>= G @ We have generated a \Rclass{GatingSet} with $\Sexpr{length(G)}$ samples, each of which has 19 associated gates. Subsets of gating hierarchies can be accessed using the standard R subset syntax. At this point we have parsed the workspace file and generate the gating hierarchy associated with each sample imported from the file. The data have been loaded, compensated, and transformed in the workspace, and gating has been executed. The resulting \Rclass{GatingSet} contains a replicated analysis of the original flowJo workspace. We can plot the gating hierarchy for a given sample: <>= gh<-G[[1]] plot(gh) @ We can list the nodes (populations) in the gating hierarchy: <>= nodelist<-getNodes(gh, path = 1) nodelist @ Note that the number preceding the period in the node names is just an identifier to help uniquely label populations in the gating hierarchy. It does not represent any information about population statistics. When the node name itself is already unique,such as "Live","B Cell" ,the prefix is ommitted by default. We can force the prefix to be added to each node by setting \Robject{prefix} to "all". <>= getNodes(gh, prefix="all", path = 1) @ Alternativly, the the full path of the node can be displayed by setting \Robject{path} to "full". <>= getNodes(gh, path = "full") @ We can get a specific gate definition: <>= node<-nodelist[3] g<-getGate(gh,node) g @ We can get the population proportion (relative to its parent) for a single population: <>= getProp(gh,node) @ Or we can retrieve the population statistics for all populations in the sample: <>= getPopStats(gh) @ We can plot the coefficients of variation between the counts derived using flowJo and flowCore for each population: <>= print(plotPopCV(gh)) @ We can plot individual gates: note the scale of the transformed axes. Second argument can be either a numeric index <>= plotGate(gh, 10) @ or unique node name, <>= plotGate(gh, "pDC") @ or a full/parital gating path, <>= plotGate(gh, "APC/pDC") @ hexbin can be enabled to speed up plotting. <>= plotGate(gh, "pDC", xbin =32) @ \Robject{overlay} can be used to plot another cell population on top of the existing gates specified by \Robject{y}. <>= plotGate(gh,y = "pDC/IL-12+", xbin =32, overlay = "mDC/IL-12+") @ \Robject{overlay} can be either a \Robject{numeric} or \Robject{character} scalar indicating the index or gating path of a gate/population within the \Robject{GatingHierarchy} or a \Robject{logical} vector that indicates the cell event indices representing a sub-cell population. If we have metadata associated with the experiment, it can be attached to the \Robject{GatingSet}. <>= d<-data.frame(sample=factor(c("sample 1", "sample 2")),treatment=factor(c("sample","control")) ) pd<-pData(G) pd<-cbind(pd,d) pData(G)<-pd pData(G); @ We can retrieve the subset of data associated with a node: <>= getData(gh,node) @ When applied to the \Robject{GatingSet},a \Robject{flowSet} or \Robject{ncdfFlowSet} is returned. <>= getData(G,node); @ Or we can retrieve the indices specifying if an event is included inside or outside a gate using: <>= table(getIndices(gh,node)) @ The indices returned are relative to the parent population (member of parent AND member of current gate), so they reflect the true hierarchical gating structure. If we wish to do compensation or transformation manually, we can retrieve all the compensation matrices from the workspace: <>= C<-getCompensationMatrices(gh); C @ Or we can retrieve transformations: <>= T<-getTransformations(gh) names(T) T[[1]] @ <>= A<-names(T) B<-names(T[[1]]) @ \Rfunction{getTransformations} returns a list of functions to be applied to different dimensions of the data. Above, the transformation is applied to this sample, the appropriate dimension is transformed using a channel--specific function from the list. The list of samples in a workspace can be accessed by: <>= getSamples(ws); @ And the groups can be accessed by: <>= getSampleGroups(ws) @ The \texttt{compID} column tells you which compensation matrix to apply to a group of files, and similarly, based on the name of the compensation matrix, which transformations to apply. %\subsection{Converting to flowCore Objects} %You may want to convert the imported workspace into \Robject{flowCore} objects, such as workflows. We provide this functionality via the \Rfunction{flowWorkspace2flowCore} function. % %\Rfunction{flowWorkspace2flowCore} extracts the compensation matrices,transformation functions and all the gates from GatingHierarchies generated by flowWorkspace package and converts them to the respective views and actionItems of workFlows defined by flowCore package. % It takes a gatingHierarchy, flowJoWorkspace or GatingSet as the input, and returns one or multiple workflows as the result, depending on whether the gating hierarchies for each sample (including gate coordinates) are identical. % % %<>= %wfs<-flowWorkspace2flowCore(G,path=ws@path); %wfs % %@ % %\Rfunction{plotWf} plots the workflow tree % %<>= %plotWf(wfs[[1]]) %@ % %Finally, when we are finished with the workspace, we close it: % %<>= %closeWorkspace(ws); %ws %@ \subsection{add/remove gates} \Robject{GatingSet} is equivalent to the \Robject{workFlow} in flowCore package ,which provides methods to build a gating tree from raw FCS files and add or remove flowCore gates(or populations) to or from it. Firstly,we start from a flowSet that contains three ungated flow samples: <>= data(GvHD) #select raw flow data fs<-GvHD[1:3] @ Then we can transform the raw data: <>= tf <- transformList(colnames(fs[[1]])[3:6], asinh, transformationId="asinh") fs_trans<-transform(fs,tf) @ and create a gatingset from the transformed flowSet: <>= gs <- GatingSet(fs_trans) gs gh1<-gs[[1]] getNodes(gh1) @ It now only contains the root node. We can add one rectangleGate: <>= rg <- rectangleGate("FSC-H"=c(200,400), "SSC-H"=c(250, 400), filterId="rectangle") nodeID<-add(gs, rg) nodeID getNodes(gh1) @ Note that the gate is added to root node by default if parent is not specified. Then we add a quadGate to the new population generated by the rectangeGate which is named after filterId of the gate because the name is not specified when \Robject{add} method is called. <>= qg <- quadGate("FL1-H"=2, "FL2-H"=4) nodeIDs<-add(gs,qg,parent="rectangle") nodeIDs getNodes(gh1) @ Here quadGate produces four population nodes/populations whose names are named after dimensions of gate if not specified. Boolean Gate can also be defined and added to GatingSet: <>= bg<-booleanFilter(`CD15 FITC-CD45 PE+|CD15 FITC+CD45 PE-`) bg nodeID2<-add(gs,bg,parent="rectangle") nodeID2 getNodes(gh1) @ Now the gating tree is finished but the actual gating is done by \Robject{recompute} method: <>= recompute(gs) @ After gating is finished,gating results can be visualized by \Robject{plotGate} method: <>= plotGate(gh1,"rectangle") #plot one Gate @ The second argument of \Robject{plotGate} is used to specify node index/name. Multiple gates can be plotted on the same pannel: <>= plotGate(gh1,nodeIDs) @ We may also want to plot all the gates without specifying the gate index: <>= plotGate(gh1) @ Boolean gate is skipped by default and can be enabled by: <>= plotGate(gh1,bool=TRUE) @ Note that smoothing may be applied automatically if there are not enough events after gating Sometime it is more useful to compare gates across samples using lattice plot by applying \Robject{plotGate} to a \Robject{GatingSet}: <>= plotGate(gs,getNodes(gs)[nodeID]) @ The gating hierarchy is plotted by: <>= plot(gh1,bool=TRUE) @ If we want to remove one node, simply: <>= Rm('rectangle', gs) getNodes(gh1) @ As we see,removing one node causes all its descendants to be removed as well. \subsection{archive and clone} Oftentime, we need to save a GatingSet including the gated flow data,gates and populations to disk and reload it later on. It can be achieved by: <>= save_gs(gs,file = "~/gs") G1 <- load_gs(file = "~/gs") @ We also provide the \Robject{clone} method to make a full copy of an existing \Robject{GatingSet}: <>= gs_cloned <- clone(gs) @ Note that the \Robject{GatingSet} contains environment slots and external pointer that point to the internal C data structure. So make sure to use these methods in order to save or make a copy of existing object. The regular R assignment (<-) or \Robject{save} routine doesn't work as expected for the \Robject{GatingSet} object. \subsection{Exporting to FlowJo OSX 9.2} The \Rfunction{exportAsFlowJoXML} function can be used to export a \Robject{flowCore::workFlow} as an XML workspace for FlowJo 9.2 OSX. If flowWorkspace has been used to import an existing FlowJo workspace, \Rfunction{flowWorkspace2flowCore} can be used to obtain a \Robject{workFlow} for exporting. Currently this function can export one workFlow at a time. %\subsection{Parallel Support} %Parsing and gating can be time--consuming. This latest version (>1.0.0) of flowWorkspace supports parallelization via multicore, snowfall, and Rmpi. If multicore is loaded, or a snowfall cluster is initialized, flowWorkspace will use snowfall or multicore (in that order of preference) to parse the workspace. Parallel gating of the workspace can be performed by loading Rmpi and running parseWorkspace(). This corresponds to the execute() step of the parseWorkspace function. Rmpi is needed to handle concurrent reads/writes to the ncdfFlowSet file. Parallel gating / parsing will work with netCDF-backed data or if data is stored in RAM. \subsection{Deprecated Functionality} The following behaviour is no longer supported and has been replace by more extensive netCDF support via the ncdfFlow package. If you have particularly large data files (millions of events), then you won't want to keep the data around, nor the indices for gate membership. Instead, pass the options \texttt{cleanup=TRUE, keep.indices=FALSE} to the \Rfunction{execute()} function, and the data will be scrubbed after computing population statistics. With future improvements making use of the netCDF framework, and bitvector representations of population memberships; this will improve memory usage in high--throughput unsupervised analysis settings. \section{Troubleshooting} If this package is throwing errors when parsing your workspace, and you are certain your workspace is version 2.0, contact the package author. If you can send your workspace by email, we can test, debug, and fix the package so that it works for you. Our goal is to provide a tool that works, and that people find useful. \section{Future Improvements} We are working on support for flowJo XML workspaces exported from the Windows version of flowJo. Efforts are underway to integrate GatingSet and GatingHierarchy objects more closely with the rest of the flow infrastructure. \end{document}