<!--
%\VignetteEngine{knitr}
%\VignetteIndexEntry{An Introduction to the openCyto package}
-->

```{r requirements, echo=FALSE}
if (!require(flowWorkspaceData)) {
  stop("Cannot build the vignettes without 'flowWorkspaceData'")
}
```


An Introduction to **openCyto** package
=======================================


```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(out.extra='style="display:block; margin: auto"', fig.align="center", message = FALSE, warning = FALSE)
```


1. Introduction
------------


The  **openCyto** package is designed to facilitate the automated gating methods in sequential way to mimic the manual gating scheme.


### 1.1. Manual gating 

Traditionally, scientists have to draw the gates for each individual sample on each 2D projections (2 channels) within `flowJo`. 
Or draw the 'template gate's on one sample and replicate it to other samples, then manually inspect the gate on each sample
to do the correction if necessary. Either way is time consuming and subjective, thus not suitable for the large data sets
generated by high-throughput flow Cytometers or the `cross-lab` data analysis.   

Here is one `xml` workspace (manual gating scheme) exported from `flowJo`.
```{r load-flowWorkspace, echo=F}
library(flowWorkspace)
```
```{r load-xml, eval=TRUE}
flowDataPath <- system.file("extdata", package = "flowWorkspaceData")
wsfile <- list.files(flowDataPath, pattern="manual.xml",full = TRUE)
wsfile
```
By using`flowWorkspace`package, We can load it into R,
```{r openWorkspace, eval=F}
ws <- openWorkspace(wsfile)
```
apply (`parseWorkspace`) the`manual gates`defined in`xml`to the raw`FSC`files,
```{r parseWorkspace, eval=F}
gs <- parseWorkspace(ws, name= "T-cell", subset =1, isNcdf = TRUE)
```
```{r load_gs_manual, echo = FALSE}
gs <- load_gs(file.path(flowDataPath,"gs_manual"))
```

and then visualize the`Gating Hierarchy` 
```{r plot-manual-GatingHierarchy}
gh <- gs[[1]]
plot(gh)
```
and the`gates`: 
```{r plot-manual-gates, fig.width = 9}
plotGate(gh)
```   
This is a gating scheme for `T cell` panel, which tries to identify `T cell` sub-populations.
We can achieve the same results by using automated gating pipeline provided by this package.

### 1.2. Automated Gating
-----------------------------
`flowCore`,`flowStats`,`flowClust` and other packages provides many different gating methods to 
detect cell populations and draw the gates automatically. 

`flowWorkspace` package provides the `GatingSet` as an efficient data structure to store, query and visualize the hierarchical gated data.

By taking advantage of these tools, `openCyto` package can create the automated gating pipeline by a `gating template`, which is essentially the same kind of hierarchical gating scheme 
used by the biologists and scientists.
  

2. Create gating templates
-----------------------------


### 2.1. Template format
First of all, we need to describe the gating hierarchy in a spread sheet (a plain text format).
This spread sheet must have the following columns:
* `alias`: a name used label the cell population, the path composed by the alias and its precedent nodes (e.g. /root/A/B/alias) has to be uniquely identifiable.
* `pop`: population patterns of `A+/-` or `A+/-B+/-`, which tells the algorithm which side (postive or negative) of 1d gate or which quadrant of 2d gate to be kept
         when it is in the form of 'A+/-B+/-', 'A' and 'B' should be the full name (or a substring as long as it is unqiuely matched) of either channel or marker of the flow data.
* `parent`: the parent population alias, its path has to be uniquely identifiable.
* `dims`: characters seperated by comma specifying the dimensions(1d or 2d) used for gating. It can be either channel name or stained marker name.
* `gating_method`: the name of the gating function (e.g. `flowClust`). It is invoked by a wrapper function that has the identical function name prefixed with a dot.(e.g. `.flowClust`)
* `gating_args`: the named arguments passed to gating function
* `collapseDataForGating`: When TRUE, data is collapsed (within groups if `groupBy` specified) before gating and the gate is replicated across collapsed samples.
  When set FALSE (or blank),then `groupBy` argument is only used by `preprocessing` and ignored by gating.
* `groupBy`: If given, samples are split into groups by the unique combinations of study variable (i.e. column names of pData,e.g."PTID:VISITNO").
  when split is numeric, then samples are grouped by every N samples 
* `preprocessing_method`: the name of the preprocessing function(e.g. `prior_flowClust`). It is invoked by a wrapper function that has the identical function name prefixed with a dot.(e.g. `.prior_flowClust`)
  the preprocessing results are then passed to gating wrapper function through `pps_res` argument.
* `preprocessing_args`: the named arguments passed to preprocessing function.

### 2.2. Example template
Here is the an example of the gating template.
```{r gatingTemplate, eval = T}
library(openCyto)
library(data.table)
gtFile <- system.file("extdata/gating_template/tcell.csv", package = "openCyto")
dtTemplate <- fread(gtFile, autostart = 1L)
dtTemplate
```
Each row is usually corresponding to one cell population and the gating method that is used to get that population.
We will try to explain how to create this gating template based on the manual gating scheme row by row.

#### 2.2.1. "nonDebris"
```{r gatingTemplate-nonDebris, eval = T}
dtTemplate[1,]
```  
* The population name is `"nonDebris"` (specified in `alias` field).
* The `parent` node is `root` (which is always the first node of `gating hierarchy` by default). 
* We use `mindensity` (one of the `gating` functions provided by `openCyto` package) as `gating_method` to gate on dimension (`dim`) of `FSC-A`.
* As the result, it will generate a 1d gate on `FSC-A`. `"nonDebris"` (equivalent to `"nonDebris+"`) in `pop` field indicates the 
`positive` side of 1d gate is kept as the population of interest. 
* There is no `grouping` or `preprocessing` involved in this gate, thus leave the other columns as `blank`

#### 2.2.2. "singlets" 
```{r gatingTemplate-singlets, eval = T}
dtTemplate[2,]
```
* The population name is `"singlets"` (`alias` field).
* The `parent` node is `nonDebris`.
* `gating_method` is `singletGate` (function from by `flowStats` package)
* As the result, a `polygonGate` will be generated on `FSC-A` and `FSC-H` (specified by `dims`) for each sample.
* Again, `"singlets"` in `pop` field stands for `"singlets+"`. But here it is 2d gate, which means we want to keep the area 
inside of the polygon 

#### 2.2.3. "lymphocyte" 
```{r gatingTemplate-lympth, eval = T}
dtTemplate[3,]
```
* Similarly, `alias` specifies the name of population
* `parent` points to `singlets`
* Since we are going to use `flowClust` as `gating_method` to do the 2-dimensional gating,
	`dims` is comma separated string, `x` axis (`FSC-A`) goes first, `y` (`SSC-A`) the second. 
	This order doesn't affect the gating process but will determine how the gates are displayed.     
* All the parameters that `flowClust` algorithm accepts can be put in `gating-args` as if they are typed in `R console`.
	see `help(flowClust)` for more details of these arguments
* `flowClust` algorithm accept the extra arguments `priors` that is calculated during `preprocessing` stage (before the actual `gating`),
	thus, we supply the `preprocessing_method` with `prior_flowClust`.

#### 2.2.4. "cd3+" (Tcell) 

```{r gatingTemplate-cd3, eval = T}
dtTemplate[4,]
```
It is similar to the `nonDebris` gate except that we specify `collapseDataForGating` as `TRUE`, 
which tells the pipeline to `collapse` all samples into one and applies `mindensity` to the collapsed data on `CD3` dimension.
Once the gate is generated, it is replicated across all samples. This is only useful when each individual sample does not have
enough events to deduce the gate. Here we do this just for the purpose of proof of concept. 

#### 2.2.5. CD4 and CD8
The forth row specifies `pop` as `cd4+/-cd8+/-`, which will be expanded this into 6 rows. 
```{r gatingTemplate-cd4cd8, eval = T}
dtTemplate[5,]
```
```{r gatingTemplate-expand, echo = F, results = F}
expanded <- openCyto:::.preprocess_csv(dtTemplate)
rownames(expanded) <- NULL
```

First two rows are two 1d gates that will be generated by `gating_method` on each 
dimension (`cd4` and `cd8`) independently:
```{r gatingTemplate-expand1, echo = F}
expanded[5:6,]
```

Then another 4 rows are 4 `rectangleGate`s that corresponds to the 4 `quadrants` in 2d projection (`cd4 vs cd8`).
```{r gatingTemplate-expand2, echo = F}
expanded[7:10,]
```
As we see here, `"refGate"` in `gating_method` indicates that they are constructed based on the 
`gate coordinates` of the previous two 1d gates.
Those 1d gates are thus considered as `"reference gates"` that are referred by colon separated `alias` string in `gating_args`: `"cd4+:cd8+"`.

Alternatively, we can expand it into these 6 rows explicitly in the spread sheet. 
But this convenient representation is recommended unless user wants have finer control on how the gating is done. 
For instance, sometime we need to use different `gating_method`s to generate 1d gates on `cd4` and `cd8`.
Or `cd8` gating needs to depend on `cd4` gating ,i.e. the `parent` of `c8+` is `cd4+`(or `cd4-`) instead of `cd3`.
Sometimes we want to have the customized `alias` other than quadrant-like name (`x+y+`) that gets generated automatically. 
(e.g. 5th row of the gating template)

3. Load gating template
----------------------------- 
After the gating template is defined in the spread sheet, it can be loaded into R:  	 	
```{r load-gt, eval = T}
gt_tcell <- gatingTemplate(gtFile, autostart = 1L)
gt_tcell
```
Besides looking at the spread sheet, we can examine the gating scheme by visualizing it:
```{r plot-gt, eval = T}
plot(gt_tcell)
```
As we can see, the gating scheme has been expanded as we described above.
All the **colored** arrows source from the `parent` population and the **grey** arrows source from the `reference` population(/gate).  

4. Run the gating pipeline
-----------------------------
Once we are satisfied with the gating template, we can apply it to the actual flow data.

### 4.1. Load the raw data
First of all, we load the raw FCS files into R by `ncdfFlow::read.ncdfFlowSet` (It uses less memory than `flowCore::read.flowSet`).
```{r load-fcs}
fcsFiles <- list.files(pattern = "CytoTrol", flowDataPath, full = TRUE)
ncfs  <- read.ncdfFlowSet(fcsFiles)
ncfs
```
### 4.2. Compensation
Then, compensate the data. If we have compensation controls (i.e. single stained samples), we can calculate the
compensation matrix by `flowCore::spillover` function.
Here we simply use the compensation matrix defined in `flowJo workspace`.
```{r compensate}
compMat <- getCompensationMatrices(gh)
ncfs_comp <- compensate(ncfs, compMat)
```

Here is one example showing the compensation outcome:
```{r compensate_plot, echo = F, fig.width = 5, fig.height = 5}
sub_chnl <- c("V545-A","V450-A")
fr <- ncfs[[1]][,sub_chnl]
fr_comp <- ncfs_comp[[1]][,sub_chnl]
fs <- as(list(fr = fr, fr_comp = fr_comp), "flowSet")
#transform data to better visualize the compensation effect
fs <- transform(fs, estimateLogicle(fr,sub_chnl))
xyplot(`V545-A`~`V450-A`, fs, smooth = FALSE, xbin =64)
```

### 4.3. Transformation
All the stained channels need to be transformed properly before the gating.
Here we use the `flowCore::estimateLogicle` to do the `logicle` transformation.
```{r transformation, eval = T}
chnls <- parameters(compMat)
transFuncts <- estimateLogicle(ncfs[[1]], channels = chnls)
ncfs_trans <- transform(ncfs_comp, transFuncts)
```
Here is one example showing the transformation outcome:
```{r transformation_plot, echo = F, fig.width = 5, fig.height = 5}
fr <- ncfs_comp[[1]][,sub_chnl[1]]
fr_trans <- ncfs_trans[[1]][,sub_chnl[1]]
fs <- as(list(fr = fr, fr_trans = fr_trans), "flowSet")
densityplot(~`V545-A`, fs, stack = FALSE,scales = list(x=list(relation = "free")))
```
### 4.4. Create 'GatingSet'
Once data is preprocessed, it can be loaded into `GatingSet` object.
```{r GatingSet}
gs <- GatingSet(ncfs_trans)
getNodes(gs[[1]])
```
As `getNodes` shows, there is only one population node `root` at this point.

### 4.5. Gating
Now we can apply the gating template to the data:
```{r gating, eval = TRUE}
gating(gt_tcell, gs)
```
Optionally, we can run the pipeline in `parallel` to speed up gating. e.g.
```{r gating_par, eval = FALSE}
gating(gt_tcell, gs, mc.cores=2, parallel_type = "multicore")
```

### 4.6. Hide nodes
After gating, there are some extra populations generated automatically by the pipeline (e.g. `refGate`).
```{r plot_afterGating}
plot(gs[[1]])
```
We can hide these populations if we are not interested in them: 
```{r hideGate, results = "hide"}
dodesToHide <- c("cd8+", "cd4+"
				, "cd4-cd8-", "cd4+cd8+"
				, "cd4+cd8-/HLA+", "cd4+cd8-/CD38+"
				, "cd4-cd8+/HLA+", "cd4-cd8+/CD38+"
				, "CD45_neg/CCR7_gate", "cd4+cd8-/CD45_neg"
				, "cd4-cd8+/CCR7+", "cd4-cd8+/CD45RA+"
				)
lapply(dodesToHide, function(thisNode)setNode(gs, thisNode, FALSE))
```
### 4.7. rename nodes
And rename the populations:
```{r rename, results = "hide"}
setNode(gs, "cd4+cd8-", "cd4")
setNode(gs, "cd4-cd8+", "cd8")
```

### 4.7. visualization
```{r plot_afterHiding}
plot(gs[[1]])
```

```{r plotGate_autoGate, fig.width = 9}
plotGate(gs[[1]])
```

5. Conclusion
------------
The `openCyto` package allows user to specify their gating schemes and gate the data 
in a data-driven fasion. It frees the scientists from the labor-intensitive manual gating routines 
and increases the speed as well as the reproducibilty and objectivity of the data analysis work.