---
title: "flowPloidy: Overview"
author: "Tyler Smith"
date: "`r Sys.Date()`"
output: 
    rmarkdown::html_vignette:
        fig_width: 6
        fig_height: 4
    pdf_document:
        fig_width: 6
        fig_height: 4
        fig_caption: false
        pandoc_args: ["--variable=fontfamily:tgschola"]
vignette: >
  %\VignetteIndexEntry{flowPloidy: Overview}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: flowPloidy.bibtex
---

# Installation

`flowPloidy` is part of the [BioConductor](bioconductor.org) respository.
To install it, you must first have the Bioconductor system installed. To
do, from within R, call:

```{r, eval=FALSE}
source("https://bioconductor.org/biocLite.R")
biocLite()
```

Then install `flowPloidy`:

```{r, eval=FALSE}
biocLite("flowPloidy")
```

The examples in this vignette, and the `flowPloidy` help files, use data in
the `flowPloidyData` package, which is also on BioConductor. (You don't
need `flowPloidyData` to use `flowPloidy` on your own data, only for the
examples).

```{r, eval=FALSE}
biocLite("flowPloidyData")
```

## Development Version 

You may also install the developmental version of `flowPloidy` directly
from its [Github repository](https://github.com/plantarum/flowPloidy). To
do so, you will need the `devtools` package:

```{r, eval = FALSE}
install.packages("devtools", dependencies = TRUE)
library("devtools")
install_github("plantarum/flowPloidy")
```

Note that the developmental version of `flowPloidy` is also available
directly from BioConductor, *if* you are using the development branch.
Instructions on how to do so are available on the [BioConductor 
    website](http://bioconductor.org/developers/how-to/useDevel/).

# Preliminaries

## Sample Data Files

Once our packages are installed, we only need to load them into memory to
use them: 

```{r}
library(flowPloidy)
library(flowPloidyData)
```

The `flowPloidyData` package provides a single variable, `flowPloidyFiles`.
This is a vector of file names that points to the sample FCS files in your
R installation that we will use in this vignette. The actual path to the
files will differ slightly on different computers, and on different
operating systems; R takes care of those details for us.

## Data Channel

You will need to know which channel to use for your analysis. This will
most likely remain the same for all analyses from the same flow cytometer.
We can use the `viewFlowChannels` function to see the options:

```{r, output = "hold"}
viewFlowChannels(flowPloidyFiles[1])
```

Hopefully you see something there that corresponds to the channel you're
interested in. In our case, it's "`FL3.INT.LIN`". 

# Simple Analysis

With all the necessary packages installed, and the name of the channel of
interest identified, we are ready to proceed with a `flowPloidy` analysis.

```{r}
fh1 <- FlowHist(file = flowPloidyFiles[1], channel = "FL3.INT.LIN",
                analyze = FALSE)
```

(Note that we used the `analyze = FALSE` option, to delay analysis until we
have inspected the data, for the purposes of this tutorial. This is not
necessary, and if you omit the `analyze` argument analysis will proceed
automatically.)

Now that we have our data loaded, `plot` will display it for us:

```{r simplePlot, fig.cap = "FCS Histogram"}
plot(fh1)
```

During the loading process, `flowPloidy` attempts to locate cell
populations and identify initial values for model fitting. You can see this
by setting the `init = TRUE` argument in the `plot` function:

```{r initialValues, fig.cap = "Histogram with Initial Values"}
plot(fh1, init = TRUE)
```

A number of features are displayed on the histogram:

Sample A G1 estimate

: a thick vertical blue line with a label at the top indicates the
estimated position of the first sample ('A') G1 peak. A blue circle
indicates the actual position on the histogram used as the initial value.

Sample A G2 estimate

: a thin vertical blue line without a label indicates the estimated
position of the G2 peak corresponding to the A G1 peak.

Sample B G1 and G2 estimates

: as for Peak A, but orange

Initial model estimate

: the thin, dashed grey line indicates the initial parameters used for the
model-fitting.

The initial estimate is rough, but in this case it's close enough for the
model-fitting routines to find a solution, so we can immediately proceed to
the complete analysis.

```{r}
fh1 <- fhAnalyze(fh1)
```

Note that I have assigned the output of `fhAnalyze()` to the original
`flowHist` object `fh1`. The output of the analysis is stored in a new slot
in the `fh1` object -- all the original data remains unaltered in that
object.

Now that `fh1` contains a fitted model, the results will be displayed when
we call `plot` again:

```{r simpleAnalysis, fig.cap = "Histogram with Fitted Model"}
plot(fh1)
```

The plot has been updated to include the fitted model components in
different colours:

- green: debris model (by default the Single-Cut model, but see below)
- blue: sample A G1 and G2 peaks
- orange: sample B G1 and G2 peaks
- magenta: S-phase of sample A
- turqoise: S-phase of sample B
- purple: aggregates
- red: the complete model

The aggregate and S-phase components may be quite small, and therefore
difficult to see. 

In the upper right corner, we see a brief summary of the numerical results.
For Sample A and Sample B, peak statistics are reported as `position / cell
counts / coefficient of variation`. We can see that the coarse initial
estimates were indeed close enough to give us a reasonable model fitting.
The numerical results are now available by printing the `FlowHist` object:

```{r, output = "hold"}
fh1
```

Here we see the ratio between the peaks with the standard error of the
ratio, and a table of peak data: nuclei counts (i.e., the number of nuclei
modelled in the G1 peaks), peak size, the coefficient of variation for the
peak, and the Reduced Chi-Square value (RCS) recommended by @bagwell_1993
as an objective assessment of model fit (but see @rabinovitch_1994 for a
contrasting view).

# Manually Selecting Starting Values

Sometimes, particularly with noisy histograms, `flowPloidy` will not
produce useful starting values:

```{r noisyData, fig.cap = "Histogram with Bad Initial Values"}
fh2 <- FlowHist(file = flowPloidyFiles["734.LMD"], channel = "FL3.INT.LIN")
plot(fh2, init = TRUE)
```

Here we can see that high spike around bin 50 was ignored, resulting in
both the A and B peaks being placed together near bin 136. We can manually
set the appropriate starting values with `pickInit`: 

```{r, eval = FALSE}
fh2 <- pickInit(fh2)
```

`flowHist` will prompt you to click on the peaks of the first and second
peak manually, adding a circle to the plot at each point.

```{r fakingInteraction, echo = FALSE, fig.cap = "Histogram with Manually Selected Peaks"}
## This stuff is automatically done by the interactive function pickInit,
## which we can't actually use in the automated vignette.
## Using internal functions here, not intended for use by users

fh2 <- flowPloidy:::resetFlowHist(fh2)
flowPloidy:::fhPeaks(fh2) <-
  structure(c(50.2623811329645, 139.061074258603, 258.242164238835,
              178.771283080744), .Dim = c(2L, 2L),
                          .Dimnames = list(NULL, c("mean", "height")))

fh2 <- flowPloidy:::addComponents(fh2)
fh2 <- flowPloidy:::makeModel(fh2)
fh2 <- flowPloidy:::getInit(fh2)

plot(fh2)
points(flowPloidy:::fhInit(fh2)$Ma, 255, col = 2, cex = 3)
points(flowPloidy:::fhInit(fh2)$Mb, 175, col = 3, cex = 3)
```

The new peak values are used to recalculate the initial values:

```{r initPick, fig.cap = "Histogram with Corrected Initial Values"}
plot(fh2, init = TRUE)
```

Now we are ready for the analysis:

```{r correctedAnalysis, fig.cap = "Histogram with Fitted Model"}
fh2 <- fhAnalyze(fh2)
plot(fh2)
fh2
```

# Model Options

## Debris Models
By default, the histogram analysis uses the Single-Cut Debris model,
[described in @bagwell_mayo_ea_1991]. This works well for most of our data.
However, @bagwell_1993 recommends the Multi-Cut model for fresh material,
and in some cases it provides better RCS values (although peak parameters
rarely change significantly).

There are a few ways to switch to the Multi-Cut model. You can update a
`FlowHist` object with the function `updateFlowHist`:

```{r multi-cut model}
fh2MC <- updateFlowHist(fh2, debris = "MC")
plot(fh2MC)
```

In this case, the Multi-Cut model produced a RCS value nearly twice that of
the Single-Cut model. Note, however, that the peak parameters hardly
changed, and the A/B ratio is identical.

You may also choose which debris model to use as an argument to `FlowHist`
(and the related `batchFlowHist` discussed below):

```{r FlowHist multi-cut, eval = FALSE}
fh2MC <- FlowHist(file = flowPloidyFiles["734.LMD"],
                  channel = "FL3.INT.LIN", debris = "MC")
```

A third option for changing the debris model is to use the interactive
plotting function `browseFlowHist`, described below.

## Linearity
By default, linearity (the ratio between the G1 and G2 peaks of the same
sample) is fit as a parameter. Ideally, linearity should be 2, but due to
issues with the running of the flow cytometer, it frequently varies between
ca 1.95 and 2.05. If you want to force linearity to be fixed at 2, you can
use the `linearity` argument to `updateFlowHist` or `FlowHist`, as we did
for `debris` above. In this case, possible values for `linearity` are
`fixed` (fixed at 2), or `variable` (fit as a parameter). When `linearity`
is `fixed`, it is reported as `NA` in plot and summary functions. You may
also set/change `linearity` with the `browseFlowHist` GUI described below.

# Exporting Results

To summarize the results in a data table, use the `tabulateFlowHist`
command: 

```{r}
tabulateFlowHist(fh1)
```

To combine multiple objects in a single summary, combine them as a list:

```{r}
tabulateFlowHist(list(fh1, fh2))
```

As a convenience, if you pass a file name to the `file` argument of
`exportFlowHist`, the table will be saved to that file:

```{r, eval = FALSE}
tabulateFlowHist(list(fh1, fh2), file = "flow-results.csv")
```

# Processing Directories with `batchFlowHist`

## Generating a file list in R

If you have a large number of files to process, `flowHist` can read them
all in together. The `batchFlowHist` function provides this feature. To use
it, you need a list of file names to process. The R function `list.files()`
is helpful here. If your FCS files are in the directory "`datadir/`", you
can collect all their names using:

```{r, eval = FALSE}
my.files <- list.files("datadir/", full.names = TRUE)
```

If you have a mix of FCS files and other files together, you can supply a
pattern to match only your data:

```{r, eval = FALSE}
my.files <- list.files("datadir/", pattern = ".LMD", full.names = TRUE)
```

If you want to combine directories, use `c()`:

```{r, eval = FALSE}
my.files <- list.files("datadir1/", full.names = TRUE)
my.files <- c(my.files, list.files("datadir2/", full.names = TRUE))
```

For this example, I'll continue to use the sample data included with
`flowPloidyData`. Here, we'll use all of them (i.e., `flowPloidyFiles`),
not just one (e.g. `flowPloidyFiles[1]`).

## Using `batchFlowHist`

The `batchFlowHist()` function has two required arguments:

files

: a vector of file names (character), such as we generated in the previous
section.

channel

: a string (character) indicating the data channel to read, which we
discovered using the function `viewFlowChannels()` above.

The following optional arguments may also be used:

bins

: the number of bins to group the data into (default: 256)

verbose

: boolean, if `TRUE` (the default) report the filenames as they are
processed. Useful for debugging, or tracking progress on long-running jobs

window

: numeric, the size of the moving window used to identify local peaks
(default: 20)

smooth

: numeric, the size of the moving window used to reduce noise in the
histogram during peak detection (default: 20)

debris

: character, either "SC" or "MC" to use the Single-Cut or Multi-Cut debris
model, as described above. (default: "SC")

linearity

: character, either "fixed" or "variable" to set the linearity parameter to
2, or allow it to be fit as a model parameter, as described above.
(default: "variable")

If you have clean, narrow peaks, you may want to lower the values of
`window` and `smooth`, say to 16 and 8 respectively. Don't worry about
finding perfect values, we provide a GUI interface (below) to allow for
quick and painless correction of peak detection.

```{r histBatch}
batch1 <- batchFlowHist(flowPloidyFiles, channel = "FL3.INT.LIN")
```

(You will see some **Error** messages in the output. That's ok, we'll fix
the problems shortly!)

You can export the results with `tabulateFlowHist()`:

```{r}
tabulateFlowHist(batch1)
```

You can scroll through the resulting plots with the following code:

```{r, eval = FALSE}
parOld <- par(ask = TRUE)
lapply(batch1, FUN = plot)
## press enter to scroll through your files!
par(parOld)
```

However, that's a bit tedious if you need to tweak the parameters for any
of the individual `FlowHist` datasets in the list. A more covenient
approach is provided by the `browseFlowHist` function, which also allows us
to address the histograms for which our analysis failed. This is described
in the next section.

# `browseFlowHist`

`browseFlowHist` uses the [Shiny](shiny.rstudio.com) framework to provide
an interface for viewing histograms. The main purpose of this interface is
to allow users a simple, quick way to confirm that the fitted models are
sensible, and to correct the most common problem: misidentified peaks that
lead to bad starting parameters.

To use `browseFlowHist`, pass it the list of `FlowHist` objects we created
with `batchFlowHist` above:

```{r browseFlowHist, eval = FALSE}
batch1 <- browseFlowHist(batch1)
```

**IMPORTANT** remember to assign `browseFlowHist(batch1)` to a variable. If
you don't, all of the corrections you make will be discarded when you close
the app. If you want to keep the original values, assign it to a new
variable, i.e., `batch1_corrected <- browseFlowHist(batch1)`.

At this point, R should respond with:

```
Listening on http://127.0.0.1:3459
```

(the number after the `:` will probably be different!)
and your web browser will open to display something like this:

![](./flowShiny01.png)

(You may need to maximize your browser window for the layout to look
'correct')

In the upper-left corner you will find the navigation. Click on `Next` and
`Prev` to move forwards and backwards over your list of files. Note the
counter - you can't move past the first or last files, of course.

For the first histogram, file `188-15.LMD`, the model fit is quite good, so
we don't need to worry about improving the initial parameters. Click on the
`Linearity` or `Debris Model` buttons to see how much difference these
options make (not much in this case). Notice that as you change the
options, the model fit updates automatically. **The last selections you
make before moving on to the next plot are the ones that will be saved!**

The second plot, for `222.LMD`, doesn't show a fitted model:

![](./flowShiny02.png)

This is due to bad initial peak estimates, despite the fact that they look
ok. To fix this, move the `Samp. A` peak estimate by clicking around near
the top of the highest peak. You might need to try a few different spots
before you find one that works (I don't know why this sample is so
tricky!).

As you click on the plot, the analysis updates using the new initial peak
estimate. If you want to move the peak for sample B, select that in the
checkbox above the plot.

Once a good initial peak is selected, the analysis succeeds with a good
fit:

![](./flowShiny03.png)

The analysis for `240+S.LMD` succeeded, and even produced a not-terrible
RCS value:

![](./flowShiny04.png)

However, a quick inspection of the plot shows that it's not a very good fit
at all. So we need to move the `A` and/or `B` peaks until we get something
sensible:

![](./flowShiny05.png)

That's better. However, in this case, the Multi-Cut debris model fits even
better:

![](./flowShiny06.png)

Browse through the rest of the histograms to see which ones succeeded, and
which ones need correction. You should be able to get acceptable fits (RCS
< 4) for all of them, despite the fact that some of these files are poor
quality data.

When you're done, click on `Return to R` to get back to the command line.
(You can close the browser window **after** you click this, **not
before**). Now all you're corrected analyses will be stored in the variable
you passed the results of `browseFlowHist()` to. You can check this with:

```{r flowShiny-output, eval = FALSE}
plot(batch1[["240+S.LMD"]])
```

![](./flowShiny07.png)

We have the corrected model fits, so we can export the data for further
analysis:

```{r flowShiny-export, eval = FALSE}
tabulateFlowHist(batch1)
```

Remember, if you don't pass a `file` argument to `tabulateFlowHist` it only
prints the results to he screen, they aren't saved in a file (which file
could it be?). To save the results of your analysis to disk, use:

```{r, eval = FALSE}
tabulateFlowHist(batch1, file = "my-analysis-output.csv")
```

# References