---
title: "Basic usage of the infinityFlow package"
author: "Etienne Becht"
date: "June 2020"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Basic usage of the infinityFlow package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```
# Introduction
Thank you for your interest in the *Infinity Flow* approach. This vignette describes how to apply the package to your massively parallel cytometry (e.g. *LEGENDScreen* or *Lyoplates kits*) experiment. Massively parallel cytometry experiments are cytometry experiments where a sample is aliquoted in _n_ subsamples, each stained with a fixed panel of "Backbone" antibodies. Each aliquot is in addition stained with a unique "Infinity" exploratory antibody. The goal of the *infinityFlow* package is to use information from the ubiquitous Backbone staining to predict the expression of sparsely-measured Infinity antibodies across the entire dataset. To learn more about this type of experiments and details about the Infinity Flow approach, please consult [Becht et al, 2020](https://www.biorxiv.org/content/10.1101/2020.06.17.152926v1). In this vignette we achieve this by using the XGBoost machine-learning framework implemented in the [xgboost R package](https://CRAN.R-project.org/package=xgboost). This vignette aims at explaining how to apply a basic *infinityFlow* analysis. Advanced usages, including different machine learning models and custom hyperparameters values, are covered in a [dedicated vignette](training_non_default_regression_models.html).
    
This vignette will cover:

1. Package installation
2. Setting up your input data
3. Specifying the Backbone and Infinity antibodies
4. Running the Infinity Flow computational pipeline
5. Description of the output

# Package installation
You can install the package from Bioconductor using
```{r, installation, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)){
    install.packages("BiocManager")
}
BiocManager::install("infinityFlow")
```

# Setting up your input data

Now that the package is installed, we load the package and attach its example data. We also load flowCore that will be used to manipulate FCS files in R.

```{r, load_package, eval = TRUE}
library(infinityFlow)
data(steady_state_lung)
```

The example data is a subset of a massively parallel cytometry experiment of the mouse lung at steady state. The example data contains 10 FCS files. To mimick real world-conditions, we will write this set of FCS files to disk. In this vignette we will use a temporary directory, but you can use the directory of your choice.

```{r, load_example, eval=TRUE}
dir <- file.path(tempdir(), "infinity_flow_example")
print(dir)
input_dir <- file.path(dir, "fcs")
write.flowSet(steady_state_lung, outdir = input_dir) ## Omit this if you already have FCS files
list.files(input_dir)
```

The second input we have to manually produce is the annotation of the experiment. In the context of a massively parallel cytometry experiment, we need to know what is the protein target of each Infinity (usually PE-conjugated or APC-conjugated) antibody, and what is its isotype. For the example dataset, the annotation is provided in the package and looks like this:

```{r, load_annotation}
data(steady_state_lung_annotation)
print(steady_state_lung_annotation)
```

The `steady_state_lung_annotation` data.frame contains one line per FCS file, with `rownames(steady_state_lung_annotation) == sampleNames(steady_state_lung)`. The first column specifies the proteins targeted by the Infinity antibody in each FCS file, and the second column specifies its isotype (species and constant region of the antibody). If you load an annotation file from disk, use this command:

```
steady_state_lung_annotation = read.csv("path/to/targets/and/isotypes/annotation/file", row.names = 1, stringsAsFactors = FALSE)
```

That is all we need in terms of inputs! To recap you only need

1. a folder with FCS files from your massively parallel cytometry experiment.
2. a table specifying the antibody targets and antibody isotypes for the Infinity antibodies from your massively parallel cytometry experiment.

# Specifying the Backbone and Infinity antibodies

Now that we have our input data, we need to specify which antibodies are part of the Backbone, and which one is the Infinity antibody. We provide an interactive function to specify this directly in R. This function can be run once, its output saved for future use by downstream functions in the *infinityFlow* package. The `select_backbone_and_exploratory_markers` function will parse an FCS file in the input directory, and for each acquisition channel, ask the user whether it should be used as a predictor (Backbone), exploratory target (Infinity antibodies), or omitted (e.g. Time or Event ID columns...).

```{r, eval = FALSE}
backbone_specification <- select_backbone_and_exploratory_markers(list.files(input_dir, pattern = ".fcs", full.names = TRUE))
```
Below is an example of the interactive execution of the `select_backbone_and_exploratory_markers` function for the example data. The resulting data.frame is printed too.

```
For each data channel, enter either: backbone, exploratory or discard (can be abbreviated)
FSC-A (FSC-A):discard
FSC-H (FSC-H):backbone
FSC-W (FSC-W):b
SSC-A (SSC-A):d
SSC-H (SSC-H):b
SSC-W (SSC-W):b
CD69-CD301b (FJComp-APC-A):b
Zombie (FJComp-APC-eFlour780-A):b
MHCII (FJComp-Alexa Fluor 700-A):b
CD4 (FJComp-BUV395-A):b
CD44 (FJComp-BUV737-A):b
CD8 (FJComp-BV421-A):b
CD11c (FJComp-BV510-A):b
CD11b (FJComp-BV605-A):b
F480 (FJComp-BV650-A):b
Ly6C (FJComp-BV711-A):b
Lineage (FJComp-BV786-A):b
CD45a488 (FJComp-GFP-A):b
Legend (FJComp-PE(yg)-A):exploratory
CD24 (FJComp-PE-Cy7(yg)-A):b
CD103 (FJComp-PerCP-Cy5-5-A):b
Time (Time):d
                         name        desc        type
$P1                     FSC-A        <NA>     discard
$P2                     FSC-H        <NA>    backbone
$P3                     FSC-W        <NA>    backbone
$P4                     SSC-A        <NA>     discard
$P5                     SSC-H        <NA>    backbone
$P6                     SSC-W        <NA>    backbone
$P7              FJComp-APC-A CD69-CD301b    backbone
$P8    FJComp-APC-eFlour780-A      Zombie    backbone
$P9  FJComp-Alexa Fluor 700-A       MHCII    backbone
$P10          FJComp-BUV395-A         CD4    backbone
$P11          FJComp-BUV737-A        CD44    backbone
$P12           FJComp-BV421-A         CD8    backbone
$P13           FJComp-BV510-A       CD11c    backbone
$P14           FJComp-BV605-A       CD11b    backbone
$P15           FJComp-BV650-A        F480    backbone
$P16           FJComp-BV711-A        Ly6C    backbone
$P17           FJComp-BV786-A     Lineage    backbone
$P18             FJComp-GFP-A    CD45a488    backbone
$P19          FJComp-PE(yg)-A      Legend exploratory
$P20      FJComp-PE-Cy7(yg)-A        CD24    backbone
$P21     FJComp-PerCP-Cy5-5-A       CD103    backbone
$P22                     Time        <NA>     discard
Is selection correct? (yes/no): yes
```
We cannot run this function interactively from this vignette, so we load the result from the package instead:

```{r, backbone specification input}
data(steady_state_lung_backbone_specification)
print(head(steady_state_lung_backbone_specification))
```

You need to save this backbone specification file as a CSV file for future use.

```{r, backbone specification output}
write.csv(steady_state_lung_backbone_specification, file = file.path(dir, "backbone_selection_file.csv"), row.names = FALSE)
```

# Running the Infinity Flow computational pipeline

Now that we have our input data, FCS files annotation and specification of the Backbone and Infinity antibodies, we have everything we need to run the pipeline.

All the pipeline is packaged into a single function, `infinity_flow()`.

Here is a description of the basic arguments it requires:

1. path_to_fcs: path to the folder with input FCS files
1. path_to_output: path to a folder where the output will be saved.
1. backbone_selection_file: the CSV file specifying backbone and Infinity antibodies, created in the *Specifying the Backbone and Infinity antibodies* section above.
1. annotation: A named vector of Infinity antibody targets per FCS file. We will create it from the annotation table we created in the *Setting up your input data* section
1. isotype: Same as annotation, but specifying Infinity antibody isotypes rather than antibody targets.

We have everything we need in our input folder to fill these arguments:
```{r, inspect input directory}
list.files(dir)
```

First, input FCS files:
```{r, input FCS files path argument}
path_to_fcs <- file.path(dir, "fcs")
head(list.files(path_to_fcs, pattern = ".fcs"))
```

Output directory. It will be created if it doesn't already exist
```{r, output path argument}
path_to_output <- file.path(dir, "output")
````

Backbone selection file:
```{r, backbone selection file path argument}
list.files(dir)
backbone_selection_file <- file.path(dir, "backbone_selection_file.csv")
head(read.csv(backbone_selection_file))
```

Annotation of Infinity antibody targets and isotypes:
```{r, targets and isotypes arguments}
targets <- steady_state_lung_annotation$Infinity_target
names(targets) <- rownames(steady_state_lung_annotation)
isotypes <- steady_state_lung_annotation$Infinity_isotype
names(isotypes) <- rownames(steady_state_lung_annotation)
head(targets)
head(isotypes)
```

Other arguments are optional, but it is notably worth considering the number of input cells and the number of output cells. This will notably be important if you are using a computer with limited RAM. For the example data it does not matter as we only have access to 2,000 cells per well, but if you run the pipeline on your own data I suggest you start by low values, and ramp up (to 20,000 or 50,000 input cells, and e.g. 10,000 output cells per well) once everything is setup. Another optional argument is `cores` which controls multicore computing, which can speed up execution at the cost of memory usage. In this vignette we use cores = 1, but you probably want to increase this to 4 or 8 or more if your computer can accomodate it.

```{r, input and output events downsampling argument}
input_events_downsampling <- 1000
prediction_events_downsampling <- 500
cores = 1L
```

There is also an argument to store temporary files, which can be useful to further analyze the data in R. If missing, this argument will default to a temporary directory.

```{r, input temporary directory path}
path_to_intermediary_results <- file.path(dir, "tmp")
```

At last, now let us execute the pipeline:

```{r, pipeline execution, eval = TRUE}
imputed_data <- infinity_flow(
	path_to_fcs = path_to_fcs,
	path_to_output = path_to_output,
	path_to_intermediary_results = path_to_intermediary_results,
	backbone_selection_file = backbone_selection_file,
	annotation = targets,
	isotype = isotypes,
	input_events_downsampling = input_events_downsampling,
	prediction_events_downsampling = prediction_events_downsampling,
	verbose = TRUE,
	cores = cores
)
```

The above command populated our output directory with new sets of files which we describe in the next section.

# Description of the output

The output mainly consists of

1. Sets of FCS files with imputed data, which can be used for further downstream analysis.
2. PDFs of a UMAP embedding of the Backbone data, color-coded by imputed expression of the Infinity antibody

Each of the above comes in two flavours, either **raw** or **background-corrected**.

At the end of the pipeline, input FCS files are augmented with imputed data. Feel free to explore these files in whatever flow cytometry software you are comfortable with! They should look pretty much like regular FCS files, although they are computationnally derived. You can find the output FCS files in the `path_to_output` directory, specifically:
```{r}
head(list.files(path_to_fcs)) ## Input files
fcs_raw <- file.path(path_to_output, "FCS", "split")
head(list.files(fcs_raw)) ## Raw output FCS files
fcs_bgc <- file.path(path_to_output, "FCS_background_corrected", "split") ## Background-corrected output FCS files
head(list.files(fcs_bgc)) ## Background-corrected output FCS files
```

Finally, the pipeline produces two PDF files, with a UMAP embedding of the backbone data color-coded by the imputed data. This is a very informative output and a good way to start analyzing your data. These files are present in the `path_to_output` directory. The example dataset is very small but feel free to look at the result for illustration purposes. This PDF is available at 

```{r}
file.path(path_to_output, "umap_plot_annotated.pdf") ## Raw plot
file.path(path_to_output, "umap_plot_annotated_backgroundcorrected.pdf") ## Background-corrected plot
```

```{r, eval = FALSE, echo = FALSE}
knitr::include_graphics(file.path(path_to_output, "umap_plot_annotated.pdf"))
```

# Conclusion

Thank you for following this vignette, I hope you made it through the end without too much headache and that it was informative. General questions about proper usage of the package are best asked on the [Bioconductor support site](https://support.bioconductor.org/) to maximize visibility for future users. If you encounter bugs, feel free to raise an issue on infinityFlow's [github](https://github.com/ebecht/infinityFlow/issues).

# Information about the R session when this vignette was built
```{r}
sessionInfo()
```

```{r, debugging, eval = FALSE, echo = FALSE}
files = list.files(dir, recursive = TRUE)
sapply(
	files,
	function(x){
		file.copy(from = file.path(dir, x), to = "~/Desktop/test/", recursive = TRUE)
	}
)
```