---
title: "Downsampling"
author: "Timothy Keyes"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
description: > 
  Read this vignette to learn how to downsample a high-dimensional cytometry
  dataset to a smaller 
  number of cells using {tidytof}.
vignette: >
  %\VignetteIndexEntry{05. Downsampling}
  %\VignetteEngine{knitr::knitr}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.height = 4,
    fig.width = 4
)

options(
  rmarkdown.html_vignette.check_title = FALSE
)
```

```{r setup, message = FALSE}
library(tidytof)
library(dplyr)
library(ggplot2)

count <- dplyr::count
```

Often, high-dimensional cytometry experiments collect tens or hundreds or millions of cells in total, and it can be useful to downsample to a smaller, more computationally tractable number of cells - either for a final analysis or while developing code. 

To do this, `{tidytof}` implements the `tof_downsample()` verb, which allows downsampling using 3 methods: downsampling to an integer number of cells, downsampling to a fixed proportion of the total number of input cells, or downsampling to a fixed cellular density in phenotypic space.

## Downsampling with `tof_downsample()`

Using `{tidytof}`'s built-in dataset `phenograph_data`, we can see that the original size of the dataset is 1000 cells per cluster, or 3000 cells in total:

```{r}
data(phenograph_data)

phenograph_data |>
    dplyr::count(phenograph_cluster)
```

To randomly sample 200 cells per cluster, we can use `tof_downsample()` using the "constant" `method`:

```{r}
phenograph_data |>
    # downsample
    tof_downsample(
        group_cols = phenograph_cluster,
        method = "constant",
        num_cells = 200
    ) |>
    # count the number of downsampled cells in each cluster
    count(phenograph_cluster)
```

Alternatively, if we wanted to sample 50% of the cells in each cluster, we could use the "prop" `method`:

```{r}
phenograph_data |>
    # downsample
    tof_downsample(
        group_cols = phenograph_cluster,
        method = "prop",
        prop_cells = 0.5
    ) |>
    # count the number of downsampled cells in each cluster
    count(phenograph_cluster)
```

And finally, we might also be interested in taking a slightly different approach to downsampling that reduces the number of cells not to a fixed constant or proportion, but to a fixed *density* in phenotypic space. For example, the following scatterplot demonstrates that there are certain areas of phenotypic density in `phenograph_data` that contain more cells than others along the `cd34`/`cd38` axes:


```{r, warning = FALSE, message = FALSE}
rescale_max <-
    function(x, to = c(0, 1), from = range(x, na.rm = TRUE)) {
        x / from[2] * to[2]
    }

phenograph_data |>
    # preprocess all numeric columns in the dataset
    tof_preprocess(undo_noise = FALSE) |>
    # plot
    ggplot(aes(x = cd34, y = cd38)) +
    geom_hex() +
    coord_fixed(ratio = 0.4) +
    scale_x_continuous(limits = c(NA, 1.5)) +
    scale_y_continuous(limits = c(NA, 4)) +
    scale_fill_viridis_c(
        labels = function(x) round(rescale_max(x), 2)
    ) +
    labs(
        fill = "relative density"
    )
```

To reduce the number of cells in our dataset until the local density around each cell in our dataset is relatively constant, we can use the "density" `method` of `tof_downsample`:

```{r, warning = FALSE, message = FALSE}
phenograph_data |>
    tof_preprocess(undo_noise = FALSE) |>
    tof_downsample(method = "density", density_cols = c(cd34, cd38)) |>
    # plot
    ggplot(aes(x = cd34, y = cd38)) +
    geom_hex() +
    coord_fixed(ratio = 0.4) +
    scale_x_continuous(limits = c(NA, 1.5)) +
    scale_y_continuous(limits = c(NA, 4)) +
    scale_fill_viridis_c(
        labels = function(x) round(rescale_max(x), 2)
    ) +
    labs(
        fill = "relative density"
    )
```

Thus, we can see that the density after downsampling is more uniform (though not exactly uniform) across the range of `cd34`/`cd38` values in `phenograph_data`.

## Additional documentation

For more details, check out the documentation for the 3 underlying members of the `tof_downsample_*` function family (which are wrapped by `tof_downsample`):

-   `tof_downsample_constant`
-   `tof_downsample_prop`
-   `tof_downsample_density`


# Session info

```{r}
sessionInfo()
```