---
title: "cellmig: Quantifying Cell Migration Velocity with 
Hierarchical Bayesian Models"
author: "Simo Kitanovski (simo.kitanovski@uni-due.de)"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{User Manual: cellmig}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include = FALSE, warning = FALSE}
knitr::opts_chunk$set(comment = FALSE, 
                      warning = FALSE, 
                      message = FALSE)
```

```{r}
library(cellmig)
library(ggplot2)
library(ggforce)
ggplot2::theme_set(new = theme_bw(base_size = 10))
```

# Introduction

High-throughput time-lapse microscopy allows us to track thousands of
cells across many wells and plates. However, these experiments generate
complex data with multiple sources of variability:

1.  **Biological variability:** Cells on different experimental plates
    (biological replicates) behave differently even under the same
    conditions.

2.  **Technical variability:** Differences between wells on the same
    plate.

3.  **Batch effects:** Systematic differences between plates (e.g.,
    imaging days, reagent batches).

Standard statistical tests often ignore this hierarchical structure
(cells $\rightarrow$ wells $\rightarrow$ plates), which can lead to
false positives or missed discoveries.

The Bioconductor package `r Biocpkg("cellmig")` addresses this by using
**Bayesian hierarchical models**. It separates biological signals from
technical noise and provides uncertainty-aware estimates (credible
intervals) for treatment effects. This vignette guides you through
installing `cellmig`, formatting your data, fitting a model, and
interpreting the results.

# Installation

To install this package, start R (version "4.5") and enter:

```{r, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("cellmig")
```

# Data Structure

## Required Columns

| Column     | Description                            | Example                |
|:-----------------|:--------------------------------|:--------------------|
| `well`     | Unique well ID                         | "w1", "w2", ...        |
| `plate`    | Unique plate ID (Biological Replicate) | "p1", "p2", ...        |
| `compound` | Name of the compound/drug              | "DMSO", "DrugA", ...   |
| `dose`     | Concentration or level                 | "0", "10", "high", ... |
| `v`        | Observed migration velocity (numeric)  | 0.54 (µm/min)          |
| `offset`   | Batch correction flag (0 or 1)         | 0 or 1                 |

### Important: The `offset` Column

To correct for plate-to-plate batch effects, you must identify a
**control treatment** that appears on every plate (e.g., DMSO).

-   Set `offset = 1` for cells in this control group.

-   Set `offset = 0` for all other treatments.

-   *Note:* The model uses this group to normalize velocities across
    plates

## Example Data

We will use the simulated dataset included in the package. It contains:

-   **3 Plates** (Biological replicates)

-   **378 Wells** (Technical replicates nested in plates)

-   **6 Compounds** at **7 Doses** (42 Treatment Groups)

```{r load-data}
data("d", package = "cellmig")
str(d)
head(d)
```

# Exploratory Data Analysis

Before modeling, it is good practice to visualize the raw data to check
for obvious batch effects or outliers.

## Raw Cell Velocities

Each dot represents a single cell. Facets show different compounds.
Colors indicate plates. If plate colors cluster separately within
facets, you likely have batch effects.

```{r, fig.width=7, fig.height=6}
ggplot(data = d)+
  facet_wrap(facets = ~paste0("compound=", compound), 
             scales = "free_y", ncol = 2)+
  geom_sina(aes(x = as.factor(dose), col = plate, y = v, group = well), 
            size = 0.5)+
  theme_bw()+
  theme(legend.position = "top",
        strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+
  ylab(label = "migration velocity")+
  xlab(label = '')+
  scale_color_grey()+
  guides(color = guide_legend(override.aes = list(size = 3)))+
  guides(shape = guide_legend(override.aes = list(size = 3)))+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "l")
```

## Mean Velocity per Well

Aggregating to the well level often makes batch effects clearer. Here,
we see the mean velocity per well.

```{r, fig.width=7, fig.height=6}
dm <- aggregate(v~well+plate+compound+dose, data = d, FUN = mean)
ggplot(data = dm)+
  facet_wrap(facets = ~paste0("compound=", compound), 
             scales = "free_y", ncol = 2)+
  geom_sina(aes(x = as.factor(dose), col = plate, y = v, group = well), 
            size = 1.5, alpha = 0.7)+
  theme_bw()+
  theme(legend.position = "top",
        strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+
  ylab(label = "migration velocity")+
  xlab(label = '')+
  scale_color_grey()+
  guides(color = guide_legend(override.aes = list(size = 3)))+
  guides(shape = guide_legend(override.aes = list(size = 3)))+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "l")
```

# Model Fitting

Now we fit the hierarchical Bayesian model. The `cellmig()` function
handles the complex Stan modeling behind the scenes.

## Control Parameters

You can adjust the MCMC (Markov Chain Monte Carlo) settings via the
`control` list.

-   `mcmc_warmup`: Steps to tune the sampler.

-   `mcmc_steps`: Actual sampling steps used for inference.

-   `mcmc_chains`: Number of independent chains (recommend $\ge$ 2 for
    convergence checks).

*Note: For this vignette, we use fewer steps for speed. For real
analysis, increase `mcmc_steps` to 2000+.*

```{r fit-model, fig.width=7, fig.height=3.5}
o <- cellmig(x = d,
             control = list(mcmc_warmup = 300,  # Warmup iterations
                            mcmc_steps = 1000,  # Sampling iterations
                            mcmc_chains = 2,    # Number of chains
                            mcmc_cores = 2))    # Parallel cores
```

# Interpreting Results

The model output `o` contains posterior distributions for all
parameters. We focus on the **overall treatment effects** ($\delta_t$),
which represent the change in velocity relative to the control (offset).

## Overall Treatment Effects ($\delta_t$)

The `delta_t` data frame contains the mean, median, and 95% Highest
Density Intervals (HDI) for each treatment.

```{r view-delta}
knitr::kable(o$posteriors$delta_t, digits = 2)
```

### Visualizing Effects

-   **Dot:** Posterior mean effect.
-   **Error Bar:** 95% HDI (Uncertainty range).
-   **Y-axis:** Log-fold change ($\delta$).
    -   If the error bar **crosses 0**, there is no clear evidence of an
        effect.
    -   **Positive values:** Increased migration.
    -   **Negative values:** Decreased migration.

```{r plot-delta, fig.width=6, fig.height=4}
ggplot(data = o$posteriors$delta_t) +
  geom_line(aes(x = dose, y = mean, col = compound, group = compound)) +
  geom_point(aes(x = dose, y = mean, col = compound)) +
  geom_errorbar(aes(x = dose, y = mean, ymin = X2.5., ymax = X97.5., 
                    col = compound), width = 0.1) +
  ylab(label = expression("Log-Fold Change ("*delta*")")) +
  xlab("Dose") +
  theme(legend.position = "top")
```

## From Log-Fold-Change to Fold-Change

Log-scale values can be hard to interpret biologically. We often prefer
**Fold-Change**, where 1 = no effect, \>1 = increase, \<1 = decrease. We
calculate this by exponentiating the values ($\exp(\delta)$).

```{r plot-fold-change, fig.width=6, fig.height=4}
ggplot(data = o$posteriors$delta_t) +
  geom_line(aes(x = dose, y = exp(mean), col = compound, group = compound)) +
  geom_point(aes(x = dose, y = exp(mean), col = compound)) +
  geom_errorbar(aes(x = dose, y = exp(mean), ymin = exp(X2.5.), 
                    ymax = exp(X97.5.), col = compound), width = 0.1) +
  ylab(label = expression("Fold Change ("*delta*"')")) +
  xlab("Dose") +
  theme(legend.position = "top")
```

# Pairwise Comparisons

Often, you want to compare specific treatments against each other (e.g.,
Drug A vs. Drug B), not just against the control.

## Comparison Matrix

The `get_pairs()` function computes the difference between all treatment
groups.

-   $\rho$: Log-fold change difference between Treatment X and
    Treatment Y.

-   $\pi$: Probability that the effect is truly different (0 to 1).
    Closer to 1 means strong evidence.

```{r}
# Get pairwise comparisons (log-scale)
u <- get_pairs(x = o, exponentiate = FALSE)
```

### Visualize $\rho$

```{r}
# vislualize matrix of rhos
u$plot_rho
```

### Visualize $\pi$

```{r}
# visualize matrix of pis
u$plot_pi
```


### Visualize $\rho$ vs. $\pi$ with a volcano plot

```{r}
# visualize volcano plot
u$plot_volcano
```

*Tip: Set `exponentiate = TRUE` to get fold-change ratios instead of
log-differences.*

## Violin Plots for Specific Comparisons

To inspect the distribution of differences between a specific target
group and all others, use `get_violins()`.

```{r get-groups}
# View available group labels
groups <- get_groups(x = o)
head(groups)
```

```{r plot-violins, fig.width=7, fig.height=3}
# Compare all groups against Compound 2, Dose 1
u_violin <- get_violins(x = o, 
                        from_groups = groups$group,
                        to_group = "C2|D1",
                        exponentiate = FALSE)
u_violin$plot
```

-   **Violins:** Show the posterior distribution of the difference.
-   **Label:** Probability ($\pi$) that the difference is non-zero.

# Model Diagnostics

It is crucial to verify that the model fits your data well. `cellmig`
provides Posterior Predictive Checks (PPC).

## Cell-Level Check

Do the simulated velocities (pink) match the observed velocities
(black)? If they overlap well, the model captures the data distribution.

```{r ppc-cell, fig.width=6, fig.height=6}
g <- get_ppc_violins(x = o, wrap = TRUE, ncol = 3)
g + scale_y_log10()
```

## Well-Level Check

Does the model predict the mean velocity per well accurately? Points
should lie close to the diagonal line.

```{r ppc-well, fig.width=5, fig.height=5}
g <- get_ppc_means(x = o)
g
```

# Variance Components

You can inspect how much variability comes from different sources
(plates, wells, treatments). This helps in planning future experiments.

```{r plot-variance, fig.height=3, fig.width=7}
# Plate-specific baseline effects
g_alpha_p <- ggplot(data = o$posteriors$alpha_p) +
  geom_errorbarh(aes(y = plate, x = mean, xmin = X2.5., xmax = X97.5.),
                 height = 0.2) +
  geom_point(aes(y = plate, x = mean)) +
  xlab("Plate Effect (log-scale)")

# Variance parameters (Biological vs Technical)
g_sigma <- ggplot() +
  geom_errorbarh(data = o$posteriors$sigma_bio,
                 aes(y = "Biological (Plate)",
                     x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) +
  geom_errorbarh(data = o$posteriors$sigma_tech,
                 aes(y = "Technical (Well)",
                     x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) +
  geom_errorbarh(data = o$posteriors$sigma_delta,
                 aes(y = "Treatment Variation",
                     x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) +
  geom_point(data = o$posteriors$sigma_bio,
             aes(y = "Biological (Plate)", x = mean)) +
  geom_point(data = o$posteriors$sigma_tech,
             aes(y = "Technical (Well)", x = mean)) +
  geom_point(data = o$posteriors$sigma_delta,
             aes(y = "Treatment Variation", x = mean)) +
  xlab("Standard Deviation")

g_alpha_p | g_sigma
```

# Advanced: Dose-Response Profiles

For screens with multiple compounds and overlapping doses, `cellmig` can
cluster compounds based on their dose-response similarity, accounting
for uncertainty.

-   **Left:** Dendrogram showing similarity between compounds.
-   **Middle/Right:** Dose-response curves for overall ($\delta$) and
    plate-specific ($\gamma$) effects.

```{r dose-response, fig.width=8, fig.height=5}
get_dose_response_profile(x = o, exponentiate = TRUE) +
  patchwork::plot_layout(widths = c(.7, 1, 2))
```

# Session Info

```{r}
sessionInfo()
```