--- title: "cellmig: Quantifying Cell Migration Velocity with Hierarchical Bayesian Models" author: "Simo Kitanovski (simo.kitanovski@uni-due.de)" output: BiocStyle::html_document vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{User Manual: cellmig} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r setup, include = FALSE, warning = FALSE} knitr::opts_chunk$set(comment = FALSE, warning = FALSE, message = FALSE) ``` ```{r} library(cellmig) library(ggplot2) library(ggforce) ggplot2::theme_set(new = theme_bw(base_size = 10)) ``` # Introduction High-throughput time-lapse microscopy allows us to track thousands of cells across many wells and plates. However, these experiments generate complex data with multiple sources of variability: 1. **Biological variability:** Cells on different experimental plates (biological replicates) behave differently even under the same conditions. 2. **Technical variability:** Differences between wells on the same plate. 3. **Batch effects:** Systematic differences between plates (e.g., imaging days, reagent batches). Standard statistical tests often ignore this hierarchical structure (cells $\rightarrow$ wells $\rightarrow$ plates), which can lead to false positives or missed discoveries. The Bioconductor package `r Biocpkg("cellmig")` addresses this by using **Bayesian hierarchical models**. It separates biological signals from technical noise and provides uncertainty-aware estimates (credible intervals) for treatment effects. This vignette guides you through installing `cellmig`, formatting your data, fitting a model, and interpreting the results. # Installation To install this package, start R (version "4.5") and enter: ```{r, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("cellmig") ``` # Data Structure ## Required Columns | Column | Description | Example | |:-----------------|:--------------------------------|:--------------------| | `well` | Unique well ID | "w1", "w2", ... | | `plate` | Unique plate ID (Biological Replicate) | "p1", "p2", ... | | `compound` | Name of the compound/drug | "DMSO", "DrugA", ... | | `dose` | Concentration or level | "0", "10", "high", ... | | `v` | Observed migration velocity (numeric) | 0.54 (µm/min) | | `offset` | Batch correction flag (0 or 1) | 0 or 1 | ### Important: The `offset` Column To correct for plate-to-plate batch effects, you must identify a **control treatment** that appears on every plate (e.g., DMSO). - Set `offset = 1` for cells in this control group. - Set `offset = 0` for all other treatments. - *Note:* The model uses this group to normalize velocities across plates ## Example Data We will use the simulated dataset included in the package. It contains: - **3 Plates** (Biological replicates) - **378 Wells** (Technical replicates nested in plates) - **6 Compounds** at **7 Doses** (42 Treatment Groups) ```{r load-data} data("d", package = "cellmig") str(d) head(d) ``` # Exploratory Data Analysis Before modeling, it is good practice to visualize the raw data to check for obvious batch effects or outliers. ## Raw Cell Velocities Each dot represents a single cell. Facets show different compounds. Colors indicate plates. If plate colors cluster separately within facets, you likely have batch effects. ```{r, fig.width=7, fig.height=6} ggplot(data = d)+ facet_wrap(facets = ~paste0("compound=", compound), scales = "free_y", ncol = 2)+ geom_sina(aes(x = as.factor(dose), col = plate, y = v, group = well), size = 0.5)+ theme_bw()+ theme(legend.position = "top", strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+ ylab(label = "migration velocity")+ xlab(label = '')+ scale_color_grey()+ guides(color = guide_legend(override.aes = list(size = 3)))+ guides(shape = guide_legend(override.aes = list(size = 3)))+ scale_y_log10()+ annotation_logticks(base = 10, sides = "l") ``` ## Mean Velocity per Well Aggregating to the well level often makes batch effects clearer. Here, we see the mean velocity per well. ```{r, fig.width=7, fig.height=6} dm <- aggregate(v~well+plate+compound+dose, data = d, FUN = mean) ggplot(data = dm)+ facet_wrap(facets = ~paste0("compound=", compound), scales = "free_y", ncol = 2)+ geom_sina(aes(x = as.factor(dose), col = plate, y = v, group = well), size = 1.5, alpha = 0.7)+ theme_bw()+ theme(legend.position = "top", strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+ ylab(label = "migration velocity")+ xlab(label = '')+ scale_color_grey()+ guides(color = guide_legend(override.aes = list(size = 3)))+ guides(shape = guide_legend(override.aes = list(size = 3)))+ scale_y_log10()+ annotation_logticks(base = 10, sides = "l") ``` # Model Fitting Now we fit the hierarchical Bayesian model. The `cellmig()` function handles the complex Stan modeling behind the scenes. ## Control Parameters You can adjust the MCMC (Markov Chain Monte Carlo) settings via the `control` list. - `mcmc_warmup`: Steps to tune the sampler. - `mcmc_steps`: Actual sampling steps used for inference. - `mcmc_chains`: Number of independent chains (recommend $\ge$ 2 for convergence checks). *Note: For this vignette, we use fewer steps for speed. For real analysis, increase `mcmc_steps` to 2000+.* ```{r fit-model, fig.width=7, fig.height=3.5} o <- cellmig(x = d, control = list(mcmc_warmup = 300, # Warmup iterations mcmc_steps = 1000, # Sampling iterations mcmc_chains = 2, # Number of chains mcmc_cores = 2)) # Parallel cores ``` # Interpreting Results The model output `o` contains posterior distributions for all parameters. We focus on the **overall treatment effects** ($\delta_t$), which represent the change in velocity relative to the control (offset). ## Overall Treatment Effects ($\delta_t$) The `delta_t` data frame contains the mean, median, and 95% Highest Density Intervals (HDI) for each treatment. ```{r view-delta} knitr::kable(o$posteriors$delta_t, digits = 2) ``` ### Visualizing Effects - **Dot:** Posterior mean effect. - **Error Bar:** 95% HDI (Uncertainty range). - **Y-axis:** Log-fold change ($\delta$). - If the error bar **crosses 0**, there is no clear evidence of an effect. - **Positive values:** Increased migration. - **Negative values:** Decreased migration. ```{r plot-delta, fig.width=6, fig.height=4} ggplot(data = o$posteriors$delta_t) + geom_line(aes(x = dose, y = mean, col = compound, group = compound)) + geom_point(aes(x = dose, y = mean, col = compound)) + geom_errorbar(aes(x = dose, y = mean, ymin = X2.5., ymax = X97.5., col = compound), width = 0.1) + ylab(label = expression("Log-Fold Change ("*delta*")")) + xlab("Dose") + theme(legend.position = "top") ``` ## From Log-Fold-Change to Fold-Change Log-scale values can be hard to interpret biologically. We often prefer **Fold-Change**, where 1 = no effect, \>1 = increase, \<1 = decrease. We calculate this by exponentiating the values ($\exp(\delta)$). ```{r plot-fold-change, fig.width=6, fig.height=4} ggplot(data = o$posteriors$delta_t) + geom_line(aes(x = dose, y = exp(mean), col = compound, group = compound)) + geom_point(aes(x = dose, y = exp(mean), col = compound)) + geom_errorbar(aes(x = dose, y = exp(mean), ymin = exp(X2.5.), ymax = exp(X97.5.), col = compound), width = 0.1) + ylab(label = expression("Fold Change ("*delta*"')")) + xlab("Dose") + theme(legend.position = "top") ``` # Pairwise Comparisons Often, you want to compare specific treatments against each other (e.g., Drug A vs. Drug B), not just against the control. ## Comparison Matrix The `get_pairs()` function computes the difference between all treatment groups. - $\rho$: Log-fold change difference between Treatment X and Treatment Y. - $\pi$: Probability that the effect is truly different (0 to 1). Closer to 1 means strong evidence. ```{r} # Get pairwise comparisons (log-scale) u <- get_pairs(x = o, exponentiate = FALSE) ``` ### Visualize $\rho$ ```{r} # vislualize matrix of rhos u$plot_rho ``` ### Visualize $\pi$ ```{r} # visualize matrix of pis u$plot_pi ``` ### Visualize $\rho$ vs. $\pi$ with a volcano plot ```{r} # visualize volcano plot u$plot_volcano ``` *Tip: Set `exponentiate = TRUE` to get fold-change ratios instead of log-differences.* ## Violin Plots for Specific Comparisons To inspect the distribution of differences between a specific target group and all others, use `get_violins()`. ```{r get-groups} # View available group labels groups <- get_groups(x = o) head(groups) ``` ```{r plot-violins, fig.width=7, fig.height=3} # Compare all groups against Compound 2, Dose 1 u_violin <- get_violins(x = o, from_groups = groups$group, to_group = "C2|D1", exponentiate = FALSE) u_violin$plot ``` - **Violins:** Show the posterior distribution of the difference. - **Label:** Probability ($\pi$) that the difference is non-zero. # Model Diagnostics It is crucial to verify that the model fits your data well. `cellmig` provides Posterior Predictive Checks (PPC). ## Cell-Level Check Do the simulated velocities (pink) match the observed velocities (black)? If they overlap well, the model captures the data distribution. ```{r ppc-cell, fig.width=6, fig.height=6} g <- get_ppc_violins(x = o, wrap = TRUE, ncol = 3) g + scale_y_log10() ``` ## Well-Level Check Does the model predict the mean velocity per well accurately? Points should lie close to the diagonal line. ```{r ppc-well, fig.width=5, fig.height=5} g <- get_ppc_means(x = o) g ``` # Variance Components You can inspect how much variability comes from different sources (plates, wells, treatments). This helps in planning future experiments. ```{r plot-variance, fig.height=3, fig.width=7} # Plate-specific baseline effects g_alpha_p <- ggplot(data = o$posteriors$alpha_p) + geom_errorbarh(aes(y = plate, x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) + geom_point(aes(y = plate, x = mean)) + xlab("Plate Effect (log-scale)") # Variance parameters (Biological vs Technical) g_sigma <- ggplot() + geom_errorbarh(data = o$posteriors$sigma_bio, aes(y = "Biological (Plate)", x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) + geom_errorbarh(data = o$posteriors$sigma_tech, aes(y = "Technical (Well)", x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) + geom_errorbarh(data = o$posteriors$sigma_delta, aes(y = "Treatment Variation", x = mean, xmin = X2.5., xmax = X97.5.), height = 0.2) + geom_point(data = o$posteriors$sigma_bio, aes(y = "Biological (Plate)", x = mean)) + geom_point(data = o$posteriors$sigma_tech, aes(y = "Technical (Well)", x = mean)) + geom_point(data = o$posteriors$sigma_delta, aes(y = "Treatment Variation", x = mean)) + xlab("Standard Deviation") g_alpha_p | g_sigma ``` # Advanced: Dose-Response Profiles For screens with multiple compounds and overlapping doses, `cellmig` can cluster compounds based on their dose-response similarity, accounting for uncertainty. - **Left:** Dendrogram showing similarity between compounds. - **Middle/Right:** Dose-response curves for overall ($\delta$) and plate-specific ($\gamma$) effects. ```{r dose-response, fig.width=8, fig.height=5} get_dose_response_profile(x = o, exponentiate = TRUE) + patchwork::plot_layout(widths = c(.7, 1, 2)) ``` # Session Info ```{r} sessionInfo() ```