--- title: "Simulating Data with cellmig" author: "Simo Kitanovski (simo.kitanovski@uni-due.de)" output: BiocStyle::html_document vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{User Manual: Data Simulation} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r setup, include = FALSE, warning = FALSE} knitr::opts_chunk$set(comment = FALSE, message = FALSE, fig.align = "center") ``` ```{r load-packages, message=FALSE} library(cellmig) library(ggplot2) library(ggforce) library(patchwork) library(rstan) # Set a clean theme for all plots ggplot2::theme_set(new = theme_bw(base_size = 10)) ``` # Introduction Why simulate data? Simulation is a powerful tool for: 1. **Validating the model:** Ensuring `cellmig` can recover known truth f rom synthetic data. 2. **Experimental Design:** Performing power analysis to determine how many replicates or cells you need. 3. **Understanding Variability:** Exploring how biological and technical noise affect your results. This vignette demonstrates two simulation modes in `cellmig`: - **Partially Generative:** You define the true effect sizes (e.g., drug effects). Useful for validation and power analysis. - **Fully Generative:** All parameters are drawn from prior distributions. Useful for checking if your priors produce realistic data. # Partially Generative Simulation In this mode, you specify the "ground truth" (e.g., exactly how much a drug slows down cells). The model then adds realistic noise (biological and technical variability) to generate synthetic cell velocities. ## Experimental Design We will simulate an experiment with the following structure: | Parameter | Value | Description | |:-----------------------|:-----------------------|:-----------------------| | **Biological Replicates** | 8 Plates | Independent experimental runs | | **Technical Replicates** | 5 Wells | Wells per treatment per plate | | **Cells per Well** | 50 | Number of tracked cells | | **Treatment Groups** | 7 | Different compounds/doses | | **True Effects (**$\delta$) | -0.4 to 0.4 | Log-fold changes relative to control | | **Bio Variability (**$\sigma_{bio}$) | 0.1 | Plate-to-plate variation | | **Tech Variability (**$\sigma_{tech}$) | 0.05 | Well-to-well variation | ## Model Parameters ### Plate Effects ($\alpha_p$) Plates often have batch effects (e.g., one day cells move faster than another). We simulate this by drawing plate-specific baselines from a normal distribution. - **Mean:** 0.0 (log-scale velocity) - **SD:** 0.1 (magnitude of batch effect) ### Control Treatment (Offset) To correct for batch effects, one treatment group must serve as a reference (control) across all plates. We designate **Treatment 4** as the control (true effect = 0). ### Cell Velocity Distribution ($\kappa_w$) Cell velocities are modeled using a Gamma distribution. The shape parameter $\kappa$ controls the skewness: - **Higher** $\kappa$: Less variable, more symmetric data. - **Lower** $\kappa$: More skewed data. - **Typical value:** $\log(\kappa) \approx 1.5$ ($\kappa \approx 5$). ## Generating the Data We use the `gen_partial()` function. The `control` list specifies all design parameters. ```{r sim-partial} # Set seed for reproducibility set.seed(1253) # Generate synthetic data y_p <- gen_partial(control = list( N_biorep = 8, # Number of plates N_techrep = 5, # Wells per treatment per plate N_cell = 50, # Cells per well delta = c(-0.4, -0.2, -0.1, 0, 0.1, 0.2, 0.4), # True treatment effects sigma_bio = 0.1, # Biological variability sigma_tech = 0.05, # Technical variability offset = 4, # Index of the control treatment prior_alpha_p_M = 0, # Mean plate baseline prior_alpha_p_SD = 0.1, # SD of plate baseline prior_kappa_mu_M = 1.5, # Mean log(shape parameter) prior_kappa_mu_SD = 0.1, # SD of log(shape parameter) prior_kappa_sigma_M = 0, # Fixed sigma for shape prior_kappa_sigma_SD = 0.1 # SD of sigma for shape )) ``` ## Inspecting the Simulated Data The output `y_p` contains the data frame `y` and the true parameters `par`. ```{r view-sim-data} str(y_p$y) ``` The data frame includes: - `v`: Simulated cell velocity. - `well`, `plate`: Hierarchical identifiers. - `group`: Treatment group ID. - `compound`, `dose`: Metadata (set to generic values in simulation). ## Visualizing Simulated Data Let's plot the raw cell velocities. Colors represent plates. If batch effects are present, you might see plate colors clustering at different heights. ```{r plot-sim-raw, fig.width=7, fig.height=3} ggplot(data = y_p$y) + geom_sina(aes(x = paste0("t=", group), col = paste0("p=", plate), y = v, group = well), size = 0.5) + xlab("Treatment Group") + ylab("Migration Speed (µm/min)") + scale_color_grey(name = "Plate") + scale_y_log10() + theme(legend.position = "top") ``` We can also look at **mean well velocities** to see batch effects more clearly: ```{r plot-sim-well, fig.width=7, fig.height=3} well_means <- aggregate(v ~ well + group + plate, data = y_p$y, FUN = mean) ggplot(data = well_means) + geom_sina(aes(x = paste0("t=", group), col = paste0("p=", plate), y = v, group = well), size = 1) + xlab("Treatment Group") + ylab("Mean Well Speed") + scale_color_grey(name = "Plate") + scale_y_log10() + theme(legend.position = "top") ``` ## Analyzing Simulated Data To verify the model works, we analyze the synthetic data just like real experimental data. **Important:** We must format the data correctly: 1. Convert IDs to characters. 2. Create an `offset` column (1 for control, 0 for others). ```{r format-sim-data} sim_data <- y_p$y # Format columns sim_data$well <- as.character(sim_data$well) sim_data$compound <- as.character(sim_data$compound) sim_data$plate <- as.character(sim_data$plate) # Define offset (Control = Group 4) sim_data$offset <- 0 sim_data$offset[sim_data$group == 4] <- 1 ``` Now we fit the model: ```{r fit-sim-data, fig.width=7, fig.height=3.5} osd <- cellmig(x = sim_data, control = list(mcmc_warmup = 250, mcmc_steps = 800, mcmc_chains = 2, mcmc_cores = 2, mcmc_algorithm = "NUTS", adapt_delta = 0.8, max_treedepth = 10)) ``` ## Validation: Truth vs. Inference Does `cellmig` recover the true parameters we set? We compare the **posterior estimates** (dots + error bars) against the **true values** (red dots). ### Plate Effects ($\alpha_p$) The model should recover the plate-specific baselines. ```{r check-alpha, fig.width=5, fig.height=3} ggplot(data = osd$posteriors$alpha_p) + geom_point(aes(y = plate_id, x = exp(mean))) + geom_errorbarh(aes(y = plate_id, x = exp(mean), xmin = exp(X2.5.), xmax = exp(X97.5.)), height = 0.1) + # Overlay true values in red geom_point(data = data.frame(v = exp(y_p$par$alpha_p), plate = osd$posteriors$alpha_p$plate_id), aes(y = plate, x = v), col = "red", size = 2) + xlab(label = expression("Plate Baseline ("*alpha[p]*"')")) + ylab("Plate ID") + scale_y_continuous(breaks = osd$posteriors$alpha_p$plate_id) ``` ### Treatment Effects ($\delta_t$) The model should recover the true log-fold changes we defined in `delta`. ```{r check-delta, fig.width=5, fig.height=3} # Note: Group 4 is the control (offset), so it is not estimated ggplot(data = osd$posteriors$delta_t) + geom_point(aes(y = group_id, x = mean)) + geom_errorbarh(aes(y = group_id, x = mean, xmin = X2.5., xmax = X97.5.), height = 0.1) + # Overlay true values (excluding control) geom_point(data = data.frame(delta = c(-0.4, -0.2, -0.1, 0.1, 0.2, 0.4), group_id = 1:6), aes(y = group_id, x = delta), col = "red", size = 2) + xlab(label = expression("Treatment Effect ("*delta[t]*")")) + ylab("Treatment Group") + scale_y_continuous(breaks = 1:8) ``` ### Variability Parameters Check if the estimated biological and technical noise matches the input. ```{r check-sigma, fig.width=5, fig.height=3} plot(osd$fit, par = c("sigma_bio", "sigma_tech", "sigma_delta")) ``` # Fully Generative Simulation In this mode, **no parameters are fixed**. Everything (including treatment effects) is drawn from the prior distributions. This generates data from the **prior predictive distribution**. **Use Case:** Check if your priors produce realistic data before collecting any real samples. If the simulated velocities are biologically impossible (e.g., 1000 µm/min), your priors may need adjustment. ## Model Specification We define priors for all parameters instead of fixed values. ```{r sim-full-control} control_full <- list( N_biorep = 4, N_techrep = 3, N_cell = 30, N_group = 8, # Priors for plate baselines prior_alpha_p_M = -0.5, prior_alpha_p_SD = 1.0, # Priors for Gamma shape prior_kappa_mu_M = 1.5, prior_kappa_mu_SD = 1.0, prior_kappa_sigma_M = 0, prior_kappa_sigma_SD = 1.0, # Priors for variability prior_sigma_bio_M = 0.0, prior_sigma_bio_SD = 1.0, prior_sigma_tech_M = 0.0, prior_sigma_tech_SD = 1.0, prior_sigma_delta_M = 0.0, prior_sigma_delta_SD = 1.0 ) ``` ## Generate Data ```{r sim-full} y_f <- gen_full(control = control_full) str(y_f) ``` ## Comparing Simulation Modes Do the two simulation modes produce similar ranges of cell velocities? ```{r compare-sim-modes, fig.width=5, fig.height=5} # Compare a subset of velocities w <- data.frame(v_f = y_f$y$v[1:2000], v_p = y_p$y$v[1:2000]) ggplot(data = w) + geom_point(aes(x = v_f, y = v_p), size = 0.5, alpha = 0.5) + geom_density_2d(aes(x = v_f, y = v_p), col = "orange") + scale_x_log10(name = "Fully Generative Speed", limits = c(0.01, 10^4)) + scale_y_log10(name = "Partially Generative Speed", limits = c(0.01, 10^4)) + annotation_logticks(base = 10, sides = "lb") + theme_bw() ``` # Power Analysis One of the most practical uses of simulation is **power analysis**. *Question:* How many biological replicates (plates) do I need to detect a specific effect size? We can answer this by: 1. Simulating many datasets with varying numbers of replicates. 2. Fitting the model to each. 3. Checking how often the model correctly identifies the true effect (True Positive Rate). **Note:** The code below is computationally intensive (it fits many models). It is set to `eval = FALSE` for this vignette. You can run it locally to perform your own power analysis. **Tip:** Use multiple cores (`mcmc_cores`) to speed this up! ## Simulation Loop ```{r power-analysis, eval = FALSE, message=FALSE, warning=FALSE} # --- Configuration --- N_bioreps <- c(3, 6, 9) # Replicate scenarios to test N_sim <- 10 # Number of simulations per scenario true_deltas <- c(-0.3, -0.15, 0, 0.2, 0.4) # Effects to test offset <- 3 # Control group index # Store results deltas <- vector(mode = "list", length = length(N_bioreps) * N_sim) i <- 1 for(N_biorep in N_bioreps) { for(b in 1:N_sim) { # 1. Simulate data y_p <- gen_partial(control = list( N_biorep = N_biorep, N_techrep = 3, N_cell = 40, delta = true_deltas, sigma_bio = 0.1, sigma_tech = 0.05, offset = offset, prior_alpha_p_M = -0.5, prior_alpha_p_SD = 0.1, prior_kappa_mu_M = 1.5, prior_kappa_mu_SD = 0.1, prior_kappa_sigma_M = 0, prior_kappa_sigma_SD = 0.1 )) # 2. Format data sim_data <- y_p$y sim_data$well <- as.character(sim_data$well) sim_data$compound <- as.character(sim_data$compound) sim_data$plate <- as.character(sim_data$plate) sim_data$offset <- 0 sim_data$offset[sim_data$group == offset] <- 1 # 3. Fit model o <- cellmig(x = sim_data, control = list(mcmc_warmup = 300, mcmc_steps = 1000, mcmc_chains = 1, mcmc_cores = 1, # Increase for speed mcmc_algorithm = "NUTS", adapt_delta = 0.8, max_treedepth = 10)) # 4. Evaluate Performance delta <- o$posteriors$delta_t delta$b <- b delta$N_biorep <- N_biorep delta$true_deltas <- true_deltas[-offset] # True Positive: HDI excludes 0 AND includes true value delta$TP <- (delta$X2.5. <= delta$true_deltas & delta$X97.5. >= delta$true_deltas) & !(delta$X2.5. <= 0 & delta$X97.5. >= 0) deltas[[i]] <- delta i <- i + 1 } } # Combine results deltas <- do.call(rbind, deltas) ``` ## Interpreting Results When you run this analysis, you will typically observe: - **Large Effects** (e.g., $|\delta| > 0.3$): Detected reliably even with few replicates (High True Positive rate). - **Small Effects** (e.g., $|\delta| < 0.2$): Require many biological replicates to distinguish from noise. ```{r plot-power, eval = FALSE, fig.height=4, fig.width = 7} ggplot(data = aggregate(TP ~ N_biorep + true_deltas, data = deltas, FUN = sum)) + geom_point(aes(x = N_biorep, y = TP, col = abs(true_deltas), group = as.factor(true_deltas)), size = 2, alpha = 0.5) + geom_line(aes(x = N_biorep, y = TP, col = abs(true_deltas), group = as.factor(true_deltas)), alpha = 0.5) + ylab("Number of True Positives (TPs)") + xlab("Number of Biological Replicates") + scale_color_distiller(name = expression("|"*delta[t]*"|"), palette = "Spectral") + theme_bw() ``` # Session Info ```{r} sessionInfo() ```