--- title: "User Manual: IgGeneUsage" author: "SK" date: "Sep 12, 2023" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{User Manual: IgGeneUsage} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE, warning = FALSE} knitr::opts_chunk$set(comment = FALSE, warning = FALSE, message = FALSE) ``` ```{r} require(IgGeneUsage) require(rstan) require(knitr) require(ggplot2) require(ggforce) require(gridExtra) require(ggrepel) require(reshape2) ``` # Introduction Decoding the properties of immune repertoires is key in understanding the response of adaptive immunity to challenges such as viral infection. One important property is biases in immunoglobulin (Ig) gene usage between biological conditions (e.g. healthy vs tumor). Yet, most analyses for differential gene usage (DGU) are performed qualitatively, or with inadequate statistical methods. Here we introduce `r Biocpkg("IgGeneUsage")`, a computational tool for DGU analysis. # Input The main input of `r Biocpkg("IgGeneUsage")` is a data.frame that has the following 4 columns: 1. sample_id: name of the repertoire (e.g. Patient-1) 2. condition: name of the condition to which each repertoire belongs (healthy, tumor_A, tumor_B, ...) 3. gene_name: gene name (e.g. IGHV1-10 or family TRVB1) 4. gene_usage_count: numeric (count) of usage related in sample x gene x condition specified in columns 1-3 The sum of all gene usage counts (column 4) for a given repertoire is equal to the repertoire size (number of cells in the repertoire). # Model `r Biocpkg("IgGeneUsage")` transforms the provided input in the following way. Given $R$ repertoires, each having $G$ genes, `r Biocpkg("IgGeneUsage")` generates a gene usage matrix $Y^{R \times G}$. Row sums in $Y$ define the total usage in each repertoire ($N$). For the analysis of DGU between biological conditions, we designed the following Bayesian model ($M$) for zero-inflated beta-binomial regression. This model can fit over-dispersed gene usage data. The immune repertoire data is also not exhaustive, i.e. some Ig genes that are systematically rearranged at low probability might not be sampled, and certain Ig genes are not encoded (or dysfunctional) in some individuals. The zero-inflated component of our model accounts for this. \begin{align} p(Y_{ijk} \mid M) = \begin{cases} \kappa_{j} + (1 - \kappa) \operatorname{BB}\left(0 \mid N_{ijk}, \theta_{ijk}, \phi \right), & \text{if $Y_{ijk}$ = 0} \\ (1 - \kappa_{j}) \operatorname{BB}\left(Y_{ijk} \mid N_{ijk}, \theta_{ijk}, \phi \right), & \text{if $Y_{ijk}$ > 0} \end{cases}\\ \theta_{ij}=\operatorname{logit^{-1}}\left(\alpha_{j}+\beta_{ijk}\right)\\ \beta_{ijk}\sim\operatorname{Normal}\left(\gamma_{jk},\sigma_{k}\right)\\ \gamma_{jk}\sim\operatorname{Normal}\left(0,\tau_{k}\right)\\ \alpha_{j}\sim\operatorname{Normal}\left(\mu_{\alpha},\sigma_{\alpha}\right)\\ \mu_{\alpha} \sim \operatorname{Normal}\left(0, 5\right)\\ \sigma_{k}, \tau_{k}, \sigma_{\alpha} \sim \operatorname{Cauchy^{+}}\left(0, 1\right) \\ \phi \sim \operatorname{Exponential}\left(\tau\right) \\ \tau \sim \operatorname{Gamma}\left(3, 0.1\right) \\ \kappa_{j} \sim \operatorname{Beta}\left({0.1, 1.0\right) \end{align} Model $M$ legend: * $i$ and $j$: index of different repertoires and genes, respectively * $\kappa$: zero-inflation probability * $\theta$: probability of gene usage * $\phi$: dispersion * $\alpha$: intercept/baseline gene usage * $\beta$: slope/within-repertoire DGU coefficient * $\gamma$, $\gamma_{\sigma}$: slope/gene-specific DGU coefficient; standard deviation * $\hat{\gamma}$, $\hat{\gamma}_{\sigma}$: mean and standard deviation of the population of gene-specific DGU coefficients * $\hat{\alpha}$, $\hat{\alpha}_{\sigma}$: mean and standard deviation of the population of gene-specific baseline usages * $\operatorname{BB}$: beta-binomial probability mass function (pmf) * $\operatorname{Normal}$: normal probability density function (pdf) * $\operatorname{Cauchy^{+}}$: half-Cauchy pdf * $\operatorname{Exponential}$: exponential pdf * $\operatorname{Gamma}$: gamma pdf * $\operatorname{Beta}$: beta pdf * $\operatorname{logit^{-1}}$: inverse logistic function In the output of `r Biocpkg("IgGeneUsage")`, we report the mean effect size (es or $\gamma$) and its 95% highest density interval (HDI). Genes with $\gamma \neq 0$ (e.g. if 95% HDI of $\gamma$ excludes 0) are most likely to experience differential usage. Additionally, we report the probability of differential gene usage ($\pi$): \begin{align} \pi = 2 \cdot \max\left(\int_{\gamma = -\infty}^{0} p(\gamma)\mathrm{d}\gamma, \int_{\gamma = 0}^{\infty} p(\gamma)\mathrm{d}\gamma\right) - 1 \end{align} with $\pi = 1$ for genes with strong differential usage, and $\pi = 0$ for genes with negligible differential gene usage. Both metrics are computed based on the posterior distribution of $\gamma$, and are thus related. # Case Study A `r Biocpkg("IgGeneUsage")` has a couple of built-in Ig gene usage datasets. Some were obtained from studies and others were simulated. Lets look into the simulated dataset `d_zibb_2`. This dataset was generated by a zero-inflated beta-binomial (ZIBB) model, and `r Biocpkg("IgGeneUsage")` was designed to fit ZIBB-distributed data. ```{r} data("d_zibb_2", package = "IgGeneUsage") knitr::kable(head(d_zibb_2)) ``` We can also visualize `d_zibb_2` with `r CRANpkg("ggplot")`: ```{r, fig.width=6, fig.height=3.25} ggplot(data = d_zibb_2)+ geom_point(aes(x = gene_name, y = gene_usage_count, col = condition), position = position_dodge(width = .7), shape = 21)+ theme_bw(base_size = 11)+ ylab(label = "Gene usage [count]")+ xlab(label = '')+ theme(legend.position = "top")+ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) ``` ## DGU analysis As main input `r Biocpkg("IgGeneUsage")` uses a data.frame formatted as `d_zibb_2`. Other input parameters allow you to configure specific settings of the `r CRANpkg("rstan")` sampler. In this example we analyze `d_zibb_2` with 2 MCMC chains, 1500 iterations each including 500 warm-ups using a single CPU core (Hint: for parallel chain execution set parameter `mcmc_cores` = 2). We report for each model parameter its mean and 95% highest density interval (HDIs). **Important remark:** you should run DGU analyses using default `r Biocpkg("IgGeneUsage")` parameters. If warnings or errors are reported with regard to the MCMC sampling, please consult the Stan manual[^2] and adjust the inputs accordingly. If the warnings persist, please submit an issue with a reproducible script at the Bioconductor support site or on Github[^3]. ```{r} M <- DGU(ud = d_zibb_2, # input data mcmc_warmup = 500, # how many MCMC warm-ups per chain (default: 500) mcmc_steps = 1500, # how many MCMC steps per chain (default: 1,500) mcmc_chains = 3, # how many MCMC chain to run (default: 4) mcmc_cores = 1, # how many PC cores to use? (e.g. parallel chains) hdi_lvl = 0.95, # highest density interval level (de fault: 0.95) adapt_delta = 0.8, # MCMC target acceptance rate (default: 0.95) max_treedepth = 10) # tree depth evaluated at each step (default: 12) ``` ## Output format The following objects are provided as part of the output of DGU: * `dgu_summary` (main results of `r Biocpkg("IgGeneUsage")`): quantitative DGU summary * `gu_summary`: quantitative summary of the gene usage (GU) of each gene in the different conditions * `glm`: rstan ('stanfit') object of the fitted model $\rightarrow$ used for model checks (see section 'Model checking') * `ppc`: posterior predictive checks data (see section 'Model checking') ```{r} summary(M) ``` ## Model checking * **Check your model fit**. For this, you can use the object glm. * Minimal checklist of successful MCMC sampling[^2]: * no divergences * no excessive warnings from rstan * Rhat < 1.05 * high Neff * Minimal checklist for valid model: * posterior predictive checks (PPCs): is model consistent with reality, i.e. is there overlap between simulated and observed data? * leave-one-out analysis [^2]: https://mc-stan.org/misc/warnings.html [^3]: https://github.com/snaketron/IgGeneUsage/issues ### MCMC sampling * divergences, tree-depth, energy * none found ```{r} rstan::check_hmc_diagnostics(M$glm) ``` * rhat < 1.03 and n_eff > 0 ```{r, fig.height = 3, fig.width = 6} gridExtra::grid.arrange(rstan::stan_rhat(object = M$glm), rstan::stan_ess(object = M$glm), nrow = 1) ``` ## PPC: posterior predictive checks ### PPCs: repertoire-specific The model used by `r Biocpkg("IgGeneUsage")` is generative, i.e. with the model we can generate usage of each Ig gene in a given repertoire (y-axis). Error bars show 95% HDI of mean posterior prediction. The predictions can be compared with the observed data (x-axis). For points near the diagonal $\rightarrow$ accurate prediction. ```{r, fig.height = 3.25, fig.width = 7} ggplot(data = M$ppc$ppc_rep)+ facet_wrap(facets = ~sample_id, ncol = 5)+ geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+ geom_errorbar(aes(x = observed_count, y = ppc_mean_count, ymin = ppc_L_count, ymax = ppc_H_count), col = "darkgray")+ geom_point(aes(x = observed_count, y = ppc_mean_count), size = 1)+ theme_bw(base_size = 11)+ theme(legend.position = "top")+ xlab(label = "Observed usage [counts]")+ ylab(label = "PPC usage [counts]") ``` ### PPCs: overall Prediction of generalized gene usage within a biological condition is also possible. We show the predictions (y-axis) of the model, and compare them against the observed mean usage (x-axis). If the points are near the diagonal $\rightarrow$ accurate prediction. Errors are 95% HDIs of the mean. ```{r, fig.height = 3, fig.width = 5} ggplot(data = M$ppc$ppc_condition)+ geom_errorbar(aes(x = gene_name, ymin = ppc_L_prop*100, ymax = ppc_H_prop*100, col = condition), position = position_dodge(width = 0.65), width = 0.1)+ geom_point(aes(x = gene_name, y = ppc_mean_prop*100,col = condition), position = position_dodge(width = 0.65))+ theme_bw(base_size = 11)+ theme(legend.position = "top")+ xlab(label = "Observed usage [%]")+ ylab(label = "PPC usage [%]") ``` ## Results Each row of `glm_summary` summarizes the degree of DGU observed for specific Igs. Two metrics are reported: * `es` (also referred to as $\gamma$): effect size on DGU, where `contrast` gives the direction of the effect (e.g. tumor - healthy or healthy - tumor) * `pmax` (also referred to as $\pi$): probability of DGU (parameter $\pi$ from model $M$) For `es` we also have the mean, median standard error (se), standard deviation (sd), L (low bound of 95% HDI), H (high bound of 95% HDI) ```{r} kable(x = head(M$dgu_summary), row.names = FALSE, digits = 3) ``` ### DGU: differential gene usage We know that the values of `\gamma` and `\pi` are related to each other. Lets visualize them for all genes (shown as a point). Names are shown for genes associated with $\pi \geq 0.9$. Dashed horizontal line represents null-effect ($\gamma = 0$). Notice that the gene with $\pi \approx 1$ also has an effect size whose 95% HDI (error bar) does not overlap the null-effect. The genes with high degree of differential usage are easy to detect with this figure. ```{r, fig.height = 4, fig.width = 5} # format data stats <- M$dgu_summary stats <- stats[order(abs(stats$es_mean), decreasing = FALSE), ] stats$gene_fac <- factor(x = stats$gene_name, levels = unique(stats$gene_name)) ggplot(data = stats)+ geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+ geom_errorbar(aes(x = pmax, y = es_mean, ymin = es_L, ymax = es_H), col = "darkgray")+ geom_point(aes(x = pmax, y = es_mean, col = contrast))+ geom_text_repel(data = stats[stats$pmax >= 0.9, ], aes(x = pmax, y = es_mean, label = gene_fac), min.segment.length = 0, size = 2.75)+ theme_bw(base_size = 11)+ theme(legend.position = "top")+ xlab(label = expression(pi))+ xlim(c(0, 1))+ ylab(expression(gamma)) ``` ### Promising hits Lets visualize the observed data of the genes with high probability of differential gene usage ($\pi \geq 0.9$). Here we show the gene usage in %. ```{r, fig.height = 3, fig.width = 5} promising_genes <- stats$gene_name[stats$pmax >= 0.9] ppc_gene <- M$ppc$ppc_condition ppc_gene <- ppc_gene[ppc_gene$gene_name %in% promising_genes, ] ppc_rep <- M$ppc$ppc_rep ppc_rep <- ppc_rep[ppc_rep$gene_name %in% promising_genes, ] ggplot()+ geom_point(data = ppc_rep, aes(x = gene_name, y = observed_prop*100, col = condition), size = 1, fill = "black", position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0, dodge.width = 0.35))+ geom_errorbar(data = ppc_gene, aes(x = gene_name, ymin = ppc_L_prop*100, ymax = ppc_H_prop*100, group = condition), position = position_dodge(width = 0.35), width = 0.15)+ theme_bw(base_size = 11)+ theme(legend.position = "top")+ ylab(label = "PPC usage [%]")+ xlab(label = '') ``` ### Promising hits [count] Lets also visualize the predicted gene usage counts in each repertoire. ```{r, fig.height = 3, fig.width = 5} ggplot()+ geom_point(data = ppc_rep, aes(x = gene_name, y = observed_count, col = condition), size = 1, fill = "black", position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0, dodge.width = 0.5))+ theme_bw(base_size = 11)+ theme(legend.position = "top")+ ylab(label = "PPC usage [count]")+ xlab(label = '') ``` ## GU: gene usage summary `r Biocpkg("IgGeneUsage")` also reports the inferred gene usage (GU) probability of individual genes in each condition. For a given gene we report its mean GU (`prob_mean`) and the 95% (for instance) HDI (`prob_L` and `prob_H`). ```{r, fig.width=5, fig.height=4} ggplot(data = M$gu_summary)+ geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+ geom_errorbar(aes(x = gene_name, y = prob_mean, ymin = prob_L, ymax = prob_H, col = condition), width = 0.1, position = position_dodge(width = 0.4))+ geom_point(aes(x = gene_name, y = prob_mean, col = condition), size = 1, position = position_dodge(width = 0.4))+ theme_bw(base_size = 11)+ theme(legend.position = "top")+ ylab(label = "GU [probability]") ``` # Leave-one-out (LOO) analysis To assert the robustness of the probability of DGU ($\pi$) and the effect size ($\gamma$), `r Biocpkg("IgGeneUsage")` has a built-in procedure for fully Bayesian leave-one-out (LOO) analysis. During each step of LOO, we discard the data of one of the R repertoires, and use the remaining data to analyze for DGU. In each step we record $\pi$ and $\gamma$ for all genes, including the mean and 95% HDI of $\gamma$. We assert quantitatively the robustness of $\pi$ and $\gamma$ by evaluating their variability for a specific gene. This analysis can be computationally demanding. ```{r} L <- LOO(ud = d_zibb_2, # input data mcmc_warmup = 500, # how many MCMC warm-ups per chain (default: 500) mcmc_steps = 1000, # how many MCMC steps per chain (default: 1,500) mcmc_chains = 1, # how many MCMC chain to run (default: 4) mcmc_cores = 1, # how many PC cores to use? (e.g. parallel chains) hdi_lvl = 0.95, # highest density interval level (de fault: 0.95) adapt_delta = 0.8, # MCMC target acceptance rate (default: 0.95) max_treedepth = 10) # tree depth evaluated at each step (default: 12) ``` Next, we collected the results (GU and DGU) from each LOO iteration: ```{r} L_gu <- do.call(rbind, lapply(X = L, FUN = function(x){return(x$gu_summary)})) L_dgu <- do.call(rbind, lapply(X = L, FUN = function(x){return(x$dgu_summary)})) ``` ... and plot them: ## LOO-DGU: variability of effect size $\gamma$ ```{r, fig.width=6.5, fig.height=4} ggplot(data = L_dgu)+ geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+ geom_errorbar(aes(x = gene_name, y = es_mean, ymin = es_L, ymax = es_H, col = contrast, group = loo_id), width = 0.1, position = position_dodge(width = 0.5))+ geom_point(aes(x = gene_name, y = es_mean, col = contrast, group = loo_id), size = 1, position = position_dodge(width = 0.5))+ theme_bw(base_size = 11)+ theme(legend.position = "top")+ ylab(expression(gamma)) ``` ## LOO-DGU: variability of $\pi$ ```{r, fig.width=6.5, fig.height=4} ggplot(data = L_dgu)+ geom_point(aes(x = gene_name, y = pmax, col = contrast, group = loo_id), size = 1, position = position_dodge(width = 0.5))+ theme_bw(base_size = 11)+ theme(legend.position = "top")+ ylab(expression(pi)) ``` ## LOO-GU: variability of the gene usage ```{r, fig.width=6.5, fig.height=4} ggplot(data = L_gu)+ geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+ geom_errorbar(aes(x = gene_name, y = prob_mean, ymin = prob_L, ymax = prob_H, col = condition, group = interaction(loo_id, condition)), width = 0.1, position = position_dodge(width = 0.5))+ geom_point(aes(x = gene_name, y = prob_mean, col = condition, group = interaction(loo_id, condition)), size = 1, position = position_dodge(width = 0.5))+ theme_bw(base_size = 11)+ theme(legend.position = "top")+ ylab("GU [probability]") ``` # Session ```{r} sessionInfo() ```