--- title: "Estimating Enrichment in PhIP-Seq Experiments with BEER" author: - name: Athena Chen - name: Kai Kammers - name: Rob Scharpf - name: Ingo Ruczinski date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteIndexEntry{beer} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: number_sections: yes toc: true --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE ) ``` # Introduction Phage immuno-precipitation sequencing (PhIP-seq) is a high-throughput approach for characterizing antibody responses to a variety of target antigens. A typical component of PhIP-seq analyses involves identifying which peptides elicit enriched antibody responses. `beer` provides two approaches for identifying peptide enrichments. The first approach is based on [`edgeR`](https://bioconductor.org/packages/release/bioc/html/edgeR.html)'s standard pipeline for identifying differential expression from read count data[^edgeRBioc][^edgeRDE][^edgeRF]. Though [`edgeR`](https://bioconductor.org/packages/release/bioc/html/edgeR.html) is remarkably effective at quickly identifying enriched antibody responses, it is less likely to pick up enriched peptides at the lower fold-change range. [^edgeRBioc]: Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140 [^edgeRDE]: McCarthy DJ, Chen Y and Smyth GK (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research 40, 4288-4297 [^edgeRF]: Chen Y, Lun ATL, Smyth GK (2016). From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research 5, 1438 The second approach, Bayesian Estimation in R (BEER) was developed specifically for the PhIP-seq setting and implements a Bayesian model to identify peptide enrichments as described in Chen et. al^[Chen A, Kammers K, Larman HB, Scharpf R, Ruczinski I. Detecting antibody reactivities in phage immunoprecipitation sequencing data (2022). _bioRxiv_. https://www.biorxiv.org/content/10.1101/2022.01.19.476926v1]. Though BEER is more likely to identify enriched peptides at the lower fold-change range, it tends to take much longer to run. Along with `beer`, we will use the following packages in this vignette: ```{r load_packages, include = TRUE, results = "hide", message = FALSE, warning = FALSE} library(ggplot2) library(dplyr) ``` # Installation ## `rjags` For Bayesian MCMC modeling, `beer` relies on `r CRANpkg("rjags")` to interface [Just Another Gibbs Sampler (JAGS)](https://mcmc-jags.sourceforge.io/). JAGS can be downloaded from [this link](https://sourceforge.net/projects/mcmc-jags/files/). [Homebrew](https://brew.sh/) users can install JAGS using, ``` brew install jags ``` For M1 Mac users using Rosetta emulation of intel, Homebrew installation of JAGS will likely work. However, we recommend installing JAGS from source for all other M1 Mac users. Once JAGS has been installed, `r CRANpkg("rjags")` can be installed in `R` via `install.packages("rjags")`. ## `beer` Once `rjags` and `PhIPData` have been installed, the stable release version of `beer` in Bioconductor can be installed using `BiocManager`: ```{r install_beer, eval = FALSE} BiocManager::install("beer") ``` To load the package: ```{r load_beer, include = TRUE, results = "hide", message = FALSE, warning = FALSE} library(beer) ``` # Simulated data To demonstrate `beer`, we simulate a small toy data set of 10 samples, each with 50 peptides. Four of the ten samples were beads-only samples, and each of the six remaining samples had 5 enriched peptides. Note that since there are rather few beads-only samples, each with very few peptides, the inference is likely to be very poor. Parameters for non-enriched peptides were derived from beads-only samples run with samples from HIV elite controllers. For the code to generate the data set, see `file.path(package = "beer", "script/sim_data.R")`. ```{r load_data} data_path <- system.file("extdata/sim_data.rds", package = "beer") sim_data <- readRDS(data_path) sim_data ``` # edgeR Differentially enriched peptides between a particular serum sample and all beads-only samples indicate enriched antibody responses to those peptides. Thus, to identify enriched peptides, we can run the standard `r Biocpkg("edgeR")` pipeline for differential expression. The `runEdgeR()` function estimates peptide-specific dispersion parameters then tests identifies differentially expressed peptides using either the exact test proposed by Robinson and Smyth^[Robinson MD and Smyth GK. Small-sample estimation of negative binomial dispersion, with applications to SAGE data (2008). _Biostatistics_, 9, 321-332. https://doi.org/10.1093/biostatistics/kxm030] (default, also specified with `de.method = "exactTest"`), or the GLM quasi-likelihood F-test^[Lun, ATL, Chen, Y, and Smyth, GK. It’s DE-licious: a recipe for differential expression analyses of RNA-seq experiments using quasi-likelihood methods in edgeR (2016). _Methods in Molecular Biology_, 1418, 391–416.] (specified with `de.method = "glmQLFTest"`). Since peptides are enriched only if average proportion of reads pulled in the serum sample is higher than the average proportion of reads pulled in a beads-only samples, two-sided p-values are converted to one-sided p-values. edgeR log10 p-values and log2 estimated fold-changes are returned in the assays specified by **`assay.names`**. ```{r edgeR} edgeR_out <- runEdgeR(sim_data, assay.names = c(logfc = "edgeR_logfc", prob = "edgeR_logpval")) ``` Using BH correction to adjust for multiple testing, enriched peptides are given by the matrix, ```{r edgeR_hits} assay(edgeR_out, "edgeR_hits") <- apply( assay(edgeR_out, "edgeR_logpval"), 2, function(sample){ pval <- 10^(-sample) p.adjust(pval, method = "BH") < 0.05 }) colSums(assay(edgeR_out, "edgeR_hits")) ``` # BEER (Bayesian Estimation Enrichment in R) BEER uses a Bayesian hierarchical model to derive posterior probabilities of enrichment and estimated fold-changes. Briefly, each sample is run individually in comparison to all beads-only samples as follows: 1. **Define prior parameters.** Though most prior parameters are supplemented by the user (or use the defaults), prior parameters for non-enriched peptides are first approximated using all beads-only samples. 2. **Identify super enriched peptides.** Based on the prior parameters, super enriched peptides are first excluded as these peptides should always have posterior probabilities of enrichment of 1. 3. **Re-estimate beads-only prior parameters.** Prior parameters are then reestimated from the beads-only samples for the remaining peptides. 4. **Initialize and run the MCMCs.** To reduce convergence time, MLE estimates are used to initialize the MCMC sampler, and samples are drawn from the posterior distributions of the unknown parameters. 5. **Summarize and store results.** Posterior samples are summarized using the means of the posterior distribution and are stored in the PhIPData object. BEER can be easily run with `brew()`. Like with `runEdgeR()`, results are stored in the locations specified by `assay.names`. We can add the existing results to our edgeR output as follows. ```{r beer, results = FALSE, message=FALSE, warning = FALSE, error = FALSE} ## Named vector specifying where we want to store the summarized MCMC output ## NULL indicates that the output should not be stored. assay_locations <- c( phi = "beer_fc_marg", phi_Z = "beer_fc_cond", Z = "beer_prob", c = "sampleInfo", pi = "sampleInfo" ) beer_out <- brew(edgeR_out, assay.names = assay_locations) ``` Thus, supposing peptides with posterior probability above 0.5 are enriched and noting that super enriched peptides were not run (and thus are missing entries in the posterior probability matrix), the matrix of enriched peptides is given by, ```{r beer_hits} ## Define matrix of peptides that were run in BEER was_run <- matrix(rep(beer_out$group != "beads", each = nrow(beer_out)), nrow = nrow(beer_out)) ## Identify super-enriched peptides ## These peptides were in samples that were run, but have missing posterior ## probabilities are_se <- was_run & is.na(assay(beer_out, "beer_prob")) ## Enriched peptides are peptides with: ## - posterior probability > 0.5, OR ## - super-enriched peptides assay(beer_out, "beer_hits") <- assay(beer_out, "beer_prob") > 0.5 | are_se colSums(assay(beer_out, "beer_hits")) ``` Each of the steps above are controlled by arguments within `brew()` and are described in a bit more detail in the following sections. ## Prior parameters The model relies on the following prior parameters: * `a_pi`, `b_pi`: shape parameters for a beta distribution that describes the proportion of peptides expected to be enriched in a given sample. The defaults are `a_pi = 2`, `b_pi = 300`. * `a_phi`, `b_phi`: shape parameters for a gamma distribution that describes fold-change for enriched peptides. This distribution is shifted by `fc` (see below). By default, `a_phi = 1.25`, `b_phi = 0.1`. * `a_c`, `b_c`: shape parameters for the attenuation constant. The defaults are `a_c = 80`, `b_c = 20`. * `fc`: minimum fold-change for an enriched peptide, defaults to 1. * `a_0j`, `b0j`: peptide-specific shape parameters. For peptide $j$, `a_0j`, `b_0j` are the shape parameters for the beta distribution that describes the proportion of reads pulled presuming the peptide is not enriched. The default prior distributions for the proportion of enriched peptides in a serum sample, the fold-change for enriched peptides, and the attenuation constants are shown below. ```{r prior_plot, echo = FALSE, fig.width = 6, fig.height=2.5} data.frame(x = c(seq(0, 1, by = 0.01), seq(0, 0.1, by = 0.001), 1 + seq(0, 49, by = 0.5)), dens = c(dbeta(seq(0, 1, by = 0.01), 80, 20), dbeta(seq(0, 0.1, by = 0.001), 2, 300), dgamma(seq(0, 49, by = 0.5), 1.25, 0.1)), param = rep(c("attenuation constant", "proportion of enriched peptides", "fold-change for enriched peptides"), times = c(101, 101, 99))) |> ggplot(aes(x = x, y = dens)) + facet_wrap(param ~., nrow = 1, scales = "free", labeller = label_wrap_gen()) + labs(x = "value", y = "density") + geom_line() + theme_bw() + theme(aspect.ratio = 1) ``` Prior parameters are specified in the **`prior.params`** argument of `brew()`. Rather than supplying `a_0j` and `b_0j`, the user can alternatively specify a method of deriving `a_0j` and `b_0j` from the beads-only samples. Currently, `beer` supports MOM, MLE, and edgeR estimates for `a_0j`, `b_0j`. Additional parameters for these methods can be supplied using the **`beads.args`** parameter of `brew()`. Alternatively the user can supply custom `a_0j`, `b_0j` to `prior.params`. Note that when `a_0j` and `b_0j` are supplied, the prior parameters are not re-calculated after tossing out clearly enriched peptides. For convenience, beta parameters can be estimated using MLE or MOM given a vector of proportions using `getAB()`. For any beta distribution, the MCMC sampler may get stuck for shape parameters less than one, so if possible it is safer for beta shape parameters to be greater than one. ## Removing super enriched peptides Super enriched peptides are peptides with MLE or edgeR estimated fold-changes in comparison to beads-only samples above a given threshold, which is defaulted to 15. These peptides should always have posterior probability 1 of being enriched. The more peptides that are identified as super enriched, the faster BEER runs. However, the threshold should be conservatively set such that all peptides are guaranteed to be enriched. In `brew()`, the argument **`se.params`** controls how super enriched peptides are identified. ## Saving MCMC samples JAGS parameters, such as the number of chains, number of iteration, thinning parameters, and more are defined in **`jags.params`** and directly passed to `rjags::jags.model()` and `rjags::coda.samples()`. By default, if **`sample.dir = NULL`** in `brew()`, the samples from each MCMC run are saved in the `R` temporary directory given by `tempdir()`. Samples can be saved by specifying a non-null sample directory. The posterior samples for each serum sample are saved in an RDS files named after the serum sample ID. Is is important to note that MCMC parameters are indexed by row and column. Since super enriched peptides are tossed out _before_ running the MCMC, the row index of the parameter does not necessarily correspond to the row of the PhIPData object. For example parameter `Z[5, 1]` could in reality correspond to peptide 8 in the PhIPData object if there were three peptides before peptide 8 that were labeled as super-enriched. # Beads-only round robin To approximate the false positive rate, we often run each of the beads-only samples against all other beads-only samples. This beads-only round robin also provides a sense of how similar the beads-only samples are to each other. The beads-only round robin can be included in `brew()` and `runEdgeR()` by specifying `beadsRR = TRUE`. ```{r edgeR_beadsRR} ## edgeR with beadsRR edgeR_beadsRR <- runEdgeR(sim_data, beadsRR = TRUE, assay.names = c(logfc = "edgeR_logfc", prob = "edgeR_logpval")) ## Calculate hits assay(edgeR_beadsRR, "edgeR_hits") <- apply( assay(edgeR_beadsRR, "edgeR_logpval"), 2, function(sample){ pval <- 10^(-sample) p.adjust(pval, method = "BH") < 0.05 }) ## Note samples 1-4 have 0 instead of NA now colSums(assay(edgeR_beadsRR, "edgeR_hits")) ``` ```{r beer_beadsRR, message=FALSE, warning = FALSE, error = FALSE} ## BEER with beadsRR added to edgeR output beer_beadsRR <- brew(edgeR_beadsRR, beadsRR = TRUE, assay.names = assay_locations) ## Check BEER hits like before was_run <- matrix(rep(beer_beadsRR$group != "beads", each = nrow(beer_beadsRR)), nrow = nrow(beer_beadsRR)) are_se <- was_run & is.na(assay(beer_beadsRR, "beer_prob")) beer_hits <- assay(beer_beadsRR, "beer_prob") > 0.5 | are_se ## Note again that samples 1-4 are not NA colSums(beer_hits) ``` Alternatively, one can run `beadsRR()` separately, ```{r edgeR_beadsRR_only} ## edgeR with beadsRR edgeR_beadsRR <- beadsRR(sim_data, method = "edgeR", assay.names = c(logfc = "edgeR_logfc", prob = "edgeR_logpval")) ## Calculate hits assay(edgeR_beadsRR, "edgeR_hits") <- apply( assay(edgeR_beadsRR, "edgeR_logpval"), 2, function(sample){ pval <- 10^(-sample) p.adjust(pval, method = "BH") < 0.05 }) ## Note samples 5-10 are NA now colSums(assay(edgeR_beadsRR, "edgeR_hits")) ``` ```{r beer_beadsRR_only, message=FALSE, warning = FALSE, error = FALSE} ## BEER with beadsRR added to edgeR output beer_beadsRR <- beadsRR(edgeR_beadsRR, method = "beer", assay.names = assay_locations) ## Check BEER hits like before was_run <- matrix(rep(beer_beadsRR$group == "beads", each = nrow(beer_beadsRR)), nrow = nrow(beer_beadsRR)) are_se <- was_run & is.na(assay(beer_beadsRR, "beer_prob")) beer_hits <- assay(beer_beadsRR, "beer_prob") > 0.5 | are_se ## Note again that samples 5-10 are now NA colSums(beer_hits) ``` # Parallelization By default, `beer` runs using the first registered back-end for parallelization as returned by `BiocParallel::bpparam()`. Other parallel evaluation environments are supported via the [`BiocParallel`](https://bioconductor.org/packages/release/bioc/html/BiocParallel.html) package (see the [BiocParallel vignette](https://bioconductor.org/packages/release/bioc/vignettes/BiocParallel/inst/doc/Introduction_To_BiocParallel.pdf) for a list of possible parallelization environments). In both `brew()` and `runEdgeR`, the parallel environments are passed via the `BPPARAM` argument which takes a `BiocParallelParam` object. ```{r, warning=FALSE} ## Run edgeR using different parallel environments runEdgeR(sim_data, BPPARAM = BiocParallel::SerialParam()) runEdgeR(sim_data, BPPARAM = BiocParallel::SnowParam()) ## Run beer in parallel brew(sim_data, BPPARAM = BiocParallel::SerialParam()) brew(sim_data, BPPARAM = BiocParallel::SnowParam()) ``` # Plot Helpers To facilitate visualization of PhIP-Seq analyses, `beer` includes plot helpers that calculate values of interest and stores these values as a new assays in the `PhIPData` object. Currently, these functions include: * `getExpected()` returns expected reads/proportions pulled by a peptide based on the average proportion of reads pulled across all beads-only samples, * `getBF()`: returns Bayes factors for the probability of enriched antibody responses for each peptide. ## `getExpected()` To calculate the expected read counts and proportion of reads, we can specify `type = c("rc", "prop")` in `getExpected()`. This resulting `PhIPData` object can be converted to a `DataFrame`/`tibble` which can be plotted using any plotting method of choice. For example, we can visualize the simulated data set by plotting the observed read counts to the expected read counts as follows, We can visualize the simulated data by plotting the observed versus the expected number of reads. ```{r, echo = FALSE, warning = FALSE, fig.align='left', fig.height=5, fig.width=6, fig.cap="Observed versus expected read counts for simulated data. Each point represents a peptide in a given sample. Point in red indicate truly enriched peptides."} getExpected(sim_data) %>% as("DataFrame") %>% as_tibble() %>% group_by(peptide) %>% mutate(sample = factor(sample, 1:10), sample_name = paste0(group, " ", sample)) %>% arrange(true_Z) %>% ggplot(aes(x = expected_rc, y = counts, color = factor(true_Z))) + geom_abline(aes(intercept = 0, slope = 1), size = 1) + geom_point() + facet_wrap(sample_name ~., ncol = 4, scales = "free") + labs(x = "expected read counts", y = "observed read counts", color = "true enrichment status") + scale_x_log10() + scale_y_log10() + scale_color_manual(values = c("grey", "red"), breaks = c(0, 1), labels = c("not enriched", "enriched")) + theme_bw() + theme(legend.position = c(0.75, 0.15), axis.text.x = element_text(angle = 45, hjust = 1), aspect.ratio = 1) ``` ## `getBF()` Once `brew()` has been run on a `PhIPData` object, we can plot Bayes factors for peptide enrichment using the `getBF()` plot helper. `getBF()` returns a `PhIPData` object with Bayes factors stored as an additional assay. For example, on the simulated data set, a plot of Bayes factors by peptide can be generated as follows, ```{r, echo = FALSE, warning = FALSE, fig.align='left', fig.height=5, fig.width=6, fig.cap="Bayes factors for simulated data."} brew(sim_data, beadsRR = TRUE) %>% getBF() %>% as("DataFrame") %>% as_tibble() %>% mutate( peptide = factor(peptide, 1:10), sample = factor(sample, 1:10), sample_name = paste0(group, " ", sample), pred = ifelse(prob >= 0.5, TRUE, FALSE)) %>% ggplot(aes(x = peptide, y = bayes_factors, color = pred)) + geom_point() + facet_wrap(sample_name ~., ncol = 4) + labs(x = "peptide", y = "Bayes factor", color = "prediction") + scale_y_log10() + scale_color_manual( breaks = c(TRUE, FALSE), values = c("red", "black"), labels = c("enriched", "not enriched") ) + theme_bw() + theme(legend.position = c(0.75, 0.15), aspect.ratio = 1) ``` # `sessionInfo()` ```{r} sessionInfo() ```