--- title: "Differential Expression Analysis with Long Read RNA-Seq Data" author: - name: "Ziyang Liu" affiliation: "University of Arizona" email: "jacobleo773@gmail.com" - name: "Hongxu Ding" affiliation: "University of Arizona" email: "hongxuding@arizona.edu" package: LRDE output: BiocStyle::html_document: toc: true toc_float: true number_sections: true vignette: > %\VignetteIndexEntry{Introduction to LRDE} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} library(BiocStyle) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction `LRDE` implements a **Hurdle Negative Binomial (Hurdle-NB) generalized linear model (GLM)** tailored for long-read RNA-seq data. Compared to short-read technologies, long-read sequencing often involves **smaller sample sizes** and **substantial expression dropout** (i.e., excess zeros), posing challenges for standard differential expression methods. This package is designed to address these issues by combining a flexible two-part model with stabilized parameter estimation. # Statistical Methodology The core contribution of `LRDE` is the integration of a hurdle model with **empirical Bayes (EB) shrinkage**, enabling robust and stable inference across genes. ## 1. Hurdle Model For each gene gene $g$ and sample $i$, the observed count $y_{ig}$ is modeled using a two-component framework: - **Zero component:** A Bernoulli distribution models the probability of observing a structural zero (dropout). - **Count component:** Conditional on being non-zero, counts follow a **zero-truncated Negative Binomial distribution**, capturing overdispersed expression levels. This formulation explicitly separates biological absence from technical dropout, improving interpretability and model fit. ## 2. Information Borrowing via Empirical Bayes Long-read RNA-seq data are often noisy, particularly for lowly expressed genes. To mitigate this, `LRDE` borrows strength across genes using an empirical Bayes strategy: - **Binning:** Genes are grouped into bins according to similar mean expression levels. - **Prior estimation:** Within each bin, empirical priors are estimated for: - the dispersion parameter ($\phi$) - the zero-component probability (treated as fixed during tagwise estimation and differential testing) - **Shrinkage:** Gene-specific (tag-wise) estimates are shrunk toward their bin-specific priors, resulting in more stable dispersion estimates and improved statistical power. ## 3. Differential Expression Testing `LRDE` provides two complementary approaches for differential expression analysis: - **Likelihood Ratio Test (`hurdle_LRT()`)** Recommended for general hypothesis testing and comparison of nested models. - **Wald Test (`hurdle_Wald_Test()`)** A computationally efficient alternative for testing individual coefficients. # Data Structure The main object returned by `LRDE` is a list containing the following components: - `counts`: normalized count matrix - `group`: sample group labels - `size.factors`: normalization factors - `tagwise.disp`: gene-wise dispersion estimates - `zero_prob_matrix`: estimated zero probabilities for each group - `lrt_stats`: likelihood ratio test statistics - `wald_stats`: Wald test statistics - `p.values`: p-values for differential expression These components can be accessed directly using standard list indexing. `LRDE` also supports input from `r Biocpkg("SummarizedExperiment")` objects, enabling integration with standard Bioconductor workflows. # Example Workflow This section demonstrates a minimal end-to-end workflow for differential expression analysis using `LRDE`. ```{r usage example} # Load the package suppressPackageStartupMessages(library(LRDE)) set.seed(123) # 1. Simulate a count matrix (Negative Binomial) mat <- matrix(rnbinom(300, size = 5, mu = 5), nrow = 50) # Define sample groups grp <- factor(c("A", "A", "A", "B", "B", "B")) # 2. Prepare the data object y <- prepareDGE(mat, grp) # 3. Estimate size factors for normalization y <- sizeFactorsEst(y) # 4. Estimate tag-wise dispersions y <- tagwiseEst(y) # 5a. Differential expression testing: Wald test y <- hurdle_Wald_Test(y) # 5b. Differential expression testing: Likelihood Ratio Test y <- hurdle_LRT(y) # 6. Inspect results head(y$lrt_stats) # LRT statistics per gene head(y$wald_stats) # Wald statistics per gene head(y$p.values) # p-values head(y$tagwise.disp) # estimated dispersions str(y, max.level = 1) # Data structure visualization ``` # Interpreting Results `LRDE` returns gene-level statistics for differential expression analysis, including test statistics and p-values. Genes with small p-values provide evidence against the null hypothesis of no differential expression between groups. ## Multiple Testing Correction Since thousands of genes are typically tested simultaneously, it is important to control for multiple comparisons. A common approach is to control the **false discovery rate (FDR)** using the Benjamini–Hochberg method: ```{r} padj <- p.adjust(y$p.values, method = "BH") head(padj) ``` Genes with adjusted p-values below a chosen threshold (e.g., FDR < 0.05) are typically considered significantly differentially expressed. # Supported Input Format In addition to standard `matrix` and `data.frame`, `prepareDGE` supports `r Biocpkg("SummarizedExperiment")` objects, allowing `LRDE` to integrate seamlessly into existing Bioconductor workflows. ```{r SE_input} # 1. Create a SummarizedExperiment object suppressPackageStartupMessages(library(SummarizedExperiment)) set.seed(123) # Simulate data counts_mat <- matrix(rnbinom(300, size = 5, mu = 5), nrow = 50) colnames(counts_mat) <- paste0("Sample", 1:6) group_info <- factor(c("A", "A", "A", "B", "B", "B")) # Construct SE with colData se <- SummarizedExperiment( assays = list(counts = counts_mat), colData = data.frame(condition = group_info) ) # 2. Start the LRDE pipeline using the SE object # Extract the condition directly from colData y_se <- prepareDGE(se, group = colData(se)$condition) # 3. Standard downstream workflow y_se <- sizeFactorsEst(y_se) y_se <- tagwiseEst(y_se) y_se <- hurdle_LRT(y_se) # View results head(y_se$lrt_stats) ``` # Session Information ```{r sessionInfo, echo=FALSE} sessionInfo() ```