--- title: "Effect size estimation with apeglm" date: "`r format(Sys.Date(), '%m/%d/%Y')`" author: "Anqi Zhu, Joseph G. Ibrahim, and Michael I. Love" output: rmarkdown::html_document: highlight: pygments toc: true toc_float: true abstract: | *apeglm* provides empirical Bayes shrinkage estimators for effect sizes for a variety of GLM models; *apeglm* stands for "Approximate Posterior Estimation for GLM". apeglm package version: `r packageVersion("apeglm")` bibliography: library.bib vignette: | %\VignetteIndexEntry{Effect size estimation with apeglm} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, echo=FALSE, results="hide"} knitr::opts_chunk$set(tidy=FALSE, cache=TRUE, dev="png", message=FALSE, error=FALSE, warning=TRUE) ``` # Typical RNA-seq call from DESeq2 **Note:** the typical RNA-seq workflow for users would be to call *apeglm* estimation from within the `lfcShrink` function from the *DESeq2* package. The unevaluated code chunk shows how to obtain *apeglm* shrinkage estimates after running `DESeq`. See the DESeq2 vignette for more details. The `lfcShrink` wrapper function takes care of many details below, and unifies the interface for multiple shrinkage estimators. The coefficient to shrink can be specified either by name or by number (following the order in `resultsNames(dds)`). Be aware that *DESeq2*'s `lfcShrink` interface provides LFCs on the log2 scale, while *apeglm* provides coefficients on the natural log scale. ```{r,eval=FALSE} res <- lfcShrink(dds, coef=2, type="apeglm") ``` # Example RNA-seq analysis Here we show example code which mimics what will happen inside the `lfcShrink` function when using the *apeglm* method [@Zhu2018]. Load a prepared `SummarizedExperiment`: ```{r} library(airway) data(airway) head(assay(airway)) ``` For demonstration, we will use 5000 genes of the `airway` dataset, the first from those genes with at least 10 counts across all samples. ```{r} keep <- head(which(rowSums(assay(airway)) > 0), 5000) airway <- airway[keep,] ``` First run a *DESeq2* differential expression analysis [@Love2014] (size factors, and dispersion estimates could similarly be estimated using *edgeR*): ```{r} library(DESeq2) dds <- DESeqDataSet(airway, ~cell + dex) dds$dex <- relevel(dds$dex, "untrt") dds <- DESeq(dds) res <- results(dds) ``` Defining data and parameter objects necessary for `apeglm`. We must multiply the coefficients from *DESeq2* by a factor, because *apeglm* provides natural log coefficients. Again, this would be handled inside of `lfcShrink` in *DESeq2* for a typical RNA-seq analysis. ```{r} x <- model.matrix(design(dds), colData(dds)) param <- dispersions(dds) mle <- log(2) * cbind(res$log2FoldChange, res$lfcSE) offset <- matrix(log(sizeFactors(dds)), ncol=ncol(dds), nrow=nrow(dds),byrow=TRUE) ``` # Running apeglm Here `apeglm` on 5000 genes takes less than a minute on a laptop. It scales with number of genes, the number of samples and the number of variables in the design formula, where here we have 5 coefficients (one for the four cell cultures and one for the difference due to dexamethasone treatment). We provide `apeglm` with the *SummarizedExperiment* although the function can also run on a *matrix* of counts or other observed data. We specify a `coef` as well as a `threshold` which we discuss below. Note that we multiple the `threshold` by `log(2)` to convert from log2 scale to natural log scale. ```{r} library(apeglm) system.time({ fit <- apeglm(Y=airway, x=x, log.lik=logLikNB, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset) }) str(fit$prior.control) ``` There are faster implementations of apeglm specifically for negative binomial likelihoods. The version `nbinomR` is ~5 times faster than the default `general`. ```{r} system.time({ fitR <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomR") }) ``` The version `nbinomCR` is ~10 times faster than the default `general`. ```{r} system.time({ fitCR <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomCR") }) ``` The version `nbinomC` returns only the MAP coefficients and can be ~50-100 times faster than the default `general`. The MAP coefficients are the same as returned by `nbinomCR` above, we just skip the calculation of posterior SD. A variant of `nbinomC` is `nbinomC*` which includes random starts. ```{r} system.time({ fitC <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x), threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomC") }) ``` Among other output, we have the estimated coefficients attached to the ranges of the *SummarizedExperiment* used as input: ```{r} class(fit$ranges) mcols(fit$ranges, use.names=TRUE) ``` We can compare the coefficients from *apeglm* with those from *DESeq2*. *apeglm* provides coefficients on the natural log scale, so we must convert to log2 scale by multiplying by `log2(exp(1))`. Note that *DESeq2*'s `lfcShrink` function converts *apeglm* coefficients to the log2 scale internally. ```{r} system.time({ res.shr <- lfcShrink(dds, coef=5) }) DESeq2.lfc <- res.shr$log2FoldChange apeglm.lfc <- log2(exp(1)) * fit$map[,5] ``` Here we plot *apeglm* estimators against *DESeq2*: ```{r apevsdeseq} plot(DESeq2.lfc, apeglm.lfc) abline(0,1) ``` Here we plot *MLE*, *DESeq2* and *apeglm* estimators against the mean of normalized counts: ```{r maplots, fig.width=9, fig.height=3.5} par(mfrow=c(1,3)) lims <- c(-8,8) hline <- function() abline(h=c(-4:4 * 2),col=rgb(0,0,0,.2)) xlab <- "mean of normalized counts" plot(res$baseMean, res$log2FoldChange, log="x", ylim=lims, main="MLE", xlab=xlab) hline() plot(res$baseMean, DESeq2.lfc, log="x", ylim=lims, main="DESeq2", xlab=xlab) hline() plot(res$baseMean, apeglm.lfc, log="x", ylim=lims, main="apeglm", xlab=xlab) hline() ``` # Specific coefficients Note that p-values and FSR define different events, and are not on the same scale. An FSR of 0.5 means that the estimated sign is as bad as random guess. ```{r pvalcomp, fig.width=8, fig.height=4} par(mfrow=c(1,2),mar=c(5,5,1,1)) plot(res$pvalue, fit$fsr, col="blue", xlab="DESeq2 pvalue", ylab="apeglm local FSR", xlim=c(0,1), ylim=c(0,.5)) abline(0,1) plot(-log10(res$pvalue), -log10(fit$fsr), xlab="-log10 DESeq2 pvalue", ylab="-log10 apeglm local FSR", col="blue") abline(0,1) ``` The s-value was proposed by @Stephens2016, as a statistic giving the aggregate false sign rate for tests with equal or lower s-value than the one considered. We recommend using a lower threshold on s-values than typically used for adjusted p-values, for example one might be interested in sets with 0.01 or 0.005 aggregate FSR. ```{r pvalcomp2} plot(res$padj, fit$svalue, col="blue", xlab="DESeq2 padj", ylab="apeglm svalue", xlim=c(0,.2), ylim=c(0,.02)) ``` More scrutiny can be applied by using an LFC threshold greater than zero, and asking for the probability of a "false-sign-or-small" (FSOS) event: that the effect size is not further from zero in distance than the threshold amount. We can run the `svalue` function on these per-gene probabilities to produce s-values that bound the FSOS rate for sets of genes. By specifying `threshold=log(2) * 1` above, `apeglm` will then output a vector `thresh` in the results list that gives the per-gene probabilities of false-sign-or-small events. ```{r maplotthresh} s.val <- svalue(fit$thresh) cols <- ifelse(s.val < .01, "red", "black") plot(res$baseMean, log2(exp(1)) * fit$map[,5], log="x", col=cols, xlab=xlab, ylab="LFC") abline(h=c(-1,0,1), col=rgb(1,0,0,.5), lwd=2) ``` # Modeling ratios of counts We also show an short example using an alternative likelihood to the negative binomial. Suppose we have allele-specific counts for n=20 vs 20 samples across 5000 genes. We can define a binomial model and test for the allelic balance across groups of samples. Here we will *simulate* allele counts from our existing dataset for demonstration. We spike in 10 genes with strong allelic imbalance (instead of an allelic ratio close to 0.5, these will have a ratio of 0.75). ```{r} library(emdbook) n <- 20 f <- factor(rep(1:2,each=n)) mu <- ifelse(res$baseMean > 50, res$baseMean, 50) set.seed(1) cts <- matrix(rnbinom(nrow(dds)*2*n, mu=mu, size=1/dispersions(dds)), ncol=2*n) theta <- runif(nrow(cts),1,1000) prob <- rnorm(nrow(cts),.5,.05) # close to 0.5 ase.cts <- matrix(rbetabinom(prod(dim(cts)), prob=prob, size=cts, theta=rep(theta,ncol(cts))), nrow=nrow(cts)) idx <- 1:10 idx2 <- which(f == 2) theta[idx] <- 1000 prob[idx] <- 0.75 # the spiked in genes have an allelic ratio of 0.75 ase.cts[idx,idx2] <- matrix(rbetabinom(length(idx)*length(idx2), prob=prob[idx], size=cts[idx,idx2], theta=theta[idx]), nrow=length(idx)) ``` We define a beta-binomial likelihood function which uses the total counts as a parameter, and the logit function as a link: ```{r} betabinom.log.lik <- function(y, x, beta, param, offset) { xbeta <- x %*% beta p.hat <- (1+exp(-xbeta))^-1 dbetabinom(y, prob=p.hat, size=param[-1], theta=param[1], log=TRUE) } ``` We first need to estimate MLE coefficients and standard errors. ```{r} theta.hat.0 <- 100 # rough estimate of dispersion param <- cbind(theta.hat.0, cts) x <- model.matrix(~f) system.time({ fit.mle <- apeglm(Y=ase.cts, x=x, log.lik=betabinom.log.lik, param=param, no.shrink=TRUE, log.link=FALSE) }) ``` ```{r mlemaplot} coef <- 2 plot(res$baseMean, fit.mle$map[,coef], log="x", xlab=xlab, ylab="log odds") points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3) ``` ```{r} theta.hat <- bbEstDisp(success=ase.cts, size=cts, x=x, beta=fit.mle$map, minDisp=1, maxDisp=500) ``` ```{r} mle <- cbind(fit.mle$map[,coef], fit.mle$sd[,coef]) param <- cbind(theta.hat, cts) system.time({ fit2 <- apeglm(Y=ase.cts, x=x, log.lik=betabinom.log.lik, param=param, coef=coef, mle=mle, threshold=0.5, log.link=FALSE) }) ``` In the `apeglm` plot, we color in red the genes with a low aggregate probability of false-sign-or-small (FSOS) events (s-value < .01), where we've again defined "small" on the log odds scale using the `threshold` argument above. ```{r asemaplot, fig.width=8, fig.height=4} par(mfrow=c(1,2)) ylim <- c(-1,1.5) s.val <- svalue(fit2$thresh) # small-or-false-sign value plot(res$baseMean, fit.mle$map[,coef], main="MLE", log="x", xlab=xlab, ylab="log odds", ylim=ylim) points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3) abline(h=0,col=rgb(1,0,0,.5)) cols <- ifelse(s.val < .01, "red", "black") plot(res$baseMean, fit2$map[,coef], main="apeglm", log="x", xlab=xlab, ylab="log odds", col=cols, ylim=ylim) points(res$baseMean[idx], fit2$map[idx,coef], col="dodgerblue", cex=3) abline(h=0,col=rgb(1,0,0,.5)) ``` ```{r} logit <- function(x) log(x/(1-x)) logit(.75) table(abs(logit(prob[s.val < .01])) > .5) ``` # Acknowledgments We have benefited in the development of `apeglm` from feedback or contributions from the following individuals: Wolfgang Huber, Cecile Le Sueur, Charlotte Soneson # Session Info ```{r} sessionInfo() ``` # References