--- title: "Confident fold change" author: "Paul Harrison" date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Confident fold change} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} params: cache: !r FALSE --- This document shows typical Topconfects usage with limma, edgeR, or DESeq2. The first step is to load a dataset. Here, we're looking at RNA-seq data that investigates the response of *Arabodopsis thaliana* to a bacterial pathogen. Besides the experimental and control conditions, there is also a batch effect. This dataset is also examined in section 4.2 of the `edgeR` user manual, and I've followed the initial filtering steps in the `edgeR` manual. ```{r setup, include=FALSE} knitr::opts_chunk$set(fig.width=7, fig.height=6) ``` ```{r load, message=FALSE, warning=FALSE} library(topconfects) library(NBPSeq) library(edgeR) library(limma) library(dplyr) library(ggplot2) data(arab) # Retrieve symbol for each gene info <- AnnotationDbi::select( org.At.tair.db::org.At.tair.db, rownames(arab), "SYMBOL") %>% group_by(TAIR) %>% summarize( SYMBOL=paste(unique(na.omit(SYMBOL)),collapse="/")) arab_info <- info[match(rownames(arab),info$TAIR),] %>% select(-TAIR) %>% as.data.frame rownames(arab_info) <- rownames(arab) # Extract experimental design from sample names Treat <- factor(substring(colnames(arab),1,4)) %>% relevel(ref="mock") Time <- factor(substring(colnames(arab),5,5)) y <- DGEList(arab, genes=as.data.frame(arab_info)) # Keep genes with at least 3 samples having an RPM of more than 2 keep <- rowSums(cpm(y)>2) >= 3 y <- y[keep,,keep.lib.sizes=FALSE] y <- calcNormFactors(y) ``` # limma analysis ## Standard limma analysis steps We use the voom-limma method. ```{r limma} design <- model.matrix(~Time+Treat) design[,] fit <- voom(y, design) %>% lmFit(design) ``` ## Apply topconfects Find largest fold changes that we can be confident of at FDR 0.05. ```{r limma_confects} confects <- limma_confects(fit, coef="Treathrcc", fdr=0.05) confects ``` Note: If not using `voom` or a similar precision weighting method, it may be important to use `limma_confects(..., trend=TRUE)` to account for a mean-variance trend, similar to how you would call `limma::treat(..., trend=TRUE)`. You can check the need for this with `limma::plotSA(fit)`. ## Looking at the result Here the usual `logFC` values estimated by `limma` are shown as dots, with lines to the `confect` value. ```{r fig.height=7} confects_plot(confects) ``` `confects_plot_me2` produces a Mean-Difference Plot (as might be produced by `limma::plotMD`), and colors points based on thresholds for the `confect` values. Effect sizes confidently "> 0" correspond to traditional differential expression testing, and effect sizes confidently greater than larger values correspond to the TREAT test at that threshold. I have specified the `breaks` here explicitly, to aid comparison with other methods in similar plots below. ```{r} confects_plot_me2(confects, breaks=c(0,1,2)) ``` You can see that while large estimated effect sizes are produced at low expression levels, this is mostly due to noise in the estimates. They are not *confidently* large expression levels. There is also a `confects_plot_me`, which I no longer recommend. This overlays the confects (red/blue) on a Mean-Difference Plot (grey). This proved to be confusing, as significant gene are represented by two different points. ```{r} confects_plot_me(confects) + labs(title="This plot is not recommended") ``` Let's compare this to the ranking we obtain from `topTable`. ```{r, fig.height=7} fit_eb <- eBayes(fit) top <- topTable(fit_eb, coef="Treathrcc", n=Inf) rank_rank_plot(confects$table$name, rownames(top), "limma_confects", "topTable") ``` You can see that the top 19 genes from topTable are all within the top 40 for topconfects ranking, but topconfects has also highly ranked some other genes. These have a large effect size, and sufficient if not overwhelming evidence of this. An MD-plot highlighting the positions of the top 40 genes in both rankings also illustrates the differences between these two ways of ranking genes. ```{r} plotMD(fit, legend="bottomleft", status=paste0( ifelse(rownames(fit) %in% rownames(top)[1:40], "topTable ",""), ifelse(rownames(fit) %in% confects$table$name[1:40], "confects ",""))) ``` # edgeR analysis An analysis in edgeR produces similar results. Note that only quasi-likelihood testing from edgeR is supported. ## Standard edgeR analysis ```{r edger} y <- estimateDisp(y, design, robust=TRUE) efit <- glmQLFit(y, design, robust=TRUE) ``` ## Apply topconfects A step of 0.05 is used here merely so that the vignette will build quickly. `edger_confects` calls `edgeR::glmTreat` repeatedly, which is necessarily slow. In practice a smaller value such as 0.01 should be used. ```{r edger_confects} econfects <- edger_confects(efit, coef="Treathrcc", fdr=0.05, step=0.05) econfects ``` ## Looking at the result ```{r fig.height=7} confects_plot(econfects) confects_plot_me2(econfects, breaks=c(0,1,2)) ``` ```{r} etop <- glmQLFTest(efit, coef="Treathrcc") %>% topTags(n=Inf) plotMD(efit, legend="bottomleft", status=paste0( ifelse(rownames(efit) %in% econfects$table$name[1:40], "confects ", ""), ifelse(rownames(efit) %in% rownames(etop)[1:40], "topTags ",""))) ``` # DESeq2 analysis DESeq2 does its own filtering of lowly expressed genes, so we start from the original count matrix. The initial steps are as for a normal DESeq2 analysis. ```{r message=F, warning=F} library(DESeq2) dds <- DESeqDataSetFromMatrix( countData = arab, colData = data.frame(Time, Treat), rowData = arab_info, design = ~Time+Treat) dds <- DESeq(dds) ``` ## Apply topconfects The contrast or coefficient to test is specified as in the `DESeq2::results` function. The step of 0.05 is merely so that this vignette will build quickly, in practice a smaller value such as 0.01 should be used. `deseq2_confects` calls `results` repeatedly, and in fairness `results` has not been optimized for this. ```{r} dconfects <- deseq2_confects(dds, name="Treat_hrcc_vs_mock", step=0.05) ``` DESeq2 offers shrunken estimates of LFC. This is another sensible way of ranking genes. Let's compare them to the confect values. ```{r} shrunk <- lfcShrink(dds, coef="Treat_hrcc_vs_mock", type="ashr") dconfects$table$shrunk <- shrunk$log2FoldChange[dconfects$table$index] dconfects ``` DESeq2 filters some genes, these are placed last in the table. If your intention is to obtain a ranking of all genes, you should disable this with `deseq2_confects(..., cooksCutoff=Inf, independentFiltering=FALSE)`. ```{r} table(dconfects$table$filtered) tail(dconfects$table) ``` ## Looking at the result Shrunk LFC estimates are shown in red. ```{r fig.height=7} confects_plot(dconfects) + geom_point(aes(x=shrunk, size=baseMean, color="lfcShrink"), alpha=0.75) ``` `lfcShrink` aims for a best estimate of the LFC, whereas confect is a conservative estimate. `lfcShrink` can produce non-zero values for genes which can't be said to significantly differ from zero -- it doesn't do double duty as an indication of significance -- whereas the confect value will be `NA` in this case. The plot below compares these two quantities. Only un-filtered genes are shown (see above). ```{r} filter(dconfects$table, !filtered) %>% ggplot(aes( x=ifelse(is.na(confect),0,confect), y=shrunk, color=!is.na(confect))) + geom_point() + geom_abline() + coord_fixed() + theme_bw() + labs(color="Significantly\nnon-zero at\nFDR 0.05", x="confect", y="lfcShrink using ashr") ``` # Comparing results The `rank_rank_plot` function can be used to pick differences between the different methods. It seems edgeR and DESeq2 tend to promote some genes, relative to limma. edgeR and DESeq2 are more similar to each other, which is reassuring given that they are based on similar statistical models. ```{r fig.height=7} rank_rank_plot(confects$table$name, econfects$table$name, "limma confects", "edgeR confects") rank_rank_plot(confects$table$name, dconfects$table$name, "limma confects", "DESeq2 confects") rank_rank_plot(econfects$table$name, dconfects$table$name, "edgeR confects", "DESeq2 confects") ``` --- ```{r} sessionInfo() ```