--- title: "BioQC Algorithm: Speeding up the Wilcoxon-Mann-Whitney Test" author: "Gregor Sturm and Jitao David Zhang" package: BioQC date: "`r Sys.Date()`" vignette: > %\VignetteIndexEntry{BioQC Alogrithm: Speeding up the Wilcoxon-Mann-Whitney Test} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} output: rmarkdown::html_vignette: self_contained: no md_document: variant: markdown_phpextra preserve_yaml: TRUE --- Supplementary Information for "Detect issue heterogenity in gene expression data with [*BioQC*](https://github.com/Accio/BioQC)" ([Jitao David Zhang](mailto:jitao_david.zhang@roche.com), Klas Hatje, Gregor Sturm, Clemens Broger, Martin Ebeling, Martine Burtin, Fabiola Terzi, Silvia Ines Pomposiello and [Laura Badi](laura.badi@roche.com)) ```{r setup, include=FALSE} options(fig_caption=TRUE) library(knitr) opts_chunk$set(out.extra='style="display:block; margin: auto"', fig.align="center") ``` In this vignette, we explain the underlaying algorithmic details of our implementation of the Wilcoxon-Mann-Whitney test. The source code used to produce this document can be found in the github repository [BioQC](https://github.com/Accio/BioQC/vignettes). *BioQC* is a R/Bioconductor package to detect tissue heterogeneity from high-throughput gene expression profiling data. It implements an efficient Wilcoxon-Mann-Whitney test, and offers tissue-specific gene signatures that are ready to use 'out of the box'. ```{r lib, warning=FALSE, message=FALSE, results="hide", include=FALSE} library(testthat) library(BioQC) library(hgu133plus2.db) ## to simulate an microarray expression dataset library(lattice) library(latticeExtra) library(gridExtra) library(gplots) library(rbenchmark) pdf.options(family="ArialMT", useDingbats=FALSE) set.seed(1887) ## list human genes humanGenes <- unique(na.omit(unlist(as.list(hgu133plus2SYMBOL)))) ## read tissue-specific gene signatures gmtFile <- system.file("extdata/exp.tissuemark.affy.roche.symbols.gmt", package="BioQC") gmt <- readGmt(gmtFile) ``` Algorithmic improvements ------------------------ The Wilcoxon-Mann-Whitney (WMW) test is a non-parametric statistical test to test if median values of two population are equal or not. Unlike the t-test, it does not require the assumption of normal distributions, which makes it more robust against noise. We improved the computational efficiency of the Wilcoxon-Mann-Whitney test in comparison to the native R implementation based on three modifications: 1. The approximative WMW-statistic (Zar, J. H. (1999). Biostatistical analysis. Pearson Education India. *pp.* 174-177) is used. The differences to the exact version are negligible for high-throughput gene expression data. 2. The core algorithm is implemented in C instead of R as programming language. 3. BioQC avoids futile expensive sorting operations. While (1) and (2) are straightforward, we elaborate (3) in the following. Let $W_{a,b}$ be the approximative WMW test of two gene vectors $a,b$, where $a$ is the gene set of interest, typically containing less than a few hundreds of genes, and $b$ is the set of all genes outside the gene set (*background genes*) typically containing $>10000$ genes. In the context of BioQC, the gene sets are referred to as *tissue signatures*. Given an $m \times n$ input matrix of gene expression data with $m$ genes and $n$ samples $s_1, \dots, s_n$, and $k$ gene sets $d_1, \dots, d_k$, the WMW-test needs to be applied for each sample $s_i, i \in 1..n$ and each gene set $d_j, j \in 1..k$. The runtime of the WMW-test is essentially determined by the sorting operation on the two input vectors. Using native R `wilcox.test`, the vectors $a$ and $b$ are sorted individually for each gene set. However, in the context of gene set analysis, this is futile, as the (large) background set changes insignificantly in relation to the (small) gene set, when testing different gene sets on the same sample. Therefore, we approximate the WMW-test by extending $b$ to all genes in the sample, keeping the background unchanged when testing multiple gene sets. Like this, $b$ has to be sorted only once per sample. The individual gene sets still need to be sorted, which is not a major issue, as they are small in comparison to the set of background genes.
Time benchmark -------------- To demonstrate BioQC's superior performance, we apply both BioQC and the native R `wilcox.test` to random expression matrices and measure the runtime. We setup random expression matrices of `r length(humanGenes)` human protein-coding genes of 1, 5, 10, 50, or 100 samples. Genes are $i.i.d$ distributed following $\mathcal{N}(0,1)$. The native R and the *BioQC* implementations of the Wilcoxon-Mann-Whitney test are applied to the matrices respectively. ```{r time_benchmark, echo=FALSE} randomMatrix <- function(rows=humanGenes, ncol=5) { nrow <- length(rows) mat <- matrix(rnorm(nrow*ncol), nrow=nrow, byrow=TRUE) rownames(mat) <- rows return(mat) } noSamples <- c(1, 5, 10, 20, 50, 100) noBenchRep <- 100 tmRandomMats <- lapply(noSamples, function(x) randomMatrix(ncol=x)) tissueInds <- sapply(gmt, function(x) match(x$genes, humanGenes)) wmwTestR <- function(matrix, indices, alternative) { res <- apply(matrix, 2, function(x) { sapply(indices, function(index) { sub <- rep(FALSE, length(x)) sub[index] <- TRUE wt <- wilcox.test(x[sub], x[!sub], alternative=alternative, exact=FALSE) return(wt$p.value) }) }) return(res) } ## WARNING: very slow (~1-2 hours) benchmarkFile <- "simulation-benchmark.RData" if(!file.exists(benchmarkFile)) { bioqcRes <- lapply(tmRandomMats, function(mat) { bench <- benchmark(wmwTestRes<- wmwTest(mat, tissueInds, valType="p.greater", simplify=TRUE), replications=noBenchRep) elapTime <- c("elapsed"=bench$elapsed, "user"=bench$user.self, "sys"=bench$sys.self)/noBenchRep res <- list(elapTime=elapTime, wmwTestRes=wmwTestRes) return(res) }) rRes <- lapply(tmRandomMats, function(mat) { elapTime <- system.time(wmwTestRes <- wmwTestR(mat, tissueInds, alternative="greater")) res <- list(elapTime=elapTime, wmwTestRes=wmwTestRes) return(res) }) save(bioqcRes, rRes, file=benchmarkFile) } else { load(benchmarkFile) } getWmwTestRes <- function(x) x$wmwTestRes rNumRes <- lapply(rRes, getWmwTestRes) bioqcNumRes <- lapply(bioqcRes, getWmwTestRes) ``` The numeric results of both implementations, `bioqcNumRes` (from *BioQC*) and `rNumRes` (from *R*), are equivalent, as shown by the next command. ```{r time_benchmark_identical} expect_equal(bioqcNumRes, rNumRes) ``` The *BioQC* implementation is more than 500 times faster: while it takes about one second for BioQC to calculate enrichment scores of all `r length(gmt)` signatures in 100 samples, the native R implementation takes about 20 minutes: ```{r trellis_prepare, echo=FALSE} op <- list(layout.widths = list(left.padding = 0, key.ylab.padding = 0.5, ylab.axis.padding = 0.5, axis.right = 0.5, right.padding = 0), layout.heights = list(top.padding = 0, bottom.padding = 0, axis.top = 0, main.key.padding = 0.5, key.axis.padding = 0.5), axis.text=list(cex=1.2), par.xlab.text=list(cex=1.4), par.sub.text=list(cex=1.4), add.text=list(cex=1.4), par.ylab.text=list(cex=1.4)) ``` ```{r time_benchmark_vis, echo=FALSE, fig.width=8, fig.height=4.5, dev='svg', dev.args=list(pointsize=2.5), fig.cap="**Figure 2**: Time benchmark results of BioQC and R implementation of Wilcoxon-Mann-Whitney test. Left panel: elapsed time in seconds (logarithmic Y-axis). Right panel: ratio of elapsed time by two implementations. All results achieved by a single thread on in a RedHat Linux server."} getTimeRes <- function(x) x$elapTime["elapsed"] bioqcTimeRes <- sapply(bioqcRes, getTimeRes) rTimeRes <- sapply(rRes, getTimeRes) timeRes <- data.frame(NoSample=noSamples, Time=c(bioqcTimeRes, rTimeRes), Method=rep(c("BioQC", "NativeR"), each=length(noSamples))) timeXY <- xyplot(Time ~ NoSample, group=Method, data=timeRes, type="b", auto.key=list(columns=2L), xlab="Number of samples", ylab="Time [s]", par.settings=list(superpose.symbol=list(cex=1.25, pch=16, col=c("black", "red")), superpose.line=list(col=c("black", "red"))), scales=list(tck=c(1,0), alternating=1L, x=list(at=noSamples), y=list(log=2, at=10^c(-2, -1, 0,1,2,3, log10(2000)), labels=c(0.01, 0.1, 1, 10,100,1000, 2000)))) timeFactor <- with(timeRes, tapply(1:nrow(timeRes), list(NoSample), function(x) { bioqcTime <- subset(timeRes[x,], Method=="BioQC")$Time rTime <- subset(timeRes[x,], Method=="NativeR")$Time rTime/bioqcTime })) timeFactor.yCeiling <- max(ceiling(timeFactor/500))*500 timeFactorBar <- barchart(timeFactor ~ noSamples, horizontal=FALSE, xlab="Number of samples", ylab="Ratio of elapsed time [R/BioQC]", ylim=c(-20, timeFactor.yCeiling+50), col=colorRampPalette(c("lightblue", "navyblue"))(length(noSamples)), scales=list(tck=c(1, 0), alternating=1L, y=list(at=seq(0,timeFactor.yCeiling, by=500)), x=list(at=seq(along=timeFactor), labels=noSamples))) grid.arrange(timeXY, timeFactorBar, ncol=2) ``` Conclusion ---------- We have shown that *BioQC* achieves identical results as the native implementation in two orders of magnitude less time. This renders *BioQC* a highly efficient tool for quality control of large-scale high-throughput gene expression data. R Session Info ---------------- ```{r session_info} sessionInfo() ```