--- title: "Computing q-values conditioned on covariates using `swfdr`" date: "`r BiocStyle::doc_date()`" bibliography: swfdrQ.bib output: BiocStyle::pdf_document vignette: > %\VignetteIndexEntry{Computing covariate-adjusted q-values} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction Multiple hypothesis testing is a common concern in modern data science workflows: after performing several - in some cases, thousands - statistical tests on a dataset, many of the resultant p-values are expected to be small purely by chance. So it is imperative to interpret the results in a manner that limits spurious associations. At the same time, a parallel requirement is to identify significant hits despite having performed many tests. In addition to the classic multiple hypothesis testing problem described above, analyses of complex datasets are also complicated by covariates. Given p-values calculated using a particular statistical test, background knowledge may suggest that the test may be more appropriate for some data instances than for others, or that some results may be more or less expected a-priori. A tempting analysis strategy in such situations might be to partition the dataset and to study each part separately. However, direct stratification is not trouble-free. It relies on splitting data into discrete groups, which can introduce challenges of its own, including in the choice of appropriate thresholds. It also leads to further statistical testing, which is redundant and magnifies the burden of multiple hypothesis testing. Thus, an alternative, more systematic, approach is needed to utilize background knowledge in statistical analyses. There are, in principle, many possible ways to use covariate variables when tackling the multiple hypothesis testing problem. This vignette does not attempt an unbiased exposition of the available methods; some of them have been summarized and benchmarked by @korthauer2018practical. Instead, this vignette provides a practical guide to using the `swfdr` package, which implements algorithms for modeling the effect of covariates on a quantity called $\pi_0$, which can be interpreted as the probability that a null hypothesis is true. The theoretical aspects and modeling details are provided by @JagerEtAl2013 and @BocaEtAl2015. The approach extends methods by @storey2002direct and @storey2003statistical. In practice, these methods take in a set of p-values and output another set of quantities, q-values, that can be interpreted in the context of a multiple hypothesis analysis. The vignette begins with notes on how to install the `swfdr` software. Next, it provides a review of q-values using the classic `qvalue` package. This section also introduces some conceptual datasets and motivates the use of covariates to augment standard q-value analysis. Subsequent sections explain how to condition q-values on covariates using `swfdr`. The main points are summarized in the discussion and some technical details are discussed in an appendix. # Installation The `swfdr` package can be installed via the Bioconductor package manager, @biocmanager2018package. If you don't already have the package manager, you'll first have to install it. ```{r, eval=FALSE} install.packages("BiocManager") ``` This might show a prompt to select a package repository, and you can choose any one that seems convenient for you. To perform the package installation, load the Bioconductor manager and request to install `swfdr`. ```{r, eval=FALSE} library("BiocManager") BiocManager::install("swfdr") ``` After the above command completes, we can load the package. ```{r} library(swfdr) ``` For this vignette, we will also use Bioconductor package `qvalue`, @qvalue2018package. This can be installed also via Bioconductor ```{r, eval=FALSE} BiocManager::install("qvalue") library(qvalue) ``` ```{r, echo=FALSE, eval=TRUE} library(qvalue) ``` # q-values without covariates Suppose we have measurements from two groups. To perform a statistical analysis, we would formulate a null hypothesis, compute a p-value against that hypothesis, and then use some criteria to call the result as significant or not. The practical details may vary from application to application. For example, a common use case is to assume that measuremets come from distributions with equal variances, formulate the hypothesis that the distributions have equal means, and compute p-values using a t-test. An interpretation strategy might consist of calling a result a 'hit' or 'significant' if the p-value is below a threshold, often 0.05. In this section, we will look at how to interpret results from a large set of such statistical tests and control the rate of false discoveries using q-values. While the technical definitions for p-values and q-values are given in textbooks and original references, it is useful to highlight some of their properties here. Consider a statistical test and an associated numerical result $v$. The interpretation with respect to p-values and q-values would be as follows (@storey2003statistical). - *p-value:* Assuming that the original data is consistent with the null hypothesis, the probability that the single result with $p=v$ is a false positive is $v$. - *q-value:* Among the tests assigned q-values below $v$ during an analysis, the proportion of false positives is $v$. There are some important conceptual and practical differences between these definitions. - The interpretation of the p-value is focused on a single test instance, so any adjustments motivated by multiple hypothesis testing must be performed *post-hoc*. In contrast, the q-value is defined relative to an ensemble of tests and thus intrinsically refers to a multiple-hypothesis context. - The p-value is defined from the perspective of a hypothetical ensemble of data that is consistent with the null hypothesis. The q-value is instead formulated with respect to the ensemble of performed measurements. (Technically, the properties of the q-value are valid in the limit of a large ensemble of measurements, but it is possible to replace that hypothetical ensemble by an actual set of measurements and thus work with q-value *estimates*.) ## Conceptual datasets To demonstrate calculations of q-values, let's work with some practical datasets. To cover various scenarios, let's define three synthetic datasets and some helper functions. Each dataset will consist of a moderate number of examples, i.e. instances requiring a statistical test. ```{r} N <- 1000 ``` In the first dataset, `dataset0`, let's consider two groups with three measurements each and formulate a null hypothesis that there is no difference in means between the two groups. Moreover, let's construct the data so that all measurements come from the same generative process. In other words, despite fluctuations in individual measurements, all instances are concordant with the null hypothesis. ```{r D0_raw, cache=TRUE} set.seed(12345) dataset0 <- matrix(rnorm(6*N), ncol=6) head(dataset0, 3) ``` From this data, we can compute p-values for each set of measurements (rows) using a t-test. Because a similar calculation will be required again later, it is best to define a helper function. ```{r D0, cache=TRUE} # compute t-tests on each row of a matrix dataset_p <- function(D, group1=c(1,2,3), group2=c(4,5,6)) { t.test.groups <- function(x) { t.test(x[group1], x[group2])$p.value } round(apply(D, 1, t.test.groups), 5) # rounding for legibility } D0 <- data.frame(p=dataset_p(dataset0), truth=0) summary(D0$p) ``` After the helper function, the above block defines a data frame with p-values and an integer code `truth`; we will use this code later. The p-values range from `r min(D0$p)` to `r max(D0$p)`. Thus, although by construction all instances in `dataset0` are consistent with the null hypothesis, some p-values are nominally quite small. For the second dataset, `dataset1`, let's again consider two groups of three measurements each, but give one the groups a shifted mean. ```{r D1, cache=TRUE} dataset1 <- matrix(c(rnorm(3*N, 0, 1), rnorm(3*N, 2, 1)), ncol=6) head(dataset1, 3) D1 <- data.frame(p=dataset_p(dataset1), truth=1) summary(D1$p) ``` The resulting p-values span a similar range as before, but are skewed toward lower values. It is important to note that although all instances are actually inconsistent with the null hypothesis, some p-values are nominally large. To conclude this section, let's create a final dataset that is a concatenation of the previous two. We will only need the p-values, so we can omit creating the raw data and only define the summary data frame. ```{r} D2 <- rbind(D0, D1) dim(D2) table(D2$truth) ``` This combined dataset has `r 2*N` instances. The first half is concordant with the null hypothesis (`truth` equal to zero) and the second half is not (`truth` equal to 1). ## pi0 {#intro_pi0} Given p-values from individual statistical tests, we can begin looking at their properties as a set of multiple hypotheses. Let's visualize the distributions of p-values. ```{r Dhist, fig.width=7.5, fig.height=2.2} par(mfrow=c(1,3), las=1, mar=c(3,4,2,1)) p_hist <- function(p, breaks=20, freq=F, ylim=c(0, 7.2), ...) { hist(p, breaks=breaks, freq=freq, ylim=ylim, xlab="", ...) } p_hist(D0$p, col="#666666", main="p-values in D0") p_hist(D1$p, col="#cccccc", main="p-values in D1") p_hist(D2$p, col="#999999", main="p-values in D2") ``` In the first histogram based on `D0`, the distribution of p-values is flat. This is the expected pattern for data consistent with the null hypothesis. In the second panel based on dataset `D1`, the distribution is skewed and indicates deviation from the null hypothesis. The last histogram, which is based on the composition of `D0` and `D1`, shows an intermediate skew. Thus, the shape of the histogram is an indicator for the extent to which a dataset is consistent with the null hypothesis. This intuition can be codified into a probability that the null hypothesis is true, denoted as $\pi_0$ or `pi0`. It is computed from the histogram by considering the number of test instances that give p-values below various thresholds, $p<\lambda$, where $\lambda$ might be 0.05, 0.1, etc. See e.g. @storey2002direct and @storey2003statistical for details. We can compute $\pi_0$ for each dataset using the `qvalue` package. ```{r Dqs, cache=TRUE} library(qvalue) D0q <- qvalue(D0$p) D1q <- qvalue(D1$p) D2q <- qvalue(D2$p) c(D0=D0q$pi0, D1=D1q$pi0, D2=D2q$pi0) ``` The $\pi_0$ estimate for `D0` is unity, which is consistent with how we generated the dataset. For `D1`, $\pi_0$ is much lower. At `r round(D1q$pi0, 3)`, it is greater than the true value (zero; none of the instances in `D1` were generated in a way consistent with the null hypothesis). However, this estimate is nonetheless closer to the truth. Unsurprisingly, the last $\pi_0$ estimate - for the combined dataset `D2` - lies between the previous values. ## q-value While $\pi_0$ is an aggregate summary of an entire dataset, we are often interested in assigning significance calls to individual test instances. The q-value is a measure that controls for false positive calls and internally uses the $\pi_0$ to allow for true positive calls. We already computed q-values when we ran the function `qvalue` in the previous section. We can display a summary for the `D0` dataset as follows. ```{r} summary(D0q) ``` The table displays hit counts at various thresholds. To simplify the discussion, let's just look at the rows labeled 'p-value' and 'q-value' and the threshold of 0.05. For the following, a helper function will be useful to extract a subset of the relevant results. ```{r} # extract number of hits with values below a threshold hit_counts <- function(x, threshold=0.05) { counts <- c(sum(x$pvalues