--- title: "Tutorial for `swfdr` package" author: "Jeff Leek, Simina Boca" date: "`r BiocStyle::doc_date()`" bibliography: myRefs.bib output: BiocStyle::pdf_document vignette: > %\VignetteIndexEntry{Tutorial for swfdr package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Package overview This package allows users to estimate the science-wise false discovery rate from @JagerEtAl2013, using an EM approach due to the presence of rounding and censoring. It also allows users to estimate the proportion of true null hypotheses in the presence of covariates, using a regression framework, as per @BocaEtAl2015. The package is loaded using: ```{r} library(swfdr) ``` ## Estimating the science-wise false discovery rate The science-wise false discovery rate (swfdr) is defined in @JagerEtAl2013 as the rate that published research results are false positives. It is based on using reported p-values reported in biomedical journals and assuming that, for the p-values below 0.05, those corresponding to false positives are distributed as U$(0, 0.05)$, while those corresponding to true positives are distributed as tBeta$(\alpha, \beta; 0.05)$, where $\alpha$ and $\beta$ are unknown and tBeta is a Beta distribution truncated at $0.05$. The estimation of the swfdr is complicated by truncation (e.g. reporting ``p < 0.01``) and rounding (e.g. p-values are often rounded to two significant digits). An EM algorithm is used to estimate $\alpha, \beta$, as well as the swfdr. ### Example: Estimate the swfdr based on p-values from biomedical journals We include a dataset containing 15,653 p-values from articles in 5 biomedical journals (American Journal of Epidemiology, BMJ, Jama, Lancet, New England Journal of Medicine), over 11 years (2000-2010). This is obtained from web-scraping, using the code at \url{https://github.com/jtleek/swfdr/blob/master/getPvalues.R} and is already loaded in the package. ```{r} colnames(journals_pVals) ``` A description of the variables in `journals_pVals`can be found on its help page. In particular, `pvalue` gives the p-value, `pvalueTruncated` is a flag for whether it is truncated, `year` is the year of publication, and `journal` the journal. ```{r} table(journals_pVals$year) table(journals_pVals$journal) ``` ### The `calculateSwfdr` function This function estimates the swfdr. It inputs the following parameters: * `pValues` Numerical vector of p-values * `truncated` Vector of 0s and 1s with indices corresponding to those in pValues; 1 indicates that the p-values is truncated, 0 that it is not truncated * `rounded` Vector of 0s and 1s with indices corresponding to those in pValues; 1 indicates that the p-values is rounded, 0 that it is not rounded * `pi0` Initial prior probability that a hypothesis is null (default is 0.5) * `alpha` Initial value of parameter alpha from Beta(alpha, beta) true positive distribution (default is 1) * `beta` Initial value of parameter beta from Beta(alpha, beta) true positive distribution (default is 50) * `numEmIterations` The number of EM iterations (default is 100) Given that it runs an EM algorithm, it is somewhat computationally intensive. We show an example of applying it to all the p-values from the abstracts for articles published in the American Journal of Epidemiology in 2015. First, we subset the `journals_pVals` and only consider the p-values below $0.05$, as in @JagerEtAl2013: ```{r} journals_pVals1 <- dplyr::filter(journals_pVals, year == 2005, journal == "American Journal of Epidemiology", pvalue < 0.05) dim(journals_pVals1) ``` Next, we define vectors corresponding to the truncation status and the rouding status (defined as rounding to 2 significant digits) and use these vectors, along with the vector of p-values, and the number of EM iterations, as inputs to the `calculateSwfdr` function: ```{r} tt <- data.frame(journals_pVals1)[,2] rr <- rep(0,length(tt)) rr[tt == 0] <- (data.frame(journals_pVals1)[tt==0,1] == round(data.frame(journals_pVals1)[tt==0,1],2)) pVals <- data.frame(journals_pVals1)[,1] resSwfdr <- calculateSwfdr(pValues = pVals, truncated = tt, rounded = rr, numEmIterations=100) names(resSwfdr) ``` The following values are returned: * `pi0` Final value of prior probability - estimated from EM - that a hypothesis is null, i.e. estimated swfdr * `alpha` Final value of parameter alpha - estimated from EM - from Beta(alpha, beta) true positive distribution * `beta` Final value of parameter beta - estimated from EM - from Beta(alpha, beta) true positive distribution * `z` Vector of expected values of the indicator of whether the p-value is null or not - estimated from EM - for the non-rounded p-values (values of NA represent the rounded p-values) * `n0` Expected number of rounded null p-values - estimated from EM - between certain cutpoints (0.005, 0.015, 0.025, 0.035, 0.045, 0.05) * `n` Number of rounded p-values between certain cutpoints (0.005, 0.015, 0.025, 0.035, 0.045, 0.05) ### Results from example dataset For the example dataset we considered, the results are as follows: ```{r} resSwfdr ``` Thus, the estimated swfdr for papers published in American Journal of Epidemiology in 2005 is `r round(resSwfdr$pi0,3)` i.e. `r round(resSwfdr$pi0 * 100,1)`\% of the discoveries - defined as associations with p < 0.05 - are expected to be false discoveries. ## Estimating the proportion of true null hypothesis in the presence of covariates As in @BocaEtAl2015, we denote by $\pi_0(x)$ the proportion of true null hypotheses as a function of a covariate $x$. This is estimated based on a list of p-values $p_1,\ldots,p_m$ corresponding to a set of null hypotheses, $H_{01},\ldots,H_{0m}$, and a design matrix $X$. The design matrix considers relevant meta-data, which could be valuable for improving estimatong of the prior probability that a hypothesis is true or false. ### Example: Adjust for sample size and allele frequency in BMI GWAS meta-analysis We consider an example from the meta-analysis of data from a genome-wide association study (GWAS) for body mass index (BMI) from @LockeEtAl2015. A subset of this data, corresponding to 50,000 single nucleotide polymorphisms (SNPs) is already loaded with the package. ```{r} head(BMI_GIANT_GWAS_sample) dim(BMI_GIANT_GWAS_sample) ``` A description of the variables in `BMI_GIANT_GWAS_sample` can be found on its help page. In particular, `p` gives the p-values for the association between the SNPs and BMI; `N` gives the total sample size considered in the study of a particular SNP; and `Freq_MAF_Hapmap` gives the frequency of the minor (less frequent allele) for a particular SNP in Hapmap. The column `Freq_MAF_Int_Hapmap` provides 3 approximately equal intervals for the Hapmap MAFs: ```{r} table(BMI_GIANT_GWAS_sample$Freq_MAF_Int_Hapmap) ``` ### The `lm_pi0` function This function estimates $\pi_0(x)$. It inputs the following parameters: * `pValues` Numerical vector of p-values * `lambda` Numerical vector of thresholds in $[0,1)$ at which $\pi_0(x)$ is estimated. Default thresholds are $(0.05, 0.10, \ldots, 0.95)$. * `X` Design matrix (one test per row, one variable per column). Do not include the intercept. * `type` Type of regression, "logistic" or "linear." Default is logistic. * `smooth.df` Number of degrees of freedom when estimating $\pi_0(x)$ by smoothing. Default is $3$. * `threshold` If `TRUE` (default), all estimates are thresholded at 0 and 1, if `FALSE`, none of them are. To apply it to the BMI GWAS dataset, we first create the design matrix, using natural cubic splines with 5 degrees of freedom to model `N` and 3 discrete categories for the MAFs: ```{r} X <- model.matrix(~ splines::ns(N,5) + Freq_MAF_Int_Hapmap, data = BMI_GIANT_GWAS_sample)[,-1] head(X) ``` We then run the `lm_pi0` function: ```{r} pi0x <- lm_pi0(BMI_GIANT_GWAS_sample$p, X=X) names(pi0x) ``` The following values are returned: * `pi0` Numerical vector of smoothed estimate of $\pi_0(x)$. The length is the number of rows in $X$. * `pi0.lambda` Numerical matrix of estimated $\pi_0(x)$ for each value of lambda. The number of columns is the number of tests, the number of rows is the length of lambda. * `lambda` Vector of the values of `lambda` used in calculating `pi0.lambda`. * `pi0.smooth` Matrix of fitted values from the smoother fit to the $\pi_0(x)$ estimates at each value of lambda (same number of rows and columns as `pi0.lambda`). ### Results from BMI GWAS meta-analysis example We first add the estimates of $\pi_0(x)$ for $\lambda=0.8$, $\lambda=0.9$, and the final smoothed value to the `BMI_GIANT_GWAS_sample` object: ```{r} BMI_GIANT_GWAS_sample$fitted0.8 <- pi0x$pi0.lambda[,round(pi0x$lambda,2)==0.8] BMI_GIANT_GWAS_sample$fitted0.9 <- pi0x$pi0.lambda[,round(pi0x$lambda,2)==0.9] BMI_GIANT_GWAS_sample$fitted.final.smooth <- pi0x$pi0 ``` We next create a long data frame so that we can use the plotting tools in `ggplot2`: ```{r} library(reshape2) ldf <- reshape2::melt(BMI_GIANT_GWAS_sample, id.vars=colnames(BMI_GIANT_GWAS_sample)[-grep("fitted", colnames(BMI_GIANT_GWAS_sample))], value.name = "pi0",variable.name = "lambda") ldf$lambda <- as.character(ldf$lambda) ldf$lambda[ldf$lambda=="fitted0.8"] <- "lambda=0.8" ldf$lambda[ldf$lambda=="fitted0.9"] <- "lambda=0.9" ldf$lambda[ldf$lambda=="fitted.final.smooth"] <- "final smoothed pi0(x)" head(ldf) ``` The plot of the estimates of $\pi_0(x)$ against the sample size $N$, stratified by the MAF categories can thus be obtained: ```{r, BMI_GWAS_plot} library(ggplot2) ggplot(ldf, aes(x=N, y=pi0))+ geom_line(aes(col=Freq_MAF_Int_Hapmap, linetype=lambda)) + ylab("Estimated proportion of nulls") + guides(color=guide_legend(title="MAF in HapMap CEU population"), linetype=guide_legend(title="Estimate")) ``` ## References