--- title: "Power analysis for DNAm arrays" author: "Sean K. Maden" date: "`r format(Sys.time(), '%d %B, %Y')`" output: html_document: df_print: paged toc: yes toc_depth: '3' BiocStyle::pdf_document: toc: yes toc_depth: 3 BiocStyle::html_document: code_folding: show toc: yes tocfloat: yes package: recountmethylation bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{Power analysis for DNAm arrays} %\VignetteDepends{RCurl} %\usepackage[UTF-8]{inputenc} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r chunk_settings, eval = T, echo = F} library(knitr) library(ggplot2) library(recountmethylation) knitr::opts_chunk$set(eval = FALSE, echo = TRUE, warning = FALSE, message = FALSE) ``` ```{r setup, eval = T, echo = F} # get the system load paths dpath <- system.file(package = "recountmethylation", "extdata", "pwrewas_files") # load example summary statistics dfpwr <- get(load(file.path(dpath, "dfpwr_test_pwrewas.rda"))) lpwr <- get(load(file.path(dpath, "lpwr-results_pwrewas-example.rda"))) ``` This notebook describes how to run power analyses for DNAm arrays on user-defined datasets with the `pwrEWAS` method. The original `pwrEWAS` method is available as a Bioconductor [package](https://www.bioconductor.org/packages/release/bioc/html/pwrEWAS.html). There was need to make the original method extensible to new user-provided datasets, and this vignette describes how to do this with a lightly modified power analysis function, `pwrEWAS_itable()`. # Source the revised function, `pwrEWAS_itable()` Source the revised function from GitHub with `source_url()`. This runs the script `pwrEWAS_revised.R`, producing a series of callable functions in the active R session. ```{r} revised_function_url <- paste0("https://github.com/metamaden/pwrEWAS/", "blob/master/inst/revised_functions/pwrEWAS_revised.R?raw=TRUE") devtools::source_url(revised_function_url) ``` # Generate DNAm summary statistics `pwrEWAS` requires a set of DNAm summary statistics, specifically a table with DNAm means (e.g. a column named "mu") and variances (e.g. a column named "var") to inform power analysis simulations. For this example, get DNAm summary statistics from `minfiData`'s example array data, stored as a `MethylSet`, then use `rowMeans()` and `rowVars()` to compute summaries from Beta-values. ```{r} library(minfiData) data("MsetEx") # load MethylSet ms <- MsetEx bval <- getBeta(ms) # get DNAm fractions # get the summary data frame dfpwr <- data.frame(mu = rowMeans(bval, na.rm = T), var = rowVars(bval, na.rm = T)) ``` ```{r, eval = T, message = T} head(dfpwr) ``` The particular samples used to generate the CpG probe summary statistics above can be important. Samples from a specific tissue and/or demographic may yield more relevant information from power analysis for a given experiment design task. # Run `pwrEWAS_itable()` There are numerous parameters for fine-tuning power analyses. For the demonstration runs below, set the following parameter values. ```{r} ttype <- dfpwr # tissueType mintss <- 10 # minTotSampleSize maxtss <- 1000 # maxTotSampleSize sstep <- 100 # SampleSizeSteps tdeltav <- c(0.05, 0.1, 0.2) # targetDelta dmethod <- "limma" # DMmethod fdr <- 0.05 # FDRcritVal nsim <- 20 # sims j <- 1000 # J ndmp <- 50 # targetDmCpGs detlim <- 0.01 # detectionLimit maxctau <- 100 # maxCnt.tau ncntper <- 0.5 # NcntPer ``` These effectively define the power analysis as a series of tests varying samples from a minimum of 10, to a maximum of 230, at intervals of 20 samples, for a total of 12 total max sample sizes tested with evenly distributed experiment groups. For instance, at 10 total samples each experiment group has 5 samples, etc. Further, simulations use the `limma()` function for differential methylation test, with 50 significant probes expected at an FDR significance of 0.05 from among 5000 total simulated probes. Mean DNAm differences between experiment groups are varied across 3 possible values, either 0.05, 0.1, or 0.2. Finally, each set of test parameters will be simulated 20 times. # Run power simulations with 2 cores Setting the method parameters as above, run `pwrEWAS` on multiple cores by setting the argument `core` to some value >1. But first set the seed to ensure run reproducibility. ```{r} set.seed(0) lpwr.c2 <- pwrEWAS_itable(core = 2, tissueType = ttype, minTotSampleSize = mintss, maxTotSampleSize = maxtss, SampleSizeSteps = sstep, NcntPer = ncntper, targetDelta = tdeltav, J = j, targetDmCpGs = ndmp, detectionLimit = detlim, DMmethod = dmethod, FDRcritVal = fdr, sims = nsim, maxCnt.tau = maxctau) # [2022-02-17 13:44:51] Finding tau...done [2022-02-17 13:45:06] # [1] "The following taus were chosen: 0.013671875, 0.02734375, 0.0546875" # [2022-02-17 13:45:06] Running simulation # [2022-02-17 13:45:06] Running simulation ... done [2022-02-17 13:48:23] ``` The commented status messages show the example run time was about 3:30. # Access the power analysis results Power analysis results are returned in a list of four objects called `"meanPower"` (a matrix), `"powerArray"` (an array), `"deltaArray"` (a list), and `"metric"`, (also a list). The first object, `meanPower`, shows the mean power (cell values) by total sample size (y-axis, rows, from 10 to 230) and delta DNAm difference (x-axis, columns) across simulations. The dimensions and first rows of this table are shown below. ```{r, eval = T} lpwr <- lpwr.c1 # get results from an above example mp <- lpwr[["meanPower"]] # get the mean power table ``` ```{r} dim(mp) # get the dimensions of mean power table ``` ```{r} head(mp) # get first rows of mean power table ``` The second object is `powerArray`, an array of matrices containing 720 data points. This data is used to calculate the `meanPower` summaries, which can be seen by comparing mean of the first 20 `powerArray` values (e.g. the 20 simulations where total samples is 10 and delta is 0.05) to the `meanPower` cell [1,1] corresponding to delta = 0.1, total samples = 10. ```{r, eval = T} pa <- lpwr$powerArray # get power array ``` ```{r, eval = T} length(pa) # get length of power array ``` ```{r, eval = T} mean(pa[1:20]) == mp[1,1] # compare means, power array vs. mean power table ``` The final objects show various observed values for the delta DNAm (in the `deltaArray` object), and the marginal type I error, classical power, FDR, FDC, and true positive probability (in the `metric` object). # Plot smooths with errors using `ggplot2` This section shows how to use the `ggplot2` package to generate publication-ready plot summaries of `pwrEWAS` power analysis results. To plot the full simulation results with smooths and standard errors, reformat the array of matrices in the `powerArray` object. Extract the power values according to the dimensions of our simulations (e.g. 10 sample sizes times 10 simulations times 2 deltas = 200 total simulations). Finally, coerce and harmonize power values across deltas to form a tall data frame for plotting. ```{r, eval = T} # extract power values from the array of matrices parr <- pa m1 <- data.frame(power = parr[1:200]) # first delta power values m2 <- data.frame(power = parr[201:400]) # second delta power values m3 <- data.frame(power = parr[401:600]) # third delta power values # assign total samples to power values m1$total.samples <- m2$total.samples <- m3$total.samples <- rep(seq(10, 910, 100), each = 20) # add delta labels m1$`Delta\nDNAm` <- rep("0.05", 200) m2$`Delta\nDNAm` <- rep("0.1", 200) m3$`Delta\nDNAm` <- rep("0.2", 200) # make the tall data frame for plotting dfp <- rbind(m1, rbind(m2, m3)) ``` Make the final plot using `goem_smooth()`, which uses `method=loess` here by default. You can specify other methods with the `method` argument (see `?geom_smooth` for details). Again, the horizontal line at 80% power is included for reference. ```{r, eval = T} ggplot(dfp, aes(x = total.samples, y = power, color = `Delta\nDNAm`)) + geom_smooth(se = T, method = "loess") + theme_bw() + xlab("Total samples") + ylab("Power") + geom_hline(yintercept = 0.8, color = "black", linetype = "dotted") ``` Including the standard errors lends some confidence to our findings. First, we can tell there is a great deal of separation between each of the delta models throughout all simulations except at the very lowest total sample sizes. Further, at the highest total sample size the power exceeds 80% with high confidence at delta = 0.2 (e.g. lowest standard error well above reference line), with less confidence where delta = 0.1 (e.g. lowest standard error touches reference line), and not at all where delta = 0.05 (e.g. highest standard error is well below reference line). # Next steps and further reading This vignette showed how to used a lightly modified implementation of `pwrEWAS` on a custom user-provided DNAm dataset. It showed how to generate DNAm summary statistics, use these in the `pwrEWAS_itable()` function, identify simulation and summary outcomes in returned results data, and plot simulation results with errors using `ggplot2`. Note the values of arguments like `J`, `nsim`, and `maxtss` can be increased in practice to yield a more robust power model. The values of arguments including `FDRcritVal` and `tdeltav` can further be set according to the empirical results of perliminary analyses or literature review to yield more informative results. More details about the `pwrEWAS` method, including more fuction parameter details, can be found in the function docstrings (e.g. check `?pwrEWAS`), descriptions in the Bioconductor package [documentation](https://www.bioconductor.org/packages/release/bioc/html/pwrEWAS.html) and in the main study, [@graw_pwrewas_2019](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2804-7). # Session Info ```{r, eval = T} utils::sessionInfo() ``` # Works Cited