--- title: "PowerExplorer Manual" author: "Xu Qiao, Laura Elo" date: "`r Sys.Date()`" output: rmarkdown::pdf_document toc: true vignette: > %\VignetteIndexEntry{PowerExplorer Manual} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` \newpage # Abstract This vignette demonstrates R package `PowerExplorer` as a power and sample size estimation tool for RNA-Seq and quantitative proteomics data. `PowerExplorer` contains the following main features: - Estimation of power based on the current data - Prediction of power corresponding to the increased sample sizes - Result visualizations # Introduction Power and sample size estimation is one of the important principles in designing next-generation sequencing experiments to discover differential expressions. `PowerExplorer` is a power estimation and prediction tool currently applicable to RNA-Seq and quantitative proteomics experiments. The calculation procedure starts with estimating the distribution parameters of each gene or protein. With the obtained prior distribution of each feature, a specified amount of simulations are executed to generate data (read counts for RNA-Seq and protein abundance for proteomics) repetitively for each entry based on null and alternative hypotheses. Furthermore, the corresponding statistical tests (t-test or Wald-test) are performed and the test statistics are collected. Eventually the statistics will be summarized to calculate the statistical power. \newpage # Input Data Preparation For both RNA-Seq (gene expression levels) and quantitative proteomics (protein abundance levels) datasets, the data matrix should be arranged as genes/proteins in rows and samples in columns. Here we show a RNA dataset as an example: ```{r showDataExample, message=FALSE, warning=FALSE} library(PowerExplorer) data("exampleProteomicsData") head(exampleProteomicsData$dataMatrix) ``` A grouping vector indicating the sample groups to which all the samples belong should also be created, for example: ```{r} show(exampleProteomicsData$groupVec) ``` The sample groups corresponding to the data: ```{r} colnames(exampleProteomicsData$dataMatrix) ``` Note that the grouping vector length should be equal to the column number of the data matrix. \newpage # Power Estimation Here we use a randomly generated Proteomics dataset `exampleProteomicsData` as an example to estimate the current power of the dataset. The input dataset is named as `dataMatrix` and the grouping vector as `groupVec`. To run the estimation, apart from the input, we still need to specify the following parameters: - `isLogTransformed`: FALSE; the input data is not log-transformed. - `dataType`: "Proteomics"; the datatype can be declared as "Proteomics" or "RNA-Seq". - `minLFC`: 0.5; the threshold of Log2 Fold Change, proteins with lower LFC will be discarded. - `enableROTS`: TRUE; Using Reproducibility-Optimized Test Statistic (ROTS) as the statistical model. - `paraROTS`: the parameters to be passed to ROTS (if enabled). Check ROTS documentation for further details on the parameters. - `alpha`: 0.05; the controlled false positive (Type I Error) rate. - `ST`: 50; the simulation of each gene will be run 50 times (ST>50 is recommended). - `seed`: 345; optional, a seed value for the random number generator to maintain the reproducibility. - `showProcess`: FALSE; no detailed processes will be shown, set to TRUE if debug is needed. - `saveSimulatedData`: FALSE; if TRUE, save the simulated data in ./savedData directory. The results will be summaried in barplot, boxplot and summary table. ```{r est run, echo=TRUE, fig.keep='none', message=FALSE, warning=FALSE} library(PowerExplorer) data("exampleProteomicsData") res <- estimatePower(inputObject = exampleProteomicsData$dataMatrix, groupVec = exampleProteomicsData$groupVec, isLogTransformed = FALSE, dataType = "Proteomics", minLFC = 0.5, enableROTS = TRUE, paraROTS = list(B = 1000, K = NULL), alpha = 0.05, ST = 50, seed = 345, showProcess = FALSE, saveResultData = FALSE ) ``` \newpage ## Visualization The estimated results can be summarized using `plotEstPwr`, the only input needed is the `estimatedPower`, which should be the estimated power object returned from `estimatePower`. ```{r plot estimated power} plotEstPwr(res) ``` The graph contains 3 plots, the `barplot` vertically shows the number of genes/proteins above the minLFC threshold, columns indicates the comparison pairs, each column presents the proportions of three power levels in three colours as indicated in the legend `power.level`; The boxplot shows the overall power distribution of each comparsion; And the summary table summarize the power in a numerical way with the same information shown in the previous two plots. \newpage ## Result Summary With the result `PowerExplorerStorage` object, summarized information can be shown by `show` method. ```{r} res ``` If interested in specific genes/proteins or a ranking list, one can use `listEstPwr` with the following parameters: - `inputObject`: A PowerExplorerStorage returned from PowerExplorer as input - `decreasing`: logical; TRUE, decreasing order; FALSE, increasing order. - `top`: numeric; the number of genes/proteins in the top list - `selected`: default as NA; specify as a list of geneID or protein ID to show power of a list of interested genes/proteins. To show the top 10 genes in an example result object `exampleObject` in decreasing order: ```{r} data(exampleObject) listEstPwr(exampleObject, decreasing = TRUE, top = 10) ``` To show the results of specific genes: ```{r} listEstPwr(exampleObject, selected = c("ENSMUSG00000000303", "ENSMUSG00000087272", "ENSMUSG00000089921")) ``` \newpage # Power Predictions With the same dataset, to run a prediction, a different parameter is needed: - `rangeSimNumRep`: the power of replicate number 5 to 20 will be predicted. Similar to the estimation process, however, the simulations will be excuted with each sample size specified in `rangeSimNumRep`. (Note: the term sample size in this vignette refers to the replicate number of each group/case) It is possible to append the prediction results within the same object by using the same result object as an input. ```{r prediction run2, message=FALSE, warning=FALSE} data("exampleProteomicsData") res <- predictPower(inputObject = res, groupVec = exampleProteomicsData$groupVec, isLogTransformed = FALSE, dataType = "Proteomics", minLFC = 0.5, rangeSimNumRep = c(5, 10, 15, 20), enableROTS = TRUE, paraROTS = list(B = 1000, K = NULL), alpha = 0.05, ST = 50, seed = 345) ``` \newpage ## Visualization The predicted results can be summaried using `plotPredPwr`. The input should be the predicted power object returned from `predictPower`, the summary can be optionally visualized by setting the following parameters: - `inputObject`: A PowerExplorerStorage returned from PowerExplorer as input - `minLFC` and `maxLFC`: to observe power in a specific range of LFC - `LFCscale`: to determine the LFC scale of the observation Lineplot (LFCscale = 0.5): ```{r LinePlot} plotPredPwr(res, LFCscale = 0.5) ``` The output figure contains a lineplot and a summary table. For each comparison, the lineplot shows the power tendency across every Log2 Fold Change segment resulted from a complete LFC list divided by a specified `LFCscale`. Each dot on the lines represents the average power (y-axis) of the genes/proteins at a certain sample size (x-axis) within different LFC ranges. In addition, a summary table below displays the average power of each comparison across the sample sizes. For instance, the line plot here shows the average power at four different sample sizes (5 to 30, with increment of 5) in LFCscale of 0.5. The LFC ranges from 0 to 5, and within each LFC segment, the graph shows the average power of the genes/proteins. Here, the higher LFC shows higher power, the average power of each LFC range increases with the larger sample sizes, as expected. \newpage ## Result Summary With the result `PowerExplorerStorage` object, summarized information can be shown by `show` method. Both estimated and predicted results can be summaried. ```{r} res ``` If interested in specific genes/proteins or a ranking list of predicted powerat each sample size, one can use `listPrePwr` with the following parameters: - `inputObject`: A PowerExplorerStorage returned from PowerExplorer as input - `decreasing`: logical; TRUE, decreasing order; FALSE, increasing order. - `top`: numeric; the number of genes/proteins in the top list - `selected`: default as NA; specify as a list of geneID or protein ID to show power of a list of interested genes/proteins. To show the top 10 genes in an example result object `exampleObject` in decreasing order at each sample size: ```{r} listPredPwr(exampleObject, decreasing = TRUE, top = 10) ``` To show the results of specific genes at each sample size: ```{r} listPredPwr(exampleObject, selected = c("ENSMUSG00000000303", "ENSMUSG00000087272", "ENSMUSG00000089921")) ``` # Parallel computation The calculation may take much longer time when an input dataset contains more than thousands of features, especially for the power prediction process. The computational time can be significantly shortened by using parallelised computation, and the simulations will be distributed to multiple cores. This can be done by loading Bioconductor pacakge BiocParallel and then set the following arguments of `estimatePower` and `predictPower`: `parallel=TRUE` and `BPPARAM=bpparam()`. This will distribute the jobs to all the available cores. One can register the number of cores to be used by setting `BPPARAM=MulticoreParam(4)`, for instance, distributing simulations (jobs) to 4 cores. However, `MulticoreParam()` only supports non-Windows platforms. For Windows platforms, one can use `SnowParam()` instead. For further details, please check the BiocParallel documentation.