---
title: "PowerExplorer Manual"
author: "Xu Qiao, Laura Elo"
date: "`r Sys.Date()`"
output: rmarkdown::pdf_document
toc: true
vignette: >
  %\VignetteIndexEntry{PowerExplorer Manual}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```
\newpage

# Abstract
This vignette demonstrates R package `PowerExplorer` as a power and sample size
estimation tool for RNA-Seq and quantitative proteomics data.

`PowerExplorer` contains the following main features:

- Estimation of power based on the current data
- Prediction of power corresponding to the increased sample sizes
- Result visualizations

# Introduction
Power and sample size estimation is one of the important principles in designing
next-generation sequencing experiments to discover differential expressions.
`PowerExplorer` is a power estimation and prediction tool currently applicable
to RNA-Seq and quantitative proteomics experiments.

The calculation procedure starts with estimating the distribution parameters of
each gene or protein. With the obtained prior distribution of each feature, 
a specified amount of simulations are executed to generate data (read counts 
for RNA-Seq and protein abundance for proteomics) repetitively for each entry 
based on null and alternative hypotheses. Furthermore, the corresponding 
statistical tests (t-test or Wald-test) are performed and the test statistics 
are collected. Eventually the statistics will be summarized to calculate 
the statistical power.

\newpage

# Input Data Preparation

For both RNA-Seq (gene expression levels) and quantitative proteomics (protein
abundance levels) datasets, the data matrix should be arranged as 
genes/proteins in rows and samples in columns. Here we show a RNA dataset
as an example:

```{r showDataExample, message=FALSE, warning=FALSE}
library(PowerExplorer)
data("exampleProteomicsData")
head(exampleProteomicsData$dataMatrix)
```

A grouping vector indicating the sample groups to which all the samples belong
should also be created, for example:
```{r}
show(exampleProteomicsData$groupVec)
```
The sample groups corresponding to the data:
```{r}
colnames(exampleProteomicsData$dataMatrix)
```

Note that the grouping vector length should be equal to the column number of the
data matrix.

\newpage

# Power Estimation

Here we use a randomly generated Proteomics dataset `exampleProteomicsData` 
as an example to estimate the current power of the dataset. 
The input dataset is named as `dataMatrix` and the grouping vector 
as `groupVec`.

To run the estimation, apart from the input, we still need to specify the
following parameters:

- `isLogTransformed`: FALSE; the input data is not log-transformed.
- `dataType`: "Proteomics"; the datatype can be declared as "Proteomics" or
"RNA-Seq".
- `minLFC`: 0.5; the threshold of Log2 Fold Change, proteins with lower LFC 
will be discarded.
- `enableROTS`: TRUE; Using Reproducibility-Optimized Test Statistic 
(ROTS) as the statistical model.
- `paraROTS`: the parameters to be passed to ROTS (if enabled).
Check ROTS documentation for further details on the parameters.
- `alpha`: 0.05; the controlled false positive (Type I Error) rate.
- `ST`: 50; the simulation of each gene will be run 50 times (ST>50 is
  recommended).
- `seed`: 345; optional, a seed value for the random number generator to
maintain the reproducibility.
- `showProcess`: FALSE; no detailed processes will be shown, set to TRUE if
debug is needed.
- `saveSimulatedData`: FALSE; if TRUE, save the simulated data in ./savedData
directory.

The results will be summaried in barplot, boxplot and summary table.

```{r est run, echo=TRUE, fig.keep='none', message=FALSE, warning=FALSE}
library(PowerExplorer)
data("exampleProteomicsData")
res <- estimatePower(inputObject = exampleProteomicsData$dataMatrix,
                     groupVec = exampleProteomicsData$groupVec,
                     isLogTransformed = FALSE,
                     dataType = "Proteomics",
                     minLFC = 0.5,
                     enableROTS = TRUE,
                     paraROTS = list(B = 1000, K = NULL),
                     alpha = 0.05,
                     ST = 50,
                     seed = 345, 
                     showProcess = FALSE, 
                     saveResultData = FALSE
                     )
```


\newpage

## Visualization
The estimated results can be summarized using `plotEstPwr`, the only input
needed is the `estimatedPower`, which should be the estimated power object
returned from  `estimatePower`.

```{r plot estimated power}
plotEstPwr(res)
```

The graph contains 3 plots, the `barplot` vertically shows the number of
genes/proteins above the minLFC threshold, columns indicates the comparison
pairs, each column presents the proportions of three power levels in three
colours as indicated in the legend `power.level`; The boxplot shows the overall
power distribution of each comparsion; And the summary table summarize the power
in a numerical way with the same information shown in the previous two plots.

\newpage

## Result Summary

With the result `PowerExplorerStorage` object, summarized information can be shown by
`show` method.
```{r}
res
```

If interested in specific genes/proteins or a ranking list, one can use 
`listEstPwr` with the following parameters:

- `inputObject`: A PowerExplorerStorage returned from PowerExplorer as input
- `decreasing`: logical; TRUE, decreasing order; FALSE, increasing order.
- `top`: numeric; the number of genes/proteins in the top list
- `selected`: default as NA; specify as a list of geneID or protein ID 
to show power of a list of interested genes/proteins.

To show the top 10 genes in an example result object `exampleObject`
in decreasing order:

```{r}
data(exampleObject)
listEstPwr(exampleObject, decreasing = TRUE, top = 10)
```

To show the results of specific genes:

```{r}
listEstPwr(exampleObject, 
           selected = c("ENSMUSG00000000303", 
                         "ENSMUSG00000087272", 
                         "ENSMUSG00000089921"))
```

\newpage

# Power Predictions
With the same dataset, to run a prediction, a different parameter is needed:

- `rangeSimNumRep`: the power of replicate number 5 to 20 will be predicted.

Similar to the estimation process, however, the simulations will be excuted with
each sample size specified in `rangeSimNumRep`. (Note: the term sample size in
this vignette refers to the replicate number of each group/case)

It is possible to append the prediction results within the same object by using
the same result object as an input.
```{r prediction run2, message=FALSE, warning=FALSE}
data("exampleProteomicsData")
res <- predictPower(inputObject = res,
                    groupVec = exampleProteomicsData$groupVec,
                    isLogTransformed = FALSE,
                    dataType = "Proteomics",
                    minLFC = 0.5,
                    rangeSimNumRep = c(5, 10, 15, 20),
                    enableROTS = TRUE,
                    paraROTS = list(B = 1000, K = NULL),
                    alpha = 0.05,
                    ST = 50,
                    seed = 345)
```

\newpage

## Visualization

The predicted results can be summaried using `plotPredPwr`. The input should be
the predicted power object returned from `predictPower`, the summary can be
optionally visualized by setting the following parameters:

- `inputObject`: A PowerExplorerStorage returned from PowerExplorer as input
- `minLFC` and `maxLFC`: to observe power in a specific range of LFC
- `LFCscale`: to determine the LFC scale of the observation

Lineplot (LFCscale = 0.5):
```{r LinePlot}
plotPredPwr(res, LFCscale = 0.5)
```

The output figure contains a lineplot and a summary table. 
For each comparison, the lineplot shows the power tendency across 
every Log2 Fold Change segment resulted from a complete LFC list divided 
by a specified `LFCscale`. Each dot on the lines represents the average
power (y-axis) of the genes/proteins at a certain sample size (x-axis) 
within different  LFC ranges. In addition, a summary table below displays 
the average power of each comparison across the sample sizes.

For instance, the line plot here shows the average power at four different 
sample sizes (5 to 30, with increment of 5) in LFCscale of 0.5. 
The LFC ranges from 0 to 5, and within each LFC segment, the graph shows the 
average power of the genes/proteins. Here, the higher LFC shows higher power, 
the average power of each LFC range increases with the larger sample sizes, 
as expected.

\newpage

## Result Summary

With the result `PowerExplorerStorage` object, summarized information can be shown by
`show` method. Both estimated and predicted results can be summaried.
```{r}
res
```

If interested in specific genes/proteins or a ranking list of predicted 
powerat each sample size, one can use `listPrePwr` with the following 
parameters:

- `inputObject`: A PowerExplorerStorage returned from PowerExplorer as input
- `decreasing`: logical; TRUE, decreasing order; FALSE, increasing order.
- `top`: numeric; the number of genes/proteins in the top list
- `selected`: default as NA; specify as a list of geneID or protein ID 
to show power of a list of interested genes/proteins.

To show the top 10 genes in an example result object `exampleObject`
in decreasing order at each sample size:

```{r}
listPredPwr(exampleObject, decreasing = TRUE, top = 10)
```

To show the results of specific genes at each sample size:

```{r}
listPredPwr(exampleObject, 
            selected = c("ENSMUSG00000000303", 
                         "ENSMUSG00000087272", 
                         "ENSMUSG00000089921"))
```

# Parallel computation

The calculation may take much longer time when an input dataset contains more than 
thousands of features, especially for the power prediction process. 
The computational time can be significantly shortened by using parallelised 
computation, and the simulations will be distributed to multiple cores. 
This can be done by loading Bioconductor pacakge BiocParallel and then set the 
following arguments of `estimatePower` and `predictPower`: 
`parallel=TRUE` and `BPPARAM=bpparam()`.
This will distribute the jobs to all the available cores.
One can register the number of cores to be used by 
setting `BPPARAM=MulticoreParam(4)`,
for instance, distributing simulations (jobs) to 4 cores. 
However, `MulticoreParam()` only supports non-Windows platforms. 
For Windows platforms, one can use `SnowParam()` instead. For further details,
please check the BiocParallel documentation.