This vignette of cypress (cell-type-specific power assessment), which is specifically designed to perform comprehensive cell-type-specific power assessment for differential expression using RNA-sequencing experiments. It accepts real Bulk RNAseq data as input for parameter estimation and simulation. The tool provides flexibility by allowing users to customize sample sizes, number of cell types, and log2 fold change values. Additionally, it computes statistical power, true discovery rate (TDR), and false discovery cost (FDC) under different scenarios as results. The vignette also offers functions to generate basic line plots illustrateing stratified power, TDR and FDC.
cypress 0.99.1
Currently, there is a lack of experimental design and statistical power assessment tool for RNA-sequencing experiments and clinical researchers are facing difficulties in using cell-type-specific Differentially Expressed (csDE) gene detection methods in real practice. One of the major difficulties is to determine sample sizes required for experimental designs under various scenarios and required by grant applications. In a real RNA-sequencing study, effect size, variance, baseline expression level, and biological variation all impact sample size determinations.
Here we develop a simulation-based method to obtain power-sample size relationship. Users have the option to provide their own bulk-level RNA-sequencing data, allowing the package to estimate distribution parameters. Alternatively, the package includes three predefined distributional parameter sets from which users can choose. Specifically, the RNA-sequencing data simulation framework is based on a Gamma-Poisson compound and is borrowed from a previous benchmark study. The cell-type-specific Differentially Expressed (csDE) genes calling process is implemented by TOAST. We will stratify genes into distinct strata base on their expression levels and evaluate the power within each stratum. We particulrly focus on examining the influence and the impact of sample size and sequencing depth, giving special consideration to genes with a low signal-to-noise ratio in sequencing data.
cypress is the first statistical tool to evaluate the power in cell-type-specific Differentially Expressed (csDE) genes detection experiments from a prospective way by letting researchers be flexible in tuning sample sizes, effect sizes, csDE genes percentage, total number of genetic features, type I error control, etc.
From Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("cypress")
To view the package vignette in HTML format, run the following lines in R:
library(cypress)
vignette("cypress")
cypress offers function simFromData
when users are expected to provide their own data for parameter estimation and simulation. simFromData
needs the input file organized into SummarizedExperiment objects including a count matrix, study design, and a optional cell type proportion matrix. An example data ASD_prop_se
is included:
library(cypress)
data(ASD_prop_se)
result1 <- simFromData(INPUTdata = ASD_prop, #SummarizedExperiment object.
CT_index = (seq_len(6) + 2), #Column index for cell types proportion matrix.
CT_unk = FALSE, #CT_unk should be True if no cell type proportion matrix.
n_sim = 2, #Total number of iterations users wish to conduct.
n_gene = 1000, #Total number of genetic features users with to conduct.
DE_pct = 0.05, #Percentage of DEG on each cell type.
ss_group_set = c(10,20), #Sample sizes per group users wish to simulate.
lfc_set = c(1, 1.5)) #Effect sizes users wish to simulate.
Unlike the simFromData
function needs to provide a real data, quickPower()
provides quick power calculation results without their own simulations. This function is designed to provide quick access to pre-calculated results.
library(cypress)
data(quickPowerGSE60424)
result2<-quickPower(data = "IAD") #Options include 'IAD', 'IBD', or 'ASD'.
Produce basic line plots of power evaluation results conducted by quickPower()
for visualization:
### plot statistical power results
plotPower(simulation_results=result2, #Simulation results generated by quickPower() or simFromData()
effect.size=1, #A numerical value indicating which effect size is to be fixed.
sample_size=10 ) #A numerical value indicating which sample size to be fixed.
### plot TDR results
plotTDR(simulation_results=result2,
effect.size=1,
sample_size=10)
### plot FDC results
plotFDC(simulation_results=result2,
sample_size=10)
There are 3 options for users to utilize cypress functions for parameter estimation and simulation:
Option1: Users wish to use their own real data for estimating parameters. The input file needs to be organized into a SummarizedExperiment
object. The object should contain a feature by sample count matrix stored in the counts
slot. It should also use the first column in colData
slot with a coulumn name of ‘disease’ to store the group status (i.e. case/control),where the control by 1 and the case is indicated by 2. The second column in the colData
slot to store the subject IDs mapping to each sample. The remaining columns in the colData
slot should store the cell type proportions.If provided, the cell type proportion matrix should sum to 1 for each sample. This cell type proportion matrix is optional.
data(ASD_prop_se)
ASD_prop
## class: SummarizedExperiment
## dim: 3000 48
## metadata(0):
## assays(1): counts
## rownames(3000): ENSG00000214727 ENSG00000214941 ... ENSG00000150471
## ENSG00000198759
## rowData names(0):
## colnames(48): SRR1576466 SRR1576467 ... SRR1576512 SRR1576513
## colData names(8): disease sample_id ... Celltype5 Celltype6
Option2: Users wish to use the existing study for parameters estimation. cypress provide 3 real data for estimate the parameters:
IAD
(GSE60424):Whole transcriptome signatures of 6 immune cell subsets,including neutrophils, monocytes, B cells, CD4 T cells, CD8 T cells, and natural killer cells.
ASD
:The bulk RNA-seq data in a large autism spectrum disorder(ASD) study.The study includes 251 samples of frontal/temporal cortex and cerebellum brain regions.Samples are from 48 ASD subjects versus 49 controls.
IBD
(GSE57945):The ileal transcriptome in pediatric inflammatory bowel disease(IBD).This dataset includes a cohort of 359 treatment-naïve pediatric patients with Crohn’s disease(CD,n=213), ulcerative colitis(UC,n=60) and healthy controls (n=41).
Option3: Users have nothing to provide but wish to use simulation, so they do not need to provide any data but can define their own study design parameters,including but not limited to number of simulations, number of genes, sample size and log fold change.
cypress offers 3 functions for simulation and power evaluation: simFromData()
,simFromPara()
and quickPower()
. If users prefer to quickly examine the power evaluation results using the built-in datasets, they can use the quickPower()
function. If users have their own bulk RNA-seq count data, they can use the simFromData()
function, otherwise, they can use simFromParam()
estimating parameters from 3 existing study to perform power evaluation on customized the simulation settings. The output of these 3 functions is a S4 object with a list of power measurements under various experimental settings, such as Statistical Power
, TDR
, and FDC
.
simFromData()
enables users to provide their own data or used the example data data(ASD_prop_se)
, they can also specify number of simulations (nsim
), number of genes(n_gene
), percentage of differential expressed genes for each cell type(DE_pct
), sample sizes (ss_group_set
), and log fold change (lfc_set
).
data(ASD_prop_se)
result <- simFromData(INPUTdata = ASD_prop, #User can provide their own designed SE object
CT_index = (seq_len(6) + 2), CT_unk = FALSE,
n_sim = 2,n_gene = 1000,DE_pct = 0.05,
ss_group_set = c(10,20),
lfc_set = c(1, 1.5))
Unlike simFromData()
which need user to provide their own count matrix and often takes a while to run, simFromPara()
produces results faster by extracting parameters from three existing study. sim_param
have 3 pre-estimated datasets for users to choose, they are IAD
,ASD
and IBD
. Additionally, users can also define their own simulation settings.
data(quickParaGSE60424)
result <- simFromParam(sim_param="IAD", #Options for 'IAD','ASD' and 'IBD'
n_sim = 2,DE_pct = 0.05,n_gene = 1000,
ss_group_set = c(10, 20),lfc_set = c(1, 1.5),
lfc_target = 0.5, fdr_thred = 0.1)
quickPower()
runs fastest than simFromData()
and simFromPara()
. Users only need to choose which study they want to use for simulation in data
argument.
data(quickPowerGSE60424)
quickPower<-quickPower(data = "IAD") ###Options include 'IAD', 'IBD', or 'ASD'.
cypress computes power evaluation metrics for each experimental scenario defined in the function arguments, the results it generates include:
Power: Statistical power.
TDR: The ratio of number of true positives to the number of positive discoveries.
FDC: The ratio of number of false positives to the number of true positives.
Once users have obtained a S4 object with a list of power measurements using either simFromData
, simFromParam
or quickPower()
, they can use functions in cypress to generate basic line plots.
cypress provides three plotting functions: plotPower()
, plotTDR()
and plotFDC()
, for figure generating.
plotPower()
: Generates 6 plots in a 2x3 panel. The illustration of each
plot from left to right and from up to bottom is as follow:
1: Statistical power by effect size, each line represent sample size. Statistical power was average value across cell types.
2: Statistical power by effect size, each line represent cell type. Statistical
power was calculated under the scenario of sample size to be fixed at 10 if sample_size=10
.
3: Statistical power by sample size, each line represent cell type. Statistical
power was calculated under the scenario of effect size to be fixed at 1 if effect.size=1
.
4: Statistical power by strata, each line represent cell type. Statistical
power was calculated under the scenario of sample size to be fixed at 10 and
effect size to be fixed at 1 if sample_size=10
and effect.size=1
.
5: Statistical power by strata, each line represent sample size. Statistical
power was average value across cell types and under the scenario of effect size
to be fixed at 1 if effect.size=1
.
6: Statistical power by strata, each line represent effect size. Statistical
power was average value across cell types and under the scenario of sample size
to be fixed at 10 if sample_size=10
.
### plot statistical power results
plotPower(simulation_results=quickPower,effect.size=1,sample_size=10 )
plotTDR()
: Generates 4 plots in a 2x2 panel. The illustration of each plot from left to right and from up to bottom is as follow:
1: True discovery rate(TDR) by top rank genes, each line represent cell type.
TDR was calculated under the scenario of sample size to be fixed at 10 and effect size to be fixed at 1 if sample_size=10
and effect.size=1
.
2: True discovery rate(TDR) by top rank genes, each line represent effect size.
TDR was average value across cell types and under the scenario of sample size to be fixed at 10 if sample_size=10
.
3: True discovery rate(TDR) by top rank genes, each line represent sample size.
TDR was average value across cell types and under the scenario of effect size to be fixed at 1 if effect.size=1
.
4: True discovery rate(TDR) by effect size, each line represent sample size. TDR was calculated under the scenario of top rank gene equals to 350.
### plot TDR results
plotTDR(simulation_results=quickPower,effect.size=1,sample_size=10)
plotFDC()
: Generates 2 plots in a 1x2 panel. The illustration of each plot from left to right is as follow:
1: False discovery cost(FDC) by effect size, each line represent cell type.
FDC was calculated under the scenario of sample size to be fixed at 10 if sample_size=10
.
2: False discovery cost(FDC) by top effect size, each line represent sample size. FDC was average value across cell types.
### plot FDC results
plotFDC(simulation_results=quickPower,sample_size=10)
## R Under development (unstable) (2024-01-16 r85808)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cypress_0.99.1 BiocStyle_2.31.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 pbapply_1.7-2
## [3] formatR_1.14 CDM_8.2-6
## [5] rlang_1.1.3 magrittr_2.0.3
## [7] matrixStats_1.2.0 e1071_1.7-14
## [9] compiler_4.4.0 gdata_3.0.0
## [11] vctrs_0.6.5 quadprog_1.5-8
## [13] stringr_1.5.1 pkgconfig_2.0.3
## [15] crayon_1.5.2 fastmap_1.1.1
## [17] magick_2.8.2 XVector_0.43.1
## [19] utf8_1.2.4 rmarkdown_2.25
## [21] pracma_2.4.4 nloptr_2.0.3
## [23] preprocessCore_1.65.0 purrr_1.0.2
## [25] xfun_0.41 zlibbioc_1.49.0
## [27] cachem_1.0.8 GenomeInfoDb_1.39.5
## [29] jsonlite_1.8.8 sirt_4.1-15
## [31] EpiDISH_2.19.0 highr_0.10
## [33] DelayedArray_0.29.1 BiocParallel_1.37.0
## [35] gmodels_2.18.1.1 parallel_4.4.0
## [37] R6_2.5.1 bslib_0.6.1
## [39] stringi_1.8.3 RColorBrewer_1.1-3
## [41] limma_3.59.2 GGally_2.2.0
## [43] GenomicRanges_1.55.2 lubridate_1.9.3
## [45] jquerylib_0.1.4 Rcpp_1.0.12
## [47] bookdown_0.37 SummarizedExperiment_1.33.3
## [49] iterators_1.0.14 knitr_1.45
## [51] IRanges_2.37.1 polycor_0.8-1
## [53] Matrix_1.6-5 nnls_1.5
## [55] splines_4.4.0 timechange_0.3.0
## [57] tidyselect_1.2.0 abind_1.4-5
## [59] yaml_2.3.8 doParallel_1.0.17
## [61] codetools_0.2-19 admisc_0.34
## [63] lattice_0.22-5 tibble_3.2.1
## [65] plyr_1.8.9 Biobase_2.63.0
## [67] evaluate_0.23 lambda.r_1.2.4
## [69] futile.logger_1.4.3 ggstats_0.5.1
## [71] proxy_0.4-27 pillar_1.9.0
## [73] BiocManager_1.30.22 MatrixGenerics_1.15.0
## [75] foreach_1.5.2 stats4_4.4.0
## [77] generics_0.1.3 RCurl_1.98-1.14
## [79] S4Vectors_0.41.3 ggplot2_3.4.4
## [81] munsell_0.5.0 TOAST_1.17.0
## [83] scales_1.3.0 gtools_3.9.5
## [85] class_7.3-22 glue_1.7.0
## [87] tools_4.4.0 data.table_1.15.0
## [89] locfit_1.5-9.8 mvtnorm_1.2-4
## [91] grid_4.4.0 tidyr_1.3.1
## [93] matrixcalc_1.0-6 edgeR_4.1.15
## [95] PROPER_1.35.0 TAM_4.1-4
## [97] colorspace_2.1-0 GenomeInfoDbData_1.2.11
## [99] rsvd_1.0.5 cli_3.6.2
## [101] futile.options_1.0.1 fansi_1.0.6
## [103] S4Arrays_1.3.3 dplyr_1.1.4
## [105] corpcor_1.6.10 gtable_0.3.4
## [107] sass_0.4.8 digest_0.6.34
## [109] BiocGenerics_0.49.1 SparseArray_1.3.3
## [111] htmltools_0.5.7 lifecycle_1.0.4
## [113] TCA_1.2.1 locfdr_1.1-8
## [115] statmod_1.5.0 mime_0.12
## [117] MASS_7.3-60.2