ctdR identifies chemicals significantly associated with a set of genes using data from the Comparative Toxicogenomics Database (CTD).
Four enrichment methods are supported through a single
enrichment_CTD() interface; the input shape depends on
the method:
clusterProfiler::enricher.fgsea::fgsea.limma::camera.GSVA::gsva.This package does NOT bundle, redistribute, or embed any data from the Comparative Toxicogenomics Database. CTD data are created and maintained by NC State University and are subject to specific licensing terms. Users must download the data directly from https://ctdbase.org and comply with the CTD Terms of Service.
🚧 ctdR is not yet accepted, or under review for Bioconductor (submission #4232). Until then, please install from GitHub. Fingers crossed 🤞.
# install.packages("devtools")
devtools::install_github("drake69/ctdR")
Once ctdR is accepted into Bioconductor (currently under review), it will be installable directly via:
BiocManager::install("ctdR")
The ctdR workflow has three steps:
Download CTD_chem_gene_ixns.csv.gz from
https://ctdbase.org/reports/CTD_chem_gene_ixns.csv.gz.
Then decompress it:
gunzip CTD_chem_gene_ixns.csv.gz
This produces CTD_chem_gene_ixns.csv (several GB
uncompressed).
In production you would run:
library(ctdR)
import_CTD("~/Downloads/CTD_chem_gene_ixns.csv")
For this vignette we use a small synthetic dataset bundled with the package:
library(ctdR)
sample_file <- system.file(
"extdata", "CTD_chem_gene_ixns_sample.csv",
package = "ctdR"
)
import_CTD(sample_file)
#> Reading CTD chemical-gene interactions from: /tmp/RtmpN1MEA8/Rinst2b06de18100d75/ctdR/extdata/CTD_chem_gene_ixns_sample.csv
#> Filtered to 86 human interactions
#> Mapping genes for 10 chemicals (this may take a while)...
#>
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> CTD data cached successfully in: ~/.cache/ctdR
import_CTD() performs the following:
org.Hs.eg.db.rappdirs::user_cache_dir("ctdR")).This step takes several minutes with the full CTD file. You only need to run it once – or again when you download a newer CTD release.
Your input must be a data frame with at least two columns:
| Column | Description |
|---|---|
entrez_ids |
Character or numeric Entrez gene IDs |
| (2nd column) | Numeric value per gene (e.g. p-value) |
The second column is used for ranking in GSEA and is ignored in ORA.
genes <- data.frame(
entrez_ids = c(
"7124", "3569", "7157", "672", "1956",
"4609", "3845", "207", "5290", "3553"
),
pvalue = c(
0.001, 0.003, 0.005, 0.008, 0.01,
0.02, 0.03, 0.04, 0.05, 0.06
)
)
ORA tests whether the overlap between your gene list and each chemical’s known gene targets is significantly larger than expected by chance.
ora_results <- enrichment_CTD(genes, method = "ORA")
#> 'select()' returned 1:1 mapping between keys and columns
#>
head(ora_results)
#> ChemicalID GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue
#> 1 D000082 10/10 12/17 0.8333333 1.416667 3.0860670 0.003393665
#> 2 D001151 6/10 10/17 0.6000000 1.020000 0.1142857 0.646133278
#> 3 D001564 8/10 10/17 0.8000000 1.360000 2.0571429 0.052241876
#> 4 D002945 9/10 11/17 0.8181818 1.390909 2.5305065 0.017533937
#> padj qvalue geneID
#> 1 0.01357466 0.007144558 PIK3CA/IL6/AKT1/IL1B/TP53/EGFR/BRCA1/KRAS/MYC/TNF
#> 2 0.64613328 0.340070147 IL6/AKT1/TP53/EGFR/KRAS/TNF
#> 3 0.06965583 0.036660965 PIK3CA/AKT1/IL1B/TP53/EGFR/KRAS/MYC/TNF
#> 4 0.03506787 0.018456775 PIK3CA/IL6/AKT1/TP53/EGFR/BRCA1/KRAS/MYC/TNF
#> Count foldEnrichment ChemicalName
#> 1 10 1.416667 Acetaminophen
#> 2 6 1.020000 Arsenic
#> 3 8 1.360000 Benzo(a)pyrene
#> 4 9 1.390909 Cisplatin
ORA output columns:
| Column | Description |
|---|---|
ChemicalID |
CTD chemical identifier |
ChemicalName |
Human-readable chemical name |
GeneRatio |
Proportion of input genes in the set |
BgRatio |
Background ratio |
pvalue |
Raw p-value |
padj |
Adjusted p-value (BH method) |
foldEnrichment |
GeneRatio / BgRatio |
geneID |
Enriched gene symbols |
Count |
Number of overlapping genes |
GSEA uses the full ranked gene list (ranked by the numeric column) to detect chemicals whose targets cluster toward the top or bottom of the ranking.
gsea_results <- enrichment_CTD(genes, method = "GSEA")
#> Warning in prepareStats(stats, scoreType, gseaParam): All values in the stats
#> vector are greater than zero and scoreType is "std", maybe you should switch to
#> scoreType = "pos".
head(gsea_results)
#> ChemicalID pval padj log2err ES NES size
#> 1 D001151 0.34627329 0.4126984 0.08528847 0.6138938 1.197998 6
#> 2 D001564 0.36507937 0.4126984 0.11426650 -0.6329866 -1.140551 8
#> 3 D002104 0.14285714 0.2778731 0.14375899 0.6661635 1.300001 6
#> 4 D002945 0.15437393 0.2778731 0.14551615 1.0000000 1.252054 9
#> 5 D003520 0.41269841 0.4126984 0.07707367 0.5714286 1.097455 3
#> 6 D003907 0.05238095 0.2507205 0.24891111 0.8188439 1.572627 3
#> leadingEdge foldEnrichment ChemicalName Enriched_GENE
#> 1 7124, 35.... 1.531849 Arsenic 1, 2, 3, 5, 7, 8
#> 2 3553, 52.... 1.579491 Benzo(a)pyrene 1, 3, 5, 6, 7, 8, 9, 10
#> 3 7124, 35.... 1.662278 Cadmium 1, 2, 3, 8, 9, 10
#> 4 7124, 35.... 2.495300 Cisplatin 1, 2, 3, 4, 5, 6, 7, 8, 9
#> 5 7157, 67.... 1.425886 Cyclophosphamide 3, 4, 6
#> 6 7124, 3569 2.043261 Dexamethasone 1, 2, 10
GSEA output columns:
| Column | Description |
|---|---|
ChemicalID |
CTD chemical identifier |
ChemicalName |
Human-readable chemical name |
pval |
Raw p-value |
padj |
Adjusted p-value |
ES |
Enrichment score |
NES |
Normalized enrichment score |
size |
Size of the gene set |
leadingEdge |
Leading-edge gene subset |
foldEnrichment |
abs(ES) / mean(ES) |
Enriched_GENE |
Comma-separated enriched genes |
CAMERA and GSVA take a gene x sample expression matrix
(typically log2 counts or microarray intensities). For
this vignette we build a small synthetic matrix using the
Entrez IDs already loaded by import_CTD() above:
# Pull the Entrez IDs covered by the bundled sample
cache_dir <- rappdirs::user_cache_dir("ctdR")
e <- new.env(parent = emptyenv())
load(file.path(cache_dir, "ChemicalName_GeneEntrezIds.rda"),
envir = e)
all_ids <- as.character(unique(unlist(
e$ChemicalName_GeneEntrezIds)))
# Synthetic expression matrix: 17 genes x 6 samples
set.seed(2026)
expr <- matrix(
rnorm(length(all_ids) * 6),
nrow = length(all_ids), ncol = 6,
dimnames = list(all_ids, paste0("S", 1:6))
)
dim(expr)
#> [1] 17 6
CAMERA tests, for each chemical, whether its target genes show a stronger differential signal than the rest of the transcriptome under a user-supplied design and contrast, while correcting for inter-gene correlation within the gene set.
grp <- factor(rep(c("ctrl", "treat"), each = 3))
design <- model.matrix(~ grp)
camera_results <- enrichment_CTD(
expr,
method = "CAMERA",
design = design,
contrast = 2 # last column of design = treat vs ctrl
)
head(camera_results)
#> ChemicalID ChemicalName NGenes Direction pvalue padj
#> 1 D002945 Cisplatin 11 Up 0.02319375 0.2319375
#> 2 D003907 Dexamethasone 7 Down 0.07151516 0.2383839
#> 3 D014635 Valproic Acid 8 Down 0.05895320 0.2383839
#> 4 D000082 Acetaminophen 12 Up 0.20432247 0.5108062
#> 5 D002104 Cadmium 8 Down 0.39170306 0.7189345
#> 6 D003520 Cyclophosphamide 5 Up 0.43136072 0.7189345
CAMERA output columns: ChemicalID, ChemicalName,
NGenes (gene set size after intersection with rownames),
Direction ("Up" / "Down"), pvalue, padj.
GSVA produces a per-sample enrichment score for each chemical, returning a matrix (chemicals in rows, samples in columns) suitable for clustering, association tests against phenotypes, survival analysis, or heatmap visualization.
gsva_scores <- enrichment_CTD(expr, method = "GSVA")
#> ℹ GSVA version 2.6.1
#> ℹ Searching for rows with constant values
#> ℹ Calculating GSVA ranks
#> ℹ kcdf='auto' (default)
#> ℹ GSVA dense (classical) algorithm
#> ℹ Row-wise ECDF estimation with Gaussian kernels
#> ℹ Calculating row ECDFs
#> ℹ Calculating column ranks
#> ℹ GSVA dense (classical) algorithm
#> ℹ Calculating GSVA scores for 10 gene sets
#> ✔ Calculations finished
dim(gsva_scores) # chemicals x samples
#> [1] 10 6
head(gsva_scores, 3)
#> S1 S2 S3 S4 S5 S6
#> D000082 -0.3073171 0.1181818 0.18181818 0.2000000 0.2770833 0.07017544
#> D001564 -0.3370536 0.4025974 0.01567944 0.3458647 0.2408638 -0.24068323
#> D002104 0.1014493 0.4313725 0.36585366 0.2833333 -0.0751634 -0.62903226
Tune the underlying GSVA::gsvaParam() through ...,
for example to restrict to gene sets of a given size:
gsva_strict <- enrichment_CTD(
expr, method = "GSVA",
minSize = 5, maxSize = 500
)
#> ℹ GSVA version 2.6.1
#> ℹ Searching for rows with constant values
#> ℹ Calculating GSVA ranks
#> ℹ kcdf='auto' (default)
#> ℹ GSVA dense (classical) algorithm
#> ℹ Row-wise ECDF estimation with Gaussian kernels
#> ℹ Calculating row ECDFs
#> ℹ Calculating column ranks
#> ℹ GSVA dense (classical) algorithm
#> ℹ Calculating GSVA scores for 10 gene sets
#> ✔ Calculations finished
nrow(gsva_strict)
#> [1] 10
plot_CTD() creates publication-ready plots from enrichment
results. It auto-detects the method: bar/dot plots of fold
enrichment for ORA/GSEA, bar/dot plots of -log10(padj)
coloured by direction for CAMERA, and a sample-level
heatmap of the top-variance chemicals for GSVA.
# ORA bar plot of top enriched chemicals
plot_CTD(ora_results, type = "bar")
# ORA dot plot (size = gene count, color = adjusted p-value)
plot_CTD(ora_results, type = "dot", n = 10)
# CAMERA bar plot: x-axis = -log10(padj), fill = Direction
plot_CTD(camera_results, type = "bar")
# GSVA heatmap: top-variance chemicals across samples
plot_CTD(gsva_scores)
By default, enrichment_CTD() uses the Benjamini-Hochberg
("BH") method for p-value adjustment. You can change this
via the pAdjustMethod parameter:
# Bonferroni correction (more conservative)
ora_bonf <- enrichment_CTD(genes, method = "ORA",
pAdjustMethod = "bonferroni")
#> 'select()' returned 1:1 mapping between keys and columns
# No adjustment
ora_raw <- enrichment_CTD(genes, method = "ORA",
pAdjustMethod = "none")
#> 'select()' returned 1:1 mapping between keys and columns
# Compare the adjusted p-value of the top hit
data.frame(
method = c("BH (default)", "bonferroni", "none"),
top_padj = c(min(ora_results$padj),
min(ora_bonf$padj), min(ora_raw$padj))
)
#> method top_padj
#> 1 BH (default) 0.013574661
#> 2 bonferroni 0.013574661
#> 3 none 0.003393665
Available methods: "BH" (default), "bonferroni",
"fdr" (alias for BH), "none".
| Aspect | ORA | GSEA | CAMERA | GSVA |
|---|---|---|---|---|
| Input | Gene list | Ranked gene list | Expression matrix + design + contrast | Expression matrix |
| Question | Over-represented? | Cluster at extremes? | Stronger signal than rest of transcriptome (correlation-corrected)? | Per-sample chemical score |
| Output | Ranked chemicals (data.frame) | Ranked chemicals (data.frame) | Ranked chemicals (data.frame) | Chemical x sample score matrix |
| Best for | DEG lists | Exploratory, ranked | Multi-sample DE experiments, co-regulated gene sets | Patient stratification, downstream tests, heatmaps |
Use ORA when you have a well-defined gene list above a significance threshold; GSEA to leverage the full ranking without a cutoff; CAMERA when you have a proper multi-sample design and want to control the inflated false-positive rate that ORA/GSEA exhibit on strongly co-regulated gene sets; GSVA when you want a per-sample score to feed into downstream analyses (clustering, survival, correlation with phenotypes).
CTD releases updated data periodically. To update:
CTD_chem_gene_ixns.csv.gz from
https://ctdbase.org/reports/CTD_chem_gene_ixns.csv.gz.import_CTD() – existing
cache files are overwritten.import_CTD("~/Downloads/CTD_chem_gene_ixns.csv")
sessionInfo()
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ctdR_0.99.2 BiocStyle_2.40.0
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.6.0 ggplotify_0.1.3
#> [3] tibble_3.3.1 polyclip_1.10-7
#> [5] enrichit_0.1.4 graph_1.90.0
#> [7] XML_3.99-0.23 lifecycle_1.0.5
#> [9] httr2_1.2.2 processx_3.9.0
#> [11] lattice_0.22-9 vroom_1.7.1
#> [13] MASS_7.3-65 magrittr_2.0.5
#> [15] limma_3.68.2 sass_0.4.10
#> [17] rmarkdown_2.31 jquerylib_0.1.4
#> [19] yaml_2.3.12 otel_0.2.0
#> [21] ggtangle_0.1.2 cowplot_1.2.0
#> [23] DBI_1.3.0 RColorBrewer_1.1-3
#> [25] abind_1.4-8 GenomicRanges_1.64.0
#> [27] purrr_1.2.2 BiocGenerics_0.58.1
#> [29] yulab.utils_0.2.4 tweenr_2.0.3
#> [31] rappdirs_0.3.4 aisdk_1.1.0
#> [33] gdtools_0.5.0 IRanges_2.46.0
#> [35] S4Vectors_0.50.1 enrichplot_1.32.0
#> [37] ggrepel_0.9.8 irlba_2.3.7
#> [39] tidytree_0.4.7 GSVA_2.6.1
#> [41] annotate_1.90.0 DelayedMatrixStats_1.34.0
#> [43] codetools_0.2-20 DelayedArray_0.38.1
#> [45] DOSE_4.6.0 ggforce_0.5.0
#> [47] tidyselect_1.2.1 aplot_0.2.9
#> [49] memuse_4.2-3 farver_2.1.2
#> [51] ScaledMatrix_1.20.0 matrixStats_1.5.0
#> [53] stats4_4.6.0 Seqinfo_1.2.0
#> [55] jsonlite_2.0.0 systemfonts_1.3.2
#> [57] tools_4.6.0 ggnewscale_0.5.2
#> [59] treeio_1.36.1 Rcpp_1.1.1-1.1
#> [61] glue_1.8.1 SparseArray_1.12.2
#> [63] xfun_0.57 qvalue_2.44.0
#> [65] MatrixGenerics_1.24.0 dplyr_1.2.1
#> [67] HDF5Array_1.40.0 withr_3.0.2
#> [69] BiocManager_1.30.27 fastmap_1.2.0
#> [71] rhdf5filters_1.24.0 callr_3.7.6
#> [73] digest_0.6.39 rsvd_1.0.5
#> [75] R6_2.6.1 gridGraphics_0.5-1
#> [77] GO.db_3.23.1 dichromat_2.0-0.1
#> [79] RSQLite_3.52.0 h5mread_1.4.0
#> [81] tidyr_1.3.2 generics_0.1.4
#> [83] fontLiberation_0.1.0 data.table_1.18.4
#> [85] httr_1.4.8 htmlwidgets_1.6.4
#> [87] S4Arrays_1.12.0 scatterpie_0.2.6
#> [89] pkgconfig_2.0.3 gtable_0.3.6
#> [91] blob_1.3.0 S7_0.2.2
#> [93] SingleCellExperiment_1.34.0 XVector_0.52.0
#> [95] clusterProfiler_4.20.0 htmltools_0.5.9
#> [97] fontBitstreamVera_0.1.1 bookdown_0.46
#> [99] fgsea_1.38.0 GSEABase_1.74.0
#> [101] scales_1.4.0 Biobase_2.72.0
#> [103] png_0.1-9 SpatialExperiment_1.22.0
#> [105] ggfun_0.2.0 knitr_1.51
#> [107] tzdb_0.5.0 reshape2_1.4.5
#> [109] rjson_0.2.23 nlme_3.1-169
#> [111] org.Hs.eg.db_3.23.1 cachem_1.1.0
#> [113] rhdf5_2.56.0 stringr_1.6.0
#> [115] parallel_4.6.0 AnnotationDbi_1.74.0
#> [117] pillar_1.11.1 grid_4.6.0
#> [119] vctrs_0.7.3 BiocSingular_1.28.0
#> [121] tidydr_0.0.6 beachmat_2.28.0
#> [123] xtable_1.8-8 cluster_2.1.8.2
#> [125] evaluate_1.0.5 readr_2.2.0
#> [127] tinytex_0.59 magick_2.9.1
#> [129] cli_3.6.6 compiler_4.6.0
#> [131] rlang_1.2.0 crayon_1.5.3
#> [133] labeling_0.4.3 ps_1.9.3
#> [135] plyr_1.8.9 fs_2.1.0
#> [137] ggiraph_0.9.6 stringi_1.8.7
#> [139] BiocParallel_1.46.0 Biostrings_2.80.0
#> [141] lazyeval_0.2.3 GOSemSim_2.38.0
#> [143] fontquiver_0.2.1 Matrix_1.7-5
#> [145] hms_1.1.4 patchwork_1.3.2
#> [147] sparseMatrixStats_1.24.0 bit64_4.8.0
#> [149] ggplot2_4.0.3 Rhdf5lib_2.0.0
#> [151] KEGGREST_1.52.0 statmod_1.5.2
#> [153] SummarizedExperiment_1.42.0 igraph_2.3.1
#> [155] memoise_2.0.1 bslib_0.11.0
#> [157] ggtree_4.2.0 fastmatch_1.1-8
#> [159] bit_4.6.0 ape_5.8-1
#> [161] gson_0.1.0