How to use the NCI CACTUS connector and its methods.
biodbNci 1.6.0
biodbNci is a biodb extension package that implements a connector to biodbNci, a library for connecting to the National Cancer Institute (USA) CACTUS API (Institute 2022).
Install using Bioconductor:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install('biodbNci')
The first step in using biodbNci, is to create an instance of the biodb
class Biodb
from the main biodb package. This is done by calling the
constructor of the class:
mybiodb <- biodb::newInst()
During this step the configuration is set up, the cache system is initialized and extension packages are loaded.
We will see at the end of this vignette that the biodb instance needs to be
terminated with a call to the terminate()
method.
In biodb the connection to a database is handled by a connector instance that you can get from the factory. biodbNci implements a connector to a remote database. Here is the code to instantiate a connector:
conn <- mybiodb$getFactory()$createConn('nci.cactus')
## Loading required package: biodbNci
For this vignette, we will avoid the downloading of the full NCI CACTUS database, and use instead an extract containing a few entries:
dbExtract <- system.file("extdata", 'generated', "cactus_extract.txt.gz",
package="biodbNci")
conn$setPropValSlot('urls', 'db.gz.url', dbExtract)
To get some of the first entry IDs (accession numbers) from the database, run:
ids <- conn$getEntryIds(2)
ids
## [1] "749674" "750690"
To retrieve entries, use:
entries <- conn$getEntry(ids)
entries
## [[1]]
## Biodb NCI CACTUS entry instance 749674.
##
## [[2]]
## Biodb NCI CACTUS entry instance 750690.
To convert a list of entries into a dataframe, run:
x <- mybiodb$entriesToDataframe(entries)
x
## accession formula molecular.mass
## 1 749674 C16H14N4O 278.3128
## 2 750690 C22H27FN4O2 398.4793
## inchi
## 1 InChI=1S/C16H14N4O/c1-11-15(20-19-12-7-3-2-4-8-12)16(21)18-14-10-6-5-9-13(14)17-11/h2-10,19H,1H3,(H,18,20,21)
## 2 InChI=1S/C22H27FN4O2/c1-5-27(6-2)10-9-24-22(29)20-13(3)19(25-14(20)4)12-17-16-11-15(23)7-8-18(16)26-21(17)28/h7-8,11-12,25H,5-6,9-10H2,1-4H3,(H,24,29)(H,26,28)/b17-12-
## inchikey nci.cactus.id cas.id
## 1 RWIQZKLIGWLCEK-UHFFFAOYSA-N 749674 <NA>
## 2 WINHZLLDWRZWRT-ATVHPVEESA-N 750690 557795-19-4
## name
## 1 <NA>
## 2 Sunitinib (free base);1H-Pyrrole-3-carboxamide, N-[2-(diethylamino)ethyl]-5-[(Z)-(5-fluoro-1,2-dihydro-2-oxo-3H-indol-3-ylidene)methyl]-2,4-dimethyl-
Here is an example of calling the Chemical Identifier Resolver for converting a SMILES into an InChI:
conn$wsChemicalIdentifierResolver(structid='C=O', repr='InChI')
## [1] "InChI=1/CH2O/c1-2/h1H2"
There are currently two methods in NCI CACTUS for converting from CAS IDs to InChI or InChI keys:
conn$convCasToInchi('87605-72-9')
## [1] "InChI=1/C25H30O5/c1-15(2)6-5-7-16(3)8-9-30-19-11-17-10-18-13-25(4,29)14-21(27)23(18)24(28)22(17)20(26)12-19/h6,8,10-12,26,28-29H,5,7,9,13-14H2,1-4H3/b16-8+"
conn$convCasToInchikey('87605-72-9')
## [1] "KZPCPZBBGCTGCN-LZYBPNLTNA-N"
The conversion is made thanks to the Chemical Identifier Resolver web service.
When done with your biodb instance you have to terminate it, in order to ensure release of resources (file handles, database connection, etc):
mybiodb$terminate()
## INFO [16:27:59.912] Closing BiodbMain instance...
## INFO [16:27:59.920] Connector "nci.cactus" deleted.
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biodbNci_1.6.0 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 sass_0.4.7 utf8_1.2.4
## [4] generics_0.1.3 stringi_1.7.12 RSQLite_2.3.1
## [7] hms_1.1.3 digest_0.6.33 magrittr_2.0.3
## [10] evaluate_0.22 bookdown_0.36 fastmap_1.1.1
## [13] blob_1.2.4 plyr_1.8.9 jsonlite_1.8.7
## [16] progress_1.2.2 DBI_1.1.3 BiocManager_1.30.22
## [19] httr_1.4.7 fansi_1.0.5 XML_3.99-0.14
## [22] jquerylib_0.1.4 cli_3.6.1 rlang_1.1.1
## [25] chk_0.9.1 crayon_1.5.2 dbplyr_2.3.4
## [28] bit64_4.0.5 withr_2.5.1 cachem_1.0.8
## [31] yaml_2.3.7 tools_4.3.1 memoise_2.0.1
## [34] biodb_1.10.0 dplyr_1.1.3 filelock_1.0.2
## [37] curl_5.1.0 vctrs_0.6.4 R6_2.5.1
## [40] BiocFileCache_2.10.0 lifecycle_1.0.3 stringr_1.5.0
## [43] bit_4.0.5 pkgconfig_2.0.3 pillar_1.9.0
## [46] bslib_0.5.1 glue_1.6.2 Rcpp_1.0.11
## [49] lgr_0.4.4 xfun_0.40 tibble_3.2.1
## [52] tidyselect_1.2.0 knitr_1.44 htmltools_0.5.6.1
## [55] rmarkdown_2.25 compiler_4.3.1 prettyunits_1.2.0
## [58] askpass_1.2.0 openssl_2.1.1
Institute, National Cancer. 2022. “CADD Group Chemoinformatics Tools and User Services (Cactus).” https://cactus.nci.nih.gov/.