--- title: Drug-Target Interactions author: "Authors: Kevin Horan, Yuzhu Duan, Austin Leong, Dan Evans, Jamison McCorrison, Nicholas Schork and Thomas Girke" date: "Last update: `r format(Sys.time(), '%d %B, %Y')`" always_allow_html: yes output: BiocStyle::html_document: toc_float: true code_folding: show vignette: | %\VignetteIndexEntry{Drug-Target Interactions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} fontsize: 14pt bibliography: bibtex.bib --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` # Introduction ## Overview The `drugTargetInteractions` package provides utilities for identifying drug-target interactions for sets of small molecule and/or gene/protein identifiers [@Wu2006-jm]. The required drug-target interaction information is obained from a downloaded SQLite instance of the [ChEMBL](https://www.ebi.ac.uk/chembl/) database [@Gaulton2012-dv; @Bento2014-ry]. ChEMBL has been chosen for this purpose, because it provides one of the most comprehensive and best annotatated knowledge resources for drug-target information available in the public domain. ## Install Package As Bioconductor package `drugTargetInteraction` can be installed with the `BiocManager::install()` function. ```{r pkg_install_bioc, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("drugTargetInteractions") ``` Alternatively, the package can be installed from GitHub as follows. ```{r pkg_install_github, eval=FALSE} devtools::install_github("girke-lab/drugTargetInteractions", build_vignettes=TRUE) # Installs from github ``` ## Load Package and Access Help To use `drugTargetInteractions`, the package needs to be loaded in a user's R session. ```{r package load, eval=TRUE, warning=FALSE} library("drugTargetInteractions") # Loads the package ``` The following commands are useful to list the available help files and open the the vignette of the package. ```{r documentation, eval=FALSE, warning=FALSE} library(help="drugTargetInteractions") # Lists package info vignette(topic="drugTargetInteractions", package="drugTargetInteractions") # Opens vignette ``` # Working Environment ## Required Files and Directories The `drugTargetInteractions` package uses a downloaded SQLite instance of the [ChEMBL](https://www.ebi.ac.uk/chembl/) database. The following code in this vignette uses a slimmed down toy version of this database that is small enough to be included in this package for demonstrating the usage of its functions. For real drug-target analysis work, it is important that users download and uncompress a recent version of the ChEMBL SQLite database from [here](https://chembl.gitbook.io/chembl-interface-documentation/downloads), and then replace the path assigned to `chembldb` below with the path to the full version of the ChEMBL database they have downloaded to their system. Since the SQLite database from ChEMBL can be used by this package as is, creating a copy of the ChEMBL SQLite database on Bioconductor's [AnnotationHub](https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html) is not necessary at this point. This way users can always use the latest or historical versions of ChEMBL without the need of maintaining a mirror instance. The following `genConfig` function call creates a list containing the paths to input and output directories used by the sample code introduced in this vignette. For real analyses, users want to customize these paths to match the environment on their system. Usually, this means all paths generated by the `system.file` function need to be changed, as those are specific to work with the test data of a package (_e.g._ toy ChEMBL SQLite instance). ```{r dir_env, eval=TRUE} chembldb <- system.file("extdata", "chembl_sample.db", package="drugTargetInteractions") resultsPath <- system.file("extdata", "results", package="drugTargetInteractions") config <- genConfig(chemblDbPath=chembldb, resultsPath=resultsPath) ``` In addition, a lookup table is downloaded on the fly from UniChem (see [web page](https://www.ebi.ac.uk/unichem/ucquery/listSources) and corresponding [ftp site](ftp://ftp.ebi.ac.uk/pub/databases/chembl/UniChem/data/wholeSourceMapping/). This lookup table is used by `drugTargetInteractions` to translate compound identifiers across different drug databases. Currently, this includes three types of compound identifiers: DrugBank, PubChem, and ChEBI. ```{r data_setup, eval=TRUE} downloadUniChem(config=config) cmpIdMapping(config=config) ``` # Produce Results Quickly Users mainly interested in generating analysis results can skip the technical details in the following sections and continue with the section entitled [Workflow to Run Everything](#7_Workflow_to_Run_Everything). # Retrieve UniProt IDs The following returns for a set of query IDs (_e.g._ ENSEMBL gene IDs) the corresponding UniProt IDs based on a stict ID matching as well as a more relaxed sequence similarity-based approach. The latter sequence similarity associations are obtained with the `getUniprotIDs` or the `getParalogs` functions using UniProt's UNIREF cluster or BioMart's paralog annotations, respectively. ## UniProt's UNIREF Clusters The `UniProt.ws` package is used to to return for a set of query IDs (here Ensembl gene IDs) the corresponding UniProt IDs based on two independent approaches: ID mappings (IDMs) and sequence similarity nearest neighbors (SSNNs) using UNIREF clusters. The latter have been generated by UniProt with the MMSeqs2 and Linclust algorithms [@Steinegger2017-uq; @Steinegger2018-ub]. Additional details on the UNIREF clusters are available [here](https://www.uniprot.org/help/uniref). The organism, query ID type and sequence similarity level can be selected under the `taxId`, `kt` and `seq_cluster` arguments, respectively. The `seq_cluster` argument can be assigned one of: `UNIREF100`, `UNIREF90` or `UNIREF50`. The result is a list with two `data.frames`. The first one is based on IDMs and the second one on SSNNs. ```{r getUniprotIDs1, eval=FALSE, message=FALSE, warning=FALSE} keys <- c("ENSG00000145700", "ENSG00000135441", "ENSG00000120071") res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=keys, seq_cluster="UNIREF90") ``` ```{r load_res_list90, echo = FALSE, results = 'asis'} res_list90 <- readRDS(system.file("extdata", "res_list90.rds", package="drugTargetInteractions")) ``` The following shows the first `data.frame` containing the ID mapping results. ```{r getUniprotIDs2, eval=TRUE, message=FALSE} library(DT) datatable(res_list90[[1]], options = list(scrollX=TRUE, scrollY="600px", autoWidth = TRUE)) ``` The following shows how to return the dimensions of the two `data.frames` and how to obtain the UniProt IDs as character vectors required for the downstream analysis steps. ```{r getUniprotIDs3, eval=TRUE} sapply(res_list90, dim, simplify=FALSE) sapply(names(res_list90), function(x) unique(na.omit(res_list90[[x]]$ID))) ``` ## BioMart's Paralogs The following obtains via `biomaRt` for a set of query genes the corresponding UniProt IDs as well as their paralogs. The query genes can be Gene Names or ENSEMBL Gene IDs from _H. sapiens_. The result is similar to IDMs and SSNNs from the `getUniprotIDs` function, but instead of UNIREF clusters, biomaRt's paralogs are used to obtain SSNNs. ```{r getParalogs1, eval=FALSE} queryBy <- list(molType="gene", idType="external_gene_name", ids=c("ANKRD31", "BLOC1S1", "KANSL1")) queryBy <- list(molType="gene", idType="ensembl_gene_id", ids=c("ENSG00000145700", "ENSG00000135441", "ENSG00000120071")) res_list <- getParalogs(queryBy) ``` ```{r getParalogs1_load, eval=TRUE,echo=FALSE} res_list <- readRDS(system.file("extdata", "paralogs1_res_list.rds", package="drugTargetInteractions")) ``` The following shows the first `data.frame` containing the ID mapping results. ```{r getParalogs2, eval=TRUE, message=FALSE} library(DT) datatable(res_list[[1]], options = list(scrollX = TRUE, scrollY="400px", autoWidth = TRUE)) ``` The following shows how to return the dimensions of the two `data.frames` and how to obtain the UniProt IDs as character vectors required for the downstream analysis steps. ```{r getParalogs3, eval=TRUE} sapply(res_list, dim, simplify=FALSE) sapply(names(res_list), function(x) unique(na.omit(res_list[[x]]$ID_up_sp))) ``` # Query Drug-Target Annotations The `drugTargetAnnot` function returns for a set of compound or gene/protein IDs the corresponding known drug-target annotation data available in ChEMBL. A related function called `getDrugTarget` is described in the subsequent subsection. This method generates very similar results, but uses internally pre-computed annotation summary tables which is less flexible than the usage of pure SQL statements. ## Using `drugTargetAnnot` The `drugTargetAnnot` function queries the ChEMBL database with SQL statements without depending on pre-computed annotation tables. ### Query with Compound IDs ```{r drugTargetQuery1cmp, eval=TRUE} queryBy <- list(molType="cmp", idType="chembl_id", ids=c("CHEMBL17", "CHEMBL19", "CHEMBL1201117", "CHEMBL25", "nomatch", "CHEMBL1742471")) qresult1 <- drugTargetAnnot(queryBy, config=config) ``` ```{r cmp_query_result1, eval=TRUE, message=FALSE} library(DT) datatable(qresult1, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE)) ``` ### Query with Protein IDs ```{r drugTargetQuery1protein, eval=TRUE} queryBy <- list(molType="protein", idType="UniProt_ID", ids=c("P43166", "P00915")) qresult2 <- drugTargetAnnot(queryBy, config=config) ``` ```{r protein_query_result1, eval=TRUE, message=FALSE} library(DT) datatable(qresult2, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE)) ``` ### Query with Gene IDs The following returns drug-target annotations for a set of query Ensembl gene IDs. For this the Ensembl gene IDs are translated into UniProt IDs using both the IDM and SSNN approaches. ```{r getUniprotIDs, eval=FALSE, message=FALSE} keys <- c("ENSG00000120088", "ENSG00000135441", "ENSG00000120071") res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=keys, seq_cluster="UNIREF90") ``` ```{r getUniprotIDs4, eval=TRUE, message=FALSE} id_list <- sapply(names(res_list90), function(x) unique(na.omit(res_list90[[x]]$ID))) ``` Next, drug-target annotations are returned for the Uniprot IDs with the IDM or SSNN methods. The following example uses the UniProt IDs of the SSNN method. Note, to include the upstream Ensembl gene IDs in the final result table, the below Ensembl ID collapse step via a `tapply` is necessary since occasionally UniProt IDs are assigned to several Ensembl gene IDs (_e.g._ recent gene duplications). ```{r drugTargetQuery2protein, eval=TRUE} queryBy <- list(molType="protein", idType="UniProt_ID", ids=id_list[[2]]) qresultSSNN <- drugTargetAnnot( queryBy, config=config) ensidsSSNN <- tapply(res_list90[[2]]$ENSEMBL, res_list90[[2]]$ID, paste, collapse=", ") qresultSSNN <- data.frame(Ensembl_IDs=ensidsSSNN[as.character(qresultSSNN$QueryIDs)], qresultSSNN) ``` ```{r protein_query_result2, eval=TRUE, message=FALSE} library(DT) datatable(qresultSSNN, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE)) ``` ## Using `getDrugTarget` The `getDrugTarget` function generates similar results as `drugTargetAnnot`, but depends on a pre-computed query table (here `drugTargetAnnot.xls`). ### Query with Compound IDs ```{r cmp_query2, eval=TRUE} id_mapping <- c(chembl="chembl_id", pubchem="PubChem_ID", uniprot="UniProt_ID", drugbank="DrugBank_ID") queryBy <- list(molType="cmp", idType="pubchem", ids=c("2244", "65869", "2244")) queryBy <- list(molType="protein", idType="uniprot", ids=c("P43166", "P00915", "P43166")) queryBy <- list(molType="cmp", idType="drugbank", ids=c("DB00945", "DB01202")) #qresult3 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17,39,46:53),config=config) qresult3 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17),config=config) ``` ```{r cmp_query_result2, eval=TRUE, message=FALSE} library(DT) datatable(qresult3, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE)) ``` ### Query with Protein IDs ```{r protein_query2, eval=TRUE} queryBy <- list(molType="protein", idType="chembl", ids=c("CHEMBL25", "nomatch", "CHEMBL1742471")) #qresult4 <- getDrugTarget(queryBy=queryBy, id_mapping, columns=c(1,5,8,16,17,39,46:52),config=config) qresult4 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17),config=config) ``` ```{r protein_query_result3, eval=TRUE, message=FALSE} library(DT) datatable(qresult4, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE)) ``` # Query Bioassay Data The `drugTargetBioactivity` function returns for a set of compound or gene/protein IDs the corresponding bioassay data available in ChEMBL. ## Query with Compound IDs Example query for compounds IDs. ```{r bioassayQuery1cmp, eval=TRUE, warning=FALSE} queryBy <- list(molType="cmp", idType="DrugBank_ID", ids=c("DB00945", "DB00316", "DB01050")) qresultBAcmp <- drugTargetBioactivity(queryBy, config=config) ``` ```{r cmpBA_query_result, eval=TRUE, message=FALSE} library(DT) datatable(qresultBAcmp, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE)) ``` ## Query with Protein IDs Example query for protein IDs. Note, the Ensembl gene to UniProt ID mappings are derived from above and stored in the named character vector `ensidsSSNN`. ```{r bioassayQuery1protein, eval=TRUE, warning=FALSE} queryBy <- list(molType="protein", idType="uniprot", ids=id_list[[1]]) qresultBApep <- drugTargetBioactivity(queryBy, config=config) qresultBApep <- data.frame(Ensembl_IDs=ensidsSSNN[as.character(qresultBApep$UniProt_ID)], qresultBApep) ``` ```{r pepBA_query_result, eval=TRUE, message=FALSE} library(DT) datatable(qresultBApep, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE)) ``` # Workflow to Run Everything This section explains how to run all of the above drug-target interaction analysis steps with a few convenience meta functions. Users mainly interested in generating analysis results quickly can focus on this section only. ## ID mapping The `getSymEnsUp` function returns for a query of gene or protein IDs a mapping table containing: ENSEMBL Gene IDs, Gene Names/Symbols, UniProt IDs and ENSEMBL Protein IDs. Internally, the function uses the `ensembldb` package. Its results are returned in a list where the first slot contains the ID mapping table, while the subsequent slots include the corresponding named character vectors: `ens_gene_id`, `up_ens_id`, and `up_gene_id`. Currently, the following query IDs are supported by `getSymEnsUp`: `GENE_NAME`, `ENSEMBL_GENE_ID` and `UNIPROT_ID`. ### Query with Gene Names ```{r gene_name, eval=TRUE, message=FALSE} gene_name <- c("CA7", "CFTR") idMap <- getSymEnsUp(EnsDb="EnsDb.Hsapiens.v86", ids=gene_name, idtype="GENE_NAME") ens_gene_id <- idMap$ens_gene_id ens_gene_id ``` ### Query with ENSEBML Gene IDs ```{r ens_gene_id, eval=TRUE, message=FALSE} ensembl_gene_id <- c("ENSG00000001626", "ENSG00000168748") idMap <- getSymEnsUp(EnsDb="EnsDb.Hsapiens.v86", ids=ensembl_gene_id, idtype="ENSEMBL_GENE_ID") ens_gene_id <- idMap$ens_gene_id ``` ### Query with UniProt IDs ```{r ens_uniprot_id, eval=TRUE, message=FALSE} uniprot_id <- c("P43166", "P13569") idMap <- getSymEnsUp(EnsDb="EnsDb.Hsapiens.v86", ids=uniprot_id, idtype="UNIPROT_ID") ens_gene_id <- idMap$ens_gene_id ``` ## Retrieve UniProt IDs The perfect match and nearest neighbor UniProt IDs can be obtained from UniProt's UNIREF cluster or BioMart's paralog annotations. ### UNIREF Cluster The corresponding IDM and SSNN UniProt IDs for the above ENSEMBL gene IDs are obtained with the `getUniprotIDs` function. This step is slow since the queries have to be performed with `chunksize=1` in order to reliably track the query ENSEMBL gene ID information in the results. ```{r res_list_ens_gene_id_uniref, eval=FALSE, message=FALSE} res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=names(ens_gene_id), seq_cluster="UNIREF90", chunksize=1) sapply(res_list90, dim) ``` ### BioMart Parlogs Here the corresponding perfect match (IDM) and paralog (SSNN) UniProt IDs for the above ENSEMBL gene IDs are obtained with the `getParalogs` function. The latter is much faster than `getUniprotIDs`, and also covers wider evolutionary distances. Thus, it may be the preferred methods for many use cases. ```{r res_list_ens_gene_id_paralog, eval=FALSE, message=FALSE} queryBy <- list(molType="gene", idType="ensembl_gene_id", ids=names(ens_gene_id)) res_list <- getParalogs(queryBy) ``` ```{r res_list_ens_gene_id_paralog_load , eval=TRUE,echo=FALSE} res_list <- readRDS(system.file("extdata", "paralogs2_res_list.rds", package="drugTargetInteractions")) ``` ```{r res_list_ens_gene_id_paralog_part2, eval=TRUE} sapply(res_list, dim) ``` ## Drug-Target Data Both drug-target annotation and bioassay data are obtained with a meta function called `runDrugTarget_Annot_Bioassay` that internally uses the main processing functions `drugTargetAnnot` and `drugTargetBioactivity`. It organizes the result in a list with the annotation and bioassay data (`data.frames`) in the first and second slot, respectively. Importantly, the results from the IDM and SSNN UniProt IDs are combined in a single table, where duplicated rows have been removed. To track in the result table, which method was used for obtaining UniProt IDs, an `IDM_Mapping_Type` column has been added. Note, the gene IDs in the SSNN rows have the string `Query_` prepended to indicate that they are not necessarily the genes encoding the SSNN UniProt proteins listed in the corresponding rows. Instead they are the genes encoding the query proteins used for searching for SSNNs. ```{r runDrugTarget_Annot_Bioassay, eval=TRUE, message=FALSE} drug_target_list <- runDrugTarget_Annot_Bioassay(res_list=res_list, up_col_id="ID_up_sp", ens_gene_id, config=config) sapply(drug_target_list, dim) ``` View content of annotation result slot: ```{r runDrugTarget_Annot_Bioassay_result1, eval=TRUE, message=FALSE} datatable(drug_target_list$Annotation, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE)) ``` View content of bioassay result slot (restricted to first 500 rows): ```{r runDrugTarget_Annot_Bioassay_result2, eval=TRUE, message=FALSE} datatable(drug_target_list$Bioassay[1:500,], options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE)) ``` ## Drug-Target Frequency The following generates a summary table containing drug-target frequency information. ```{r DrugTargetSummary, eval=TRUE, message=FALSE} df <- drug_target_list$Annotation df[,"GeneName"] <- gsub("Query_", "", as.character(df$GeneName)) stats <- tapply(df$CHEMBL_CMP_ID, as.factor(df$GeneName), function(x) unique(x)) stats <- sapply(names(stats), function(x) stats[[x]][nchar(stats[[x]]) > 0]) stats <- sapply(names(stats), function(x) stats[[x]][!is.na(stats[[x]])]) statsDF <- data.frame(GeneNames=names(stats), Drugs=sapply(stats, paste, collapse=", "), N_Drugs=sapply(stats, length)) ``` Print drug-target frequency table. ```{r DrugTargetSummary2, eval=TRUE, message=FALSE} datatable(statsDF, options = list(scrollX = TRUE, scrollY="150px", autoWidth = TRUE)) ``` ## Write Results to Tabular Files Both the annotation and bioassay data of the `drug_target_list` object can be exported to separate tabular files as follows. ```{r export_annotation_bioassay, eval=FALSE, message=FALSE} write.table(drug_target_list$Annotation, "DrugTargetAnnotation.xls", row.names=FALSE, quote=FALSE, na="", sep="\t") write.table(drug_target_list$Bioassay, "DrugTargetBioassay.xls", row.names=FALSE, quote=FALSE, na="", sep="\t") write.table(statDF, "statDF.xls", row.names=FALSE, quote=FALSE, na="", sep="\t") ``` # Session Info ```{r sessionInfo} sessionInfo() ``` # References