---
title: "Simple comparison between CTDquerier R package and CTDbase Batch Query web tool"
author:
- name: Carles Hernandez-Ferrer
  affiliation: ISGlobal, Centre for Research in Environmental Epidemiology ( CREAL )
- name: Juan R. Gonzalez
  affiliation: ISGlobal, Centre for Research in Environmental Epidemiology ( CREAL )
  email: juanr.gonzalez@isglobal.org
date: "`r doc_date()`"
package: "`r pkg_ver( 'CTDquerier' )`"
csl: biomed-central.csl
bibliography: case_study.bib
vignette: >
    %\VignetteIndexEntry{Simple comparison between CTDquerier R package and CTDbase Batch Query web tool}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
output: 
    BiocStyle::html_document:
    toc_float: true
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

## The Comparative Toxicogenomics Database

The Comparative Toxicogenomics Database (*CTDbase*; http://ctdbase.org) is a public resource for toxicogenomic information manually curated from the peer-reviewed scientific literature, providing key information about the interactions of environmental chemicals with gene products and their effect on human disease [@CTDbase2003][@CTDbase2017]. 

## `CTDquerier` R package

`CTDquerier` is an R package that allows to R users to download basic data from *CTDbase* about genes, chemicals and diseases. Once the user's input is validated allows to query *CTDbase* to download the information of the given input from the other modules.


# Quering The Comparative Toxicogenomics Database

*CTDbase* offers a public web-based interface that includes basic and advanced query options to access data for sequences, references, and toxic agents, and a platform for analysis sequences.

## Keyword Search

In order to query *CTDbase* with a single term (aka. a gene, a chemical or a diseases) users can access to the web portal and use the *keyword search*.

![The Comparative Toxicogenomics Database - Web Portal](img/01_CTDbase.png)

Looking for the associations in *CTDbase* for the following set of then genes of interest implies to perform ten queries using this interface.

Follows the summary page of the results obtained after searching for the term *XKR4*:

![The Comparative Toxicogenomics Database - Summary results for *XKR4*](img/02_XKR4_keyword.png)

## Batch Query

The **Batch Query** tool (http://ctdbase.org/tools/batchQuery.go) is a provided by *CTDbase* and allows to download custom data associated with a set of chemicals, diseases and genes amount others.

![The Comparative Toxicogenomics Database - Batch Query](img/03_BatchQuery.png)

Given a set of terms the tool allows to download (as `.tsv`, `.xml`, ...) curated or inferred data from *CTDbase* associated to the terms of interest. Table \@ref(tab:BatchQuery-data) indicates the type of available data depending on input terms, being `C` curated, `I` inferred, `E` enriched and `A` all.

| Data Available/Input Data| Chemicals | Diseases | Genes |
|:-------------------------|:---------:|:--------:|:-----:|
|Chemical–gene interactions| C         |          | C     |
|Chemical associations     |           | A,C,I    | C     |
|Gene associations         | C         | A,C,I    | C     |
|Disease associations      | A,C,I     |          | A,C,I |
|Pathway associations      | I,E       | I        | C     |
|Gene Ontology associations| A,E       |          | A     |

: (\#tab:BatchQuery-data) Type of available data in Batch Query depending on type of input terms.

The resulting tables obtained from querying *CTDbase* using the **Batch Query** tool with the gene *XKR4* and asking for associated chemicals and associated diseases (curated, inferred and all) are included in `CTDquerier` R package (queries performed 2018/JAN/02).

These four files can be loaded as follows:

```{r loading_xkr4_tables, warning=FALSE}
# Chemicals - XKR4
bq_xkr4_c <- system.file(
  paste0( "extdata", .Platform$file.sep, "bq_xkr4_chem.tsv" ), 
  package="CTDquerier"
)
nrow( read.delim( bq_xkr4_c, sep = "\t" ) )
# Diseses curated - XKR4
bq_xkr4_dC <- system.file(
  paste0( "extdata", .Platform$file.sep, "bq_xkr4_disease_curated.tsv" ), 
  package="CTDquerier"
)
nrow( read.delim( bq_xkr4_dC, sep = "\t" ) )
# Diseases inferred - XKR4
bq_xkr4_dI <- system.file(
  paste0( "extdata", .Platform$file.sep, "bq_xkr4_disease_inferred.tsv" ), 
  package="CTDquerier"
)
nrow( read.delim( bq_xkr4_dI, sep = "\t" ) )
# Diseases all - XKR4
bq_xkr4_dA <- system.file(
  paste0( "extdata", .Platform$file.sep, "bq_xkr4_disease_all.tsv" ), 
  package="CTDquerier"
)
nrow( read.delim( bq_xkr4_dA, sep = "\t" ) )
```

What we can see from these files is that *XKR4* has, according to *CTDbase*, 18 curated associations with chemicals, 1 curated association with diseases, 1339 inferred associations with diseases and 1340 association with diseases (including both curated and inferred). It must be said that these associations are not unique.

## `CTDquerier`

The `CTDquerier` allows to download the associated information to a single or a set of genes by ysing the function `query_ctd_gene`:

```{r ctdquerier_xkr4}
library( CTDquerier )
xkr4 <- query_ctd_gene( terms = "XKR4", verbose = TRUE )
xkr4
```


The query indicates that 25 gene-chemical interactions were downloaded from *CTDbase*. Takeing a close look to them we see that they corrsponds to the 18 chemicals obtained from **Batch Query** tool.

```{r compare_xkr4_chemicals}
# How many unique chemicals associations there are in the result object?
xkr4_chem <- get_table( xkr4, index_name = "chemical interactions" )
length( unique( xkr4_chem$Chemical.Name ) )

# How many of the chemicals download using CTDquerier are in the Batch Query files?
bq_xkr4_c <- read.delim( bq_xkr4_c, sep = "\t" )
sum( as.character( bq_xkr4_c[ , 2] ) %in% unique( xkr4_chem$Chemical.Name ) )
```

On the side of disease associations, the retrieved data for *XKR4* with `CTDqurier` indicates that there are 762 gene-disease associations.

```{r}
dim( get_table( xkr4, index_name = "diseases" ) )
```

These 762 gene-disease assocations corresponds to the 1340 obtained from **Batch Query** one filtered by unique disease:

```{r}
bq_xkr4_dA <- read.delim( bq_xkr4_dA, sep = "\t" )
length( unique( bq_xkr4_dA$DiseaseID ) )

sum( as.character( unique( bq_xkr4_dA$DiseaseID ) ) %in% 
    get_table( xkr4, index_name = "diseases" )$Disease.ID )
```

The diference in terms of numbers of associations between the results obtained from **Batch Query** and from `CTDquerier` corresponds to the way the chemicals are nested in both tables. While in the results from **Batch Query** there is a row for each associations:

```{r}
bq_xkr4_dA[1:3, ]
```

In the results from `CTDquerier` there is a single entry for the disease instead one for each disease-chemical we see in the previous table from **Batch Query**. This is seen since in the results from `CTDquerier` there is a single entry for *Abdominal Pain* and has the three chemicals in a single `string` into the column `Inference.Network`:

```{r}
tbl <- get_table( xkr4, index_name = "diseases" )
tbl[ tbl$Disease.ID == "MESH:D015746", "Inference.Network" ]
```

# Session Info.

```{r sessionInfo, echo=FALSE}
sessionInfo()
```

# Bibliography