--- title: "Case study on Environmental Chemicals and asthma-related genes" author: - name: Carles Hernandez-Ferrer affiliation: ISGlobal, Centre for Research in Environmental Epidemiology ( CREAL ) - name: Juan R. Gonzalez affiliation: ISGlobal, Centre for Research in Environmental Epidemiology ( CREAL ) email: juanr.gonzalez@isglobal.org date: "`r doc_date()`" package: "`r pkg_ver( 'CTDquerier' )`" csl: biomed-central.csl bibliography: case_study.bib vignette: > %\VignetteIndexEntry{Case study on Environmental Chemicals and asthma-related genes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc_float: true --- # `CTDquerier` R package `CTDquerier` is an R package that allows R users to download basic data from *CTDbase* about genes, chemicals and diseases. First, the input provided by user is validated in *CTDbase* vocabulary files. Second, the validated results are used to query *CTDbase* and their data are downloaded into the R session. The package can be installed using `biocLite` function from Biocondcutor by executing: ```{r eval=FALSE} source("https://bioconductor.org/biocLite.R") biocLite("CTDquerier") ``` # Case-Study aim The case-study is based on the results obtained in the article entitled "Case-control admixture mapping in Latino populations enriches for known asthma-associated genes", from *Torgerson et. al.* [@Torgerson2012]. The genes from **Table E1** of that paper have been collected and used as starting point for this document. The aim of this case study is to illustrate how `CTDquerier` can be used to provide further evidences about a given hypothesis or how to provide biological insights from genetic, toxicological or environmental epidemiological studies. In our case, we are using a real example where the authors are interested in providing a set of genes associated with asthma. By using `CTDquerier` we will provide information about, among others, what is the level of curated evidence, from *CTDbase*, of the association between those genes and asthma or which chemicals are relaqted to this list of genes and asthma according to *CTDbase*. # Case-Study description Asthma is a common long term inflammatory disease of the airways of the lungs. It is characterized by variable and recurring symptoms, reversible airflow obstruction, and bronchospasm. Symptoms include episodes of wheezing, coughing, chest tightness, and shortness of breath. Asthma is thought to be caused by a combination of genetic and environmental factors. Environmental factors include exposure to air pollution and allergens. Diagnosis is usually based on the pattern of symptoms, response to therapy over time, and spirometry. Asthma is classified according to the frequency of symptoms, forced expiratory volume in one second, and peak expiratory flow rate. It may also be classified as atopic or non-atopic where atopy refers to a predisposition toward developing a type 1 hypersensitivity reaction. The Genetics of Asthma in Latino Americans (GALA) study [@Price2009] includes subjects with asthma and their biological parents recruited from schools, clinics, and hospitals at 4 sites: San Francisco Bay Area, New York City, Puerto Rico, and Mexico City. Patients were contacted to participate in the study if approved by their primary physician. On the basis of interviews and questionnaire data, subjects were included in the study if they were between the ages of 8 and 40 years with physician-diagnosed mild to moderate-to-severe asthma and had experienced 2 or more symptoms in the previous 2 years at the time of recruitment (including wheezing, coughing, and/or shortness of breath) and if both parents and 4 sets of grandparents self-identified as being of either Puerto Rican or Mexican ethnicity. After quality control [@Torgerson2012], the total number of SNPs included in the study was 729,685, and the total number of subjects was 529 children with asthma (253 Mexican and 276 Puerto Rican subjects) and 347 control subjects (158 Mexican and 189 Puerto Rican subjects). The authors aim to identify novel asthma-related genes in Latino populations by using case-control admixture mapping. By comparing local ancestry between cases and control subjects the authors identified 62 admixture mapping peaks (29 in Puerto Ricans and 33 in Mexicans). They also observed a significant enrichment of previously identified asthma-associated genes within the 62 admixture mapping peaks (P = .0051). The information about all the genes in those candidate regions are provided in **Table E1** of the paper [@Torgerson2012]. # Getting data (GALA genes) The **Table E1** from the GALA Study [@Torgerson2012] was downloaded and converted into a `.csv` file. This table is included in the package and can be loaded into R by: ```{r gala_read_csv} table_e1_csv <- system.file( paste0( "extdata", .Platform$file.sep, "gala_table_e1.csv" ), package="CTDquerier" ) table_e1 <- read.csv( table_e1_csv, stringsAsFactors = FALSE ) ``` The genes of interest can be retrieve from the colum *Genes* of this table. Let us start by removing regions without annotated genes. ```{r gala_remove_na_file} dim( table_e1 ) table_e1 <- table_e1[ table_e1$Genes != "NA ", ] dim( table_e1 ) ``` We have filtered 17 regions that had no genes. Now we will get the gene list by: ```{r gala_create_list} gala_genes <- trimws( unlist( strsplit( table_e1$Genes, "," ) ) ) length( gala_genes ) gala_genes[1:15] ``` NOTE that each region can have more than one gene that is separated by `,`. We end up with a total of 305 candidate genes related to asthma as proposed by *Torgerson et. al.* [@Torgerson2012]. Let us call to this list `GALA genes`. # Querying GALA genes in *CTDbase* Any type of enrichment analysis of a given list of genes should start by querying *CTDabse*. This can be done by using the function `query_ctd_gene`. The function validates the list of genes into *CTDbase gene-vocabulary*. Then the function performs a different query per gene and download the associated information into your computer. In our example this is performed by: ```{r load_ctdquerier, message=FALSE} library( CTDquerier ) ``` ```{r gale_query, eval=FALSE} gala <- query_ctd_gene( terms = gala_genes, verbose = TRUE ) ``` Since this process can be high time-consuming, `CTDquerier` package already contains the result of the query encapsulated in an object that can be loaded into your R session with: ```{r gala_data} data( gala, package = "CTDquerier" ) gala ``` # Questions that can be answered from a gene list Now, let us illustrate the type of questions that can be addressed by using `CTDquerier` when the user wants to provide biological insights from a gene list. In our case, we are interested in providing existing literature evidences (annotated in *CTDbase*) that our GALA genes are enriched in chemical or genetic processes realted to asthma. ## How many GALA genes are in *CTDbase*? Recall that our genes of interest are stored in the vector `gala_genes` (305 genes) and that the result obtained from querying *CTDbase* with those genes is stored in the object `gala` (258 terms). This means that 47 genes from the GALA study were not present in *CTDbase*. This information can be visually inspect by using the method `plot` of a `CTDdata` object. ```{r gala_plot_query, message=FALSE} library( ggplot2 ) plot( gala ) + ggtitle( "Lost & Found Genes from GALA Study" ) ``` The names of the *lost* genes from GALA study can be obtained using the method `get_terms`: ```{r gala_lost} get_terms( gala )[[ "lost" ]] ``` ## How many diseases are associated with GALA genes according to *CTDbase*? The call to `query_ctd_gene` done in previous section has downloaded "all the information available" in *CTDbase* of GALA genes. We can inspect the information available just printing our object of class `CTDdata`: ```{r gala_show_2} gala ``` The line starting by `#Diseases` indicates the number of relationships between a gene (or chemical) and diseases. For the GALA genes, there are a total of 223,462 relations between our genes of interest and diseases. The table with the relations between genes and diseases can be obtained using the method `get_table`. ```{r gala_gda_all} gala_all_diseases <- get_table( gala, index_name = "diseases" ) colnames( gala_all_diseases ) dim( gala_all_diseases ) ``` We observe that the table with the relations between gene and diseases has close to ~220K rows. This table provides with all the relations that link GALA genes with diseases, including the **curated** and the **inferred** ones. We can also check that the number of genes used to create this table are the correct ones: ```{r gala_disease_genes} length( unique( gala_all_diseases$GeneSymbol ) ) sum( get_terms( gala )[[ "found" ]] %in% unique( gala_all_diseases$GeneSymbol ) ) sum( !get_terms( gala )[[ "found" ]] %in% unique( gala_all_diseases$GeneSymbol ) ) ``` We can also check the name of the genes that do not contributed to create the table of associations between genes and diseases: ```{r gala_diseases_no_genes} get_terms( gala )[[ "found" ]][ !get_terms( gala )[[ "found" ]] %in% unique( gala_all_diseases$GeneSymbol ) ] ``` Finally, the answer to our question of interest can be obtained by: ```{r gala_diseases_unique} length( unique( gala_all_diseases$Disease.Name ) ) ``` This is the number of unique diseases associated to the GALA genes using all the relations. If we are interested in getting only the associations between the GALA genes and the diseases, understood as the **curated** relations: ```{r gala_diseases_curated} gala_all_diseases_cu <- gala_all_diseases[ !is.na( gala_all_diseases$Direct.Evidence ), ] gala_all_diseases_cu <- gala_all_diseases_cu[ gala_all_diseases_cu$Direct.Evidence != "", ] dim( gala_all_diseases_cu ) length( unique( gala_all_diseases_cu$Disease.Name ) ) ``` So, in summary: There are 301 diseases linked to GALA genes, supported by 436 curated associations. ## What is the level of evidence of the association between GALA genes and asthma according to *CTDbase*? In order to answer this question we first need to get `"Asthma"` in the column containing disease information in the table of associations between genes and diseases that we have previously created. This can be performed by executing: ```{r gala_diseases_asthma} gala_asthma <- gala_all_diseases[ gala_all_diseases$Disease.Name == "Asthma" , ] dim( gala_asthma ) ``` The subsetting resulted in a table with 209 entries. The firts column to inspect is the `Direct.Evidene` that provides a clue of the relations that are **curated** and the ones that are **inferred**. ```{r gala_diseases_asthma_direct} sum( gala_asthma$Direct.Evidence != "" & !is.na( gala_asthma$Direct.Evidence ) ) ``` So, only two of the 214 relations are curated associatins. The other 212 (92 + 120) are inferred. for the ones that have no direct evidence, *CTDbase* informs about the level of evidence of the relation between a gene and a diseasewith the columns `Inference.Score` and `Reference.Count`. The first contains the *CTDbase* inference-score of gene-disease association in base of a series of criterion defined by *CTDbase*'s developers while the second are just the number of bibliographical references enforcing the relation. Then, the mean of the *Inference Score* can be interpreted as the level of evidence of the association between GALA genes and asthma according to *CTDbase*: ```{r gala_diseases_asthma_evidence} mean( gala_asthma$Inference.Score, na.rm = TRUE ) ``` This index provides evidences about the degree of similarity between CTD chemical–gene–disease networks and a similar scale-free random network. The higher the score, the more likely the inference network has atypical connectivity. In other words, this large score indicates that the network of GALA genes with asthma is real and not build by chance (http://ctdbase.org/help/chemDiseaseDetailHelp.jsp). We can also provide the total number of papers relating GALA genes with asthma: ```{r gala_diseases_asthma_reference} sum( gala_asthma$Reference.Count, na.rm = TRUE ) ``` We can provide the *Inference Score* of each gene of interest with asthma by: ```{r gala_diseases_asthma_evidence_plot} plot( gala, index_name = "disease", subset.disease = "Asthma", filter.score = 20 ) + ggtitle( "Evidence of the association between GALA genes and Asthma" ) ``` Note that here only genes having an *Inference Score* higher that 20 are plotted (argument `filter.score`) ## Which chemicals are associated with GALA genes according to *CTDbase*? The object `gala` has a total of `r nrow(gala@chemicals_interactions)` gene-chemical interaction registers. We can obtain this table using the `get_table` method: ```{r gala_chemicals} gala_chem <- get_table( gala, index_name = "chemical interactions" ) colnames( gala_chem ) length( unique( gala_chem$Chemical.Name ) ) ``` This output is indicating that GALA genes are *associated* with `r length( unique( gala_chem$Chemical.Name ) )` unique chemicals according to *CTDabse*. The distribution of the number of *associations* (`Reference.Count` column) can be obtained by: ```{r, gala_chemicals_table, results="asis"} knitr::kable( t( table( gala_chem$Reference.Count ) ) ) ``` This information can also be obtained in a heat-map plot. The arguments `subset.genes` and `subset.chemicals` are used to filter the elements in X-axis (genes) and Y-axis (chemicals). The argument `filter.score` allows to filter by minimum number of existing papers (variable `Reference.Count`) providing evidences about the association between chemicals and genes. ```{r, gala_chemicals_plot} plot( gala, index_name = "chemical interactions", filter.score = 6 ) ``` # Questions that can be answered from a given disease In that case we are interested in knowing genes or chemicals that have been associated with our disease (or group of diseases) of interest ## Which genes are associated with asthma according to *CTDbase*? The way we can obtain all the information related to a disease from *CTDbase* is similar to the one we got the information of genes. In that case the function to be used is `query_ctd_dise`: ```{r ctd_asthma} asthma <- query_ctd_dise( terms = "Asthma", verbose = TRUE ) ``` As we can see from the log obtained when `verbose` is set to `TRUE`, the information downloaded from *CTDabse* by `CTDquerier` for diseases is composed by disease-gene interactions, disease-chemical interactions and pathway interactions (KEGG). ```{r ctd_asthma_show} asthma ``` To know the exact number of genes related to Asthma according to *CTDbase* we extract the proper table and count the unique genes that appear on it. ```{r ctd_asthma_n_genes} ctd_asthma <- get_table( asthma, index_name = "gene interactions" ) length( unique( ctd_asthma$Gene.Symbol ) ) ``` We can also look for teh curated associations between genes and Asthma: ```{r ctd_asthma_n_genes_curated} sum( !is.na( ctd_asthma$Direct.Evidence ) & ctd_asthma$Direct.Evidence != "" ) ``` According to *CTDbase*, there are `r length( unique( ctd_asthma$Gene.Symbol ) )` genes related to asthma and `r sum( !is.na( ctd_asthma$Direct.Evidence ) & ctd_asthma$Direct.Evidence != "" )` of them are curated associations. However the term `"Asthma"` may appear in different ways. Here we illustrate how to get the real number of genes related with Asthma. ```{r ctd_asthma_table, results="asis"} library( knitr ) tt <- as.data.frame( table( ctd_asthma$Disease.Name ) ) colnames( tt ) <- c( "Disease", "Frequency" ) kable( tt[ order( tt$Frequency, decreasing = TRUE ), ] ) ``` Hence, the *true* answer is `r tt[ order( tt$Frequency, decreasing = TRUE ), 2][1]` gene-diseases relations and not `r, length( unique( ctd_asthma$Gene.Symbol ) )`. ## Which chemicals are associated with asthma according to *CTDbase*? First, let us retrive the chemicals related to asthma from `asthma` object ```{r ctd_asthma_chem} ctd_asthma_chem <- get_table( asthma, index_name = "chemical interactions" ) colnames( ctd_asthma_chem ) length( unique( ctd_asthma_chem$Chemical.Name ) ) ``` Hence, the answer is `r length( unique( ctd_asthma_chem$Chemical.Name ) )`. Now lets see how many of these relations are curated associations: ```{r ctd_asthma_chem_cur} sum( !is.na( ctd_asthma_chem$Direct.Evidence ) & ctd_asthma_chem$Direct.Evidence != "" ) ``` Visualization can be perform trough `Inference.Score` or `Reference.Count` plot (for both curated and inferred relations). Let us assume that we are interested in providing information about the chemicals that are associated with GALA genes having an score larger than 30. The plot can be created by: ```{r ctd_asthma_plot} plot( asthma, index_name = "chemical interactions", subset.disease = "Asthma", filter.score = 30 ) + ggtitle( "Evidence of the association between GALA genes and chemicals" ) ``` # Questions relating the triad: chemicals, genes and disease of interest In some occassions the user may be interested in determining how chemicals, genes and a given disease have been associated in the literature. ## Which chemicals are associated to both GALA genes and asthma according to *CTDbase*? The table `gala_chem` contains the relation between GALA genes and the chemicals associated to these genes according to *CTDbase*. The table `ctd_asthma_chem` contains the chemicals related to Asthma according to *CTDbase*. Therefore, to find the chemicals that are related to both GALA genes and asthma, according to *CTDbase* we get the intersection of both tables. This can be performed by ```{r intersect_gala_asthma_chem_1} intr_chem <- intersect( gala_chem$Chemical.Name, ctd_asthma_chem$Chemical.Name ) length( intr_chem ) ``` So, there are `r length( intr_chem )` chemicals sharing related with GALA genes and asthma at the same time. Next code can be used to quantify the overlap. ```{r intersect_gala_asthma_chem_2} length( intr_chem ) / nrow( gala_chem ) * 100 length( intr_chem ) / nrow( ctd_asthma_chem ) * 100 ``` ```{r intersect_gala_asthma_chem_2_temp, echo=FALSE} p1 <- round(length( intr_chem ) / nrow( gala_chem ) * 100, 2) p2 <- round(length( intr_chem ) / nrow( ctd_asthma_chem ) * 100, 2) ``` That is, the `r p1`% of the total number of associations of GALA genes with chemicals are shared with the observed associations of asthma with chemicals. On the other hand, the `r p2`% of the chemicals associated with asthma are shared with the chemicals associated with GALA genes. On the other and the curated chemical associations on both sides of GALA genes and Asthma can be found as: ```{r intersect_gala_asthma_chem_cur_1} a <- ctd_asthma_chem$Chemical.Name[ !is.na( ctd_asthma_chem$Direct.Evidence ) & ctd_asthma_chem$Direct.Evidence != "" ] intr_chem <- intersect( gala_chem$Chemical.Name, a ) length( intr_chem ) ``` It can be really useful to have a `data.frame` with three columns: chemical names, and the number of references in the literature associated with GALA genes and asthma. This table can be retrieve by: ```{r intersect_gala_asthma_chem_plot_1} gala_chem_r <- gala_chem[ gala_chem$Chemical.Name %in% intr_chem, ] gala_chem_r <- gala_chem_r[ !duplicated( gala_chem_r$Chemical.Name ), ] ctd_asthma_chem_r <- ctd_asthma_chem[ ctd_asthma_chem$Chemical.Name %in% intr_chem, ] ctd_asthma_chem_r <- ctd_asthma_chem_r[ !duplicated( ctd_asthma_chem_r$Chemical.Name ), ] dta <- merge( gala_chem_r[ , c( "Chemical.Name", "Reference.Count" ) ], ctd_asthma_chem_r[ , c( "Chemical.Name", "Reference.Count" ) ], by = "Chemical.Name" ) colnames( dta ) <- c( "Chemical.Name", "Reference.Gala", "Reference.Asthma" ) dta <- dta[ order( dta$Reference.Gala, dta$Reference.Asthma, decreasing = TRUE ), ] dta[1:5, ] ``` Having this `data.frame`, we can use the function `leaf_plot` to compare the number of references in the gene list and in asthma for the top 25 chemicals by: ```{r intersect_gala_asthma_chem_plot_2} leaf_plot( dta[1:25, ], label = "Chemical.Name", valueLeft = "Reference.Gala", valueRight = "Reference.Asthma", titleLeft = "GALA", titleRight = "Asthma" ) ``` # Questions related to enrichment analysis Finally, one may be interested in providing statistical evidences about whether a gene list is enriched in the genes that have been described in the literature to be associated with our disease of interest or in a given chemical that is relevant in to our health problem. This can be answered by using the method `enrich`. This method performs a Fisher test of enrichment given two objects of class `CTDdata`. For this analysis we need to provide to method `enrich` with a gene universe. `CTDquerier` provides a gene universe obtained from HGNC (HUGO Gene Nomenclature Committee) at date 2018/01/10 ([https://www.genenames.org/cgi-bin/download](https://www.genenames.org/cgi-bin/download)). The universe can be loaded with: ```{r load_hugo} hgnc_universe <- read.delim( system.file( "extdata", "HGNC_Genes.tsv", package="CTDquerier" ), sep = "\t", stringsAsFactor = FALSE ) ``` ## Are the GALA genes significantly associated with asthma according to *CTDabse*? The `enrich` method allows to use all relations between genes and asthma or only the curated ones. By setting the argument `use` to `"all", all infered and curated relations are used to test for enrichment: ```{r gala_enrich_asthma_all} enrich( gala, asthma, use = "all", universe = hgnc_universe$Approved.Symbol ) ``` The function `enrich` takes the genes in the `gala` object (e.g. target genes), the genes in the `asthma` object (e.g. gene set/pathway) and performs a Fisher test (see `fisher.test`) using the provided gene universes. We can observe that our GALA genes are enriched in those genes that are described in the literature to be associated with asthma (p < 2.2e-16). Alternatively, using only the curated associations we see no enrichment: ```{r gala_enrich_asthma_curated} enrich( gala, asthma, universe = hgnc_universe$Approved.Symbol, use = "curated" ) ``` ## Are the GALA genes significantly enriched in Air Pollutants chemicals according to *CTDabse*? First of all we need to obtain the information about air pollutants from *CTDbase*. This is done by using the function `query_ctd_chem`. ```{r air_ctd} air <- query_ctd_chem( terms = "Air Pollutants" ) air ``` Once we obtained all the information about air pollutants from *CTDbase* the enrichment analysis can also be performed by using the method `enrich` by changing our gene set of interest. In this case our second set corresponds to the genes related to air pollutants. ```{r gala_enrich_air} enrich( gala, air, universe = hgnc_universe$Approved.Symbol, use = "all" ) ``` We observe that the chemicals associated with GALA genes are not enriched in toxics belonging to Air Pollution group. # Session Info. ```{r sessionInfo, echo=FALSE} sessionInfo() ``` # Bibliography