Exposome Data Integration with Omic Data

Carles Hernandez-Ferer and Juan R. Gonzalez

4 January 2019

Abstract

This is an introductory guide to integration analysis between exposome and omics data with R package omicRexposome. The document illustrates two types of analysis: 1) Association analysis, that are performed between exposome and a single omic data-set; and 2) Integration analysis where multiple data-sets, including exposome data, are analysed at the same time.

Package

omicRexposome 1.4.1

1 Introduction
2 Analysis
- 2.1 Association Studies
  - 2.1.1 Exposome - Transcriptome Data Association
  - 2.1.2 Exposome - Proteome Data Association
- 2.2 Integration Analysis
Session info

1 Introduction

omicRexposome is an R package designed to work join with rexposome. The aim of omicRexposome is to perform analysis joining exposome and omic data with the goal to find the relationship between a single or set of exposures (external exposome) and the behavior of a gene, a group of CpGs, the level of a protein, etc. Also to provide a series of tools to analyse exposome and omic data using standard methods from Biocondcutor.

1.1 Installation

omicRexposome is currently in development and not available from CRAN nor Bioconductor. Anyway, the package can be installed by using devtools R package and taking the source from Bioinformatic Research Group in Epidemiology’s GitHub repository.

This can be done by opening an R session and typing the following code:

devtools::install_github("isglobal-brge/omicRexposome")

User must take into account that this sentence do not install the packages’ dependencies.

1.2 Pipeline

Two different types of analyses can be done with omicRexposome:

Analysis	`omicRexposome` function
Association Study	`association`
Integration Study	`crossomics`

Both association and integration studies are based in objects of class MultiDataSet. A MultiDataSet object is a contained for multiple layers of sample information. Once the exposome data and the omics data are encapsulated in a MultiDataSet the object can be used for both association and integration studies.

The method association requires a MultiDataSet object having to types of information: the exposome data from an ExposomeSet object and omic information from objects of class ExpressionSet, MethylationSet, SummarizedExperiment or others. ExposomeSet objects are created with functions read_exposome and load_exposome from rexposome R package (see next section Loading Exposome Data) and encapsulates exposome data. The method crossomics expects a MultiDataSet with any number of different data-sets (at last two). Compared with association method, crossomics do not requires an ExposomeSet.

1.3 Exposome and Omic Data

In order to illustrate the capabilities of omicRexposome and the exposome-omic analysis pipeline, we will use the data from BRGdata package. This package includes different omic-sets including methylation, transcriptome and proteome data-sets and an exposome-data set.

2 Analysis

omicRexposome and MultiDataSet R packages are loaded using the standard library command:

library(omicRexposome)
library(MultiDataSet)

2.1 Association Studies

The association studies are performed using the method association. This method requires, at last four, augments:

Argument object should be filled with a MultiDataSet object.
Argument formula should be filled with an expression containing the covariates used to adjust the model.
Argument expset should be filled with the name that the exposome-set receives in the MultiDataSet object.
Argument omicset should be filled with the name that the omic-set receives in the MultiDataSet object.

The argument formula should follow the pattern: ~sex+age. The method association will fill the formula placing the exposures in the ExposomeSetm between ~ and the covariates sex+age.

association implements the limma pipeline using lmFit and eBayes in the extraction methods from MultiDataSet. The method takes care of the missing data in exposures, outcomes and omics data and locating and is subsets both data-sets, exposome data and omic data, by common samples. The argument method allows to select the fitting method used in lmFit. By default it takes the value "ls" for least squares but it can also takes "robust" for robust regression.

The following subsections illustrates the usage of association with different types of omics data: methylome, transcriptome and proteome.

2.1.1 Exposome - Transcriptome Data Association

First we get the exposome data from BRGdata package that we will use in the whole section.

data("brge_expo", package = "brgedata")
class(brge_expo)

## [1] "ExposomeSet"
## attr(,"package")
## [1] "rexposome"

The aim of this analysis is to perform an association test between the gene expression levels and the exposures. So the first point is to obtain the transcriptome data from the brgedata package.

data("brge_gexp", package = "brgedata")

The association studies between exposures and transcriptome are done in the same way that the ones with methylome. The method used is association, that takes as input an object of MultiDataSet class with both exposome and expression data.

mds <- createMultiDataSet()
mds <- add_genexp(mds, brge_gexp)
mds <- add_exp(mds, brge_expo)

gexp <- association(mds, formula=~Sex+Age, 
    expset = "exposures", omicset = "expression")

We can have a look to the number of hits and the lambda score of each analysis with the methods tableHits and tableLambda, seen in the previous section.

hit <- tableHits(gexp, th=0.001)
lab <- tableLambda(gexp)
merge(hit, lab, by="exposure")

##    exposure hits    lambda
## 1     BPA_p   19 0.9072377
## 2    BPA_t1   27 0.8807316
## 3    BPA_t3   56 0.9391129
## 4     Ben_p   19 0.8013466
## 5    Ben_t1   12 0.8234104
## 6    Ben_t2    9 0.8393350
## 7    Ben_t3   21 0.8301203
## 8     NO2_p   32 1.0281960
## 9    NO2_t1   16 0.7942881
## 10   NO2_t2   35 1.1482314
## 11   NO2_t3   31 0.8770931
## 12   PCB118   59 0.9308472
## 13   PCB138   38 1.0726221
## 14   PCB153   51 1.1743989
## 15   PCB180   17 0.9790750

Since most of all models have a lambda under one, we should consider use Surrogate Variable Analysis. This can be done using the same association method but by setting the argument sva to "fast" so the pipeline of isva and SmartSVA R packages is applied. If sva is set to "slow" the applied. pipeline is the one from sva R package.

gexp <- association(mds, formula=~Sex+Age, 
    expset = "exposures", omicset = "expression", sva = "fast")

We can re-check the results creating the same table than before:

hit <- tableHits(gexp, th=0.001)
lab <- tableLambda(gexp)
merge(hit, lab, by="exposure")

##    exposure hits    lambda
## 1     BPA_p   50 0.9874152
## 2    BPA_t1   51 0.9453795
## 3    BPA_t3   61 0.9842216
## 4     Ben_p   76 1.0117733
## 5    Ben_t1   64 1.0115515
## 6    Ben_t2   71 1.0089834
## 7    Ben_t3   59 0.9969123
## 8     NO2_p   78 1.0117151
## 9    NO2_t1   68 1.0056950
## 10   NO2_t2   69 1.0210000
## 11   NO2_t3   49 0.9802407
## 12   PCB118  129 1.0518170
## 13   PCB138   67 1.0094139
## 14   PCB153   58 0.9924482
## 15   PCB180   67 0.9973381

The objects of class ResultSet have a method called plotAssociation that allows to create QQ Plots (that are another useful way to see if there are some inflation/deflation in the P-Values).

gridExtra::grid.arrange(
    plotAssociation(gexp, rid="Ben_p", type="qq") + 
        ggplot2::ggtitle("Transcriptome - Pb Association"),
    plotAssociation(gexp, rid="BPA_p", type="qq") + 
        ggplot2::ggtitle("Transcriptome - THM Association"),
    ncol=2
)

Following this line, the same method plotAssociation can be used to create volcano plots.

gridExtra::grid.arrange(
    plotAssociation(gexp, rid="Ben_p", type="volcano", tPV=-log10(1e-04)) + 
        ggplot2::ggtitle("Transcriptome - Pb Association"),
    plotAssociation(gexp, rid="BPA_p", type="volcano", tPV=-log10(1e-04)) + 
        ggplot2::ggtitle("Transcriptome - THM Association"),
    ncol=2
)

2.1.2 Exposome - Proteome Data Association

The proteome data-set included in brgedata has 47 proteins for 90 samples.

data("brge_prot", package="brgedata")
brge_prot

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 47 features, 90 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: x0001 x0002 ... x0090 (90 total)
##   varLabels: age sex
##   varMetadata: labelDescription
## featureData
##   featureNames: Adiponectin_ok Alpha1AntitrypsinAAT_ok ...
##     VitaminDBindingProte_ok (47 total)
##   fvarLabels: chr start end
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

The association analysis between exposures and proteome is also done using association.

mds <- createMultiDataSet()
mds <- add_eset(mds, brge_prot, dataset.type  ="proteome")
mds <- add_exp(mds, brge_expo)

prot <- association(mds, formula=~Sex+Age,
    expset = "exposures", omicset = "proteome")

The tableHits indicates that no association was found between the 47 proteins and the exposures.

tableHits(prot, th=0.001)

##        exposure hits
## Ben_p     Ben_p    0
## Ben_t1   Ben_t1    0
## Ben_t2   Ben_t2    0
## Ben_t3   Ben_t3    0
## BPA_p     BPA_p    0
## BPA_t1   BPA_t1    0
## BPA_t3   BPA_t3    0
## NO2_p     NO2_p    0
## NO2_t1   NO2_t1    1
## NO2_t2   NO2_t2    0
## NO2_t3   NO2_t3    0
## PCB118   PCB118    0
## PCB138   PCB138    0
## PCB153   PCB153    0
## PCB180   PCB180    0

This is also seen in the Manhattan plot for proteins that can be obtained from plotAssociation.

gridExtra::grid.arrange(
    plotAssociation(prot, rid="Ben_p", type="protein") + 
        ggplot2::ggtitle("Proteome - Cd Association") +
        ggplot2::geom_hline(yintercept = 1, color = "LightPink"),
    plotAssociation(prot, rid="NO2_p", type="protein") + 
        ggplot2::ggtitle("Proteome - Cotinine Association") +
        ggplot2::geom_hline(yintercept = 1, color = "LightPink"),
    ncol=2
)

NOTE: A real Manhattan plot can be draw with plot method for ResultSet objects by setting the argument type to "manhattan".

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  5866033 313.3   11073700  591.4   8412587  449.3
## Vcells 17941490 136.9  349617321 2667.4 437013262 3334.2

2.2 Integration Analysis

omicRexposome allows to study the relation between exposures and omic-features from another perspective, different from the association analyses. The integration analysis can be done, in omicRexposome using multi canonical correlation analysis or using multiple co-inertia analysis. The first methods is implemented in R package PMA (CRAN) and the second in omicade4 R package (Bioconductor). The two methods are encapsulated in the crossomics method.

The differences between association and crossomics are that the first method test association between two complete data-sets, by removing the samples having missing values in any of the involved data-sets, and the second try to find latent relationships between two or more sets.

Hence, we need to explore the missing data in the exposome data-set. This can be done using the methods plotMissings and tableMissings from rexposome R package.

library(rexposome)
plotMissings(brge_expo, set = "exposures")

From the plot we can see that more of the exposures have up to 25% of missing values. Hence the first step in the integration analysis is to avoid missing values. so, we perform a fast imputation on the exposures side:

brge_expo <- imputation(brge_expo)

crossomics function expects to obtain the different data-sets in a single labelled-list, in the argument called list. The argument method from crossomics function can be set to mcia (for multiple co-inertia analysis) or to mcca (for multi canonical correlation analysis).

The following code shows how to perform the integration of the exposome and the proteome. The method crossomics request a MultiDataSet object as input, containing the data-set to be integrated.

mds <- createMultiDataSet()
mds <- add_genexp(mds, brge_gexp)
mds <- add_eset(mds, brge_prot, dataset.type = "proteome")
mds <- add_exp(mds, brge_expo)

cr_mcia <- crossomics(mds, method = "mcia", verbose = TRUE)
cr_mcia

## Object of class 'ResultSet'
##  . created with: crossomics 
##  . sva:   
##     . method: mcia  ( omicade4 )
##  . #results: 1 ( error: 0 )
##  . featureData:  3 
##     . expression: 67528x11
##     . proteome: 47x3
##     . exposures: 15x12

As can be seen, crossomics returns an object of class ResultSet. In the integration process, the different data-sets are subset by common samples. This is done taking advantage of MultiDataSet capabilities.

The same is done when method is set to mcca.

cr_mcca <- crossomics(mds, method = "mcca", permute=c(4, 2))
cr_mcca

We used an extra argument (permute) into the previous call to crossomics using multi canonical correlation analysis. This argument allows to set the internal argument corresponding to permutations and iterations, that are used to tune-up internal parameters.

When a ResultSet is generated using crossomics the methods plotHits, plotLambda and plotAssociation can NOT be used. But the plotIntegration will help us to understand what was done. This method allows to provide the colors to be used on the plots:

colors <- c("green", "blue", "red")
names(colors) <- names(mds)

The graphical representation of the results from a multiple co-inertia analysis is a composition of four different plots.

plotIntegration(cr_mcia, colors=colors)

The first plot (first row, first column) is the samples space. It illustrates how the different data-sets are related in terms of intra-sample variability (each data-set has a different color). The second plot (first row, second column) shows the feature space. The features of each set are drawn on the same components so the relation between each data-set can be seen (the features are colored depending of the set were they belong).

The third plot (second row, first column) shows the inertia of each component. The two first plots are drawn on the first and second component. Finally, the fourth plot shows the behavior of the data-sets.

A radar plots is obtained when plotIntegration is used on a ResultSet created though multi canonical correlation analysis.

plotIntegration(cr_mcca, colors=colors)

This plot shows the features of the three data-sets in the same 2D space.The relation between the features can be understood by proximity. This means that the features that clusters, or that are in the same quadrant are related and goes in a different direction than the features in the other quadrants.

rm(cr_mcia, cr_mcca)

Session info

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] rexposome_1.4.0     MultiDataSet_1.10.0 omicRexposome_1.4.1
## [4] Biobase_2.42.0      BiocGenerics_0.28.0 BiocStyle_2.10.0   
## 
## loaded via a namespace (and not attached):
##   [1] minqa_1.2.4                 colorspace_1.3-2           
##   [3] pryr_0.1.4                  circlize_0.4.5             
##   [5] qvalue_2.14.0               htmlTable_1.13             
##   [7] qqman_0.1.4                 XVector_0.22.0             
##   [9] GenomicRanges_1.34.0        GlobalOptions_0.1.0        
##  [11] base64enc_0.1-3             clue_0.3-56                
##  [13] rstudioapi_0.8              ggrepel_0.8.0              
##  [15] bit64_0.9-7                 mvtnorm_1.0-8              
##  [17] AnnotationDbi_1.44.0        RSpectra_0.13-1            
##  [19] codetools_0.2-16            splines_3.5.2              
##  [21] leaps_3.0                   impute_1.56.0              
##  [23] knitr_1.21                  ade4_1.7-13                
##  [25] Formula_1.2-3               lsr_0.5                    
##  [27] nloptr_1.2.1                annotate_1.60.0            
##  [29] cluster_2.0.7-1             BiocManager_1.30.4         
##  [31] compiler_3.5.2              backports_1.1.3            
##  [33] assertthat_0.2.0            Matrix_1.2-15              
##  [35] lazyeval_0.2.1              gmm_1.6-2                  
##  [37] SmartSVA_0.1.3              limma_3.38.3               
##  [39] acepack_1.4.1               htmltools_0.3.6            
##  [41] tools_3.5.2                 bindrcpp_0.2.2             
##  [43] gtable_0.2.0                glue_1.3.0                 
##  [45] GenomeInfoDbData_1.2.0      reshape2_1.4.3             
##  [47] dplyr_0.7.8                 FactoMineR_1.41            
##  [49] Rcpp_1.0.0                  gdata_2.18.0               
##  [51] JADE_2.0-1                  nlme_3.1-137               
##  [53] iterators_1.0.10            made4_1.56.0               
##  [55] tmvtnorm_1.4-10             xfun_0.4                   
##  [57] stringr_1.3.1               lme4_1.1-19                
##  [59] gtools_3.8.1                XML_3.98-1.16              
##  [61] isva_1.9                    zoo_1.8-4                  
##  [63] zlibbioc_1.28.0             MASS_7.3-51.1              
##  [65] scales_1.0.0                pcaMethods_1.74.0          
##  [67] sandwich_2.5-0              SummarizedExperiment_1.12.0
##  [69] RColorBrewer_1.1-2          yaml_2.2.0                 
##  [71] memoise_1.1.0               gridExtra_2.3              
##  [73] ggplot2_3.1.0               rpart_4.1-13               
##  [75] fastICA_1.2-1               calibrate_1.7.2            
##  [77] latticeExtra_0.6-28         stringi_1.2.4              
##  [79] RSQLite_2.1.1               genefilter_1.64.0          
##  [81] S4Vectors_0.20.1            foreach_1.4.4              
##  [83] corrplot_0.84               checkmate_1.8.5            
##  [85] caTools_1.17.1.1            BiocParallel_1.16.5        
##  [87] shape_1.4.4                 GenomeInfoDb_1.18.1        
##  [89] rlang_0.3.0.1               pkgconfig_2.0.2            
##  [91] matrixStats_0.54.0          bitops_1.0-6               
##  [93] imputeLCMD_2.0              evaluate_0.12              
##  [95] lattice_0.20-38             PMA_1.0.11                 
##  [97] purrr_0.2.5                 bindr_0.1.1                
##  [99] labeling_0.3                htmlwidgets_1.3            
## [101] omicade4_1.22.0             bit_1.1-14                 
## [103] tidyselect_0.2.5            norm_1.0-9.5               
## [105] plyr_1.8.4                  magrittr_1.5               
## [107] bookdown_0.9                R6_2.3.0                   
## [109] gplots_3.0.1                IRanges_2.16.0             
## [111] Hmisc_4.1-1                 DelayedArray_0.8.0         
## [113] DBI_1.0.0                   pillar_1.3.1               
## [115] foreign_0.8-71              mgcv_1.8-26                
## [117] survival_2.43-3             scatterplot3d_0.3-41       
## [119] RCurl_1.95-4.11             nnet_7.3-12                
## [121] tibble_2.0.0                crayon_1.3.4               
## [123] KernSmooth_2.23-15          rmarkdown_1.11             
## [125] grid_3.5.2                  sva_3.30.1                 
## [127] data.table_1.11.8           blob_1.1.1                 
## [129] digest_0.6.18               flashClust_1.01-2          
## [131] xtable_1.8-3                glmnet_2.0-16              
## [133] stats4_3.5.2                munsell_0.5.0