Dealing with Multiple Imputations

Carles Hernandez-Ferrer and Juan R. Gonzalez

30 October 2018

Abstract

An introductory guide to analysing multiple imputed exposome data with R package rexposome. The areas covered in this document are: loading the multiple imputations of both exposures and phenotypes from common data.frames, exploration the exposome data, and testing association between exposures and health outcomes.

Package

rexposome 1.4.0

1 Introduction
2 Exposome-Wide Association Studies (ExWAS)
- 2.1 Extract the exposures over the threshold of effective tests
Session info

1 Introduction

1.1 Dummy Imputation with `mice`

To illustrate how to perform a multiple imputation using mice we start loading both rexposome and mice libraries.

library(rexposome)
library(mice)

The we load the txt files includes in rexposome package so we can load the exposures and see the amount of missing data (check vignette Exposome Data Analysis for more information).

The following lines locates where the txt files were installed.

path <- file.path(path.package("rexposome"), "extdata")
description <- file.path(path, "description.csv")
phenotype <- file.path(path, "phenotypes.csv")
exposures <- file.path(path, "exposures.csv")

Once the files are located we load them as data.frames:

dd <- read.csv(description, header=TRUE, stringsAsFactors=FALSE)
ee <- read.csv(exposures, header=TRUE)
pp <- read.csv(phenotype, header=TRUE)

In order to speed up the imputation process that will be carried in this vignette, we will remove four families of exposures.

dd <- dd[-which(dd$Family %in% c("Phthalates", "PBDEs", "PFOAs", "Metals")), ]
ee <- ee[ , c("idnum", dd$Exposure)]

We can check the amount of missing data in both exposures and phenotypes data.frames:

data.frame(
    Set=c("Exposures", "Phenotypes"),
    Count=c(sum(is.na(ee)), sum(is.na(pp)))
)

##          Set Count
## 1  Exposures   304
## 2 Phenotypes     5

Before running mice, we need to collapse both the exposures and the phenotypes in a single data.frame.

rownames(ee) <- ee$idnum
rownames(pp) <- pp$idnum

dta <- cbind(ee[ , -1], pp[ , -1])
dta[1:3, c(1:3, 52:56)]

##            DDE       DDT      HCB  birthdate  sex age cbmi blood_pre
## id001       NA        NA       NA 2004-12-29 male 4.2 16.3       120
## id002 1.713577 0.6931915 1.270750 2005-01-05 male 4.2 16.4       121
## id003 2.594590 0.7448906 2.205519 2005-01-05 male 4.2 19.0       120

Once this is done, the class of each column needs to be set, so mice will be able to differentiate between continuous and categorical exposures.

for(ii in c(1:13, 18:47, 55:56)) {
    dta[, ii] <- as.numeric(dta[ , ii])
}
for(ii in c(14:17, 48:54)) {
    dta[ , ii] <- as.factor(dta[ , ii])
}

With this data.frame we perform the imputation calling mice functions (for more information about this call, check mice’s vignette). We remove the columns birthdate since it is not necessary for the imputations and carries lots of categories.

imp <- mice(dta[ , -52], pred = quickpred(dta[ , -52], mincor = 0.2, 
    minpuc = 0.4), seed = 38788, m = 5, maxit = 10, printFlag = FALSE)

## Warning: Number of logged events: 240

class(imp)

## [1] "mids"

The created object imp, that is an object of class mids contains 20 data-sets with the imputed exposures and the phenotypes. To work with this information we need to extract each one of these sets and create a new data-set that includes all of them. This new data.frame will be passed to rexposome (check next section to see the requirements).

mice package includes the function complete that allows to extract a single data-set from an object of class mids. We will use this function to extract the sets and join them in a single data.frame.

If we set the argument action of the complete function to 0, it will return the original data:

me <- complete(imp, action = 0)
me[ , ".imp"] <- 0
me[ , ".id"] <- rownames(me)
dim(me)

## [1] 109  57

summary(me[, c("H_pesticides", "Benzene")])

##  H_pesticides    Benzene        
##  0   :68      Min.   :-0.47427  
##  1   :35      1st Qu.:-0.19877  
##  NA's: 6      Median :-0.11975  
##               Mean   :-0.12995  
##               3rd Qu.:-0.06879  
##               Max.   : 0.13086  
##               NA's   :3

If the action number is between 1 and the m value, it will return the selected set.

for(set in 1:5) {
    im <- complete(imp, action = set)
    im[ , ".imp"] <- set
    im[ , ".id"] <- rownames(im)
    me <- rbind(me, im)
}
me <- me[ , c(".imp", ".id", colnames(me)[-(97:98)])]
rownames(me) <- 1:nrow(me)
dim(me)

## [1] 654  59

1.2 Data Format

The format of the multiple imputation data for rexposome needs to follow some restrictions:

Both the exposures and the phenotypes are stored in the same data.frame.
This data.frame must have a column called .imp indicating the number of imputation. This imputation tagged as 0 are raw exposures (no imputation).
This data.frame must have a column called .id indicating the name of samples. This will be converted to character.
A data.frame with the description with the relation between exposures and families.

1.3 Creating an `imExposomeSet`

With the exposome data.frame and the description data.frame an object of class imExposomeSet can be created. To this end, the function loadImputed is used:

ex_imp <- loadImputed(data = me, description = dd, 
                       description.famCol = "Family", 
                       description.expCol = "Exposure")

The function loadImputed has several arguments:

args(loadImputed)

## function (data, description, description.famCol = "family", description.expCol = "exposure", 
##     exposures.asFactor = 5, warnings = TRUE) 
## NULL

The argument data is filled with the data.frame of exposures. The argument decription with the data.frame with the exposures’ description. description.famCol indicates the column on the description that corresponds to the family. description.expCol indicates the column on the description that corresponds to the exposures. Finally, exposures.asFactor indicates that the exposures with less that, by default, five different values are considered categorical exposures, otherwise continuous.

ex_imp

## Object of class 'imExposomeSet'
##  . exposures description:
##     . categorical:  4 
##     . continuous:  43 
##  . #imputations: 6 (raw detected) 
##  . assayData: 47 exposures 109 individuals
##  . phenoData: 109 individuals 12 phenotypes
##  . featureData: 47 exposures 3 explanations

The output of this object indicates that we loaded 14 exposures, being 13 continuous and 1 categorical.

1.3.1 Accessing to Exposome Data

The class ExposomeSet has several accessors to get the data stored in it. There are four basic methods that returns the names of the individuals (sampleNames), the name of the exposures (exposureNames), the name of the families of exposures (familyNames) and the name of the phenotypes (phenotypeNames).

head(sampleNames(ex_imp))

## [1] "1"   "10"  "100" "101" "102" "103"

head(exposureNames(ex_imp))

## [1] "DDE"    "DDT"    "HCB"    "bHCH"   "PCB118" "PCB153"

familyNames(ex_imp)

## [1] "Organochlorines"   "Bisphenol A"       "Water Pollutants" 
## [4] "Cotinine"          "Home Environment"  "Air Pollutants"   
## [7] "Built Environment" "Noise"             "Temperature"

phenotypeNames(ex_imp)

##  [1] "whistling_chest" "flu"             "rhinitis"        "wheezing"       
##  [5] "sex"             "age"             "cbmi"            "blood_pre"      
##  [9] ".imp.1"          ".id.1"

fData will return the description of the exposures (including internal information to manage them).

head(fData(ex_imp), n = 3)

## DataFrame with 3 rows and 4 columns
##              Family    Exposure                             Name       .type
##         <character> <character>                      <character> <character>
## DDE Organochlorines         DDE Dichlorodiphenyldichloroethylene     numeric
## DDT Organochlorines         DDT  Dichlorodiphenyltrichloroethane     numeric
## HCB Organochlorines         HCB                Hexachlorobenzene     numeric

pData will return the phenotypes information.

head(pData(ex_imp), n = 3)

## DataFrame with 3 rows and 12 columns
##        .imp         .id whistling_chest      flu rhinitis wheezing      sex
##   <numeric> <character>        <factor> <factor> <factor> <factor> <factor>
## 1         0           1           never       no       no       no     male
## 2         0          10         1-2 epi      yes       no       no     male
## 3         0         100           never       no       no       no     male
##        age      cbmi blood_pre    .imp.1       .id.1
##   <factor> <numeric> <numeric> <numeric> <character>
## 1      4.2      16.3       120         0           1
## 2      4.2      15.6       116         0          10
## 3      4.1      16.9       117         0         100

1.3.2 Exposures Behaviour

The behavior of the exposures through the imputation process can be studies using the plotFamily method. This method will draw the behavior of the exposures in each imputation set in a single chart.

The method required an argument family and it will draw a mosaic with the plots from the exposures within the family. Following the same strategy than using an ExposomeSet, when the exposures are continuous box-plots are used.

plotFamily(ex_imp, family = "Organochlorines")

## Warning: Removed 104 rows containing non-finite values (stat_boxplot).

For categorical exposures, the method draws accumulated bar-plot:

plotFamily(ex_imp, family = "Home Environment")

The arguments group and na.omit are not available when plotFamily is used with an imExposomeSet.

1.4 Extracting an `ExposomeSet` from an `imExposomeSet`

Once an imExposomeSet is created, an ExposomeSet can be obtained by selecting one of the internal imputed-sets. This is done using the method toES and setting the argument rid with the number of the imputed-set to use:

ex_1 <- toES(ex_imp, rid = 1)
ex_1

## Object of class 'ExposomeSet' (storageMode: environment)
##  . exposures description:
##     . categorical:  4 
##     . continuous:  43 
##  . exposures transformation:
##     . categorical: 0 
##     . transformed: 0 
##     . standardized: 0 
##     . imputed: 0 
##  . assayData: 47 exposures 109 individuals
##     . element names: exp 
##     . exposures: AbsPM25, ..., Temp 
##     . individuals: 1, ..., 98 
##  . phenoData: 109 individuals 10 phenotypes
##     . individuals: 1, ..., 98 
##     . phenotypes: whistling_chest, ..., .imp.1 
##  . featureData: 47 exposures 8 explanations
##     . exposures: AbsPM25, ..., Temp 
##     . descriptions: Family, ..., .imp 
## experimentData: use 'experimentData(object)'
## Annotation:

ex_3 <- toES(ex_imp, rid = 3)
ex_3

## Object of class 'ExposomeSet' (storageMode: environment)
##  . exposures description:
##     . categorical:  4 
##     . continuous:  43 
##  . exposures transformation:
##     . categorical: 0 
##     . transformed: 0 
##     . standardized: 0 
##     . imputed: 0 
##  . assayData: 47 exposures 109 individuals
##     . element names: exp 
##     . exposures: AbsPM25, ..., Temp 
##     . individuals: 1, ..., 98 
##  . phenoData: 109 individuals 10 phenotypes
##     . individuals: 1, ..., 98 
##     . phenotypes: whistling_chest, ..., .imp.1 
##  . featureData: 47 exposures 8 explanations
##     . exposures: AbsPM25, ..., Temp 
##     . descriptions: Family, ..., .imp 
## experimentData: use 'experimentData(object)'
## Annotation:

2 Exposome-Wide Association Studies (ExWAS)

The interesting point on working with multiple imputations is to test the association of the different version of the exposures with a target phenotype. rexposome implements the method exwas to be used with an imExposomeSet.

as_iew <- exwas(ex_imp, formula = blood_pre~sex+age, family = "gaussian")
as_iew

## An object of class 'ExWAS'
## 
##        blood_pre ~ sex+age 
## 
## Tested exposures:  47 
## Threshold for effective tests (TEF):  2.44e-03 
##  . Tests < TEF: NA

As usual, the object obtained from exwas method can be plotted using plotExwas:

clr <- rainbow(length(familyNames(ex_imp)))
names(clr) <- familyNames(ex_imp)
plotExwas(as_iew, color = clr)

2.1 Extract the exposures over the threshold of effective tests

The method extract allows to obtain a table of P-Values from an ExWAS object. At the same time, the tef method allows to obtain the threshold of effective tests computed at exwas. We can use them combined in order to create a table with the P-Value of the exposures that are beyond the threshold of efective tests.

First we get the threshold of effective tests

(thr <- tef(as_iew))

## [1] 0.002441308

Second we get the table of P-Values

tbl <- extract(as_iew)

Third we filter the table with the threshold

(sig <- tbl[tbl$pvalue <= thr, ])

## DataFrame with 10 rows and 4 columns
##                       pvalue           effect             X2.5
##                    <numeric>        <numeric>        <numeric>
## NO2     4.93174209856839e-05 15.1019393592132 8.05249677475728
## NOx      5.4053184695313e-05 13.0679142694299  6.9237273749917
## NO      6.93577244148535e-05 10.4612167216436  5.4675489250452
## AbsPM25  0.00010849521186973 19.6283472511496 9.96133403024455
## PM25    0.000137011348647853 36.8364649364314 18.4399075002739
## PM25CU  0.000265183311340955 11.7686204802454 5.60126881235511
## Temp     0.00122705664499012 88.1527135282858 35.5988388431556
## PCB153   0.00125475824499421 4.26293143314533 1.71426970647294
## PCB138   0.00162516166610449 4.64231172123763 1.79535492687626
## PM10Fe   0.00213657460974703 13.4582118778229 4.99484009546175
##                    X97.5
##                <numeric>
## NO2     22.1513819436691
## NOx     19.2121011638681
## NO      15.4548845182419
## AbsPM25 29.2953604720547
## PM25    55.2330223725888
## PM25CU  17.9359721481356
## Temp    140.706588213416
## PCB153  6.81159315981772
## PCB138  7.48926851559899
## PM10Fe  21.9215836601841

Session info

## R version 3.5.1 Patched (2018-07-12 r74967)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2      mice_3.3.0          lattice_0.20-35    
## [4] ggplot2_3.1.0       rexposome_1.4.0     Biobase_2.42.0     
## [7] BiocGenerics_0.28.0 BiocStyle_2.10.0   
## 
## loaded via a namespace (and not attached):
##   [1] nlme_3.1-137         bitops_1.0-6         RColorBrewer_1.1-2  
##   [4] rprojroot_1.3-2      tools_3.5.1          backports_1.1.2     
##   [7] R6_2.3.0             tmvtnorm_1.4-10      rpart_4.1-13        
##  [10] KernSmooth_2.23-15   Hmisc_4.1-1          lazyeval_0.2.1      
##  [13] colorspace_1.3-2     jomo_2.6-4           nnet_7.3-12         
##  [16] withr_2.1.2          tidyselect_0.2.5     gridExtra_2.3       
##  [19] compiler_3.5.1       glmnet_2.0-16        lsr_0.5             
##  [22] formatR_1.5          htmlTable_1.12       flashClust_1.01-2   
##  [25] sandwich_2.5-0       labeling_0.3         bookdown_0.7        
##  [28] caTools_1.17.1.1     scales_1.0.0         checkmate_1.8.5     
##  [31] mvtnorm_1.0-8        stringr_1.3.1        digest_0.6.18       
##  [34] foreign_0.8-71       minqa_1.2.4          rmarkdown_1.10      
##  [37] base64enc_0.1-3      pkgconfig_2.0.2      htmltools_0.3.6     
##  [40] lme4_1.1-18-1        FactoMineR_1.41      htmlwidgets_1.3     
##  [43] rlang_0.3.0.1        GlobalOptions_0.1.0  rstudioapi_0.8      
##  [46] pryr_0.1.4           impute_1.56.0        shape_1.4.4         
##  [49] bindr_0.1.1          zoo_1.8-4            gtools_3.8.1        
##  [52] acepack_1.4.1        dplyr_0.7.7          magrittr_1.5        
##  [55] Formula_1.2-3        leaps_3.0            Matrix_1.2-14       
##  [58] Rcpp_0.12.19         munsell_0.5.0        S4Vectors_0.20.0    
##  [61] imputeLCMD_2.0       scatterplot3d_0.3-41 stringi_1.2.4       
##  [64] yaml_2.2.0           MASS_7.3-51          gplots_3.0.1        
##  [67] plyr_1.8.4           grid_3.5.1           gdata_2.18.0        
##  [70] ggrepel_0.8.0        mitml_0.3-6          crayon_1.3.4        
##  [73] splines_3.5.1        circlize_0.4.4       knitr_1.20          
##  [76] pillar_1.3.0         reshape2_1.4.3       codetools_0.2-15    
##  [79] stats4_3.5.1         pan_1.6              glue_1.3.0          
##  [82] evaluate_0.12        latticeExtra_0.6-28  pcaMethods_1.74.0   
##  [85] data.table_1.11.8    BiocManager_1.30.3   nloptr_1.2.1        
##  [88] foreach_1.4.4        tidyr_0.8.2          gtable_0.2.0        
##  [91] purrr_0.2.5          norm_1.0-9.5         assertthat_0.2.0    
##  [94] xfun_0.4             broom_0.5.0          survival_2.43-1     
##  [97] tibble_1.4.2         iterators_1.0.10     gmm_1.6-2           
## [100] cluster_2.0.7-1      corrplot_0.84