---
title: "Dealing with Multiple Imputations"
author: "Carles Hernandez-Ferrer and Juan R. Gonzalez"
date: "`r doc_date()`"
package: "`r pkg_ver('rexposome')`"
abstract: >
      An introductory guide to analysing multiple imputed exposome data with R package `rexposome`. The areas covered in this document are: loading the multiple imputations of both exposures and phenotypes from common `data.frame`s, exploration the exposome data, and testing association between exposures and health outcomes.
vignette: >
  %\VignetteIndexEntry{Dealing with Multiple Imputations}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document
---

```{r setup, include=FALSE}
BiocStyle::markdown()
knitr::opts_chunk$set(echo = TRUE, warnings=FALSE, crop = NULL)
```

# Introduction

## Dummy Imputation with `mice`

To illustrate how to perform a multiple imputation using `mice` we start loading both `rexposome` and `mice` libraries.

```{r, message=FALSE}
library(rexposome)
library(mice)
```

The we load the `txt` files includes in `rexposome` package so we can load the exposures and see the amount of missing data (check vignette *Exposome Data Analysis* for more information). 

The following lines locates where the `txt` files were installed.

```{r files_path}
path <- file.path(path.package("rexposome"), "extdata")
description <- file.path(path, "description.csv")
phenotype <- file.path(path, "phenotypes.csv")
exposures <- file.path(path, "exposures.csv")
```

Once the files are located we load them as `data.frames`:

```{r read_csv_files}
dd <- read.csv(description, header=TRUE, stringsAsFactors=FALSE)
ee <- read.csv(exposures, header=TRUE)
pp <- read.csv(phenotype, header=TRUE)
```

In order to speed up the imputation process that will be carried in this vignette, we will remove four families of exposures.

```{r remove_exposures}
dd <- dd[-which(dd$Family %in% c("Phthalates", "PBDEs", "PFOAs", "Metals")), ]
ee <- ee[ , c("idnum", dd$Exposure)]
```

We can check the amount of missing data in both exposures and phenotypes `data.frames`:

```{r check_na}
data.frame(
    Set=c("Exposures", "Phenotypes"),
    Count=c(sum(is.na(ee)), sum(is.na(pp)))
)
```

Before running `mice`, we need to collapse both the exposures and the phenotypes in a single `data.frame`.

```{r set_up_imputation}
rownames(ee) <- ee$idnum
rownames(pp) <- pp$idnum

dta <- cbind(ee[ , -1], pp[ , -1])
dta[1:3, c(1:3, 52:56)]
```

Once this is done, the *class* of each column needs to be set, so `mice` will be able to differentiate between continuous and categorical exposures.

```{r class_imputation}
for(ii in c(1:13, 18:47, 55:56)) {
    dta[, ii] <- as.numeric(dta[ , ii])
}
for(ii in c(14:17, 48:54)) {
    dta[ , ii] <- as.factor(dta[ , ii])
}
```

With this `data.frame` we perform the imputation calling `mice` functions (for more information about this call, check `mice`'s vignette). We remove the columns *birthdate* since it is not necessary for the imputations and carries lots of categories.

```{r mice_imputation, message=FALSE}
imp <- mice(dta[ , -52], pred = quickpred(dta[ , -52], mincor = 0.2, 
    minpuc = 0.4), seed = 38788, m = 5, maxit = 10, printFlag = FALSE)
class(imp)
```

The created object `imp`, that is an object of class `mids` contains 20 data-sets with the imputed exposures and the phenotypes. To work with this information we need to extract each one of these sets and create a new data-set that includes all of them. This new `data.frame` will be passed to `rexposome` (check next section to see the requirements).

`mice` package includes the function `complete` that allows to extract a single data-set from an object of class `mids`. We will use this function to extract the sets and join them in a single `data.frame`.

If we set the argument `action` of the `complete` function to `0`, it will return the original data:

```{r extract_non_imputed}
me <- complete(imp, action = 0)
me[ , ".imp"] <- 0
me[ , ".id"] <- rownames(me)
dim(me)
summary(me[, c("H_pesticides", "Benzene")])
```

If the `action` number is between 1 and the `m` value, it will return the selected set.

```{r extract_imputation}
for(set in 1:5) {
    im <- complete(imp, action = set)
    im[ , ".imp"] <- set
    im[ , ".id"] <- rownames(im)
    me <- rbind(me, im)
}
me <- me[ , c(".imp", ".id", colnames(me)[-(97:98)])]
rownames(me) <- 1:nrow(me)
dim(me)
```

## Data Format

The format of the multiple imputation data for `rexposome` needs to follow some restrictions:

  1. Both the **exposures** and the **phenotypes** are stored in the same `data.frame`.
  2. This `data.frame` must have a column called `.imp` indicating the number of imputation. This imputation tagged as `0` are raw exposures (no imputation).
  3. This `data.frame` must have a column called `.id` indicating the name of samples. This will be converted to character.
  4. A `data.frame` with the *description* with the relation between exposures and families.

## Creating an `imExposomeSet`

With the exposome `data.frame` and the description `data.frame` an object of class `imExposomeSet` can be created. To this end, the function `loadImputed` is used:

```{r create_imexposomeset}
ex_imp <- loadImputed(data = me, description = dd, 
                       description.famCol = "Family", 
                       description.expCol = "Exposure")
```

The function `loadImputed` has several arguments:

```{r args_load}
args(loadImputed)
```

The argument `data` is filled with the `data.frame` of exposures. The argument `decription` with the `data.frame` with the exposures' description. `description.famCol` indicates the column on the description that corresponds to the family. `description.expCol` indicates the column on the description that corresponds to the exposures. Finally, `exposures.asFactor` indicates that the exposures with less that, by default, five different values are considered categorical exposures, otherwise continuous.

```{r imexposomeset_show}
ex_imp
```

The output of this object indicates that we loaded 14 exposures, being 13 continuous and 1 categorical.

### Accessing to Exposome Data

The class `ExposomeSet` has several accessors to get the data stored in it. There are four basic methods that returns the names of the individuals (`sampleNames`), the name of the exposures (`exposureNames`), the name of the families of exposures (`familyNames`) and the name of the phenotypes (`phenotypeNames`).

```{r individuals_names}
head(sampleNames(ex_imp))
```

```{r exposures_names}
head(exposureNames(ex_imp))
```

```{r families_names}
familyNames(ex_imp)
```

```{r phenotype_names}
phenotypeNames(ex_imp)
```

`fData` will return the description of the exposures (including internal information to manage them).

```{r exposures_matrix}
head(fData(ex_imp), n = 3)
```

`pData` will return the phenotypes information.

```{r phenotype}
head(pData(ex_imp), n = 3)
```

### Exposures Behaviour

The behavior of the exposures through the imputation process can be studies using the `plotFamily` method. This method will draw the behavior of the exposures in each imputation set in a single chart.

The method required an argument `family` and it will draw a mosaic with the plots from the exposures within the family. Following the same strategy than using an `ExposomeSet`, when the exposures are continuous box-plots are used.

```{r plot_family_continuous}
plotFamily(ex_imp, family = "Organochlorines")
```

For categorical exposures, the method draws accumulated bar-plot:

```{r plot_family_categorical}
plotFamily(ex_imp, family = "Home Environment")
```

The arguments `group` and `na.omit` are not available when `plotFamily` is used with an `imExposomeSet`.

## Extracting an `ExposomeSet` from an `imExposomeSet`

Once an `imExposomeSet` is created, an `ExposomeSet` can be obtained by selecting one of the internal imputed-sets. This is done using the method `toES` and setting the argument `rid` with the number of the imputed-set to use:

```{r creating_es}
ex_1 <- toES(ex_imp, rid = 1)
ex_1

ex_3 <- toES(ex_imp, rid = 3)
ex_3
```


# Exposome-Wide Association Studies (ExWAS)

The interesting point on working with multiple imputations is to test the association of the different version of the exposures with a target phenotype. `rexposome` implements the method `exwas` to be used with an `imExposomeSet`.

```{r exwas, warning=FALSE, message=FALSE, warning=FALSE}
as_iew <- exwas(ex_imp, formula = blood_pre~sex+age, family = "gaussian")
as_iew
```

As usual, the \texttt{ExWAS} object obtained from `exwas` method can be plotted using `plotExwas`:

```{r plot_exwas, fig.height=7}
clr <- rainbow(length(familyNames(ex_imp)))
names(clr) <- familyNames(ex_imp)
plotExwas(as_iew, color = clr)
```

## Extract the exposures over the *threshold of effective tests*

The method `extract` allows to obtain a table of P-Values from an `ExWAS` object. At the same time, the `tef` method allows to obtain the *threshold of effective tests* computed at `exwas`. We can use them combined in order to create a table with the P-Value of the exposures that are beyond the *threshold of efective tests*.

  1. First we get the *threshold of effective tests* 

```{r tef}
(thr <- tef(as_iew))
```

  2. Second we get the table of P-Values
  
```{r pvalue}
tbl <- extract(as_iew)
```

  3. Third we filter the table with the threshold

```{r sig}
(sig <- tbl[tbl$pvalue <= thr, ])
```

# Session info {.unnumbered}

```{r sessionInfo, echo=FALSE}
sessionInfo()
```