---
title: "Utilizing Mechanism-Aware Imputation (MAI)"
author: 
    - name: Jonathan Dekermanjian
      email: Jonathan.Dekermanjian@CUAnschutz.edu
      affiliation: University of Colorado Anschutz Medical Campus
    - name: Elin Shaddox
      email: Elin.Shaddox@CUAnschutz.edu
      affiliation: University of Colorado Anschutz Medical Campus
    - name: Debmalya Nandy
      email: Debmalya.Nandy@CUAnschutz.edu
      affiliation: University of Colorado Anschutz Medical Campus
    - name: Debashis Ghosh
      email: Debashis.Ghosh@CUAnschutz.edu
      affiliation: University of Colorado Anschutz Medical Campus
    - name: Katerina Kechris
      email: Katerina.Kechris@CUAnschutz.edu
      affiliation: University of Colorado Anschutz Medical Campus
package: MAI
output: 
  BiocStyle::html_document:
    highlight: "tango"
    code_folding: show
    toc: true
    toc_float: 
      collapsed: false
date: "7/21/2021"
vignette: >
  %\VignetteIndexEntry{Utilizing Mechanism-Aware Imputation (MAI)}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

A two-step approach to imputing missing data in metabolomics.
Step 1 uses a random forest classifier to classify missing values as
either Missing Completely at Random/Missing At Random (MCAR/MAR) or Missing
Not At Random (MNAR). MCAR/MAR are combined because it is often difficult to
distinguish these two missing types in metabolomics data. Step 2 imputes the
missing values based on the classified missing mechanisms, using the
appropriate imputation algorithms. Imputation algorithms tested and
available for MCAR/MAR include Bayesian Principal Component Analysis (BPCA),
Multiple Imputation No-Skip K-Nearest Neighbors (Multi_nsKNN), and
Random Forest. Imputation algorithms tested and available for MNAR include
nsKNN and a single imputation approach for imputation of metabolites where
left-censoring is present.

# Installation

The following code chunk depicts how to install MAI from Bioconductor

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MAI")
```

# Using MAI when your data is a data.frame or matrix

```{r}
# Load the MAI package
library(MAI)
# Load the example data with missing values
data("untargeted_LCMS_data")
# Set a seed for reproducibility
## Estimating pattern of missingness involves imposing MCAR/MAR into the data
## these are done at random and as such may slightly change the results of the
## estimated parameters.
set.seed(137690)
# Impute the data using BPCA for predicted MCAR value imputation and
# use Single imputation for predicted MNAR value imputation

Results = MAI(data_miss = untargeted_LCMS_data, # The data with missing values
          MCAR_algorithm = "BPCA", # The MCAR algorithm to use
          MNAR_algorithm = "Single", # The MNAR algorithm to use
          assay_ix = 1, # If SE, designates the assay to impute
          forest_list_args = list( # random forest arguments for training
            ntree = 300,
            proximity = FALSE
        ),
          verbose = TRUE # allows console message output
        )

# Get MAI imputations
Results[["Imputed_data"]][1:5, 1:5] # show only 5x5
```

```{r}
# Get the estimated mixed missingness parameters
Results[["Estimated_Params"]]
```

These parameters estimate the ratio of MCAR/MAR to MNAR in the data. The parameters $\alpha$ and $\beta$ separate high, medium, and low average abundance metabolites, while the parameter $\gamma$ is used to impose missingness in the medium and low abundance metabolites.  A smaller $\alpha$ corresponds to more MCAR/MAR being present, while larger $\beta$ and $\gamma$ values imply more MNAR values being present. The returned estimated parameters are then used to impose known missingness in the complete subset of the input data. Subsequently, a random forest classifier is trained to classify the known missingness in the complete subset of the input data. Once the classifier is established it is applied to the unknown missingness of the full input data to predict the missingness. Finally, the missing values are imputed using a specific algorithm, chosen by the user, according to the predicted missingness mechanism.

# Using MAI when your data is a SummarizedExperiment (SE) class
```{r}
# Load the SummarizedExperiment package
suppressMessages(
  library(SummarizedExperiment)
)

# Load the example data with missing values
data("untargeted_LCMS_data")
# Turn the data to a SummarizedExperiment
se = SummarizedExperiment(untargeted_LCMS_data)
# Set a seed for reproducibility
## Estimating pattern of missingness involves imposing MCAR/MAR into the data
## these are done at random and as such may slightly change the results of the
## estimated parameters.
set.seed(137690)
# Impute the data using BPCA for predicted MCAR value imputation and
# use Single imputation for predicted MNAR value imputation

Results = MAI(se, # The data with missing values
          MCAR_algorithm = "BPCA", # The MCAR algorithm to use
          MNAR_algorithm= "Single", # The MNAR algorithm to use
          assay_ix = 1, # If SE, designates the assay to impute
          forest_list_args = list( # random forest arguments for training
            ntree = 300,
            proximity = FALSE
        ),
          verbose = TRUE # allows console message output
        )

# Get MAI imputations
assay(Results)[1:5, 1:5] # show only 5x5
```

```{r}
# Get the estimated mixed missingness parameters
metadata(Results)[["meta_assay_1"]][["Estimated_Params"]]
```

# Session Information

```{r}
sessionInfo()
```

# References
Dekermanjian, J.P., Shaddox, E., Nandy, D. et al. Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics. BMC Bioinformatics 23, 179 (2022). https://doi.org/10.1186/s12859-022-04659-1