mixgb: Multiple Imputation Through XGBoost

Yongshi Deng

2026-01-17

Introduction

The mixgb package provides a scalable approach to imputation for large data using XGBoost, subsampling, and predictive mean matching. It leverages XGBoost—an efficient implementation of gradient-boosted trees—to automatically capture complex interactions and non-linear relationships. Subsampling and predictive mean matching are incorporated to reduce bias and to preserve realistic imputation variability. The package accommodates a wide range of variable types and offers flexible control over subsampling and predictive matching settings.

We also recommend our package vismi (Visualisation Tools for Multiple Imputation), which offers a comprehensive set of diagnostics for assessing the quality of multiply imputed data.

Impute missing values with mixgb

We first load the mixgb package and the newborn dataset, which contains 16 variables of various types (integer/numeric/factor/ordinal factor). There are 9 variables with missing values.

library(mixgb)
str(newborn)
#> tibble [2,107 × 16] (S3: tbl_df/tbl/data.frame)
#>  $ household_size                : int [1:2107] 4 3 5 4 4 3 5 3 3 3 ...
#>  $ age_months                    : int [1:2107] 2 5 10 10 8 3 10 7 2 7 ...
#>  $ sex                           : Factor w/ 2 levels "Male","Female": 2 1 2 2 1 1 2 2 2 1 ...
#>  $ race                          : Factor w/ 3 levels "White","Black",..: 1 1 2 1 1 1 2 1 2 2 ...
#>  $ ethnicity                     : Factor w/ 3 levels "Mexican-American",..: 3 1 3 3 3 3 3 3 3 3 ...
#>  $ race_ethinicity               : Factor w/ 4 levels "Non-Hispanic White",..: 1 3 2 1 1 1 2 1 2 2 ...
#>  $ head_circumference_cm         : num [1:2107] 39.3 45.4 43.9 45.8 44.9 42.2 45.8 NA 40.2 44.5 ...
#>  $ recumbent_length_cm           : num [1:2107] 59.5 69.2 69.8 73.8 69 61.7 74.8 NA 64.5 70.2 ...
#>  $ first_subscapular_skinfold_mm : num [1:2107] 8.2 13 6 8 8.2 9.4 5.2 NA 7 5.9 ...
#>  $ second_subscapular_skinfold_mm: num [1:2107] 8 13 5.6 10 7.8 8.4 5.2 NA 7 5.4 ...
#>  $ first_triceps_skinfold_mm     : num [1:2107] 9 15.6 7 16.4 9.8 9.6 5.8 NA 11 6.8 ...
#>  $ second_triceps_skinfold_mm    : num [1:2107] 9.4 14 8.2 12 8.8 8.2 6.6 NA 10.9 7.6 ...
#>  $ weight_kg                     : num [1:2107] 6.35 9.45 7.15 10.7 9.35 7.15 8.35 NA 7.35 8.65 ...
#>  $ poverty_income_ratio          : num [1:2107] 3.186 1.269 0.416 2.063 1.464 ...
#>  $ smoke                         : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 2 2 1 2 1 ...
#>  $ health                        : Ord.factor w/ 5 levels "Excellent"<"Very Good"<..: 1 3 1 1 1 1 1 1 2 1 ...
colSums(is.na(newborn))
#>                 household_size                     age_months 
#>                              0                              0 
#>                            sex                           race 
#>                              0                              0 
#>                      ethnicity                race_ethinicity 
#>                              0                              0 
#>          head_circumference_cm            recumbent_length_cm 
#>                            124                            114 
#>  first_subscapular_skinfold_mm second_subscapular_skinfold_mm 
#>                            161                            169 
#>      first_triceps_skinfold_mm     second_triceps_skinfold_mm 
#>                            124                            167 
#>                      weight_kg           poverty_income_ratio 
#>                            117                            192 
#>                          smoke                         health 
#>                              7                              0

To impute this dataset, we use the default settings. By default, the number of imputed datasets is set to m = 5. The data do not need to be converted to a dgCMatrix or one-hot encoded format, as these transformations are handled automatically by the package. Supported variable types include numeric, integer, factor, and ordinal factor.

# use mixgb with default settings
imp_list <- mixgb(data = newborn, m = 5)

Customise imputation settings

We can also customise imputation settings:

set.seed(2026)
# Use mixgb with chosen settings
params <- list(
  max_depth = 5,
  subsample = 0.9,
  nthread = 2,
  tree_method = "hist"
)

imp_list <- mixgb(
  data = newborn, m = 10, maxit = 2,
  ordinalAsInteger = FALSE,
  pmm.type = "auto", pmm.k = 5, pmm.link = "prob",
  initial.num = "normal", initial.int = "mode", initial.fac = "mode",
  save.models = FALSE, save.vars = NULL,
  xgb.params = params, nrounds = 200, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0
)

Tune hyperparameters

Imputation performance can be influenced by the choice of hyperparameters. While tuning a large number of hyperparameters may seem daunting, the search space can often be substantially reduced because many of them are correlated. In mixgb, the function mixgb_cv() is provided to tune the number of boosting rounds (nrounds). As XGBoost does not define a default value for nrounds, users must specify this parameter explicitly. The default setting in mixgb() is nrounds = 100; however, we recommend using mixgb_cv() to get an appropriate value first.

params <- list(max_depth = 3, subsample = 0.7, nthread = 2)
cv.results <- mixgb_cv(data = newborn, nrounds = 100, xgb.params = params, verbose = FALSE)
cv.results$evaluation.log
#>      iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
#>     <int>           <num>          <num>          <num>         <num>
#>  1:     1       1.2361129     0.01558786      1.2475935    0.06870012
#>  2:     2       1.0176746     0.01717526      1.0424598    0.07874904
#>  3:     3       0.8719048     0.02081757      0.9006647    0.08512384
#>  4:     4       0.7729998     0.02320058      0.8114182    0.09141472
#>  5:     5       0.7106921     0.02228392      0.7594188    0.09764968
#>  6:     6       0.6681453     0.02403651      0.7214191    0.10517030
#>  7:     7       0.6393555     0.02535016      0.6969163    0.10746088
#>  8:     8       0.6172915     0.02457553      0.6825221    0.10873454
#>  9:     9       0.5996640     0.02405186      0.6723316    0.10954086
#> 10:    10       0.5873325     0.02603579      0.6640412    0.10883955
#> 11:    11       0.5767323     0.02478809      0.6597094    0.11041219
#> 12:    12       0.5696605     0.02529658      0.6550031    0.11278109
#> 13:    13       0.5623289     0.02657183      0.6503620    0.11267360
#> 14:    14       0.5576936     0.02601675      0.6485935    0.11352195
#> 15:    15       0.5531543     0.02582060      0.6453916    0.11458535
#> 16:    16       0.5492073     0.02539735      0.6439388    0.11370427
#> 17:    17       0.5447605     0.02462400      0.6409097    0.11258197
#> 18:    18       0.5399025     0.02563486      0.6400586    0.11189189
#> 19:    19       0.5313389     0.02270545      0.6411977    0.11286770
#> 20:    20       0.5292437     0.02296678      0.6421435    0.11267979
#> 21:    21       0.5273530     0.02348606      0.6414809    0.11358972
#> 22:    22       0.5251579     0.02400633      0.6411687    0.11479062
#> 23:    23       0.5228063     0.02426503      0.6416890    0.11348986
#> 24:    24       0.5184316     0.02286996      0.6424514    0.11329055
#> 25:    25       0.5156545     0.02337316      0.6427545    0.11420816
#> 26:    26       0.5121317     0.02295848      0.6436561    0.11287003
#> 27:    27       0.5076351     0.02153164      0.6488316    0.10984152
#> 28:    28       0.5037371     0.02142791      0.6492334    0.10992071
#>      iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
#>     <int>           <num>          <num>          <num>         <num>
cv.results$response
#> [1] "weight_kg"
cv.results$best.nrounds
#> [1] 18

By default, mixgb_cv() randomly selects an incomplete variable as the response and fits an XGBoost model using the remaining variables as predictors, based on the complete cases of the dataset. As a result, repeated runs of mixgb_cv() may yield different results. Users may instead explicitly specify the response variable and the set of covariates via the response and select_features arguments, respectively.

cv.results <- mixgb_cv(
  data = newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1,
  response = "head_circumference_cm", select_features = c("age_months", "sex", "race_ethinicity", "recumbent_length_cm", "first_subscapular_skinfold_mm", "second_subscapular_skinfold_mm", "first_triceps_skinfold_mm", "second_triceps_skinfold_mm", "weight_kg"), xgb.params = params, verbose = FALSE
)

cv.results$best.nrounds
#> [1] 16

We can then set nrounds = cv.results$best.nrounds in mixgb() to generate five imputed datasets.

imp_list <- mixgb(data = newborn, m = 5, nrounds = cv.results$best.nrounds)

Inspect multiply imputed values

Older version of mixgb package included a few visual diagnostic functions. These have now been removed from mixgb.

We recommend our standalone package vismi (Visualisation Tools for Multiple Imputation), which provides a comprehensive set of visual diagnostics for evaluating multiply imputed data.

For more details, please visit:

https://agnesdeng.github.io/vismi/

https://github.com/agnesdeng/vismi.