The mixgb package provides a scalable approach to imputation for large data using XGBoost, subsampling, and predictive mean matching. It leverages XGBoost—an efficient implementation of gradient-boosted trees—to automatically capture complex interactions and non-linear relationships. Subsampling and predictive mean matching are incorporated to reduce bias and to preserve realistic imputation variability. The package accommodates a wide range of variable types and offers flexible control over subsampling and predictive matching settings.
We also recommend our package vismi (Visualisation Tools for Multiple Imputation), which offers a comprehensive set of diagnostics for assessing the quality of multiply imputed data.
mixgbWe first load the mixgb package and the
newborn dataset, which contains 16 variables of various
types (integer/numeric/factor/ordinal factor). There are 9 variables
with missing values.
library(mixgb)
str(newborn)
#> tibble [2,107 × 16] (S3: tbl_df/tbl/data.frame)
#> $ household_size : int [1:2107] 4 3 5 4 4 3 5 3 3 3 ...
#> $ age_months : int [1:2107] 2 5 10 10 8 3 10 7 2 7 ...
#> $ sex : Factor w/ 2 levels "Male","Female": 2 1 2 2 1 1 2 2 2 1 ...
#> $ race : Factor w/ 3 levels "White","Black",..: 1 1 2 1 1 1 2 1 2 2 ...
#> $ ethnicity : Factor w/ 3 levels "Mexican-American",..: 3 1 3 3 3 3 3 3 3 3 ...
#> $ race_ethinicity : Factor w/ 4 levels "Non-Hispanic White",..: 1 3 2 1 1 1 2 1 2 2 ...
#> $ head_circumference_cm : num [1:2107] 39.3 45.4 43.9 45.8 44.9 42.2 45.8 NA 40.2 44.5 ...
#> $ recumbent_length_cm : num [1:2107] 59.5 69.2 69.8 73.8 69 61.7 74.8 NA 64.5 70.2 ...
#> $ first_subscapular_skinfold_mm : num [1:2107] 8.2 13 6 8 8.2 9.4 5.2 NA 7 5.9 ...
#> $ second_subscapular_skinfold_mm: num [1:2107] 8 13 5.6 10 7.8 8.4 5.2 NA 7 5.4 ...
#> $ first_triceps_skinfold_mm : num [1:2107] 9 15.6 7 16.4 9.8 9.6 5.8 NA 11 6.8 ...
#> $ second_triceps_skinfold_mm : num [1:2107] 9.4 14 8.2 12 8.8 8.2 6.6 NA 10.9 7.6 ...
#> $ weight_kg : num [1:2107] 6.35 9.45 7.15 10.7 9.35 7.15 8.35 NA 7.35 8.65 ...
#> $ poverty_income_ratio : num [1:2107] 3.186 1.269 0.416 2.063 1.464 ...
#> $ smoke : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 2 2 1 2 1 ...
#> $ health : Ord.factor w/ 5 levels "Excellent"<"Very Good"<..: 1 3 1 1 1 1 1 1 2 1 ...
colSums(is.na(newborn))
#> household_size age_months
#> 0 0
#> sex race
#> 0 0
#> ethnicity race_ethinicity
#> 0 0
#> head_circumference_cm recumbent_length_cm
#> 124 114
#> first_subscapular_skinfold_mm second_subscapular_skinfold_mm
#> 161 169
#> first_triceps_skinfold_mm second_triceps_skinfold_mm
#> 124 167
#> weight_kg poverty_income_ratio
#> 117 192
#> smoke health
#> 7 0To impute this dataset, we use the default settings. By default, the
number of imputed datasets is set to m = 5. The data do not
need to be converted to a dgCMatrix or one-hot encoded
format, as these transformations are handled automatically by the
package. Supported variable types include numeric, integer, factor, and
ordinal factor.
We can also customise imputation settings:
The number of imputed datasets m
The number of imputation iterations maxit
XGBoost hyperparameters and verbose settings.
xgb.params, nrounds,
early_stopping_rounds, print_every_n and
verbose.
Subsampling ratio. By default, subsample = 0.7.
Users can change this value under the xgb.params
argument.
Predictive mean matching settings pmm.type,
pmm.k and pmm.link.
Whether ordinal factors should be converted to integer
(imputation process may be faster)
ordinalAsInteger
Initial imputation methods for different types of variables
initial.num, initial.int and
initial.fac.
Whether to save models for imputing newdata
save.models and save.vars.
set.seed(2026)
# Use mixgb with chosen settings
params <- list(
max_depth = 5,
subsample = 0.9,
nthread = 2,
tree_method = "hist"
)
imp_list <- mixgb(
data = newborn, m = 10, maxit = 2,
ordinalAsInteger = FALSE,
pmm.type = "auto", pmm.k = 5, pmm.link = "prob",
initial.num = "normal", initial.int = "mode", initial.fac = "mode",
save.models = FALSE, save.vars = NULL,
xgb.params = params, nrounds = 200, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0
)Imputation performance can be influenced by the choice of
hyperparameters. While tuning a large number of hyperparameters may seem
daunting, the search space can often be substantially reduced because
many of them are correlated. In mixgb, the function
mixgb_cv() is provided to tune the number of boosting
rounds (nrounds). As XGBoost does not define a default
value for nrounds, users must specify this parameter
explicitly. The default setting in mixgb() is
nrounds = 100; however, we recommend using
mixgb_cv() to get an appropriate value first.
params <- list(max_depth = 3, subsample = 0.7, nthread = 2)
cv.results <- mixgb_cv(data = newborn, nrounds = 100, xgb.params = params, verbose = FALSE)
cv.results$evaluation.log
#> iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
#> <int> <num> <num> <num> <num>
#> 1: 1 1.2361129 0.01558786 1.2475935 0.06870012
#> 2: 2 1.0176746 0.01717526 1.0424598 0.07874904
#> 3: 3 0.8719048 0.02081757 0.9006647 0.08512384
#> 4: 4 0.7729998 0.02320058 0.8114182 0.09141472
#> 5: 5 0.7106921 0.02228392 0.7594188 0.09764968
#> 6: 6 0.6681453 0.02403651 0.7214191 0.10517030
#> 7: 7 0.6393555 0.02535016 0.6969163 0.10746088
#> 8: 8 0.6172915 0.02457553 0.6825221 0.10873454
#> 9: 9 0.5996640 0.02405186 0.6723316 0.10954086
#> 10: 10 0.5873325 0.02603579 0.6640412 0.10883955
#> 11: 11 0.5767323 0.02478809 0.6597094 0.11041219
#> 12: 12 0.5696605 0.02529658 0.6550031 0.11278109
#> 13: 13 0.5623289 0.02657183 0.6503620 0.11267360
#> 14: 14 0.5576936 0.02601675 0.6485935 0.11352195
#> 15: 15 0.5531543 0.02582060 0.6453916 0.11458535
#> 16: 16 0.5492073 0.02539735 0.6439388 0.11370427
#> 17: 17 0.5447605 0.02462400 0.6409097 0.11258197
#> 18: 18 0.5399025 0.02563486 0.6400586 0.11189189
#> 19: 19 0.5313389 0.02270545 0.6411977 0.11286770
#> 20: 20 0.5292437 0.02296678 0.6421435 0.11267979
#> 21: 21 0.5273530 0.02348606 0.6414809 0.11358972
#> 22: 22 0.5251579 0.02400633 0.6411687 0.11479062
#> 23: 23 0.5228063 0.02426503 0.6416890 0.11348986
#> 24: 24 0.5184316 0.02286996 0.6424514 0.11329055
#> 25: 25 0.5156545 0.02337316 0.6427545 0.11420816
#> 26: 26 0.5121317 0.02295848 0.6436561 0.11287003
#> 27: 27 0.5076351 0.02153164 0.6488316 0.10984152
#> 28: 28 0.5037371 0.02142791 0.6492334 0.10992071
#> iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
#> <int> <num> <num> <num> <num>
cv.results$response
#> [1] "weight_kg"
cv.results$best.nrounds
#> [1] 18By default, mixgb_cv() randomly selects an incomplete
variable as the response and fits an XGBoost model using the remaining
variables as predictors, based on the complete cases of the dataset. As
a result, repeated runs of mixgb_cv() may yield different
results. Users may instead explicitly specify the response variable and
the set of covariates via the response and
select_features arguments, respectively.
cv.results <- mixgb_cv(
data = newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1,
response = "head_circumference_cm", select_features = c("age_months", "sex", "race_ethinicity", "recumbent_length_cm", "first_subscapular_skinfold_mm", "second_subscapular_skinfold_mm", "first_triceps_skinfold_mm", "second_triceps_skinfold_mm", "weight_kg"), xgb.params = params, verbose = FALSE
)
cv.results$best.nrounds
#> [1] 16We can then set nrounds = cv.results$best.nrounds in
mixgb() to generate five imputed datasets.
Older version of mixgb package included a few visual diagnostic functions. These have now been removed from mixgb.
We recommend our standalone package vismi (Visualisation Tools for Multiple Imputation), which provides a comprehensive set of visual diagnostics for evaluating multiply imputed data.
For more details, please visit: