--- title: "Summary Functions" date: "`r format(Sys.Date(), '%Y-%m-%d')`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 fig_width: 7 fig_height: 5 dpi: 600 vignette: > %\VignetteIndexEntry{Summary Functions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- There are two summary functions included with the rCISSVAE package that can help visualize the data clusters and model suitability to the data. ## Per-cluster Summary The `cluster_summary()` function creates a data summary table stratified by missingness cluster. The function builds on `gtsummary::tbl_summary()`, so gtsummary-like statistics can be used for summarizing variables ( see tbl_summary() documentation for details ). ```{r setup} library(tidyverse) library(reticulate) library(rCISSVAE) library(kableExtra) library(gtsummary) data(df_missing) data(clusters) ## Integer clusters must be passed in as a factor cluster_summary(data = df_missing, factor(clusters$clusters), include = setdiff(names(df_missing), "index"), statistic = list( all_continuous() ~ "{mean} ({sd})", all_categorical() ~ "{n} / {N}\n ({p}%)"), missing = "always") ``` ## Missingness Heatmap ```{r} cluster_heatmap( data = df_missing, clusters = paste0("Cluster ", clusters$clusters), ## Adds 'Cluster' to the cluster label cols_ignore = "index", observed_color = "#23013aff", ## A dark purple missing_color = "yellow") ``` ## By-cluster imputation loss function After running the model, you can get the per-cluster validation set imputation loss using the `performance_by_cluster()` function. Set 'return_validation_dataset = TRUE' in the `run_cissvae()` function to be able to use performance_by_cluster on the result object. If the validation dataset (val_data in result object) and imputed validation dataset (val_imputed in the result object) are not returned, the imputation loss cannot be calculated. If the `run_cissvae()` function was used to generate clusters, set `return_clusters=TRUE` and the clusters will be part of the return object. Otherwise, use the 'clusters' parameter in `performance_by_cluster()` to input the clusters. ```{r eval=FALSE} result = run_cissvae( data = df_missing, index_col = "index", val_proportion = 0.1, ## pass a vector for different proportions by cluster columns_ignore = c("Age", "Salary", "ZipCode10001", "ZipCode20002", "ZipCode30003"), ## If there are columns in addition to the index you want to ignore when selecting validation set, list them here. In this case, we ignore the 'demographic' columns because we do not want to remove data from them for validation purposes. clusters = clusters$clusters, ## we have precomputed cluster labels so we pass them here epochs = 5, return_silhouettes = FALSE, return_history = TRUE, # Get detailed training history verbose = FALSE, return_model = TRUE, ## Allows for plotting model schematic device = "cpu", # Explicit device selection layer_order_enc = c("unshared", "shared", "unshared"), layer_order_dec = c("shared", "unshared", "shared"), return_validation_dataset = TRUE ) cat(paste("Check necessary returns:", paste0(names(result), collapse = ", "))) ``` ```{r checkres, echo=FALSE} result = readRDS(system.file("extdata", "demo_run.rds", package = "rCISSVAE")) cat(paste("Check necessary returns:", paste0(names(result), collapse = ", "))) ``` ```{r perf_by_clus} performance_by_cluster(res = result, group_col = NULL, clusters = clusters$clusters, feature_cols = NULL, ## default, all numeric columns excluding group_col & cols_ignore by_group = FALSE, by_cluster = TRUE, cols_ignore = c( "index", "Age", "Salary", "ZipCode10001", "ZipCode20002", "ZipCode30003") ## columns to not score ) ```