--- title: "Multiple dispatch based on dataframes" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Multiple dispatch based on dataframes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(interfacer) ``` # Rationale The S3 type system allows for dispatch based on the first argument of a function. In the situation where we are developing functions that use dataframes as input selecting a dispatch function needs to be based on the structure of the input rather than its class. `interfacer` can use `iface` specifications to associate a particular action with a specific input type. # Dispatch Dispatching to one of a number of functions based on the nature of a dataframe input is enabled by `idispatch(...)`. This emulates the behaviour of `S3` classes but for dataframes, based on their columns and also their grouping. Consider the following `iface` specifications: ```{r} i_test = iface( id = integer ~ "an integer ID", test = logical ~ "the test result" ) # Extends the i_test to include an additional column i_test_extn = iface( i_test, extra = character ~ "a new value", .groups = FALSE ) ``` We can create specific handlers for each type of data and decide which function to dispatch to at runtime based on the input dataframe. The handlers are specified in the format `function_name = iface constraint`. ```{r} # The generic function disp_example = function(x, ...) { idispatch(x, disp_example.extn = i_test_extn, disp_example.no_extn = i_test ) } # The handler for extended input dataframe types disp_example.extn = function(x = i_test_extn, ...) { message("extended data function") return(colnames(x)) } # The handler for non-extended input dataframe types disp_example.no_extn = function(x = i_test, ...) { message("not extended data function") return(colnames(x)) } ``` If we call `disp_example()` with data that matches the `i_test_extn` specification we get one type of behaviour: ```{r} tmp = tibble::tibble( id=c("1","2","3"), test = c(TRUE,FALSE,TRUE), extra = 1.1 ) tmp %>% disp_example() ``` But if we call `disp_example()` with data that only matches the `i_test` specification we get different behaviour: ```{r} # this matches the i_test_extn specification: tmp2 = tibble::tibble( id=c("1","2","3"), test = c(TRUE,FALSE,TRUE) ) tmp2 %>% disp_example() ``` I've used this mechanism, for example, to configure how plots are produced depending on the input. The order of the rules provided to `idispatch` is important. In general the more detailed specifications needing to be provided first, and the more generic specifications last. # Grouping based dispatch It is often useful to have a function that can expects a specific grouping but can handle additional groups. One way of handling these is to use `purrr` and nest columns extensively. Nesting data in the unexpected groups and repeatedly applying the function you want. An alternative `dplyr` solution is to use a `group_modify`. `interfacer` leverages this second option to automatically determine a grouping necessary for a pipeline function from the stated grouping requirements and automatically handle them without additional coding in the package. For example if we have the following `iface` the input for a function must be grouped only by the `color` column: ```{r} # This specification requires that the dataframe is grouped only by the color # column i_diamond_price = interfacer::iface( color = enum(`D`,`E`,`F`,`G`,`H`,`I`,`J`, .ordered=TRUE) ~ "the color column", price = integer ~ "the price column", .groups = ~ color ) ``` A package developer writing a pipeline function may use this fact to handle possible additional grouping by using a `igroup_process(df, ...)` ```{r} # An example function which would be exported in a package # This function expects a dataframe with a colour and price column, grouped # by price. mean_price_by_colour = function(df = i_diamond_price, extra_param = ".") { # When called with a dataframe with extra groups `igroup_process` will # regroup the dataframe according to the structure # defined for `i_diamond_price` and apply the inner function to each group # after first calling `ivalidate` on each group. igroup_process(df, # the real work of this function is provided as an anonymous inner # function (but can be any other function e.g. package private function # but not a purrr style lambda). Ideally this function parameters are named the # same as the enclosing function (here `mean_price_by_colour(df,extra_param)`), however # there is some flexibility here. The special `.groupdata` parameter will # be populated with the values of the unexpected grouping. function(df, extra_param, .groupdata) { message(extra_param, appendLF = FALSE) if (nrow(.groupdata) == 0) message("N.B. zero length group data") return(df %>% dplyr::summarise(mean_price = mean(price))) } ) } ``` If we pass this to correctly grouped data conforming to `i_diamond_price` the inner function is executed once transparently, after the input has been validated: ```{r} # The correctly grouped dataframe. The `ex_mean` function calculates the mean # price for each `color` group. ggplot2::diamonds %>% dplyr::group_by(color) %>% mean_price_by_colour(extra_param = "without additional groups... ") %>% dplyr::glimpse() ``` If an additionally grouped dataframe is provided by the user. The `mean_price_by_colour` function calculates the mean price for each `cut`,`clarity`, and `color` combination. Data validation happens once per group, which affects interpretation of uniqueness. ```{r} ggplot2::diamonds %>% dplyr::group_by(cut, color, clarity) %>% mean_price_by_colour() %>% dplyr::glimpse() ``` The output of this is actually grouped by `cut` as the `color` column grouping is consumed by the nested function in `igroup_process`. `igroup_process` can also be used recursively for a very succinct handling of additional grouping. In this case the function being developed calls `igroup_process` with itself as a parameter. If the input is correctly formatted the `igroup_process` exits, otherwise it splits the input into the correct format and processes each group individually: ```{r} recursive_example = function(df = i_diamond_price) { # call enclosing function recursively if additional groups detected igroup_process(df) # code after this point is only executed if the grouping is correct # it will be executed once per additional group. # it must return a dataframe return(tibble::tibble("rows detected:"=nrow(df))) } # this input is grouped as the specification is expecting # the i_group_process does nothing. ggplot2::diamonds %>% dplyr::group_by(color) %>% recursive_example() %>% dplyr::glimpse() # this input has additional grouping beyond the specification but is handled # gracefully ggplot2::diamonds %>% dplyr::group_by(cut,clarity,color) %>% recursive_example() %>% dplyr::glimpse() ```