--- title: "Runtime" output: rmarkdown::html_vignette bibliography: '`r system.file("REFERENCES.bib", package="SDModels")`' vignette: > %\VignetteIndexEntry{Runtime} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` For this package, we have written methods to estimate regressions trees and random forests to minimize the spectral objective: $$\hat{f} = \text{argmin}_{f' \in \mathcal{F}} \frac{||Q(\mathbf{Y} - f'(\mathbf{X}))||_2^2}{n}$$ The package is currently fully written in @RCoreTeam2024R:Computing for now and it gets quite slow for larger sample sizes. There might be a faster cpp version in the future, but for now, there are a few ways to increase the computations if you apply the methods to larger data sets. ## Parallel Processing Many of our functions support parallel processing using the parameter `mc.cores` to control the number of cores used. ```{r core1, eval=FALSE} # fits the individual SDTrees in parallel on 22 cores fit <- SDForest(x = X, y = Y, mc.cores = 22) # predicts with the individual SDTrees in parallel predict(fit, newdata = data.frame(X), mc.cores = 10) # evaluates different strengths of regularization in parallel paths <- regPath(fit, mc.cores = 10) # predicts potential outcomes for different values of covariate one in parallel pd <- partDependence(model, 1, mc.cores = 10) # performs cross validation in parallel model <- SDAM(X, Y, cv_k = 5, mc.cores = 5) ``` To support parallelization, we use the R package [future](https://future.futureverse.org/) @RJ-2021-048. If `mc.cores` is larger than one, `multicore` (forking of processes) is used if possible, and `multisession` otherwise. If `mc.cores` is smaller than two, we process sequentially or use a pre-specified plan. This way, a user can freely choose and set up any [backend](https://future.futureverse.org/articles/future-2b-backend.html). ```{r core2, eval=FALSE} # predefined plan future::plan(multisession, workers = 2) # fits the individual SDTrees in parallel on 2 cores fit <- SDForest(x = X, y = Y) ``` ## Approximations In a few places, approximations perform almost as well as if we run the whole procedure. Reasonable split points to divide the space of $\mathbb{R}^p$ are, in principle, all values between the observed ones. In practice and with many observations, the number of potential splits grows too large. We, therefore, evaluate maximal `max_candidates` splits of the potential ones and choose them according to the quantiles of the potential ones. ```{r candidates, eval=FALSE} # approximation of candidate splits fit <- SDForest(x = X, y = Y, max_candidates = 100) tree <- SDTree(x = X, y = Y, max_candidates = 50) ``` If we have many observations, we can reduce computing time by only sampling `max_size` observations from the data instead of $n$. This can dramatically reduce computing time compared to a full bootstrap sample but could also decrease performance. ```{r subsample, eval=FALSE} # draws maximal 500 samples from the data for each tree fit <- SDForest(x = X, y = Y, max_size = 500) ```