---
title: "Runtime"
output: rmarkdown::html_vignette
bibliography: '`r system.file("REFERENCES.bib", package="SDModels")`'
vignette: >
%\VignetteIndexEntry{Runtime}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
For this package, we have written methods to estimate regressions trees and random forests to minimize the spectral objective:
$$\hat{f} = \text{argmin}_{f' \in \mathcal{F}} \frac{||Q(\mathbf{Y} - f'(\mathbf{X}))||_2^2}{n}$$
The package is currently fully written in @RCoreTeam2024R:Computing for now and it gets quite slow for larger sample sizes. There might be a faster cpp version in the future, but for now, there are a few ways to increase the computations if you apply the methods to larger data sets.
## Parallel Processing
Many of our functions support parallel processing using the parameter `mc.cores` to control the number of cores used.
```{r core1, eval=FALSE}
# fits the individual SDTrees in parallel on 22 cores
fit <- SDForest(x = X, y = Y, mc.cores = 22)
# predicts with the individual SDTrees in parallel
predict(fit, newdata = data.frame(X), mc.cores = 10)
# evaluates different strengths of regularization in parallel
paths <- regPath(fit, mc.cores = 10)
# predicts potential outcomes for different values of covariate one in parallel
pd <- partDependence(model, 1, mc.cores = 10)
# performs cross validation in parallel
model <- SDAM(X, Y, cv_k = 5, mc.cores = 5)
```
To support parallelization, we use the R package [future](https://future.futureverse.org/) @RJ-2021-048. If `mc.cores` is larger than one, `multicore` (forking of processes) is used if possible, and `multisession` otherwise. If `mc.cores` is smaller than two, we process sequentially or use a pre-specified plan. This way, a user can freely choose and set up any [backend](https://future.futureverse.org/articles/future-2b-backend.html).
```{r core2, eval=FALSE}
# predefined plan
future::plan(multisession, workers = 2)
# fits the individual SDTrees in parallel on 2 cores
fit <- SDForest(x = X, y = Y)
```
## Approximations
In a few places, approximations perform almost as well as if we run the whole procedure. Reasonable split points to divide the space of $\mathbb{R}^p$ are, in principle, all values between the observed ones. In practice and with many observations, the number of potential splits grows too large. We, therefore, evaluate maximal `max_candidates` splits of the potential ones and choose them according to the quantiles of the potential ones.
```{r candidates, eval=FALSE}
# approximation of candidate splits
fit <- SDForest(x = X, y = Y, max_candidates = 100)
tree <- SDTree(x = X, y = Y, max_candidates = 50)
```
If we have many observations, we can reduce computing time by only sampling `max_size` observations from the data instead of $n$. This can dramatically reduce computing time compared to a full bootstrap sample but could also decrease performance.
```{r subsample, eval=FALSE}
# draws maximal 500 samples from the data for each tree
fit <- SDForest(x = X, y = Y, max_size = 500)
```