# GAReg

## Genetic Algorithms in Regression.

Authors: Mo Li (mo.li@louisiana.edu), QiQi Lu (qlu2@vcu.edu), Robert Lund (rolund@ucsc.edu), Xueheng Shi (xshi11@unl.edu)

## Overview

GAReg is a unified genetic algorithm framework for regression 
problems that require discrete optimization over model spaces with unknown or 
variable dimension, where gradient-based methods and exhaustive search are 
impractical. It provides a compact chromosome representation for tasks such 
as optimal spline knot placement, best-subset variable selection, and multiple 
changepoint detection, along with exact uniform initialization, 
constraint-preserving crossover and mutation, steady-state replacement, 
and optional island-model parallelization. In challenging high-dimensional 
settings, GAReg enables efficient search and delivers near-optimal solutions 
when alternative algorithms are not well-justified.

## Package download and installation

You can install the version of changepointGA from CRAN:

```{r}
install.packages("GAReg")
```

or the development version from Github:

```{r}
# install.packages("remotes")
remotes::install_github("mli171/GAReg", build_vignettes = TRUE, dependencies = TRUE)
```

## Open the Vignette in R for more details

```{r}
browseVignettes("GAReg")
```

## GAReg: Genetic Algorithm Applications in Regression

### Best Subset Variable Selection

We use the simulation function below for subset selection illustration. 
Here, `n` is the number of observations and `p` is the number of predictors. 
For the covairates, `s0` represent number of truly active predictors, valued 
range from `0` to `p`. `magnitudes_range` specifies the range of significantly 
expressed coefficients that corresponding to the truly active predictors. 
If `rho` is specified with some values, the autoregressive structure is 
introduced into the error terms. If `rho=NULL`, we will have independent and 
identically distributed (IID) errors. We can also specify `sigma` for the error 
standard deviation.

```{r}
sim_subset_data <- function(n = 60, p = 50, s0 = 25, sigma = 1.5,
                            magnitudes_range = c(0.5, 2),
                            rho = NULL,
                            seed = NULL) {
  stopifnot(n > 0, p > 0, s0 >= 0, s0 <= p, sigma >= 0)
  if (!is.null(seed)) set.seed(seed)

  X <- matrix(rnorm(n * p), n, p)
  
  # Active set and coefficients
  true_idx    <- if (s0 > 0) sort(sample.int(p, s0)) else integer(0)
  signs       <- if (s0 > 0) sample(c(-1, 1), s0, replace = TRUE) else numeric(0)
  magnitudes  <- if (s0 > 0) runif(s0, magnitudes_range[1], magnitudes_range[2]) else numeric(0)

  beta_true <- numeric(p)
  if (s0 > 0) beta_true[true_idx] <- magnitudes * signs

  if (is.null(rho)) {
    e <- rnorm(n, sd = sigma)
  } else {
    sd_innov <- sigma * sqrt(1 - rho^2)
    burn_in  <- 100
    z        <- rnorm(n + burn_in, sd = sd_innov)
    e_full   <- numeric(n + burn_in)
    for (t in 2:(n + burn_in)) e_full[t] <- rho * e_full[t - 1] + z[t]
    e <- e_full[(burn_in + 1):(burn_in + n)]
  }

  y <- as.numeric(X %*% beta_true + e)

  DF <- data.frame(y = y, as.data.frame(X))
  colnames(DF)[-1] <- paste0("X", seq_len(p))

  list(
    X = X,
    y = y,
    beta_true = beta_true,
    true_idx = true_idx,
    DF = DF,
    rho = if (is.null(rho)) NULL else rho,
    args = list(n = n, p = p, s0 = s0, sigma = sigma,
                magnitudes_range = magnitudes_range,
                rho = rho, seed = seed)
  )
}

sim <- sim_subset_data(n=100, p=50, s0=25, sigma=1.5, rho=NULL, seed=123)
y <- sim$y
X <- sim$X
ga <- gareg_subset(y=y, X=X, gaMethod = "GA", monitor = FALSE,
                   gacontrol=list(popSize=120,
                                  maxiter=20000,
                                  run=4000,
                                  pmutation=0.02))
summary(ga)

# False Discovery Rate and True Positive Rate calculation function
res <- FDRCalc(truelabel=sim$true_idx, predlabel=ga@bestsol, N=50)
# FALSE Discover Rate
res$fdr
# TRUE Positive Rate
res$tpr
```

### Changepoint Detection

The multiple changepoint detection can be conducted through `changepointGA` 
package (Li and Lu, 2024). The BIC penalized function is provided below for IID 
data. The related math details can be found in (Li et al., 2026).

```{r}
BIC.cpt = function(chromosome, Xt){
  
  m   = chromosome[1]
  tau = chromosome[2:(2 + m - 1)]
  N   = length(Xt)
  
  if(m==0){
    mu.hat = mean(Xt)
    sigma.hatsq = sum( (Xt-mu.hat)^2 )/N
    BIC.obj = 0.5*N*log(sigma.hatsq)+ 2*log(N)
  }
  else{
    tau.ext = c(1, tau, N+1)
    seg.len = diff(tau.ext)
    ff = rep(0:m, times=seg.len)
    Xseg = split(Xt, ff)
    mu.seg = unlist(lapply(Xseg,mean), use.names=F)
    mu.hat = rep(mu.seg, seg.len)
    sigma.hatsq = sum( (Xt-mu.hat)^2 )/N
    BIC.obj = 0.5*N*log(sigma.hatsq) + (2*m + 2)*log(N)
  }
  return(BIC.obj)
}

# IID data
set.seed(1234)
n = 200
et = rnorm(n)
Xt = et + rep(c(0,2,0,2), each=n/4)

library(changepointGA)
GA.res <- cptga(
  ObjFunc = BIC.cpt, N = n, popSize = 500, 
  pcrossover = 0.95, pmutation = 0.3, pchangepoint = 10/n,
  Xt = Xt
)
summary(GA.res)
```

### Optimal Knot Selection for Splines

The classic motocycle impact dataset from `MASS` package (Venables & Ripley, 2002) 
is used as example here. 

```{r}
library(MASS)
library(splines)

data(mcycle)
head(mcycle)
```

1. Different GA model set-up

- Varying knots with single population GA;

```{r}
g1 <- gareg_knots(
  y=mcycle$accel, x=mcycle$times,
  minDist = 5,
  gaMethod = "cptga",
  cptgactrl = cptgaControl(popSize=200, pcrossover=0.9, pmutation=0.3),
  ic_method = "BIC"
)
summary(g1)

# knots location
g1@bestsol
```

- Varying knots with island model GA;

```{r}
g2 <- gareg_knots(
  y=mcycle$accel, x=mcycle$times,
  minDist = 5,
  gaMethod = "cptgaisl",
  cptgactrl = cptgaControl(numIslands=5, popSize=200, maxMig=250,
                           pcrossover=0.9, pmutation=0.3),
  ic_method = "BIC"
)
summary(g2)
```

- Fixed knots with single population GA;
```{r}
g3 <- gareg_knots(
  y=mcycle$accel, x=mcycle$times,
  fixedknots = 3,
  minDist = 5,
  gaMethod = "cptga",
  cptgactrl = cptgaControl(popSize=200, pcrossover=0.9, pmutation=0.3),
  ic_method = "BIC"
)
summary(g3)
```

- Fixed knots with island model GA;

```{r}
g4 <- gareg_knots(
  y=mcycle$accel, x=mcycle$times,
  fixedknots = 4,
  minDist = 5,
  gaMethod = "cptgaisl",
  cptgactrl = cptgaControl(numIslands=5, popSize=200, maxMig=250, 
                           pcrossover=0.9, pmutation=0.3),
  ic_method = "BIC"
)
summary(g4)
```

- Comparison;

```{r}
y = mcycle$accel
x = mcycle$times
x_unique = unique(x)

tBIC.vary.ga     = g1@bestsol
tBIC.vary.gaisl  = g2@bestsol
tBIC.fix.3.ga    = g3@bestsol
tBIC.fix.4.gaisl = g4@bestsol

bsfit.vary.ga     = lm(y ~ bs(x, knots=x_unique[g1@bestsol], Boundary.knots = range(x)))
bsfit.vary.gaisl  = lm(y ~ bs(x, knots=x_unique[g2@bestsol], Boundary.knots = range(x)))
bsfit.fix.3.ga    = lm(y ~ bs(x, knots=x_unique[g3@bestsol], Boundary.knots = range(x)))
bsfit.fix.4.gaisl = lm(y ~ bs(x, knots=x_unique[g4@bestsol], Boundary.knots = range(x)))

plot(x, y, xlab = "Time (ms)", ylab = "Acceleration (g)")
ht = seq(min(x), max(x), length.out = 200)
lines(ht, predict(bsfit.vary.ga, data.frame(x = ht)), col="blue", lty = 5, lwd = 2)
lines(ht, predict(bsfit.vary.gaisl, data.frame(x = ht)), col="orange", lty = 4, lwd = 2)
lines(ht, predict(bsfit.fix.3.ga, data.frame(x = ht)), col="purple", lty = 3, lwd = 2)
lines(ht, predict(bsfit.fix.4.gaisl, data.frame(x = ht)), col="#D55E00", lty = 2, lwd = 2)
legend("bottomright", 
       legend = c("Varying knots GA", 
                  "Varying knots island model GA", 
                  "Fixed 3 knots GA", 
                  "Fixed 4 knots island model GA"),
       lty = 5:2, lwd = 2,
       col = c("blue", "orange", "purple", "#D55E00"), bty = "n")
```

2. Spline options: piecewise polynomials, natural cubic, and B-splines

This section illustrates how to build spline design matrices via `splineX()` 
for three common options:

- `type="ppolys"`: degree-d truncated power piecewise polynomials;
- `type="ns"`: degree-3 natural cubic spline (degree will be ignored);
- `type="bs"`: degree-d B-spline basis;

We’ll use the motorcycle acceleration data `MASS::mcycle`, create interior 
knots at quantiles of `times`, and compare how different spline types/degrees 
behave. Here, we only illustrate through `Varying number and locations of knots` 
set-up (Let GA choose both **how many** knots and **where** they go). 

```{r}
g_pp3 <- gareg_knots(
  y = y, x = x,
  minDist   = 5,
  gaMethod  = "cptga",
  ObjFunc   = NULL,          # use default varyknotsIC
  type      = "ppolys",
  degree    = 3,             # degree-3 piecewise polynomial
  intercept = TRUE,
  ic_method = "BIC"
)
summary(g_pp3)

g_ns <- gareg_knots(
  y = y, x = x,
  minDist   = 5,
  gaMethod  = "cptga",
  type      = "ns",          # natural cubic (degree ignored)
  degree    = 3,             # ignored for "ns"
  intercept = TRUE,
  ic_method = "BIC"
)
summary(g_ns)

g_bs1 <- gareg_knots(
  y = y, x = x,
  minDist   = 5,
  gaMethod  = "cptga",
  type      = "bs",
  degree    = 1,             # linear B-splines
  intercept = TRUE,
  ic_method = "BIC"
)
summary(g_bs1)
```

# References

Li, M., & Lu, Q. (2024). changepointGA: An R package for Fast Changepoint Detection via Genetic Algorithm. arXiv preprint arXiv:2410.15571.

Mo Li, QiQi Lu, Robert Lund, & Xueheng Shi. (2026). Genetic Algorithms in Regression. arXiv preprint arXiv:.

Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0.


# Code style

Before pushing changes, please run 

```r
styler::style_pkg()
```

to ensure your code follows the tidyverse style guide.
