--- title: "Beta Regression in R" author: - name: Francisco Cribari-Neto affiliation: Universidade Federal de Pernambuco address: Brazil orcid: 0000-0002-5909-6698 email: cribari@ufpe.br url: https://www.de.ufpe.br/~cribari/ - name: Achim Zeileis affiliation: Universität Innsbruck address: Austria orcid: 0000-0003-0918-3766 email: Achim.Zeileis@R-project.org url: https://www.zeileis.org/ format: html: html-math-method: mathjax toc: true number-sections: true abstract: > This introduction to the R package **betareg** is a (slightly) modified version of @betareg:Cribari-Neto+Zeileis:2010, published in the _Journal of Statistical Software_. A follow-up paper with various extensions is @betareg:Gruen+Kosmidis+Zeileis:2012 -- a slightly modified version of which is also provided within the package as `vignette("betareg-ext", package = "betareg")`\ The class of beta regression models is commonly used by practitioners to model variables that assume values in the standard unit interval $(0, 1)$. It is based on the assumption that the dependent variable is beta-distributed and that its mean is related to a set of regressors through a linear predictor with unknown coefficients and a link function. The model also includes a precision parameter which may be constant or depend on a (potentially different) set of regressors through a link function as well. This approach naturally incorporates features such as heteroskedasticity or skewness which are commonly observed in data taking values in the standard unit interval, such as rates or proportions. This paper describes the **betareg** package which provides the class of beta regressions in the R system for statistical computing. The underlying theory is briefly outlined, the implementation discussed and illustrated in various replication exercises. bibliography: betareg.bib vignette: > %\VignetteIndexEntry{Beta Regression in R} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} %\VignetteDepends{stats,betareg,car,lmtest,sandwich,strucchange} %\VignetteKeywords{beta regression, rates, proportions, R} %\VignettePackage{betareg} --- ```{r preliminaries, include = FALSE} library("betareg") data("GasolineYield", package = "betareg") data("FoodExpenditure", package = "betareg") gy_loglog <- betareg(yield ~ batch + temp, data = GasolineYield, link = "loglog") fe_beta2 <- betareg(I(food/income) ~ income + persons | persons, data = FoodExpenditure) knitr::opts_chunk$set( engine = "R", collapse = TRUE, comment = "##", message = FALSE, warning = FALSE, echo = TRUE ) options(width = 70, prompt = "R> ", continue = "+ ", useFancyQuotes = FALSE, digits = 5) ``` ## Introduction {#sec-intro} How should one perform a regression analysis in which the dependent variable (or response variable), $y$, assumes values in the standard unit interval $(0,1)$? The usual practice used to be to transform the data so that the transformed response, say $\tilde y$, assumes values in the real line and then apply a standard linear regression analysis. A commonly used transformation is the logit, $\tilde y = \log(y/(1-y))$. This approach, nonetheless, has shortcomings. First, the regression parameters are interpretable in terms of the mean of $\tilde y$, and not in terms of the mean of $y$ (given Jensen's inequality). Second, regressions involving data from the unit interval such as rates and proportions are typically heteroskedastic: they display more variation around the mean and less variation as we approach the lower and upper limits of the standard unit interval. Finally, the distributions of rates and proportions are typically asymmetric, and thus Gaussian-based approximations for interval estimation and hypothesis testing can be quite inaccurate in small samples. @betareg:Ferrari+Cribari-Neto:2004 proposed a regression model for continuous variates that assume values in the standard unit interval, e.g., rates, proportions, or concentration indices. Since the model is based on the assumption that the response is beta-distributed, they called their model _the beta regression model_. In their model, the regression parameters are interpretable in terms of the mean of $y$ (the variable of interest) and the model is naturally heteroskedastic and easily accomodates asymmetries. A variant of the beta regression model that allows for nonlinearities and variable dispersion was proposed by @betareg:Simas+Barreto-Souza+Rocha:2010. In particular, in this more general model, the parameter accounting for the precision of the data is not assumed to be constant across observations but it is allowed to vary, leading to the _variable dispersion beta regression model_. The chief motivation for the beta regression model lies in the flexibility delivered by the assumed beta law. The beta density can assume a number of different shapes depending on the combination of parameter values, including left- and right-skewed or the flat shape of the uniform density (which is a special case of the more general beta density). This is illustrated in @fig-beta-distributions which depicts several different beta densities. Following @betareg:Ferrari+Cribari-Neto:2004, the densities are parameterized in terms of the mean $\mu$ and the precision parameter $\phi$; all details are explained in the next section. The evident flexiblity makes the beta distribution an attractive candidate for data-driven statistical modeling. ```{r beta-distributions} #| echo: false #| fig-width: 8.5 #| fig-height: 4.5 #| out-width: 100% #| label: fig-beta-distributions #| fig-cap: "Probability density functions for beta distributions with varying parameters $\\mu = 0.10, 0.25, 0.50, 0.75, 0.90$ and $\\phi = 5$ (left) and $\\phi = 100$ (right)." par(mfrow = c(1, 2), mar = c(4.1, 4.1, 4.1, 0.1)) dbeta2 <- function(x, mu, phi = 1) dbeta(x, mu * phi, (1 - mu) * phi) x <- seq(from = 0.01, to = 0.99, length = 200) xx <- cbind(x, x, x, x, x) yy <- cbind( dbeta2(x, 0.10, 5), dbeta2(x, 0.25, 5), dbeta2(x, 0.50, 5), dbeta2(x, 0.75, 5), dbeta2(x, 0.90, 5) ) matplot(xx, yy, type = "l", xlab = "y", ylab = "Density", main = expression(phi == 5), lty = 1, col = "black", ylim = c(0, 15)) text(0.05, 12 , "0.10") text(0.95, 12 , "0.90") text(0.22, 2.8, "0.25") text(0.78, 2.8, "0.75") text(0.50, 2.3, "0.50") yy <- cbind( dbeta2(x, 0.10, 100), dbeta2(x, 0.25, 100), dbeta2(x, 0.50, 100), dbeta2(x, 0.75, 100), dbeta2(x, 0.90, 100) ) matplot(xx, yy, type = "l", xlab = "y", ylab = "", main = expression(phi == 100), lty = 1, col = "black", ylim = c(0, 15)) text(0.10, 14.5, "0.10") text(0.90, 14.5, "0.90") text(0.25, 9.8, "0.25") text(0.75, 9.8, "0.75") text(0.50, 8.6, "0.50") ``` The idea underlying beta regression models dates back to earlier approaches such as @betareg:Williams:1982 or @betareg:Prentice:1986. The initial motivation was to model binomial random variables with extra variation. The model postulated for the (discrete) variate of interest included a more flexible variation structure determined by independent beta-distributed variables which are related to a set of independent variables through a regression structure. However, unlike the more recent literature, the main focus was to model binomial random variables. Our interest in what follows will be more closely related to the recent literature, i.e., modeling continous random variables that assume values in $(0,1)$, such as rates, proportions, and concentration or inequality indices (e.g., Gini). In this paper, we describe the **betareg** package which can be used to perform inference in both fixed and variable dispersion beta regressions. The package is implemented in the R system for statistical computing [@betareg:R:2009] and available from the Comprehensive R Archive Network (CRAN) at . The initial version of the package was written by @betareg:Simas+Rocha:2006 up to version 1.2 which was orphaned and archived on CRAN in mid-2009. Starting from version 2.0-0, Achim Zeileis took over maintenance after rewriting/extending the package's functionality. The paper unfolds as follows: @sec-model outlines the theory underlying the beta regression model before @sec-implementation describes its implementation in R. @sec-illustrations and @sec-replications provide various empirical applications: The former focuses on illustrating various aspects of beta regressions in practice while the latter provides further replications of previously published empirical research. Finally, @sec-conclusion contains concluding remarks and directions for future research and implementation. ## Beta regression {#sec-model} The class of beta regression models, as introduced by @betareg:Ferrari+Cribari-Neto:2004, is useful for modeling continuous variables $y$ that assume values in the open standard unit interval $(0,1)$. Note that if the variable takes on values in $(a, b)$ (with $a < b$ known) one can model $(y - a)/(b - a)$. Furthermore, if $y$ also assumes the extremes $0$ and $1$, a useful transformation in practice is $(y \cdot (n - 1) + 0.5) / n$ where $n$ is the sample size [@betareg:Smithson+Verkuilen:2006]. The beta regression model is based on an alternative parameterization of the beta density in terms of the variate mean and a precision parameter. The beta density is usually expressed as $$ f(y;p,q) = \frac{\Gamma(p+q)}{\Gamma(p)\Gamma(q)}y^{p-1}(1-y)^{q-1}, \quad 00$ and $\Gamma(\cdot)$ is the gamma function.^[A beta regression model based on this parameterization was proposed by @betareg:Vasconcellos+Cribari-Neto:2005. We shall, however, focus on the parameterization indexed by the mean and a precision parameter.] @betareg:Ferrari+Cribari-Neto:2004 proposed a different parameterization by setting $\mu = p/(p+q)$ and $\phi = p+q$: $$ f(y;\mu,\phi) = \frac{\Gamma(\phi)}{\Gamma(\mu\phi)\Gamma((1-\mu)\phi)}y^{\mu\phi-1}(1-y)^{(1-\mu)\phi-1}, \quad 00$. We write $y \,\sim\, \mathcal{B}(\mu, \phi)$. Here, $\text{E}(y) = \mu$ and $\text{Var}(y) = \mu(1-\mu)/(1+\phi)$. The parameter $\phi$ is known as the precision parameter since, for fixed $\mu$, the larger $\phi$ the smaller the variance of $y$; $\phi^{-1}$ is a dispersion parameter. Let $y_1,\ldots,y_n$ be a random sample such that $y_i \sim {\mathcal{B}}(\mu_i,\phi)$, $i=1,\ldots,n$. The beta regression model is defined as $$ g(\mu_i) = x_{i}^\top \beta = \eta_i, $$ where $\beta=(\beta_1,\ldots,\beta_k)^\top$ is a $k \times 1$ vector of unknown regression parameters ($k