---
title: "Data Formatting and Encoding"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data Formatting and Encoding}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  fig.retina = 3,
  comment = "#>"
)
```

```{r, child='../man/rmdchunks/header.Rmd'}
```

# Basic required format

```{r, child='../man/rmdchunks/dataFormat.Rmd'}
```

The {logitr} package contains several example data sets that illustrate this data structure. For example, the `yogurt` contains observations of yogurt purchases by a panel of 100 households [@Jain1994]. Choice is identified by the `choice` column, the observation ID is identified by the `obsID` column, and the columns `price`, `feat`, and `brand` can be used as model covariates:

```{r}
library("logitr")

head(yogurt)
```

This data set also includes an `alt` variable that determines the alternatives included in the choice set of each observation and an `id` variable that determines the individual as the data have a panel structure containing multiple choice observations from each individual.

# Continuous versus discrete variables

Variables are modeled as either continuous or discrete based on their data _type_. Numeric variables are by default estimated with a single "slope" coefficient. For example, consider a data frame that contains a `price` variable with the levels $10, $15, and $20. Adding `price` to the `pars` argument in the main `logitr()` function would result in a single `price` coefficient for the "slope" of the change in price.

In contrast, categorical variables (i.e. `character` or `factor` type variables) are by default estimated with a coefficient for all but the first level, which serves as the reference level. The default reference level is determined alphabetically, but it can also be set by modifying the factor levels for that variable. For example, the default reference level for the `brand` variable is `"dannon"` as it is alphabetically first. To set `"weight"` as the reference level, the factor levels can be modified using the `factor()` function:

```{r}
yogurt2 <- yogurt

brands <- c("weight", "hiland", "yoplait", "dannon")
yogurt2$brand <- factor(yogurt2$brand, levels = brands)
```

# Creating dummy coded variables

If you wish to make dummy-coded variables yourself to use them in a model, I recommend using the `dummy_cols()` function from the [{fastDummies}](https://github.com/jacobkap/fastDummies) package. For example, in the code below, I create dummy-coded columns for the `brand` variable and then use those variables as covariates in a model:

```{r}
yogurt2 <- fastDummies::dummy_cols(yogurt2, "brand")
```

The `yogurt2` data frame now has new dummy-coded columns for brand:

```{r}
head(yogurt2)
```

Now I can use those columns as covariates:

```{r}
mnl_pref_dummies <- logitr(
  data    = yogurt2,
  outcome = 'choice',
  obsID   = 'obsID',
  pars    = c(
    'price', 'feat', 'brand_yoplait', 'brand_dannon', 'brand_weight'
  )
)

summary(mnl_pref_dummies)
```

# Validating data before estimation

Before estimating a model, it is often helpful to validate that your data is properly formatted. The `validate_data()` function checks for common formatting errors and provides detailed diagnostic information. This can save time by catching errors before you attempt to estimate a model.

## Basic validation

At a minimum, you should validate the `outcome` and `obsID` columns:

```{r}
validation <- validate_data(
  data = yogurt,
  outcome = "choice",
  obsID = "obsID"
)

validation
```

The function returns a validation object that indicates whether the data is valid and provides summary information about the data structure.

## Validation with parameters

You can also validate specific parameters to check for missing values or other issues:

```{r}
validation <- validate_data(
  data = yogurt,
  outcome = "choice",
  obsID = "obsID",
  pars = c("price", "feat", "brand")
)

validation
```

## Panel data validation

For panel data, you can validate the panel structure:

```{r}
validation <- validate_data(
  data = yogurt,
  outcome = "choice",
  obsID = "obsID",
  pars = c("price", "feat", "brand"),
  panelID = "id"
)

validation
```

## Common errors detected

The `validate_data()` function checks for several common formatting errors:

1. **Multiple choices per observation**: Each `obsID` should have exactly one choice (outcome = 1)
2. **No choice in observation**: Each `obsID` must have at least one choice
3. **Non-contiguous observation blocks**: All rows with the same `obsID` must be grouped together
4. **Invalid outcome values**: The outcome variable must only contain 0 and 1 (or TRUE and FALSE)
5. **Missing values**: Checks for missing values in required columns

Here's an example of detecting an error:

```{r}
# Create problematic data with multiple choices in one observation
bad_data <- yogurt
bad_data$choice[1:2] <- 1

validation <- validate_data(
  data = bad_data,
  outcome = "choice",
  obsID = "obsID"
)

validation
```