--- title: "Getting started with spicy" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with spicy} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(spicy) ``` spicy is an R package for descriptive statistics and data analysis, designed for data science and survey research workflows. It covers variable inspection, frequency tables, cross-tabulations with chi-squared tests and effect sizes, and publication-ready APA-style reporting — offering functionality similar to Stata or SPSS but within a tidyverse-friendly R environment. This vignette walks through the core workflow using the bundled `sochealth` dataset, a simulated social-health survey with 1 200 respondents and 24 variables. ## Inspect your data `varlist()` (or its shortcut `vl()`) gives a compact overview of every variable in a data frame: name, label, representative values, class, number of distinct values, valid observations, and missing values. In RStudio or Positron, calling `varlist()` without arguments opens an interactive viewer — this is the most common usage in practice. Here we use `tbl = TRUE` to produce static output for the vignette: ```{r varlist} varlist(sochealth, tbl = TRUE) ``` You can also select specific columns with tidyselect syntax: ```{r varlist-select} varlist(sochealth, starts_with("bmi"), income, weight, tbl = TRUE) ``` ## Frequency tables `freq()` produces frequency tables with counts, percentages, and (optionally) valid and cumulative percentages. ```{r freq} freq(sochealth, education) ``` Weighted frequencies use the `weights` argument. With `rescale = TRUE`, the total weighted N matches the unweighted N: ```{r freq-weighted} freq(sochealth, education, weights = weight, rescale = TRUE) ``` ## Cross-tabulations `cross_tab()` crosses two categorical variables. By default it shows counts, a chi-squared test, and Cramer's V: ```{r crosstab} cross_tab(sochealth, smoking, education) ``` Add percentages with `percent`: ```{r crosstab-pct} cross_tab(sochealth, smoking, education, percent = "col") ``` Group by a third variable with `by`: ```{r crosstab-by} cross_tab(sochealth, smoking, education, by = sex) ``` When both variables are ordered factors, `cross_tab()` automatically selects an ordinal measure (Kendall's Tau-b) instead of Cramer's V: ```{r crosstab-ordinal} cross_tab(sochealth, self_rated_health, education) ``` ## Association measures For a quick overview of all available association statistics, pass a contingency table to `assoc_measures()`: ```{r assoc-measures} tbl <- xtabs(~ smoking + education, data = sochealth) assoc_measures(tbl) ``` Individual functions such as `cramer_v()`, `gamma_gk()`, or `kendall_tau_b()` return a scalar by default. Pass `detail = TRUE` for the confidence interval and p-value: ```{r cramer-detail} cramer_v(tbl, detail = TRUE) ``` ## APA tables `table_apa()` builds a publication-ready cross-tabulation report by crossing one grouping variable with one or many row variables. It supports multiple output formats. With `output = "tinytable"` you get a formatted table suitable for R Markdown or Quarto documents: ```{r table-apa-tt} table_apa( sochealth, row_vars = c("smoking", "physical_activity", "dentist_12m"), group_var = "education", output = "tinytable" ) ``` Use `assoc_ci = TRUE` to display the confidence interval inline: ```{r table-apa-ci} table_apa( sochealth, row_vars = "smoking", group_var = "education", output = "tinytable", assoc_ci = TRUE ) ``` Other formats include `"gt"`, `"flextable"`, `"excel"`, `"clipboard"`, and `"word"`. The `"wide"` and `"long"` formats return data frames for further processing: ```{r table-apa-wide} table_apa( sochealth, row_vars = c("smoking", "physical_activity"), group_var = "education", output = "wide" ) ``` ## Row-wise summaries `mean_n()`, `sum_n()`, and `count_n()` compute row-wise statistics across selected columns, with automatic handling of missing values. ```{r mean-n} sochealth |> dplyr::mutate( mean_sat = mean_n(select = starts_with("life_sat")), sum_sat = sum_n(select = starts_with("life_sat"), min_valid = 2), n_missing = count_n(select = starts_with("life_sat"), special = "NA") ) |> dplyr::select(starts_with("life_sat"), mean_sat, sum_sat, n_missing) |> head() |> as.data.frame() ``` ## Learn more - See `?cross_tab` for the full list of arguments (weights, simulation, association measures). - See `?table_apa` for all output formats and customization options. - See `?assoc_measures` for the complete list of association statistics. - See `?code_book` to generate an interactive HTML codebook.