--- title: "Bayesian MAIHDA for sparse intersections" author: "Hamid Bulut" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Bayesian MAIHDA for sparse intersections} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.2, warning = FALSE, message = FALSE ) library(MAIHDA) library(ggplot2) # brms needs Stan and a compiler and takes minutes, so it cannot run on CRAN's or # pkgdown's builders. Its results are precomputed in data-raw/precompute_sparse_vignette.R # and read from a small cache here; the fast lme4 fits run live. pc <- readRDS("sparse_precomputed.rds") pct <- function(x) sprintf("%.0f%%", 100 * x) ``` ## Sparse intersectional cells Intersectional MAIHDA decomposes the between-stratum variation into an additive part (the constituent dimensions' main effects) and an interaction part (the model-implied departure from additivity). That conclusion is defensible only when each intersectional stratum is populated well enough to estimate its effect. In practice it often is not. Crossing several dimensions multiplies the strata while the sample size stays fixed, so cell counts fall away quickly. Estimating an interaction variance from cells like these is hard. ## A dataset with a known interaction `maihda_sparse_data` is simulated for exactly that: 4 dimensions form 36 intersectional strata, and the interaction is constructed to account for **`r pct(pc$truth$gaussian$interaction_share)` of the between-stratum variance** -- on a Gaussian outcome `y` and on the latent scale of a binary outcome `event`. Individuals are allocated to strata unevenly, reproducing the skewed cell sizes of real survey data. ```{r data} data(maihda_sparse_data) d <- maihda_sparse_data cells <- table(interaction(d$gender, d$ethnicity, d$education, d$age_group, drop = TRUE)) summary(as.numeric(cells)) # cell-size distribution sum(cells < 5) # how many strata have < 5 people attr(d, "truth")$gaussian$interaction_share # the true interaction share: 0.40 ``` A third of the strata hold fewer than five individuals. This is not an unusual case, as it is often the situation once more than two or three dimensions are crossed, and it is precisely where maximum likelihood becomes unreliable. ## What lme4 reports We fit the crossed-dimensions decomposition with `engine = "lme4"` on both outcomes. ```{r lme4} m_g <- maihda(y ~ 1 + (1 | gender:ethnicity:education:age_group), data = d, decomposition = "crossed-dimensions", engine = "lme4") m_b <- maihda(event ~ 1 + (1 | gender:ethnicity:education:age_group), data = d, decomposition = "crossed-dimensions", engine = "lme4", family = "binomial") c(gaussian = m_g$decomposition$interaction_share, binary = m_b$decomposition$interaction_share) c(gaussian_singular = isTRUE(m_g$model$diagnostics$singular), binary_singular = isTRUE(m_b$model$diagnostics$singular)) ``` Both fits are singular: maximum likelihood has placed the interaction variance at the boundary. The Gaussian interaction share is pulled down to `r pct(pc$lme4$gaussian$share)`, and the binary share to `r pct(pc$lme4$binary$share)` -- against a true 40%, the binary fit reports a substantial interaction as effectively absent. Crucially, neither estimate carries an interval. The estimate is not only biased but reported with unwarranted precision, because the model has no way to express how little the sparse cells determine. ## What brms adds: posterior uncertainty Refitting with `engine = "brms"` and a weakly-informative prior on the random-effect standard deviations changes the output in one important respect. The prior regularises the between-stratum standard deviations away from the boundary (so the fit is not singular) and the posterior yields a credible interval in place of a single number. The call is shown but not evaluated here (`brms` requires Stan and several minutes; see the note on precomputation below). ```{r brms-call, eval = FALSE} m_g_brms <- maihda( y ~ 1 + (1 | gender:ethnicity:education:age_group), data = d, decomposition = "crossed-dimensions", engine = "brms", prior = brms::set_prior("normal(0, 0.5)", class = "sd"), chains = 4, iter = 2000, warmup = 1000, cores = 4, seed = 1, control = list(adapt_delta = 0.97) ) m_g_brms$decomposition$interaction_share m_g_brms$decomposition$interaction_share_ci ``` The cached posterior summaries: ```{r brms-summary} bg <- pc$brms$gaussian; bb <- pc$brms$binary data.frame( outcome = c("Gaussian", "Binary"), true_share = c(pc$truth$gaussian$interaction_share, pc$truth$binary_latent$interaction_share), brms_share = c(bg$share, bb$share), ci_low = c(bg$ci[1], bb$ci[1]), ci_high = c(bg$ci[2], bb$ci[2]), max_rhat = c(bg$diag$max_rhat, bb$diag$max_rhat), divergences = c(bg$diag$divergences, bb$diag$divergences) ) ``` The posterior point remains low: the interaction is weakly identified at this density, so `brms` does not recover the 40% either, and we should not expect it to. The credible interval, however, runs from near zero to past the true value, reflecting posterior uncertainty about the interaction share. This is the substantive difference. Maximum likelihood reported a singular point as if it were certain; `brms` reports an interval that includes the true share in this simulation and makes the uncertainty explicit. The diagnostics (`max_rhat` close to 1, no divergent transitions) suggest the posterior is well-sampled. ## Comparison ```{r fig-share, fig.height = 3.8} fig <- data.frame( outcome = factor(rep(c("Gaussian", "Binary"), each = 2), levels = c("Gaussian", "Binary")), method = rep(c("lme4 (ML)", "brms (Bayesian)"), 2), share = c(pc$lme4$gaussian$share, pc$brms$gaussian$share, pc$lme4$binary$share, pc$brms$binary$share), lo = c(NA, pc$brms$gaussian$ci[1], NA, pc$brms$binary$ci[1]), hi = c(NA, pc$brms$gaussian$ci[2], NA, pc$brms$binary$ci[2]) ) truth <- pc$truth$gaussian$interaction_share ggplot(fig, aes(method, share, colour = method)) + geom_hline(yintercept = truth, linetype = "dashed") + geom_pointrange(aes(ymin = lo, ymax = hi), na.rm = TRUE, linewidth = 0.8) + geom_point(size = 2.5) + facet_wrap(~ outcome) + scale_y_continuous("Interaction share of between-stratum variance", labels = function(x) paste0(round(100 * x), "%")) + labs(x = NULL, colour = NULL, title = "A singular ML point estimate vs. a brms posterior interval", subtitle = "Dashed line: true interaction share (40%)") + theme_minimal() + theme(legend.position = "none") ``` Both point estimates fall below the true value; neither engine can locate an interaction the sparse data do not pin down. They differ in what is reported alongside the point. The `brms` credible interval spans the true 40%, whereas `lme4` provides a single value with no uncertainty. Reading the `lme4` point at face value (an interaction of roughly 3--23%, or none at all) is the error the singular fit invites. ## Why the interval, not the point, is what changes The per-stratum effects (the stratum random intercepts) are similar under the two engines as both apply partial pooling, shrinking thinly-estimated strata toward the additive prediction. What differs is the treatment of uncertainty. Maximum likelihood commits to a single variance decomposition and, at the boundary, reports it without a standard error. `brms` propagates the full posterior, so the interaction share is accompanied by an interval whose width reflects how little the sparse cells constrain it. Summarising the posterior by its point discards exactly the information that matters: the degree of uncertainty. When intersectional cells are sparse, a regularised Bayesian fit with an explicit interval is the more defensible summary.