---
title: "Bayesian MAIHDA for sparse intersections"
author: "Hamid Bulut"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Bayesian MAIHDA for sparse intersections}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4.2,
  warning = FALSE,
  message = FALSE
)
library(MAIHDA)
library(ggplot2)
# brms needs Stan and a compiler and takes minutes, so it cannot run on CRAN's or
# pkgdown's builders. Its results are precomputed in data-raw/precompute_sparse_vignette.R
# and read from a small cache here; the fast lme4 fits run live.
pc <- readRDS("sparse_precomputed.rds")
pct <- function(x) sprintf("%.0f%%", 100 * x)
```

## Sparse intersectional cells

Intersectional MAIHDA decomposes the between-stratum variation into an additive
part (the constituent dimensions' main effects) and an interaction part (the
model-implied departure from additivity). That conclusion is defensible only
when each intersectional stratum is populated well enough to estimate its
effect. In practice it often is not. Crossing several dimensions multiplies the strata while the
sample size stays fixed, so cell counts fall away quickly. Estimating an interaction variance from
cells like these is hard. 

## A dataset with a known interaction

`maihda_sparse_data` is simulated for exactly that: 4 dimensions form 36 intersectional strata,
and the interaction is constructed to account for **`r pct(pc$truth$gaussian$interaction_share)`
of the between-stratum variance** -- on a Gaussian outcome `y` and on the latent scale of
a binary outcome `event`. Individuals are allocated to strata unevenly, reproducing the
skewed cell sizes of real survey data.

```{r data}
data(maihda_sparse_data)
d <- maihda_sparse_data

cells <- table(interaction(d$gender, d$ethnicity, d$education, d$age_group, drop = TRUE))
summary(as.numeric(cells))                    # cell-size distribution
sum(cells < 5)                                # how many strata have < 5 people
attr(d, "truth")$gaussian$interaction_share   # the true interaction share: 0.40
```

A third of the strata hold fewer than five individuals. This is not an unusual case, as it is often the situation
once more than two or three dimensions are crossed, and it is
precisely where maximum likelihood becomes unreliable.

## What lme4 reports

We fit the crossed-dimensions decomposition with `engine = "lme4"` on both outcomes.

```{r lme4}
m_g <- maihda(y ~ 1 + (1 | gender:ethnicity:education:age_group),
              data = d, decomposition = "crossed-dimensions", engine = "lme4")
m_b <- maihda(event ~ 1 + (1 | gender:ethnicity:education:age_group),
              data = d, decomposition = "crossed-dimensions",
              engine = "lme4", family = "binomial")

c(gaussian = m_g$decomposition$interaction_share,
  binary   = m_b$decomposition$interaction_share)
c(gaussian_singular = isTRUE(m_g$model$diagnostics$singular),
  binary_singular   = isTRUE(m_b$model$diagnostics$singular))
```

Both fits are singular: maximum likelihood has placed the interaction variance at
the boundary. The Gaussian interaction share is pulled down to
`r pct(pc$lme4$gaussian$share)`, and the binary share to `r pct(pc$lme4$binary$share)` --
against a true 40%, the binary fit reports a substantial interaction as effectively
absent. Crucially, neither estimate carries an interval. The estimate is not only biased
but reported with unwarranted precision, because the model has no way to express how
little the sparse cells determine.

## What brms adds: posterior uncertainty

Refitting with `engine = "brms"` and a weakly-informative prior on the random-effect
standard deviations changes the output in one important respect. The prior regularises
the between-stratum standard deviations away from the boundary (so the fit is not
singular) and the posterior yields a credible interval in place of a single number.
The call is shown but not evaluated here (`brms` requires Stan and several minutes; see
the note on precomputation below).

```{r brms-call, eval = FALSE}
m_g_brms <- maihda(
  y ~ 1 + (1 | gender:ethnicity:education:age_group),
  data = d, decomposition = "crossed-dimensions", engine = "brms",
  prior = brms::set_prior("normal(0, 0.5)", class = "sd"),
  chains = 4, iter = 2000, warmup = 1000, cores = 4, seed = 1,
  control = list(adapt_delta = 0.97)
)
m_g_brms$decomposition$interaction_share
m_g_brms$decomposition$interaction_share_ci
```

The cached posterior summaries:

```{r brms-summary}
bg <- pc$brms$gaussian; bb <- pc$brms$binary
data.frame(
  outcome      = c("Gaussian", "Binary"),
  true_share   = c(pc$truth$gaussian$interaction_share, pc$truth$binary_latent$interaction_share),
  brms_share   = c(bg$share, bb$share),
  ci_low       = c(bg$ci[1], bb$ci[1]),
  ci_high      = c(bg$ci[2], bb$ci[2]),
  max_rhat     = c(bg$diag$max_rhat, bb$diag$max_rhat),
  divergences  = c(bg$diag$divergences, bb$diag$divergences)
)
```

The posterior point remains low: the interaction is weakly identified at this
density, so `brms` does not recover the 40% either, and we should not expect it to.
The credible interval, however, runs from near zero to past the true value, reflecting
posterior uncertainty about the interaction share. This is the substantive difference.
Maximum likelihood reported a singular point as if it were certain; `brms` reports an
interval that includes the true share in this simulation and makes the uncertainty
explicit. The diagnostics (`max_rhat` close to 1, no divergent transitions) suggest
the posterior is well-sampled.

## Comparison

```{r fig-share, fig.height = 3.8}
fig <- data.frame(
  outcome = factor(rep(c("Gaussian", "Binary"), each = 2), levels = c("Gaussian", "Binary")),
  method  = rep(c("lme4 (ML)", "brms (Bayesian)"), 2),
  share   = c(pc$lme4$gaussian$share, pc$brms$gaussian$share,
              pc$lme4$binary$share,   pc$brms$binary$share),
  lo      = c(NA, pc$brms$gaussian$ci[1], NA, pc$brms$binary$ci[1]),
  hi      = c(NA, pc$brms$gaussian$ci[2], NA, pc$brms$binary$ci[2])
)
truth <- pc$truth$gaussian$interaction_share

ggplot(fig, aes(method, share, colour = method)) +
  geom_hline(yintercept = truth, linetype = "dashed") +
  geom_pointrange(aes(ymin = lo, ymax = hi), na.rm = TRUE, linewidth = 0.8) +
  geom_point(size = 2.5) +
  facet_wrap(~ outcome) +
  scale_y_continuous("Interaction share of between-stratum variance",
                     labels = function(x) paste0(round(100 * x), "%")) +
  labs(x = NULL, colour = NULL,
       title = "A singular ML point estimate vs. a brms posterior interval",
       subtitle = "Dashed line: true interaction share (40%)") +
  theme_minimal() + theme(legend.position = "none")
```

Both point estimates fall below the true value; neither engine can locate an interaction
the sparse data do not pin down. They differ in what is reported alongside the point. The
`brms` credible interval spans the true 40%, whereas `lme4` provides a single value with
no uncertainty. Reading the `lme4` point at face value (an interaction of roughly
3--23%, or none at all) is the error the singular fit invites.

## Why the interval, not the point, is what changes

The per-stratum effects (the stratum random intercepts) are similar under the two
engines as both apply partial pooling, shrinking thinly-estimated strata toward the
additive prediction. What differs is the treatment of uncertainty. Maximum likelihood
commits to a single variance decomposition and, at the boundary, reports it without a
standard error. `brms` propagates the full posterior, so the interaction share is
accompanied by an interval whose width reflects how little the sparse cells constrain it. Summarising
the posterior by its point discards exactly the information that matters: the degree of
uncertainty. When intersectional cells are sparse,
a regularised Bayesian fit with an explicit interval is the more defensible
summary.