---
title: "MAIHDA for Binary Outcomes (Discriminatory Accuracy)"
author: "Hamid Bulut"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{MAIHDA for Binary Outcomes (Discriminatory Accuracy)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4.5
)
```

## Why binary outcomes?

Many of the original MAIHDA applications target a **binary** health outcome --
obese vs. not, hypertensive vs. not, screened vs. not. Merlo's framing of
*discriminatory accuracy* is, at heart, a question about a binary classifier: how
well do the intersectional strata alone separate people who do and do not have the
outcome?

For a binary outcome the model is a multilevel **logistic** regression:

\[
\text{logit}\,\Pr(y_{ij} = 1) = \beta_0 + u_j, \qquad u_j \sim N(0, \sigma_u^2),
\]

where \(j\) indexes the intersectional stratum. MAIHDA reads off two
complementary quantities from this model:

- the **VPC/ICC**, the share of the (latent) variation that lies *between* strata;
- the **discriminatory accuracy** (e.g. the AUC / C-statistic), how well stratum
  membership predicts the individual outcome.

A high between-stratum VPC can still go with only *moderate* discriminatory
accuracy at the individual level -- that contrast is the whole point of the "DA"
in MAIHDA, and this vignette shows how to get both numbers.

```{r load}
library(MAIHDA)
data("maihda_health_data")

# A two-level outcome: obese (Yes) vs. not (No)
table(maihda_health_data$Obese)
```

## Fitting a logistic MAIHDA model

`fit_maihda()` detects a two-level outcome automatically. If you leave `family` at
its default, the function switches to `family = "binomial"`, recodes the response
to 0/1 for `glmer()`, and warns you so the change is never silent:

```{r autodetect}
model_null <- fit_maihda(
  Obese ~ 1 + (1 | Gender:Race:Education),
  data = maihda_health_data
)
```

The warning is a feature, not a problem. To be explicit (and to silence the
warning), pass `family = "binomial"` yourself -- this is the recommended form for
scripts:

```{r explicit, eval=FALSE}
model_null <- fit_maihda(
  Obese ~ 1 + (1 | Gender:Race:Education),
  data   = maihda_health_data,
  family = "binomial"
)
```

> If you actually want a linear probability model on a 0/1 outcome, ask for it
> explicitly with `family = "gaussian"`; the auto-switch only fires for the
> default family.

## The VPC is on the latent scale

`summary()` reports the VPC/ICC for the logistic model. There is no observed-scale
residual variance for a Bernoulli outcome, so MAIHDA uses the standard
**latent-scale** approximation: the level-1 (within-stratum) variance is fixed at
\(\pi^2/3 \approx 3.29\) for the logit link (and the corresponding constant for a
probit link). The VPC is therefore the between-stratum share of variation on that
underlying latent scale, not on the probability scale.

```{r summary-null}
summary_null <- summary(model_null)
print(summary_null)
```

Read from this null model, the VPC is the **total** between-stratum share of
latent variation in the odds of obesity. As in the Gaussian case, the stratum
random effects capture the *combined* additive + interaction differences across
strata; they isolate the *pure* intersectional (interaction) component only once
the additive main effects of the strata variables are in the model.

## Adjusted model and PCV

A "Model 2" adds individual-level covariates to ask how much of the
between-stratum variation they account for. Here we adjust for age, fit on the
**same analytic sample and strata** as the null model, and compare with
`calculate_pvc()`. PCV compares variances across models, so both must use the same
complete-case sample:

```{r adjusted}
health_complete <- maihda_health_data[complete.cases(
  maihda_health_data[, c("Obese", "Age", "Gender", "Race", "Education")]
), ]

model_null2 <- fit_maihda(
  Obese ~ 1 + (1 | Gender:Race:Education),
  data = health_complete, family = "binomial"
)

# Model 2: adjust for an individual-level covariate (age)
model_adj <- fit_maihda(
  Obese ~ Age + (1 | Gender:Race:Education),
  data = health_complete, family = "binomial"
)

pcv <- calculate_pvc(model_null2, model_adj)
print(pcv)
```

The same caveats as in the continuous case apply, with one extra wrinkle: for
non-Gaussian models the latent residual variance is fixed by the link, so a change
in the between-stratum variance is partly a **rescaling** of the latent scale, not
only "variance explained". Interpret the PCV as a model-dependent, descriptive
change, not a causal decomposition.

> **A note on adjusting for the strata's own categories.** If instead you add the
> *categorical* main effects that define the strata (e.g.
> `Obese ~ Gender + Race + Education + ...`) to recover the additive-vs-interaction
> split shown in the [introduction](introduction.html), be aware that in a logistic
> model those fixed effects are nearly collinear with the stratum random intercept,
> so `glmer()` often reports a convergence or "nearly unidentifiable" note. Scaling
> covariates and using `control = lme4::glmerControl(optimizer = "bobyqa")` usually
> helps.

## Discriminatory accuracy (AUC / C-statistic)

The VPC summarises *variation*; discriminatory accuracy summarises *prediction*.
The package does **not** ship an AUC or median-odds-ratio helper, but the AUC is a
one-liner from the model's predicted probabilities. The AUC equals the
Mann-Whitney U statistic, so it needs no extra package:

```{r auc-helper}
# Rank-based AUC (C-statistic): P(prob for a case > prob for a non-case)
maihda_auc <- function(prob, y) {
  y <- as.numeric(y)
  n1 <- sum(y == 1); n0 <- sum(y == 0)
  if (n1 == 0 || n0 == 0) return(NA_real_)
  r <- rank(prob)
  (sum(r[y == 1]) - n1 * (n1 + 1) / 2) / (n1 * n0)
}
```

Compute it from the predicted probabilities (`scale = "response"`) and the 0/1
outcome stored in the fitted model frame. The strata-only model's AUC is the
discriminatory accuracy of the intersectional strata *themselves* -- Merlo's
central quantity. Comparing it with the adjusted model shows whether individual
information beyond stratum membership sharpens classification:

```{r auc}
y_obs <- as.numeric(lme4::getME(model_null2$model, "y"))

prob_null <- predict_maihda(model_null2, type = "individual", scale = "response")
prob_adj  <- predict_maihda(model_adj,  type = "individual", scale = "response")

c(
  AUC_strata_only = maihda_auc(prob_null, y_obs),
  AUC_adjusted    = maihda_auc(prob_adj,  y_obs)
)
```

An AUC of 0.5 is chance. Here both models sit around 0.6: even a non-trivial
between-stratum VPC translates into only **modest** accuracy at the *individual*
level -- exactly the cautionary message that motivates discriminatory-accuracy
reporting in MAIHDA. Note the adjusted model barely moves the AUC: the
categorical covariates that *define* the strata are already captured by stratum
membership, so only the continuous covariate (`Age`) adds genuinely new
individual-level information.

## Plots adapt to the binomial family

The standard plots recognise the binomial family and switch to the probability
scale and to deviance-based diagnostics automatically:

```{r plot-predicted}
# Predicted probabilities per stratum with intervals
plot(model_adj, type = "predicted")
```

```{r plot-vpc}
# Latent-scale variance partition
plot(model_adj, type = "vpc")
```

```{r plot-prediction-deviation, warning = FALSE}
# For binomial fits the dashboard highlights the largest absolute
# deviance residuals rather than raw deviations from the mean.
plot(model_adj, type = "prediction_deviation")
```

See the [plot interpretation vignette](interpreting_plots.html) for how to read
each of these.

## Count outcomes work the same way

A Poisson (count) outcome follows the identical pattern -- pass
`family = "poisson"` to `fit_maihda()`. The VPC then uses the log-link
latent-scale residual variance, and the summary, PCV, and plotting helpers all
behave as above.

## References

- Merlo, J. (2018). Multilevel analysis of individual heterogeneity and discriminatory accuracy (MAIHDA) within an intersectional framework. *Social Science & Medicine*, 203, 74-80.

- Evans, C. R., Williams, D. R., Onnela, J. P., & Subramanian, S. V. (2018). A multilevel approach to modeling health inequalities at the intersection of multiple social identities. *Social Science & Medicine*, 203, 64-73.

- Merlo, J., Wagner, P., Ghith, N., & Leckie, G. (2016). An original stepwise multilevel logistic regression analysis of discriminatory accuracy: the case of neighbourhoods and health. *PLOS ONE*, 11(4), e0153778.