Many of the original MAIHDA applications target a binary health outcome – obese vs. not, hypertensive vs. not, screened vs. not. Merlo’s framing of discriminatory accuracy is, at heart, a question about a binary classifier: how well do the intersectional strata alone separate people who do and do not have the outcome?
For a binary outcome the model is a multilevel logistic regression:
\[ \text{logit}\,\Pr(y_{ij} = 1) = \beta_0 + u_j, \qquad u_j \sim N(0, \sigma_u^2), \]
where \(j\) indexes the intersectional stratum. MAIHDA reads off two complementary quantities from this model:
A high between-stratum VPC can still go with only moderate discriminatory accuracy at the individual level – that contrast is the whole point of the “DA” in MAIHDA, and this vignette shows how to get both numbers.
fit_maihda() detects a two-level outcome automatically.
If you leave family at its default, the function switches
to family = "binomial", recodes the response to 0/1 for
glmer(), and warns you so the change is never silent:
model_null <- fit_maihda(
Obese ~ 1 + (1 | Gender:Race:Education),
data = maihda_health_data
)
#> Warning: The outcome variable appears to be binary. Automatically switching to
#> family = 'binomial'. To fit a Linear Probability Model, explicitly specify
#> family = 'gaussian'.
#> Binary outcome 'Obese' recoded to 0/1: 'No' = 0 (reference), 'Yes' = 1 (modeled event). Set the factor levels (or supply a 0/1 outcome) to control which level is the event.The warning is a feature, not a problem. To be explicit (and to
silence the warning), pass family = "binomial" yourself –
this is the recommended form for scripts:
model_null <- fit_maihda(
Obese ~ 1 + (1 | Gender:Race:Education),
data = maihda_health_data,
family = "binomial"
)If you actually want a linear probability model on a 0/1 outcome, ask for it explicitly with
family = "gaussian"; the auto-switch only fires for the default family.
summary() reports the VPC/ICC for the logistic model.
There is no observed-scale residual variance for a Bernoulli outcome, so
MAIHDA uses the standard latent-scale approximation:
the level-1 (within-stratum) variance is fixed at \(\pi^2/3 \approx 3.29\) for the logit link
(and the corresponding constant for a probit link). The VPC is therefore
the between-stratum share of variation on that underlying latent scale,
not on the probability scale.
summary_null <- summary(model_null)
print(summary_null)
#> MAIHDA Model Summary
#> ====================
#>
#> Variance Partition Coefficient (VPC/ICC):
#> Estimate: 0.0634
#>
#> Variance Components:
#> component variance sd proportion
#> Between-stratum (random) 0.2227 0.4719 0.06339
#> Within-stratum (residual) 3.2899 1.8138 0.93661
#> Total 3.5125 1.8742 1.00000
#>
#> Fixed Effects:
#> term estimate
#> (Intercept) -0.616
#>
#> Stratum Estimates (first 10):
#> stratum stratum_id label random_effect se
#> 1 1 male × Hispanic × Some College -0.075834 0.3155
#> 2 2 male × Black × College Grad -0.001524 0.3325
#> 3 3 female × White × College Grad -0.360949 0.1186
#> 4 4 male × Hispanic × 8th Grade 0.218443 0.3808
#> 5 5 female × Mexican × 8th Grade 0.437816 0.2809
#> 6 6 male × White × College Grad -0.293351 0.1187
#> 7 7 female × White × High School -0.006285 0.1322
#> 8 8 male × White × Some College 0.502977 0.1077
#> 9 9 female × White × 9 - 11th Grade 0.259733 0.1835
#> 10 10 female × Hispanic × High School -0.006773 0.3208
#> lower_95 upper_95
#> -0.6941 0.54246
#> -0.6533 0.65025
#> -0.5935 -0.12843
#> -0.5280 0.96489
#> -0.1127 0.98833
#> -0.5259 -0.06076
#> -0.2654 0.25283
#> 0.2919 0.71405
#> -0.1000 0.61948
#> -0.6355 0.62197
#> ... and 40 more strataRead from this null model, the VPC is the total between-stratum share of latent variation in the odds of obesity. As in the Gaussian case, the stratum random effects capture the combined additive + interaction differences across strata; they isolate the pure intersectional (interaction) component only once the additive main effects of the strata variables are in the model.
A “Model 2” adds individual-level covariates to ask how much of the
between-stratum variation they account for. Here we adjust for age, fit
on the same analytic sample and strata as the null
model, and compare with calculate_pvc(). PCV compares
variances across models, so both must use the same complete-case
sample:
health_complete <- maihda_health_data[complete.cases(
maihda_health_data[, c("Obese", "Age", "Gender", "Race", "Education")]
), ]
model_null2 <- fit_maihda(
Obese ~ 1 + (1 | Gender:Race:Education),
data = health_complete, family = "binomial"
)
#> Binary outcome 'Obese' recoded to 0/1: 'No' = 0 (reference), 'Yes' = 1 (modeled event). Set the factor levels (or supply a 0/1 outcome) to control which level is the event.
# Model 2: adjust for an individual-level covariate (age)
model_adj <- fit_maihda(
Obese ~ Age + (1 | Gender:Race:Education),
data = health_complete, family = "binomial"
)
#> Binary outcome 'Obese' recoded to 0/1: 'No' = 0 (reference), 'Yes' = 1 (modeled event). Set the factor levels (or supply a 0/1 outcome) to control which level is the event.
pcv <- calculate_pvc(model_null2, model_adj)
print(pcv)
#> Proportional Change in Variance (PCV)
#> =====================================
#>
#> PCV: 0.0167
#>
#> Between-stratum variance:
#> Model 1: 0.222667
#> Model 2: 0.218941
#> Change: 0.003726 (1.67%)
#>
#> Interpretation (PCV is the proportional change in between-stratum
#> variance between the models; it is variance 'explained' only when Model 2
#> nests Model 1 by adding predictors on the same outcome, sample and strata):
#> Between-stratum variance is 1.7% lower in Model 2 than in Model 1.The same caveats as in the continuous case apply, with one extra wrinkle: for non-Gaussian models the latent residual variance is fixed by the link, so a change in the between-stratum variance is partly a rescaling of the latent scale, not only “variance explained”. Interpret the PCV as a model-dependent, descriptive change, not a causal decomposition.
A note on adjusting for the strata’s own categories. If instead you add the categorical main effects that define the strata (e.g.
Obese ~ Gender + Race + Education + ...) to recover the additive-vs-interaction split shown in the introduction, be aware that in a logistic model those fixed effects are nearly collinear with the stratum random intercept, soglmer()often reports a convergence or “nearly unidentifiable” note. Scaling covariates and usingcontrol = lme4::glmerControl(optimizer = "bobyqa")usually helps.
The VPC summarises variation; discriminatory accuracy summarises prediction. The package does not ship an AUC or median-odds-ratio helper, but the AUC is a one-liner from the model’s predicted probabilities. The AUC equals the Mann-Whitney U statistic, so it needs no extra package:
# Rank-based AUC (C-statistic): P(prob for a case > prob for a non-case)
maihda_auc <- function(prob, y) {
y <- as.numeric(y)
n1 <- sum(y == 1); n0 <- sum(y == 0)
if (n1 == 0 || n0 == 0) return(NA_real_)
r <- rank(prob)
(sum(r[y == 1]) - n1 * (n1 + 1) / 2) / (n1 * n0)
}Compute it from the predicted probabilities
(scale = "response") and the 0/1 outcome stored in the
fitted model frame. The strata-only model’s AUC is the discriminatory
accuracy of the intersectional strata themselves – Merlo’s
central quantity. Comparing it with the adjusted model shows whether
individual information beyond stratum membership sharpens
classification:
y_obs <- as.numeric(lme4::getME(model_null2$model, "y"))
prob_null <- predict_maihda(model_null2, type = "individual", scale = "response")
prob_adj <- predict_maihda(model_adj, type = "individual", scale = "response")
c(
AUC_strata_only = maihda_auc(prob_null, y_obs),
AUC_adjusted = maihda_auc(prob_adj, y_obs)
)
#> AUC_strata_only AUC_adjusted
#> 0.6261898 0.6282023An AUC of 0.5 is chance. Here both models sit around 0.6: even a
non-trivial between-stratum VPC translates into only
modest accuracy at the individual level –
exactly the cautionary message that motivates discriminatory-accuracy
reporting in MAIHDA. Note the adjusted model barely moves the AUC: the
categorical covariates that define the strata are already
captured by stratum membership, so only the continuous covariate
(Age) adds genuinely new individual-level information.
The standard plots recognise the binomial family and switch to the probability scale and to deviance-based diagnostics automatically:
# For binomial fits the dashboard highlights the largest absolute
# deviance residuals rather than raw deviations from the mean.
plot(model_adj, type = "prediction_deviation")See the plot interpretation vignette for how to read each of these.
A Poisson (count) outcome follows the identical pattern – pass
family = "poisson" to fit_maihda(). The VPC
then uses the log-link latent-scale residual variance, and the summary,
PCV, and plotting helpers all behave as above.
Merlo, J. (2018). Multilevel analysis of individual heterogeneity and discriminatory accuracy (MAIHDA) within an intersectional framework. Social Science & Medicine, 203, 74-80.
Evans, C. R., Williams, D. R., Onnela, J. P., & Subramanian, S. V. (2018). A multilevel approach to modeling health inequalities at the intersection of multiple social identities. Social Science & Medicine, 203, 64-73.
Merlo, J., Wagner, P., Ghith, N., & Leckie, G. (2016). An original stepwise multilevel logistic regression analysis of discriminatory accuracy: the case of neighbourhoods and health. PLOS ONE, 11(4), e0153778.