--- title: "MAIHDA for Binary Outcomes (Discriminatory Accuracy)" author: "Hamid Bulut" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{MAIHDA for Binary Outcomes (Discriminatory Accuracy)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.5 ) ``` ## Why binary outcomes? Many of the original MAIHDA applications target a **binary** health outcome -- obese vs. not, hypertensive vs. not, screened vs. not. Merlo's framing of *discriminatory accuracy* is, at heart, a question about a binary classifier: how well do the intersectional strata alone separate people who do and do not have the outcome? For a binary outcome the model is a multilevel **logistic** regression: \[ \text{logit}\,\Pr(y_{ij} = 1) = \beta_0 + u_j, \qquad u_j \sim N(0, \sigma_u^2), \] where \(j\) indexes the intersectional stratum. MAIHDA reads off two complementary quantities from this model: - the **VPC/ICC**, the share of the (latent) variation that lies *between* strata; - the **discriminatory accuracy** (e.g. the AUC / C-statistic), how well stratum membership predicts the individual outcome. A high between-stratum VPC can still go with only *moderate* discriminatory accuracy at the individual level -- that contrast is the whole point of the "DA" in MAIHDA, and this vignette shows how to get both numbers. ```{r load} library(MAIHDA) data("maihda_health_data") # A two-level outcome: obese (Yes) vs. not (No) table(maihda_health_data$Obese) ``` ## Fitting a logistic MAIHDA model `fit_maihda()` detects a two-level outcome automatically. If you leave `family` at its default, the function switches to `family = "binomial"`, recodes the response to 0/1 for `glmer()`, and warns you so the change is never silent: ```{r autodetect} model_null <- fit_maihda( Obese ~ 1 + (1 | Gender:Race:Education), data = maihda_health_data ) ``` The warning is a feature, not a problem. To be explicit (and to silence the warning), pass `family = "binomial"` yourself -- this is the recommended form for scripts: ```{r explicit, eval=FALSE} model_null <- fit_maihda( Obese ~ 1 + (1 | Gender:Race:Education), data = maihda_health_data, family = "binomial" ) ``` > If you actually want a linear probability model on a 0/1 outcome, ask for it > explicitly with `family = "gaussian"`; the auto-switch only fires for the > default family. ## The VPC is on the latent scale `summary()` reports the VPC/ICC for the logistic model. There is no observed-scale residual variance for a Bernoulli outcome, so MAIHDA uses the standard **latent-scale** approximation: the level-1 (within-stratum) variance is fixed at \(\pi^2/3 \approx 3.29\) for the logit link (and the corresponding constant for a probit link). The VPC is therefore the between-stratum share of variation on that underlying latent scale, not on the probability scale. ```{r summary-null} summary_null <- summary(model_null) print(summary_null) ``` Read from this null model, the VPC is the **total** between-stratum share of latent variation in the odds of obesity. As in the Gaussian case, the stratum random effects capture the *combined* additive + interaction differences across strata; they isolate the *pure* intersectional (interaction) component only once the additive main effects of the strata variables are in the model. ## Adjusted model and PCV A "Model 2" adds individual-level covariates to ask how much of the between-stratum variation they account for. Here we adjust for age, fit on the **same analytic sample and strata** as the null model, and compare with `calculate_pvc()`. PCV compares variances across models, so both must use the same complete-case sample: ```{r adjusted} health_complete <- maihda_health_data[complete.cases( maihda_health_data[, c("Obese", "Age", "Gender", "Race", "Education")] ), ] model_null2 <- fit_maihda( Obese ~ 1 + (1 | Gender:Race:Education), data = health_complete, family = "binomial" ) # Model 2: adjust for an individual-level covariate (age) model_adj <- fit_maihda( Obese ~ Age + (1 | Gender:Race:Education), data = health_complete, family = "binomial" ) pcv <- calculate_pvc(model_null2, model_adj) print(pcv) ``` The same caveats as in the continuous case apply, with one extra wrinkle: for non-Gaussian models the latent residual variance is fixed by the link, so a change in the between-stratum variance is partly a **rescaling** of the latent scale, not only "variance explained". Interpret the PCV as a model-dependent, descriptive change, not a causal decomposition. > **A note on adjusting for the strata's own categories.** If instead you add the > *categorical* main effects that define the strata (e.g. > `Obese ~ Gender + Race + Education + ...`) to recover the additive-vs-interaction > split shown in the [introduction](introduction.html), be aware that in a logistic > model those fixed effects are nearly collinear with the stratum random intercept, > so `glmer()` often reports a convergence or "nearly unidentifiable" note. Scaling > covariates and using `control = lme4::glmerControl(optimizer = "bobyqa")` usually > helps. ## Discriminatory accuracy (AUC / C-statistic) The VPC summarises *variation*; discriminatory accuracy summarises *prediction*. The package does **not** ship an AUC or median-odds-ratio helper, but the AUC is a one-liner from the model's predicted probabilities. The AUC equals the Mann-Whitney U statistic, so it needs no extra package: ```{r auc-helper} # Rank-based AUC (C-statistic): P(prob for a case > prob for a non-case) maihda_auc <- function(prob, y) { y <- as.numeric(y) n1 <- sum(y == 1); n0 <- sum(y == 0) if (n1 == 0 || n0 == 0) return(NA_real_) r <- rank(prob) (sum(r[y == 1]) - n1 * (n1 + 1) / 2) / (n1 * n0) } ``` Compute it from the predicted probabilities (`scale = "response"`) and the 0/1 outcome stored in the fitted model frame. The strata-only model's AUC is the discriminatory accuracy of the intersectional strata *themselves* -- Merlo's central quantity. Comparing it with the adjusted model shows whether individual information beyond stratum membership sharpens classification: ```{r auc} y_obs <- as.numeric(lme4::getME(model_null2$model, "y")) prob_null <- predict_maihda(model_null2, type = "individual", scale = "response") prob_adj <- predict_maihda(model_adj, type = "individual", scale = "response") c( AUC_strata_only = maihda_auc(prob_null, y_obs), AUC_adjusted = maihda_auc(prob_adj, y_obs) ) ``` An AUC of 0.5 is chance. Here both models sit around 0.6: even a non-trivial between-stratum VPC translates into only **modest** accuracy at the *individual* level -- exactly the cautionary message that motivates discriminatory-accuracy reporting in MAIHDA. Note the adjusted model barely moves the AUC: the categorical covariates that *define* the strata are already captured by stratum membership, so only the continuous covariate (`Age`) adds genuinely new individual-level information. ## Plots adapt to the binomial family The standard plots recognise the binomial family and switch to the probability scale and to deviance-based diagnostics automatically: ```{r plot-predicted} # Predicted probabilities per stratum with intervals plot(model_adj, type = "predicted") ``` ```{r plot-vpc} # Latent-scale variance partition plot(model_adj, type = "vpc") ``` ```{r plot-prediction-deviation, warning = FALSE} # For binomial fits the dashboard highlights the largest absolute # deviance residuals rather than raw deviations from the mean. plot(model_adj, type = "prediction_deviation") ``` See the [plot interpretation vignette](interpreting_plots.html) for how to read each of these. ## Count outcomes work the same way A Poisson (count) outcome follows the identical pattern -- pass `family = "poisson"` to `fit_maihda()`. The VPC then uses the log-link latent-scale residual variance, and the summary, PCV, and plotting helpers all behave as above. ## References - Merlo, J. (2018). Multilevel analysis of individual heterogeneity and discriminatory accuracy (MAIHDA) within an intersectional framework. *Social Science & Medicine*, 203, 74-80. - Evans, C. R., Williams, D. R., Onnela, J. P., & Subramanian, S. V. (2018). A multilevel approach to modeling health inequalities at the intersection of multiple social identities. *Social Science & Medicine*, 203, 64-73. - Merlo, J., Wagner, P., Ghith, N., & Leckie, G. (2016). An original stepwise multilevel logistic regression analysis of discriminatory accuracy: the case of neighbourhoods and health. *PLOS ONE*, 11(4), e0153778.