It looks like they updated the article with a correction!
It looks like they updated the article with a correction!
This is not true (and I'm surprised by the bad reporting here from 404). arXiv is no longer accepting *review papers* unless they are peer reviewed. This has no effect on the submission of research articles. See the original post: blog.arxiv.org/2025/10/31/a....
This work was a collaboration with Natalie Harris, Chirag Nagpal, David Madras, Vishwali Mhasawade, Olawale Salaudeen, @adoubleva.bsky.social, Shannon Sequeira, Santiago Arciniegas, Lillian Sung, Nnamdi Ezeanochie, Heather Cole-Lewis, @kat-heller.bsky.social, Sanmi Koyejo, Alexander D'Amour.
2. downstream context (the fairness or equity implications that a model has when used as a component of a policy/intervention in a specific context).
1. upstream context (e.g., understanding the role of social and structural determinants of disparities and their impact on selection, measurement, and problem formulation)
We advocate for an approach that uses interdisciplinary expertise and domain knowledge to ground the analytic approach to model evaluation in both:
Beyond characterization of modeling implications, we argue that fairness (as well as related concepts such as equity or justice) is best understood not as a property of a model, but rather as a property of a policy or intervention that leverages the model in a specific sociotechnical context.
3. We provide evaluation methodology for controlling for confounding and conditional independence testing. These methods complement standard disaggregated evaluation to provide insight into why model performance differs across subgroups.
2. Observing model performance differences thus motivates deeper investigation to understand the causes of distributional differences across subgroups and to disambiguate them from observational biases (e.g., selection bias) and from model estimation error.
A few concrete practical takeaways:
1. Our results show that if it is of interest to model well outcomes that may be disparate across subgroups, we should not in general expect parity in model performance across subgroups.
3. How do model performance and fairness properties that change under different assumptions on the data generating process (reflecting different causal processes and structural causes of disparity) and mechanisms of selection bias (rendering data misrepresentative of the ideal target population)?
2. When and why do models that explicitly use subgroup membership information for prediction behave differently from those that do not?
A few of the key questions that we grappled with in this work included:
1. Why do models that predict outcomes well (even optimally) for all subgroups still exhibit systematic differences in performance across subgroups?
To summarize, we conducted a deep dive into some of the more challenging conceptual issues when it comes to evaluating machine learning models across subgroups, as is typically done to evaluate fairness or robustness.
Excited to share that our paper, βUnderstanding challenges to the interpretation of evaluations of algorithmic fairnessβ has been accepted to NeurIPS 2025! You can read the paper now on arXiv: arxiv.org/abs/2506.04193.
Ice berg meme template. From top to bottom: Correlation does not imply causation Third variable adjustment Just run an experiment Confounding Collider bias Selective mortality Observational longitudinal data Mediation analysis is messed up Posttreatment bias in experiments Generalization as a causal inference problem Missing data as a causal inference problem Measurement as a causal inference problem Causal foundations of applied probability and statistics
Causal inference iceberg!
What's missing?
This is (unfortunately) the required style for Nature journals
Check out our new paper "Tackling Algorithmic Bias and Promoting Transparency in Health Datasets: The STANDING Together Consensus Recommendations" jointly published in NEJM AI and The Lancet Digital Health, led by @jaldmn.bsky.social @xiaoliu.bsky.social