EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

A systematic fairness audit reveals that five major LLMs exhibit significant gender bias when deployed for emergency department triage decisions, with flip rates ranging from 9.9% to 43.8% when patient gender is swapped in identical clinical scenarios. The finding matters because hospitals are actively piloting these models as decision support tools in high-stakes settings where bias directly affects patient outcomes. Rather than mitigating known human disparities in triage assessment, current models appear to reproduce or amplify them, raising urgent questions about LLM deployment in clinical workflows before bias mitigation strategies mature.

Modelwire context

Explainer

The study uses MIMIC-IV-ED, a real clinical dataset, as its substrate, which means these aren't hypothetical vignettes but cases drawn from actual emergency department encounters. That methodological choice makes the bias findings harder to dismiss as artifacts of synthetic prompting.

The tension here is direct and uncomfortable. Coverage from May 3rd of the Harvard study showing LLMs outperforming emergency room physicians on diagnostic accuracy created real momentum for faster clinical deployment. EQUITRIAGE arrives two days later and complicates that picture considerably: accuracy on aggregate benchmarks can coexist with systematic demographic bias at the individual decision level. The ethical divergence benchmark covered around the same time, 'Same prompt, different morals' from The Decoder, showed frontier models encoding different value systems across moral domains. Triage is precisely the kind of high-stakes sequential decision context where those divergences compound, and the procedural faithfulness gaps documented in the arXiv step-execution paper from May 1st suggest models may also be inconsistently applying whatever clinical logic they do have.

Watch whether any of the five audited vendors, particularly Google or OpenAI, respond with model cards or updated system prompts specifically addressing gender-neutral clinical framing before hospital pilot programs expand past the proof-of-concept stage. A vendor response within 90 days would signal the audit has regulatory traction; silence would suggest deployment is outpacing accountability.

Coverage we drew on

In Harvard study, AI offered more accurate diagnoses than emergency room doctors · TechCrunch - AI

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGoogle Gemini · Nemotron · DeepSeek · Mistral · OpenAI GPT-4 · MIMIC-IV-ED

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.