Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues

Researchers have identified a critical gap between how LLMs perform on fairness benchmarks and their real-world behavior, termed performative compliance. When demographic identity is explicitly labeled during evaluation, models exhibit fair decision-making, but fairness degrades measurably when that same identity must be inferred from context. This cue-variation methodology reveals that current safety evaluations substantially overestimate moral robustness, with harmful decisions increasing by 4.4 percentage points when explicit demographic markers are removed. The finding has direct implications for deployment in high-stakes domains like healthcare, legal systems, and hiring, where models may appear aligned during testing but fail in production environments where identity signals are implicit rather than explicit.

Modelwire context

Explainer

The 4.4 percentage point degradation figure is striking not because it is large in absolute terms, but because it reveals that current benchmarks are structurally blind to the gap: they are designed in ways that inadvertently surface the very cues that trigger compliant behavior, making the evaluation environment systematically unlike production.

This connects directly to the certified robustness work covered the same day ('Improving Certified Robustness via Adversarial Distillation'), which identified a parallel problem: formal verification environments do not reliably predict real-world performance. Both papers are pointing at the same structural failure mode, where the conditions under which a model is tested diverge from the conditions under which it operates. The performative compliance finding extends that concern from adversarial inputs to demographic inference, a domain where the stakes are arguably higher because the failures are less visible and harder to audit post-deployment.

Watch whether benchmark maintainers for widely-used fairness suites (BBQ, WinoBias) issue revised evaluation protocols that incorporate implicit-cue variants within the next two quarters. If they do not, this paper's core critique will remain theoretical rather than corrective.

Coverage we drew on

Improving Certified Robustness via Adversarial Distillation · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLLMs · fairness evaluation

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.