"Don't Say It!": Constraints, Compliance, and Communication when Language Models Play Taboo

Researchers are using the Taboo word game as a controlled testbed to understand how language models handle competing constraints at inference time. By progressively intervening in the generation pipeline, from prompt engineering to internal representation manipulation, the work isolates how models balance strict lexical restrictions against communicative effectiveness. This matters because real-world deployment often requires similar trade-offs between safety guardrails and utility, making Taboo a proxy for studying constraint compliance without sacrificing output quality. The evaluation methodology combining violation detection with LLM-as-judge scoring offers a replicable framework for measuring constraint adherence across different intervention depths.

Modelwire context

Explainer

The paper's core contribution isn't just that models struggle with constraints, but that it maps exactly where in the generation pipeline compliance breaks down. By comparing prompt engineering against internal representation interventions, it reveals whether constraint violations stem from training data patterns or from how models actively generate tokens.

This connects directly to the stability and consistency problems surfaced in recent work on persona maintenance and rhetorical interpretation. The Persona Non Grata paper found that LLMs drift unpredictably across task contexts; this Taboo work provides a mechanistic explanation by showing that constraints compete with learned generation patterns at inference time. The framework also echoes the clinical NLP production finding that learned gating rules fail at scale, suggesting that hard lexical constraints (like word bans) may be more reliable than soft behavioral ones when safety matters.

If researchers apply this intervention methodology to actual safety-critical constraints (e.g., refusal to generate malware code or PII), and show that representation-level interventions outperform prompt-based ones by >15 percentage points, that signals a practical path for hardening deployed models without retraining. If the gap is smaller than 5 points, prompt engineering remains sufficient and the added complexity isn't justified.

Coverage we drew on

Persona Non Grata: LLM Persona-Driven Generations in MCQA are Unstable in Distinct Dimensions · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsarXiv

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Research