Modelwire
Subscribe

IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies

Illustration accompanying: IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies

Researchers have identified a critical vulnerability in how large language models handle conflicting instructions across multi-turn conversations, where lower-priority directives systematically override higher-priority ones. The team formalizes this 'role-influence inversion' problem through a Jensen-Shannon Divergence framework and proposes IHDec, a training-free decoding method that detects and corrects hierarchy violations at the token level. This addresses a meaningful gap in LLM robustness for real-world deployments where instruction priority matters, particularly in multi-agent or hierarchical permission systems, without requiring expensive model retraining.

Modelwire context

Explainer

The paper formalizes a specific failure mode: lower-priority instructions systematically winning over higher ones in multi-turn conversations. This isn't just a jailbreak or adversarial attack; it's a structural problem in how models weight conflicting directives across turns.

This connects directly to the broader pattern we've covered this week around hidden competence and missing visibility. The Proprioceptive Dashboard story (VISTA) showed that models have latent capabilities they can't introspect on; IHDec suggests models also have latent instruction-ranking capabilities they fail to apply consistently. Both are training-free interventions that expose or correct behavior the model already possesses. The ParametricSkills work from the same day also touches instruction-following overhead, though from the angle of parametrizing skills rather than resolving conflicts. Where those papers focus on what models know but can't access, IHDec focuses on what models do but shouldn't.

If IHDec's divergence-steering method maintains its correction rate when tested on real hierarchical permission systems (e.g., multi-agent deployments with role-based access control) rather than synthetic multi-turn datasets, that confirms the method generalizes beyond the lab. If it doesn't, the vulnerability may be narrower than claimed.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLarge Language Models · IHDec · Jensen-Shannon Divergence · Contrastive Decoding

MW

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies · Modelwire