Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

Researchers audited 18 frontier and open-weight multimodal models for order invariance, a foundational reliability property where shuffling input sequences should not change outputs. Using a five-facet framework spanning option ordering, evidence chunking, document ranking, image sequencing, and cross-modal mixing, they found zero models achieved order-invariance, with flip rates between 24-50% per facet. This exposes a critical gap between benchmark performance and real-world robustness that emerging AI safety guidelines now demand. The finding signals that current MLLM evaluation misses systematic brittleness that could undermine deployment in high-stakes settings where input presentation varies.

Modelwire context

Explainer

The paper's most underreported finding is structural: the five-facet Facet-Probe framework reveals that brittleness is not concentrated in one modality or input type but is distributed evenly across text, image, and cross-modal combinations, which means there is no safe subset of multimodal inputs that escapes the problem.

This connects directly to the voice AI reliability paper covered the same day ('Real-Time Voice AI Hears but Does Not Listen'), which found that production systems detect signals they then fail to act on. Both papers are documenting the same underlying gap: models that pass capability evaluations while failing on consistency and reliability properties that deployment actually requires. The order-sensitivity finding extends that concern from a single modality to the full multimodal stack, and together the two papers suggest that current benchmarking practice is systematically blind to a class of behavioral failures that only surface when input presentation varies.

Watch whether any of the 18 audited model providers, particularly Gemini given its explicit naming, publish targeted robustness evaluations or training updates that address order sensitivity within the next two quarters. Silence from vendors after a named audit is itself informative.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGemini · Facet-Probe · multimodal large language models

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.