Opinion & Analysis Tools & Code·Simon Willison·3h ago

Claude Opus 4.8 generates invalid tool calls despite model scale

Illustration accompanying: Better Models: Worse Tools

Armin Ronacher discovered that Claude Opus 4.8 frequently generates malformed tool calls by inventing schema fields that don't exist, causing Pi to reject valid requests. The pattern contradicts the assumption that larger models produce cleaner outputs, suggesting capability scaling doesn't guarantee tool-use reliability. This surfaces a critical gap in production LLM deployment: even frontier models can degrade user experience when their outputs don't conform to strict API contracts, forcing developers to implement expensive retry logic or accept silent failures.

Modelwire context

Analyst take

The specific failure mode here isn't hallucination in the conversational sense but schema hallucination, where the model confidently generates structurally invalid outputs that pass no validation layer. That distinction matters because it's harder to catch in evals and harder to explain to stakeholders who assume bigger models are safer bets for production.

This connects directly to the persona instability research we covered on July 1st ('Persona Non Grata'), which found that model scale correlates with instability patterns in structured output tasks. Both stories chip away at the same assumption: that frontier models are production-ready by default. The groupthink piece from MIT Technology Review that same day adds another dimension, noting that these models aren't truly stochastic but constrained in ways developers don't anticipate until something breaks. Taken together, the pattern suggests a reliability debt accumulating beneath capability headlines.

Watch whether Anthropic acknowledges tool-call schema conformance as a named regression between Opus 4.7 and 4.8 in any public changelog or model card update within the next 30 days. Silence would confirm that structured output reliability isn't yet a tracked metric at the model release level.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsClaude Opus 4.8 · Armin Ronacher · Pi · Simon Willison

Read full story at Simon Willison →(simonwillison.net)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. Simon Willison originally reported this story as “Better Models: Worse Tools”. The full content lives on simonwillison.net. If you’re a publisher and want a different summarization policy for your work, see our takedown page.