The strength of clinical evidence is recoverable from language model representations but not from their stated grades

Illustration accompanying: The strength of clinical evidence is recoverable from language model representations but not from their stated grades

Researchers tested whether 22 open-weight LLMs can internally represent clinical evidence strength, separate from factual accuracy. Using 45,134 harmonized medical claims across three grading frameworks, they found that linear probes successfully recovered evidence grades from model activations in every tested model, despite the systems rarely stating confidence levels explicitly when queried. This gap between hidden representational capacity and stated outputs has direct implications for clinical AI deployment, where confidence calibration failures could propagate silently through downstream applications.

Modelwire context

Explainer

The finding isn't just that models know more than they say, it's that this hidden knowledge is linearly decodable, meaning it's organized and accessible in principle, which makes the silence about confidence a choice of architecture and training rather than a fundamental limitation. That distinction matters enormously for anyone deciding whether to patch outputs or retrain.

This connects directly to the ThinkProbe work published the same day, which introduced structural profiling of reasoning traces and found that reasoning patterns are stable model-level signatures. Both papers are probing beneath the surface of what models output to characterize what they internally represent. ThinkProbe asks how models reason structurally; this paper asks what models know about their own epistemic standing. Together they reinforce a broader methodological shift: the output layer is increasingly understood as an unreliable narrator, and the real signal lives in activations and trace structure. That framing is becoming a recurring theme in interpretability research, though clinical AI is a higher-stakes application than most prior work in this vein.

Watch whether any of the 22 tested models release fine-tuned variants that explicitly surface evidence grades in outputs, using probing results as a training signal. If that happens within the next 12 months, it confirms the gap is practically closeable and not just a diagnostic curiosity.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLLMs · Clinical evidence grading · Model interpretability · Open-weight models

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.