Research Tools & Code·arXiv cs.CL·6d ago

ThinkProbe: Beyond Accuracy -- Structural Profiling of Open-Ended LLM Reasoning Traces via Non-Generative Thought Graphs

Illustration accompanying: ThinkProbe: Beyond Accuracy -- Structural Profiling of Open-Ended LLM Reasoning Traces via Non-Generative Thought Graphs

ThinkProbe introduces a non-generative framework for dissecting how language models reason by converting reasoning traces into structured thought graphs with 19 metrics across five cognitive dimensions. Testing on 4,200 traces from seven reasoning models reveals that reasoning patterns are stable model-level signatures, with between-model differences outweighing domain variation by up to 4x. This work matters because it shifts focus from raw accuracy to the structural fingerprints of reasoning, offering a new lens for comparing and debugging reasoning models that goes beyond benchmark scores.

Modelwire context

Explainer

The most underappreciated finding here is not the framework itself but the stability result: reasoning patterns function as consistent model-level signatures, meaning a model's 'cognitive style' is more fixed than its domain knowledge, which has real implications for how we should interpret benchmark comparisons across tasks.

This is largely disconnected from recent activity in our archive, as Modelwire has no prior coverage to anchor it to. It belongs to a growing body of interpretability and evaluation research that sits adjacent to, but distinct from, standard capability benchmarking. The core tension ThinkProbe addresses is one the field has circled for a while: accuracy scores tell you what a model got right, not how it got there, and those two things can diverge in ways that matter for reliability and debugging. By treating reasoning traces as structured graphs rather than free text, the authors are essentially proposing a new unit of analysis for model comparison.

Watch whether any of the seven tested models' developers formally respond to or adopt these structural metrics in their own evaluation reporting within the next six months. Adoption by even one major lab would signal that structural profiling is moving from academic novelty toward a practical evaluation standard.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsThinkProbe · Thought Graph · LLM reasoning models

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.