Research Tools & Code·arXiv cs.CL·5d ago

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

Researchers have identified a critical weakness in LLM-as-judge systems, which now dominate evaluation of open-ended AI outputs, yet fail dramatically on objective correctness tasks in standard benchmarks. RTLC, a novel three-stage prompting method rooted in pedagogical scaffolding, substantially improves judge accuracy by orchestrating multiple independent reasoning paths and self-critique without requiring fine-tuning or external infrastructure. This addresses a foundational measurement problem: as AI evaluation shifts from automated metrics to LLM verdicts, the judges themselves must become more reliable, making this technique strategically relevant for anyone building evaluation pipelines or relying on LLM-based quality signals.

Modelwire context

Explainer

The Feynman framing is more than branding: the core insight is that forcing a model to 'teach' a concept back to itself before critiquing exposes gaps in its own reasoning that single-pass prompting masks. The gains come from structured self-inconsistency detection, not just chain-of-thought repetition.

This connects directly to the hallucination detection work covered the same day ('Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry'), which also targets the question of where model reasoning fails rather than just whether it fails. Both papers are attacking measurement reliability from different angles: one through geometric analysis of hidden states, the other through prompting scaffolds. Together they suggest a convergent push to make LLM self-assessment trustworthy enough to use in production. The 'Senses Wide Shut' coverage adds a third data point: models can internally represent failures they don't surface in output, which is exactly the failure mode RTLC's self-critique stage is designed to force into the open.

If RTLC's accuracy gains replicate on evaluation benchmarks outside JudgeBench, particularly on adversarial or domain-specific judge tasks, the no-fine-tuning constraint becomes a genuine deployment argument. If gains don't transfer, this is a benchmark-specific result.

Coverage we drew on

Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsJudgeBench · RTLC · Feynman Learning Technique

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.