Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?

Researchers split MMLU into symbolic reasoning and knowledge recall subsets, finding that self-consistency prompting boosts performance on both despite being designed for reasoning alone. The technique achieves 89% accuracy on MMLU, suggesting the mechanism generalizes beyond its original use case.

Modelwire context

Explainer

The more interesting finding isn't the 89% headline number — it's the methodological move of splitting MMLU into reasoning versus recall subsets at all, which exposes how conflated that benchmark has always been and raises questions about what prior self-consistency results were actually measuring.

This connects directly to the April 21 paper on unsupervised confidence calibration ('Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation'), which also treats self-consistency as a signal worth distilling, but for a different purpose: estimating model confidence rather than boosting accuracy. Together, the two papers suggest self-consistency is doing something more general than majority-vote correction on symbolic steps — it may be functioning as a soft ensemble that smooths over retrieval noise as much as reasoning errors. That reframing matters because calibration and accuracy are related but distinct goals, and conflating the mechanism risks building confidence estimators on shaky theoretical ground.

If follow-up work replicates the recall gains on MedMCQA's harder clinical reasoning subset (where knowledge and inference are tightly coupled), that would support the generalization claim. If gains collapse there, the effect is probably specific to MMLU's particular knowledge-retrieval structure rather than a property of self-consistency itself.

Coverage we drew on

Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsMMLU · GSM8K · MedMCQA · self-consistency · chain-of-thought prompting

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.