Research Models & Releases·arXiv cs.CL·May 2

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

Researchers demonstrate that rubric-based evaluation with multi-judge filtering outperforms holistic LLM-as-a-judge scoring by removing judge model bias. The work introduces Prosa, a 1,000-conversation Brazilian Portuguese benchmark where three independent judges achieve perfect rank agreement on 16 models using structured rubrics, versus only 7 of 16 under traditional holistic scoring. The rubric approach also increases discriminative power between models by 47 percent, suggesting that decomposing evaluation criteria matters more than which model serves as judge. This challenges a prevailing assumption in LLM benchmarking and offers a replicable methodology for more robust cross-model comparison.

Modelwire context

Explainer

The finding that judge model identity is less important than rubric structure inverts the usual optimization instinct, where teams spend effort selecting the strongest available judge rather than decomposing the scoring criteria itself. Prosa also fills a concrete gap in non-English evaluation infrastructure, where Brazilian Portuguese has been largely absent from serious benchmarking efforts.

This connects directly to the evaluation methodology thread running through recent coverage. The MathArena piece from May 1st argued that reliable progress tracking requires moving beyond static, holistic leaderboards, and Prosa makes a complementary case at the scoring level: the problem isn't just what you measure but how you decompose the measurement. Similarly, the Themis work on multilingual code reward models showed that moving beyond single-dimension scoring surfaces gaps that binary metrics hide. Prosa applies that same decomposition logic to open-ended chat evaluation, and the 47 percent discriminative improvement suggests the payoff is substantial.

If other multilingual benchmark efforts, particularly those covering low-resource languages, adopt rubric-based multi-judge filtering and replicate the rank-agreement gains, that confirms the methodology generalizes beyond Portuguese. If adoption stays narrow, it likely signals that the overhead of rubric design limits uptake outside well-resourced research groups.

Coverage we drew on

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsProsa · WildChat · Brazilian Portuguese

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.