Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

Researchers demonstrate that rubric-based evaluation with multi-judge filtering outperforms holistic LLM-as-a-judge scoring by removing judge model bias. The work introduces Prosa, a 1,000-conversation Brazilian Portuguese benchmark where three independent judges achieve perfect rank agreement on 16 models using structured rubrics, versus only 7 of 16 under traditional holistic scoring. The rubric approach also increases discriminative power between models by 47 percent, suggesting that decomposing evaluation criteria matters more than which model serves as judge. This challenges a prevailing assumption in LLM benchmarking and offers a replicable methodology for more robust cross-model comparison.
Modelwire context
ExplainerThe finding that judge model identity is less important than rubric structure inverts the usual optimization instinct, where teams spend effort selecting the strongest available judge rather than decomposing the scoring criteria itself. Prosa also fills a concrete gap in non-English evaluation infrastructure, where Brazilian Portuguese has been largely absent from serious benchmarking efforts.
This connects directly to the evaluation methodology thread running through recent coverage. The MathArena piece from May 1st argued that reliable progress tracking requires moving beyond static, holistic leaderboards, and Prosa makes a complementary case at the scoring level: the problem isn't just what you measure but how you decompose the measurement. Similarly, the Themis work on multilingual code reward models showed that moving beyond single-dimension scoring surfaces gaps that binary metrics hide. Prosa applies that same decomposition logic to open-ended chat evaluation, and the 47 percent discriminative improvement suggests the payoff is substantial.
If other multilingual benchmark efforts, particularly those covering low-resource languages, adopt rubric-based multi-judge filtering and replicate the rank-agreement gains, that confirms the methodology generalizes beyond Portuguese. If adoption stays narrow, it likely signals that the overhead of rubric design limits uptake outside well-resourced research groups.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsProsa · WildChat · Brazilian Portuguese
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.