Research Models & Releases·arXiv cs.LG·5d ago

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Researchers have built Themis-CodeRewardBench, a multilingual evaluation framework that stress-tests reward models across eight programming languages and five quality dimensions beyond mere execution correctness. The work exposes significant gaps in how current RMs assess code, moving the field beyond binary pass/fail metrics toward nuanced preference learning. This matters because code generation is where LLMs face the highest stakes for real-world deployment, and reward models are the primary lever for steering post-training. The benchmark profiles 50+ existing RMs, establishing a new standard for what robust code alignment should measure.

Modelwire context

Analyst take

The benchmark's real leverage isn't the eight languages or five quality dimensions in isolation: it's that profiling 50+ existing reward models simultaneously creates a public ranking that will pressure model developers to optimize specifically for Themis-CodeRewardBench, raising the familiar Goodhart's Law concern that the benchmark becomes the target rather than a proxy for genuine code quality.

This connects directly to two threads in recent coverage. The ChatGPT goblin incident (The Decoder, May 1) illustrated how reward signal misconfiguration produces persistent behavioral artifacts at scale, and Themis is essentially an attempt to make those failure modes visible before deployment rather than after. Separately, the AutoMat benchmark story showed that coding agents collapse on underspecified real-world tasks even when generic benchmarks look strong, which is precisely the gap Themis claims to address by moving past binary pass/fail scoring. Both stories reinforce the same underlying problem: current reward models are optimized for the wrong signal.

Watch whether any of the major post-training labs (Anthropic, DeepSeek, or Xiaomi given MiMo-V2.5-Pro's coding focus) cite Themis-CodeRewardBench in a model card or training report within the next six months. Adoption at that level would confirm it as infrastructure rather than academic artifact.

Coverage we drew on

ChatGPT's goblin obsession may be hilarious, but it points to a deeper problem in AI training · The Decoder

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsThemis · Themis-CodeRewardBench

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Research

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

arXiv cs.CL·5d ago

Research

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

arXiv cs.CL·5d ago

Research

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

arXiv cs.CL·5d ago

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Modelwire context

Coverage we drew on

Modelwire Editorial

Related

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models