Research Models & Releases·arXiv cs.CL·3d ago

CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes

Researchers have released CDR-Bench, a 3,462-task evaluation suite that measures whether large language models can reliably execute multi-step data transformation workflows where operator sequencing and composition directly affect outcomes. The benchmark spans four real-world refinement domains with 29 distinct operators, testing models across atomic, order-agnostic, and order-sensitive scenarios using deterministic validation. Early results across 10+ leading LLMs expose gaps in faithful recipe execution, surfacing a capability gap that matters for production data pipelines where instruction fidelity and procedural reasoning determine correctness.

Modelwire context

Skeptical read

The benchmark's real contribution isn't just measuring compositional reasoning, but doing so with deterministic validation across order-sensitive workflows. What's missing from the summary: no disclosure of whether CDR-Bench itself has been audited for label quality, test-train leakage, or whether the 'gaps' it surfaces reflect genuine model limitations or dataset artifacts.

This lands directly in the wake of the RVL-CDIP audit from earlier this week, which exposed how systemic benchmark contamination (12% label corruption, 35% test-train overlap) artificially inflates reported accuracy and propagates false confidence through the field. CDR-Bench's deterministic validation is a step toward rigor, but the field has learned that benchmarks require adversarial scrutiny before practitioners cite them to validate production systems. The question isn't whether CDR-Bench identifies real gaps, but whether the benchmark itself withstands the same audit pressure that just reshaped how we read RVL-CDIP results.

If independent researchers reproduce CDR-Bench's reported gaps on a held-out subset of tasks that the benchmark authors did not construct, and those gaps persist, the benchmark earns credibility. If performance improves substantially when the same models are tested on order-sensitive tasks from a different domain, that signals the gaps may be dataset-specific rather than fundamental to procedural reasoning.

Coverage we drew on

Revising RVL-CDIP: Quantifying Errors and Test-Train Overlap · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsCDR-Bench · LLMs

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.