Research Models & Releases·arXiv cs.CL·2d ago

AGC-Bench: Measuring Artificial General Creativity

Researchers have assembled the first systematic benchmark for measuring creativity across AI systems, addressing a long-standing gap in LLM evaluation. AGC-Bench synthesizes 497 existing creativity benchmarks into a unified 78-dataset framework spanning brainstorming, problem-solving, STEM, narrative, and humor tasks. The work introduces Judge Response Theory to correct for bias in LLM-as-judge evaluation, a methodological advance that matters as creativity assessment becomes central to claims about general AI capability. This standardization effort signals that creativity metrics are moving from niche research into mainstream model evaluation, reshaping how labs will benchmark and compare systems beyond traditional accuracy-focused tasks.

Modelwire context

Analyst take

The consolidation of 497 prior benchmarks into a single framework is less a research contribution than an infrastructure play. Whoever controls the dominant creativity evaluation standard gains soft influence over how labs define and report creative capability, which is a different kind of power than publishing a model.

This connects directly to two threads in recent coverage. The 'Measuring the Gap Between Human and LLM Research Ideas' piece from the same day exposed a core methodological problem: most creativity scoring happens in isolation, without grounding in what humans actually produce. AGC-Bench doesn't fully solve that, since its 78 datasets are drawn from existing benchmarks rather than live human comparisons. Separately, the MIT Technology Review piece on LLM groupthink identified that models cluster toward predictable outputs, which means any creativity benchmark that uses LLM judges risks rewarding the same statistical consensus it's supposed to measure. Judge Response Theory is AGC-Bench's answer to that problem, but whether it actually corrects for the bias the groupthink piece describes is an open question the paper will need to answer empirically.

Watch whether any major lab (Anthropic, Google, OpenAI) cites AGC-Bench in a model release or technical report within the next six months. Adoption at that level would confirm it's becoming infrastructure; silence would suggest it remains a research artifact.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsAGC-Bench · HELM · Judge Response Theory

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.