Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

Researchers created PBIG-DATA, a dataset of 3,000 expert scores across 300 patent-based product ideas, to study whether LLM judges should model consensus or individual evaluator preferences when assessing business concepts on six dimensions like feasibility and market potential.

Modelwire context

Explainer

The paper's sharpest contribution isn't the dataset itself but the question it forces: when experts systematically disagree on dimensions like feasibility or market potential, averaging their scores into a single 'ground truth' may not be a neutral methodological choice but an active distortion of what evaluation is measuring.

This connects directly to the April 16 piece on 'Diagnosing LLM Judge Reliability,' which found that aggregate consistency metrics can look healthy (around 96%) while hiding logical contradictions in a third to two-thirds of individual comparisons. PBIG-DATA extends that concern into a new domain: rather than asking whether a judge is internally consistent, it asks whether the target signal being judged is itself coherent across human raters. The two papers together suggest that LLM evaluation has a two-sided reliability problem, one on the model side and one on the ground-truth side. Neither paper resolves the other, but they point toward the same uncomfortable conclusion: benchmark scores in subjective domains may be measuring something fuzzier than they appear.

Watch whether follow-on work tests whether personalized judge models trained on PBIG-DATA's individual expert profiles outperform aggregate baselines on held-out raters, since that result would determine whether the personalization framing is practically useful or mainly a theoretical reframe.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsPBIG-DATA

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.