
The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text
Researchers tested whether prompt engineering or model selection better improves LLM accuracy on fan experience ratings from baseball survey text. Prompt tweaks yielded only 2 percentage points of gain (67% to 69% accuracy), while GPT-5.2 and GPT-4.1-mini both underperformed the baseline, suggesting diminishing returns on optimization.42




























