Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations

Researchers have identified a critical blind spot in using large language models as surrogate models for optimization tasks: their uncertainty estimates and predictions shift dramatically based on how questions are framed and sequenced. The work reveals that prompt structure functions as an implicit prior, different query formats (pointwise vs. joint) produce incompatible belief systems, and confidence updates follow non-monotonic patterns tied to evidence order. These findings matter because they expose a reliability gap in a growing practice, suggesting that practitioners deploying LLMs for low-data optimization may be making acquisition decisions based on unstable, prompt-dependent uncertainty signals rather than genuine model confidence.
Modelwire context
ExplainerThe deeper issue isn't just that prompts matter, which practitioners already accept as folk wisdom. It's that the failure is structural: pointwise and joint query formats produce internally inconsistent belief systems, meaning two valid-looking prompting strategies can't even be reconciled into a single coherent uncertainty estimate.
This connects directly to the position paper 'agentic AI orchestration should be Bayes-consistent' from early May, which argued that belief maintenance in LLM systems needs principled Bayesian grounding rather than ad-hoc design. That paper assumed the inference layer could produce stable beliefs worth orchestrating. This new work undercuts that assumption at the source: if the uncertainty signals feeding into any Bayesian control layer are prompt-dependent and non-monotonic, the orchestration framework is building on unstable inputs regardless of how principled its architecture is. The 'Adaptive Querying with AI Persona Priors' work from the same period faces a related exposure, since its closed-form updates depend on LLM outputs that may shift based on query framing.
Watch whether Bayesian optimization benchmarks that use LLM surrogates begin reporting prompt protocol as a controlled variable in their experimental setups. If that standardization appears in major venues within the next two conference cycles, it signals the field has absorbed this finding as a methodological requirement rather than a footnote.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsLarge Language Models · Bayesian Optimization · Surrogate Models
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.