Using DSPy to evaluate and improve Datasette Agent's SQL system prompts

Simon Willison demonstrates a practical workflow for using DSPy, Stanford's framework for optimizing language model behavior, to systematically evaluate and refine SQL generation prompts in Datasette Agent. The work bridges prompt engineering and formal evaluation, showing how developers can move beyond manual tuning toward measurable improvements in agent reliability. This matters for the broader shift toward reproducible LLM optimization: as agents become production infrastructure, the ability to programmatically test and iterate on system prompts becomes a competitive advantage for teams building data-access tools.
Modelwire context
ExplainerThe less-discussed implication here is that Willison is treating system prompts as artifacts to be tested under version control, not as configuration to be tweaked by intuition. DSPy's contribution is providing a feedback loop with measurable signal, which is distinct from simply 'using a framework.'
This connects directly to the Taboo constraint study covered yesterday (arXiv, July 1), which used a game environment as a controlled testbed to isolate how models balance competing instructions at inference time. Willison is doing something structurally similar: creating a repeatable evaluation harness to surface where SQL generation prompts fail under specific conditions. Both approaches treat prompt behavior as something to be measured rather than assumed. The broader pattern across recent coverage is a quiet shift away from prompting as craft toward prompting as engineering discipline, with reproducible test suites replacing intuition.
Watch whether Willison publishes the evaluation dataset and DSPy configuration publicly. If he does, and other Datasette users report measurable accuracy gains on their own schemas within the next few months, that would confirm the method generalizes beyond his specific test cases.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsSimon Willison · DSPy · Datasette Agent · Claude · Stanford NLP
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on simonwillison.net. If you’re a publisher and want a different summarization policy for your work, see our takedown page.