Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

Researchers benchmarked cloud and open-source LLMs on system dynamics tasks, finding cloud models hit 77-89% accuracy on causal diagram extraction while the best local model (Kimi K2.5) matched mid-tier cloud performance. Local models struggled with error-fixing in interactive coaching scenarios, revealing a gap in long-context reasoning.
MentionsKimi K2.5 · CLD Leaderboard · Discussion Leaderboard
Read full story at arXiv cs.LG →(arxiv.org)
Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.