Research Tools & Code·arXiv cs.CL·15h ago

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Researchers introduced Chat2Workflow, a benchmark and agentic framework for converting natural language into executable visual workflows, addressing the manual engineering bottleneck in industrial automation. The work tests whether LLMs can automate multi-step workflow design and error correction without human intervention.

Modelwire context

Explainer

The benchmark's industrial automation framing is the part worth pausing on: this isn't about generating code for a general-purpose runtime, but about producing structured, node-based workflow graphs that industrial systems can actually execute, which introduces a stricter correctness bar than most LLM code-generation benchmarks impose.

The closest parallel in recent coverage is QuantCode-Bench (arXiv, April 16), which tested whether LLMs could generate executable algorithmic trading strategies for a specific framework. Both papers are probing the same underlying question: can models reliably produce outputs that satisfy domain-specific execution constraints, not just syntactic plausibility? That benchmark found the gap between 'looks right' and 'runs correctly' to be significant. Chat2Workflow is essentially asking the same question in a different vertical. The agentic error-correction loop described here also echoes the self-reflection mechanism in MM-WebAgent, though the domains are distinct enough that direct comparison would be a stretch.

The meaningful test will be whether any industrial automation vendor adopts Chat2Workflow as an external evaluation standard within the next 12 months. Benchmark adoption by practitioners, rather than citation count, is what separates a useful measurement tool from an academic artifact.

Coverage we drew on

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsChat2Workflow · Large Language Models

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.