ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks , by Artificial Analysis and IBM

Artificial Analysis and IBM released ITBench-AA, the first standardized benchmark for evaluating AI agents on real-world enterprise IT operations. The finding that frontier models score below 50% signals a significant capability gap between current LLM performance and production-ready autonomous IT task execution. This benchmark matters because enterprise deployment of agentic systems hinges on reliability thresholds well above current baselines, reshaping expectations around timeline-to-deployment and the types of tasks safe for autonomous delegation in mission-critical infrastructure.
Modelwire context
ExplainerThe sub-50% headline obscures a more important design choice: ITBench-AA evaluates agents on multi-step, stateful IT workflows (think incident triage, configuration changes, runbook execution) rather than isolated Q&A, which means errors compound across steps and a 45% task-completion rate likely masks much higher failure rates on the longest, highest-stakes sequences.
This is largely disconnected from recent activity in our archive, as we have no prior coverage to anchor it to. It belongs to a growing cluster of agentic evaluation work that has been quietly advancing in parallel with the more visible model-release cycle. The significance is that enterprise IT is one of the first verticals where agentic deployment has real financial and operational consequences if an agent misconfigures infrastructure or misroutes an incident. Benchmarks like this one set the evidentiary floor that procurement and risk teams will eventually demand before signing off on autonomous delegation.
Watch whether competing benchmark efforts from vendors like ServiceNow or Salesforce adopt ITBench-AA as a shared standard within the next six months. If they build proprietary alternatives instead, it signals the industry is not yet ready to agree on what 'good enough' means for autonomous IT operations.
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsArtificial Analysis · IBM · ITBench-AA
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on huggingface.co. If you’re a publisher and want a different summarization policy for your work, see our takedown page.