ITBench-AA: Top AI Models Flunk Enterprise IT Tasks
IBM and Artificial Analysis just dropped ITBench-AA, the first real test of AI agents on enterprise IT work. Every frontier model scored under 50%.
IBM and Artificial Analysis just dropped ITBench-AA, the first real test of AI agents on enterprise IT work. Every frontier model scored under 50%.

Every frontier model on the market right now scores below 50% on enterprise IT agent tasks. That's the headline from ITBench-AA, the new benchmark from IBM Research and Artificial Analysis. And honestly? It's a much-needed reality check.
For months, AI labs have been waving around SWE-bench Verified scores in the 70-85% range and acting like agentic AI for the enterprise is basically a solved problem. The ITBench-AA benchmark suggests otherwise. When you take the same models that crush coding puzzles and drop them into real site reliability engineering scenarios, performance collapses.
This matters because enterprise IT is where the actual money lives. Not chatbots. Not vibe-coded side projects. The boring, expensive, mission-critical work of keeping production systems running.
ITBench-AA is the first public benchmark designed to evaluate AI agents on agentic enterprise IT tasks, built jointly by IBM Research and Artificial Analysis. It runs frontier models through realistic site reliability engineering and IT operations scenarios in live environments, then scores them on whether they actually fix the problem. Across all tested models, no system broke 50%.

The benchmark is a subset of IBM's larger ITBench framework, narrowed down by Artificial Analysis into a standardized evaluation suite. It's designed to be reproducible, comparable across model releases, and aligned with how IT work actually happens in production.
And that last point is what makes the scores so interesting.
Most AI benchmarks you see in marketing decks measure isolated capabilities. MMLU tests trivia knowledge. HumanEval tests function-completion in a vacuum. Even SWE-bench Verified, which is closer to real engineering, hands the model a clean repo and a well-described bug.
Real IT work is nothing like that.
When a Kubernetes cluster starts dropping pods at 3am, nobody hands you a Markdown file describing the root cause. You get a vague PagerDuty alert, a dashboard with seventeen graphs, and a chat thread of confused engineers. The job isn't writing code. The job is figuring out what's wrong in a noisy, multi-system environment, then taking action that doesn't make things worse.
That's what ITBench-AA actually measures.
The benchmark gives agents access to live IT environments through tool use. Models can run shell commands, query monitoring systems, inspect logs, interact with Kubernetes APIs, and execute remediation steps. Scoring is binary at the task level: did the agent resolve the incident or not?

IBM has open-sourced the environment setup so anyone can reproduce results. Artificial Analysis runs the standardized evaluation across major frontier models and publishes comparable scores. According to the official announcement post, the scenarios cover incident response, configuration troubleshooting, and multi-step remediation workflows.
This isn't a trivia test. It's a job interview.
It gets uncomfortable from here for the AI labs. According to the published ITBench-AA SRE results, Claude Opus 4.7 (Max Effort) leads at 47% (self-reported by Artificial Analysis), with GPT-5.5 at 46% and Qwen3.7 Max at 42%. Every frontier model lands below 50%.
| Benchmark | Best frontier score | What it measures |
|---|---|---|
| MMLU | 92.3% | General knowledge |
| HumanEval | 93.7% | Isolated code completion |
| SWE-bench Verified | ~82% | Repo-level bug fixing |
| GPQA Diamond | 87.7% | Graduate science questions |
| ITBench-AA | <50% | Agentic enterprise IT |
See the gap? Models that ace graduate-level science and write production code can't reliably fix a broken Helm chart. That's the gap between "smart" and "useful in production."
The failure modes are pretty consistent across model families. Based on the benchmark's reported results and IBM Research's commentary, the recurring issues include:
That last one is the killer. Modern frontier models have context windows in the millions of tokens. Doesn't matter. If you can't identify the three lines in the kubelet log that actually point to the problem, all that context is just noise.
Let's put ITBench-AA next to other agentic benchmarks to see the shape of the problem.
| Benchmark | Top model | Score | Domain |
|---|---|---|---|
| SWE-bench Verified | Claude Opus 4.6 | 81.4% (self-reported) | Software engineering |
| ARC-AGI | o3 (high compute) | 87.5% | Abstract reasoning |
| GPQA Diamond | o3 | 87.7% | Science Q&A |
| ITBench-AA | Claude Opus 4.7 (Max Effort) | 47% (self-reported) | Enterprise IT ops |
The scores tell a story. Anything that looks like "closed-form reasoning" or "code in a sandbox" is approaching saturation. The moment you introduce real systems, real noise, and real tools that punish hallucination, performance drops by 20 to 40 points.
And that's exactly the work that enterprise buyers care about. AIOps. Incident response. Configuration management. Nobody is paying $5 per million tokens to autocomplete their CSS.
It's tempting to read sub-50% scores as bad news for AI agents. We'd argue the opposite. The benchmark community has been starved for evaluations that aren't already saturated. When every model scores 90%+ on a test, you've stopped measuring capability and started measuring training data overlap.
A benchmark where the best model gets 47% is a benchmark with room to grow. It's a real signal. You can run the same eval six months from now and actually see whether agents got better at the work that pays the bills.
The most useful benchmarks are the ones where current models clearly fail. Saturation is the enemy of progress measurement.
That's the deeper value of ITBench-AA. It gives us a ruler that won't max out by next quarter.
So what does this mean if you're an engineering leader thinking about deploying AI agents in your stack? A few things worth taking seriously.
If a vendor walks in claiming their agent can autonomously resolve incidents, ask them what their ITBench-AA score is. Or any IT-specific eval. The fact that frontier models score under 50% means anything claiming "full autonomy" is either overselling or running on rails so narrow it's not really agentic.
A human-in-the-loop design is almost certainly the right pattern for production IT work in 2026. The agent proposes, a human approves. That's where the real productivity gains are right now.
Claude Code and Cursor are pretty solid at what they do because software engineering inside a single repo is a constrained problem. The codebase is bounded. The tools are well-documented. The failure modes are quick to spot.

IT ops is unbounded. Every Kubernetes cluster is configured differently. Every observability stack has its own quirks. Tool documentation often lies. A coding agent that scores 80%+ on SWE-bench won't come anywhere near that on ITBench-AA. Don't assume the skills transfer.
For a couple of years the message from frontier labs has been: just use the biggest general model and prompt it well. ITBench-AA suggests that approach is hitting a wall for specialized enterprise domains. Expect to see more domain-specific models, scaffolds, and tool wrappers built specifically for ops work over the next year. IBM is clearly positioning to lead that conversation.
A few things stood out as genuinely interesting from the benchmark release.
First, reasoning models don't dominate the way you'd expect. On MATH and GPQA, the gap between reasoning models and non-reasoning models is enormous. On ITBench-AA, the gap shrinks substantially. Long chains of thought don't help much when the bottleneck is interpreting tool output, not planning.
Second, scaffolding seems to matter as much as the underlying model. The same model with different agent harnesses produces very different results. This mirrors what's been happening on SWE-bench, where a good scaffold can add 10+ points to a model's score.
Third, and this one isn't gonna sit well with the open-source crowd, the gap between frontier models and the best open weights appears wider on agentic IT tasks than on traditional benchmarks. The complexity of real tool use exposes weaknesses that don't show up in static evaluations.
The benchmark will only matter if it actually drives progress. A few signals to watch over the next six months:
If scores are still stuck below 50% a year from now, that's a story about the limits of current agent architectures. If they're at 70% by mid-2027, the enterprise AI pitch finally has a leg to stand on.
For now? Enterprise IT remains stubbornly hard for AI agents. And that's actually useful information.
ITBench-AA was built jointly by IBM Research and Artificial Analysis. The underlying ITBench framework is open-sourced on IBM's GitHub at github.com/itbench-hub/ITBench, so teams can reproduce results or extend scenarios. Artificial Analysis maintains the standardized scoring leaderboard for cross-model comparison.
SWE-bench Verified gives the agent a single code repository and a written bug description, then scores whether the patch passes tests. ITBench-AA drops the agent into live IT environments like Kubernetes clusters with monitoring tools, then scores whether it actually resolves an incident. SWE-bench measures code reasoning; ITBench-AA measures multi-system operations work.
Yes. Because the IBM ITBench framework is open source, you can wire up any model that supports tool calling, including Llama 4 and DeepSeek variants. The Artificial Analysis leaderboard primarily tracks frontier closed models, but the eval harness itself is model-agnostic.
The benchmark covers incident response, configuration troubleshooting, and multi-step remediation workflows in environments like Kubernetes. Tasks require agents to inspect logs, query monitoring tools, run shell commands, and modify infrastructure. Each task is scored binary on whether the underlying issue gets resolved, not just whether the agent produced reasonable-sounding output.
Based on sub-50% benchmark scores, fully autonomous deployment is risky for anything mission-critical. The pragmatic pattern in 2026 is human-in-the-loop: let the agent investigate and propose actions, but require human approval before it touches production. Reserve autonomous action for low-stakes tasks like log enrichment or runbook drafting.