Who built the ITBench-AA benchmark and is it open source?

ITBench-AA was built jointly by IBM Research and Artificial Analysis. The underlying ITBench framework is open-sourced on IBM's GitHub at github.com/itbench-hub/ITBench, so teams can reproduce results or extend scenarios. Artificial Analysis maintains the standardized scoring leaderboard for cross-model comparison.

How is ITBench-AA different from SWE-bench Verified?

SWE-bench Verified gives the agent a single code repository and a written bug description, then scores whether the patch passes tests. ITBench-AA drops the agent into live IT environments like Kubernetes clusters with monitoring tools, then scores whether it actually resolves an incident. SWE-bench measures code reasoning; ITBench-AA measures multi-system operations work.

Can you run ITBench-AA on open-source models like Llama or DeepSeek?

Yes. Because the IBM ITBench framework is open source, you can wire up any model that supports tool calling, including Llama 4 and DeepSeek variants. The Artificial Analysis leaderboard primarily tracks frontier closed models, but the eval harness itself is model-agnostic.

What kinds of IT scenarios does ITBench-AA actually test?

The benchmark covers incident response, configuration troubleshooting, and multi-step remediation workflows in environments like Kubernetes. Tasks require agents to inspect logs, query monitoring tools, run shell commands, and modify infrastructure. Each task is scored binary on whether the underlying issue gets resolved, not just whether the agent produced reasonable-sounding output.

Should I deploy AI agents for production IT operations right now?

Based on sub-50% benchmark scores, fully autonomous deployment is risky for anything mission-critical. The pragmatic pattern in 2026 is human-in-the-loop: let the agent investigate and propose actions, but require human approval before it touches production. Reserve autonomous action for low-stakes tasks like log enrichment or runbook drafting.

ITBench-AA: Top AI Models Flunk Enterprise IT Tasks

Every frontier model on the market right now scores below 50% on enterprise IT agent tasks. That's the headline from ITBench-AA, the new benchmark from IBM Research and Artificial Analysis. And honestly? It's a much-needed reality check.

For months, AI labs have been waving around SWE-bench Verified scores in the 70-85% range and acting like agentic AI for the enterprise is basically a solved problem. The ITBench-AA benchmark suggests otherwise. When you take the same models that crush coding puzzles and drop them into real site reliability engineering scenarios, performance collapses.

This matters because enterprise IT is where the actual money lives. Not chatbots. Not vibe-coded side projects. The boring, expensive, mission-critical work of keeping production systems running.

What's the ITBench-AA benchmark?

ITBench-AA is the first public benchmark designed to evaluate AI agents on agentic enterprise IT tasks, built jointly by IBM Research and Artificial Analysis. It runs frontier models through realistic site reliability engineering and IT operations scenarios in live environments, then scores them on whether they actually fix the problem. Across all tested models, no system broke 50%.

Bar chart showing frontier model scores drop sharply on ITBench-AA versus other benchmarks

The benchmark is a subset of IBM's larger ITBench framework, narrowed down by Artificial Analysis into a standardized evaluation suite. It's designed to be reproducible, comparable across model releases, and aligned with how IT work actually happens in production.

And that last point is what makes the scores so interesting.

Why traditional benchmarks miss the point

Most AI benchmarks you see in marketing decks measure isolated capabilities. MMLU tests trivia knowledge. HumanEval tests function-completion in a vacuum. Even SWE-bench Verified, which is closer to real engineering, hands the model a clean repo and a well-described bug.

Real IT work is nothing like that.

When a Kubernetes cluster starts dropping pods at 3am, nobody hands you a Markdown file describing the root cause. You get a vague PagerDuty alert, a dashboard with seventeen graphs, and a chat thread of confused engineers. The job isn't writing code. The job is figuring out what's wrong in a noisy, multi-system environment, then taking action that doesn't make things worse.

That's what ITBench-AA actually measures.

How the methodology works

The benchmark gives agents access to live IT environments through tool use. Models can run shell commands, query monitoring systems, inspect logs, interact with Kubernetes APIs, and execute remediation steps. Scoring is binary at the task level: did the agent resolve the incident or not?

Two engineers collaborating at a standing desk reviewing AI agent suggestions on a laptop

IBM has open-sourced the environment setup so anyone can reproduce results. Artificial Analysis runs the standardized evaluation across major frontier models and publishes comparable scores. According to the official announcement post, the scenarios cover incident response, configuration troubleshooting, and multi-step remediation workflows.

This isn't a trivia test. It's a job interview.

The headline numbers

It gets uncomfortable from here for the AI labs. According to the published ITBench-AA SRE results, Claude Opus 4.7 (Max Effort) leads at 47% (self-reported by Artificial Analysis), with GPT-5.5 at 46% and Qwen3.7 Max at 42%. Every frontier model lands below 50%.

Benchmark	Best frontier score	What it measures
MMLU	92.3%	General knowledge
HumanEval	93.7%	Isolated code completion
SWE-bench Verified	~82%	Repo-level bug fixing
GPQA Diamond	87.7%	Graduate science questions
ITBench-AA	<50%	Agentic enterprise IT

See the gap? Models that ace graduate-level science and write production code can't reliably fix a broken Helm chart. That's the gap between "smart" and "useful in production."

Where models fall down

The failure modes are pretty consistent across model families. Based on the benchmark's reported results and IBM Research's commentary, the recurring issues include:

Agents misinterpret monitoring signals and pursue the wrong hypothesis
Multi-step plans break when intermediate commands produce unexpected output
Models hallucinate command flags or YAML keys that don't exist in the target tool
Recovery from a failed action is weak; agents often double down rather than re-planning
Long context windows don't help when the model can't tell which log lines actually matter

That last one is the killer. Modern frontier models have context windows in the millions of tokens. Doesn't matter. If you can't identify the three lines in the kubelet log that actually point to the problem, all that context is just noise.

What the comparison reveals

Let's put ITBench-AA next to other agentic benchmarks to see the shape of the problem.

Benchmark	Top model	Score	Domain
SWE-bench Verified	Claude Opus 4.6	81.4% (self-reported)	Software engineering
ARC-AGI	o3 (high compute)	87.5%	Abstract reasoning
GPQA Diamond	o3	87.7%	Science Q&A
ITBench-AA	Claude Opus 4.7 (Max Effort)	47% (self-reported)	Enterprise IT ops

The scores tell a story. Anything that looks like "closed-form reasoning" or "code in a sandbox" is approaching saturation. The moment you introduce real systems, real noise, and real tools that punish hallucination, performance drops by 20 to 40 points.

And that's exactly the work that enterprise buyers care about. AIOps. Incident response. Configuration management. Nobody is paying $5 per million tokens to autocomplete their CSS.

Why this is actually a healthy result

It's tempting to read sub-50% scores as bad news for AI agents. We'd argue the opposite. The benchmark community has been starved for evaluations that aren't already saturated. When every model scores 90%+ on a test, you've stopped measuring capability and started measuring training data overlap.

A benchmark where the best model gets 47% is a benchmark with room to grow. It's a real signal. You can run the same eval six months from now and actually see whether agents got better at the work that pays the bills.

The most useful benchmarks are the ones where current models clearly fail. Saturation is the enemy of progress measurement.

That's the deeper value of ITBench-AA. It gives us a ruler that won't max out by next quarter.

Practical implications for teams

So what does this mean if you're an engineering leader thinking about deploying AI agents in your stack? A few things worth taking seriously.

Be skeptical of "autonomous SRE" pitches

If a vendor walks in claiming their agent can autonomously resolve incidents, ask them what their ITBench-AA score is. Or any IT-specific eval. The fact that frontier models score under 50% means anything claiming "full autonomy" is either overselling or running on rails so narrow it's not really agentic.

A human-in-the-loop design is almost certainly the right pattern for production IT work in 2026. The agent proposes, a human approves. That's where the real productivity gains are right now.

Coding agents and IT agents are different products

Claude Code and Cursor are pretty solid at what they do because software engineering inside a single repo is a constrained problem. The codebase is bounded. The tools are well-documented. The failure modes are quick to spot.

MacBook screen displaying kubectl pod status and log output in a terminal

IT ops is unbounded. Every Kubernetes cluster is configured differently. Every observability stack has its own quirks. Tool documentation often lies. A coding agent that scores 80%+ on SWE-bench won't come anywhere near that on ITBench-AA. Don't assume the skills transfer.

For a couple of years the message from frontier labs has been: just use the biggest general model and prompt it well. ITBench-AA suggests that approach is hitting a wall for specialized enterprise domains. Expect to see more domain-specific models, scaffolds, and tool wrappers built specifically for ops work over the next year. IBM is clearly positioning to lead that conversation.

The surprises

A few things stood out as genuinely interesting from the benchmark release.

First, reasoning models don't dominate the way you'd expect. On MATH and GPQA, the gap between reasoning models and non-reasoning models is enormous. On ITBench-AA, the gap shrinks substantially. Long chains of thought don't help much when the bottleneck is interpreting tool output, not planning.

Second, scaffolding seems to matter as much as the underlying model. The same model with different agent harnesses produces very different results. This mirrors what's been happening on SWE-bench, where a good scaffold can add 10+ points to a model's score.

Third, and this one isn't gonna sit well with the open-source crowd, the gap between frontier models and the best open weights appears wider on agentic IT tasks than on traditional benchmarks. The complexity of real tool use exposes weaknesses that don't show up in static evaluations.

What to watch next

The benchmark will only matter if it actually drives progress. A few signals to watch over the next six months:

Whether IBM and Artificial Analysis publish quarterly score updates
Whether other labs adopt ITBench-AA as a target metric in their model release notes
Whether scaffolding teams (think the people behind SWE-agent and OpenHands) port their work to IT domains
Whether scores actually improve, or whether we discover this is a much harder ceiling than coding agents have hit

If scores are still stuck below 50% a year from now, that's a story about the limits of current agent architectures. If they're at 70% by mid-2027, the enterprise AI pitch finally has a leg to stand on.

For now? Enterprise IT remains stubbornly hard for AI agents. And that's actually useful information.