ScarfBench: IBM's Brutal Test for Java Migration AI
IBM Research's ScarfBench puts AI coding agents through real enterprise Java framework migrations. The results show a big gap between demo-day hype and production reality.
IBM Research's ScarfBench puts AI coding agents through real enterprise Java framework migrations. The results show a big gap between demo-day hype and production reality.

Enterprise Java migrations are the kind of work most engineers dread. Moving a Spring application to Jakarta EE or Quarkus, dozens of transitive dependencies breaking in ways nobody can fully predict. So when IBM Research announced ScarfBench on the Hugging Face blog, the timing felt overdue. A benchmark built specifically to measure whether AI agents can actually do this work, not just generate a passable Hello World.
And the numbers are humbling.
ScarfBench (Self-Contained Application Refactoring Benchmark) is IBM Research's evaluation suite for AI agents tackling enterprise Java framework migration. It sources tasks from real open-source Java projects and scores agents on whether the migrated code compiles, passes the original test suite, and preserves runtime behavior.
Photo by Compagnons on Unsplash
That's the short answer for anyone Googling this.
The longer story is more interesting. Most code benchmarks you've seen, SWE-bench included, are heavily skewed toward Python and toward single-file bug fixes. Java migration is a different animal. You're touching build files, changing package names across hundreds of imports (jakarta.* vs javax.* being the poster child), rewriting deprecated API calls, and hoping the Maven or Gradle graph still resolves. ScarfBench targets that specific mess.
According to the IBM Research post, the benchmark spans cross-framework migrations across three major Java ecosystems: Spring, Jakarta EE, and Quarkus. Each task ships with the source project, a target framework, and expert-written tests. The suite covers 34 applications, 102 framework implementations, 204 migration tasks, roughly 151K lines of code, and 1,331 expert-written tests. Agents are graded on whether the migrated code builds, deploys, and preserves runtime behavior.
Python refactoring is often local. Java framework migration is systemic. A single Spring Boot version bump can require:
And that's before you get to the enterprise-specific stuff: internal wrappers, custom starters, security filters that changed contract between versions. This is the kind of migration where a senior engineer might spend two weeks reading changelogs before touching code.
Asking an AI agent to do it in one shot? Ambitious.
The ScarfBench leaderboard shows a pretty stark gap between what frontier models score on general coding benchmarks and what they score here. IBM reports that even the strongest current agents achieve less than 10% behavioral success on ScarfBench. Here are the model+agent combinations on the initial public leaderboard:
| Agent | Model | Submission date |
|---|---|---|
| claude-code | Claude Opus 4.6 | 2026-04-14 |
| codex | GPT-5.2 | 2026-02-27 |
| gemini-cli | Gemini 2.5 Pro | 2026-04-13 |
| opencode | GLM-5 | 2026-04-13 |
| opencode | Kimi K2.5 | 2026-04-13 |
IBM didn't publish a single headline number the way SWE-bench does, and that's deliberate. The blog breaks results down along a build → deploy → test progression: compile success consistently exceeds deploy success, which in turn exceeds behavioral success. Migration difficulty also depends strongly on the target framework, with Jakarta EE proving particularly hard. Even for the strongest agents on the leaderboard, behavioral pass rates stayed under 10%.
That pattern matters more than any single number.
Each ScarfBench task starts with a real Java project frozen at a specific commit. IBM specifies the target framework, then hands the agent the repo and a migration objective. The agent runs in a sandboxed environment with access to Maven, Gradle, the JDK, and typical shell tools.
Success is graded on three tiers:
Agents aren't allowed to modify tests to make them pass. That sounds obvious until you remember how many benchmark-gaming attempts have quietly done exactly that. IBM checks.
The evaluation harness is available on GitHub, which is honestly the part I care about most. Benchmarks nobody can reproduce are just marketing.
Based on the IBM writeup, the failure modes cluster into a few categories that anyone who has done a real migration will recognize immediately.
Transitive dependency conflicts. Agents would fix the direct imports but miss version bumps needed for libraries pulled in through a transitive dependency. Code compiles, tests explode at runtime.
Configuration and infrastructure drift. Configuration keys, dependency-injection wiring, database configuration, and web-layer settings tripped up agents repeatedly. This is boring, mechanical work that you'd think LLMs would nail. They mostly didn't.
Environmental and tooling issues. IBM specifically calls out Docker cache inconsistencies, port connectivity problems, and Maven wrapper or build-tooling issues that delayed validation even when the source-code migration was largely complete.
Agent overconfidence. In one telling data point, Claude Code reported successful builds for 29 out of 30 whole applications; only 22 actually built successfully. Agent self-assessment is not a reliable signal of migration completion.
And honestly, this is exactly where you'd expect a human contractor to earn their money too.
If you want context for the ScarfBench numbers, it helps to see them alongside the general SWE-bench Verified leaderboard:
| Benchmark | Leader (per public leaderboard) | Top score | What it measures |
|---|---|---|---|
| SWE-bench Verified | Sonar Foundation Agent + Claude 4.5 Opus | 79.2% | Real GitHub bug fixes |
| SWE-bench Verified | live-SWE-agent + Gemini 3 Pro Preview | 77.4% | Real GitHub bug fixes |
| SWE-bench Verified | mini-SWE-agent + GPT-5-2 Codex | 72.8% | Real GitHub bug fixes |
| ScarfBench | claude-code + Claude Opus 4.6 (top submission) | Under 10% behavioral success | Enterprise Java framework migration |
The gap between SWE-bench Verified and ScarfBench for the same model families tells you something. General coding fluency doesn't automatically translate to framework migration competence. Migration is dependency reasoning, not just code generation.
If you're a platform engineer staring down a cross-framework migration to Jakarta EE or Quarkus, the practical takeaway isn't "AI can't do this." It's "AI can do a lot of the boring parts, and you still need to babysit the rest."
Based on the ScarfBench breakdowns, here's roughly where AI agents are pulling weight today:
Agents like Claude Code, Cursor, and Aider can make a Java migration meaningfully faster if you scope tasks carefully. Feed them one module at a time. Verify each step. Don't let them touch the parent POM until you've reviewed the child modules.
That's the workflow that actually works.
A couple of things in the ScarfBench results genuinely surprised me.
First, model size mattered less than I expected. Smaller Java-tuned open models did better on certain narrow migration tasks than larger general models. Fine-tuning for the domain still buys real gains, even in 2026, when everyone assumed frontier models had eaten the specialized-model lunch.
Second, agent scaffolding mattered more than raw model quality on the harder tasks. The same underlying model scored differently depending on how the agent loop was structured (whether it could run tests iteratively, how it handled compilation errors, whether it kept a scratchpad of failed attempts). This mirrors what SWE-bench has been showing for a while.
The benchmark that finally acknowledges Java engineers exist. That alone is worth something.
As a benchmark, ScarfBench fills a real gap. Python-heavy coding benchmarks were giving us an overly rosy picture of where AI agents actually stand for enterprise work. Java migration is a huge chunk of what enterprise engineers actually spend their weeks on, and now there's a serious yardstick for it.
As a signal for buyers, the ScarfBench numbers should temper expectations. If a vendor pitches you fully autonomous Java migration in 2026, ask them for their ScarfBench score. If they hedge, you have your answer.
And if you're building coding agents, this is the benchmark to target if you want to matter to the JVM enterprise market. Which, spoiler, is where most of the actual budget lives.
Sources
IBM Research published the ScarfBench harness, dataset, and task set on GitHub at github.com/scarfbench/benchmark, alongside a dataset space on Hugging Face and a public leaderboard at scarfbench.info/leaderboard. The setup expects a JDK, Maven/Gradle, and Docker; disk requirements scale with how many of the ~151K lines of frozen project code you pull down. Check the repo README for exact prerequisites before running.
The initial release focuses on Java-only cross-framework migration tasks between Spring, Jakarta EE, and Quarkus. IBM has expressed interest in expanding to other JVM languages, but the current benchmark does not score Kotlin or Scala projects. If you need JVM-wide coverage, you'd need to extend the harness yourself.
The public ScarfBench leaderboard evaluates model+agent combinations rather than commercial tools directly. As of the initial submissions, claude-code paired with Claude Opus 4.6 sits at the top, alongside codex + GPT-5.2 and gemini-cli + Gemini 2.5 Pro. Cursor and Aider running the same underlying models should perform similarly given equivalent scaffolding. Actual real-world performance also depends on how much you scope tasks per turn and whether you let the agent run the test suite iteratively.
Partially. ScarfBench sources tasks from open-source Java projects, so it captures common migration patterns like Spring Boot upgrades and jakarta namespacing well. It does not capture the messy parts of enterprise codebases: internal frameworks, custom security wrappers, and undocumented business logic. Treat ScarfBench scores as an upper bound on what you'll see internally, not a floor.
Running the full ScarfBench suite against a frontier model API can burn through significant tokens because Java projects are large and iterative builds require repeated context loads. Budget for API costs in the hundreds to low thousands of dollars for a full run against a model like Claude Opus 4.6 or GPT-5.2, depending on how many retries you allow. For local open-source model runs, plan on a machine with enough VRAM for whichever weights you choose (roughly 24-48GB for common quantized 30-70B class models).