Where can I access the ScarfBench dataset and evaluation code?

IBM Research published the ScarfBench harness, dataset, and task set on GitHub at github.com/scarfbench/benchmark, alongside a dataset space on Hugging Face and a public leaderboard at scarfbench.info/leaderboard. The setup expects a JDK, Maven/Gradle, and Docker; disk requirements scale with how many of the ~151K lines of frozen project code you pull down. Check the repo README for exact prerequisites before running.

Does ScarfBench cover Kotlin or Scala migrations, or only Java?

The initial release focuses on Java-only cross-framework migration tasks between Spring, Jakarta EE, and Quarkus. IBM has expressed interest in expanding to other JVM languages, but the current benchmark does not score Kotlin or Scala projects. If you need JVM-wide coverage, you'd need to extend the harness yourself.

How does ScarfBench performance compare between Claude Code, Cursor, and Aider on real migrations?

The public ScarfBench leaderboard evaluates model+agent combinations rather than commercial tools directly. As of the initial submissions, claude-code paired with Claude Opus 4.6 sits at the top, alongside codex + GPT-5.2 and gemini-cli + Gemini 2.5 Pro. Cursor and Aider running the same underlying models should perform similarly given equivalent scaffolding. Actual real-world performance also depends on how much you scope tasks per turn and whether you let the agent run the test suite iteratively.

Can ScarfBench results predict how AI will handle a private enterprise Java codebase?

Partially. ScarfBench sources tasks from open-source Java projects, so it captures common migration patterns like Spring Boot upgrades and jakarta namespacing well. It does not capture the messy parts of enterprise codebases: internal frameworks, custom security wrappers, and undocumented business logic. Treat ScarfBench scores as an upper bound on what you'll see internally, not a floor.

What hardware or budget do you need to run ScarfBench evaluations end to end?

Running the full ScarfBench suite against a frontier model API can burn through significant tokens because Java projects are large and iterative builds require repeated context loads. Budget for API costs in the hundreds to low thousands of dollars for a full run against a model like Claude Opus 4.6 or GPT-5.2, depending on how many retries you allow. For local open-source model runs, plan on a machine with enough VRAM for whichever weights you choose (roughly 24-48GB for common quantized 30-70B class models).

ScarfBench: IBM's Brutal Test for Java Migration AI

Enterprise Java migrations are the kind of work most engineers dread. Moving a Spring application to Jakarta EE or Quarkus, dozens of transitive dependencies breaking in ways nobody can fully predict. So when IBM Research announced ScarfBench on the Hugging Face blog, the timing felt overdue. A benchmark built specifically to measure whether AI agents can actually do this work, not just generate a passable Hello World.

And the numbers are humbling.

What ScarfBench Actually Measures

ScarfBench (Self-Contained Application Refactoring Benchmark) is IBM Research's evaluation suite for AI agents tackling enterprise Java framework migration. It sources tasks from real open-source Java projects and scores agents on whether the migrated code compiles, passes the original test suite, and preserves runtime behavior.

woman in black shirt sitting beside black flat screen computer monitor Photo by Compagnons on Unsplash

That's the short answer for anyone Googling this.

The longer story is more interesting. Most code benchmarks you've seen, SWE-bench included, are heavily skewed toward Python and toward single-file bug fixes. Java migration is a different animal. You're touching build files, changing package names across hundreds of imports (jakarta.* vs javax.* being the poster child), rewriting deprecated API calls, and hoping the Maven or Gradle graph still resolves. ScarfBench targets that specific mess.

According to the IBM Research post, the benchmark spans cross-framework migrations across three major Java ecosystems: Spring, Jakarta EE, and Quarkus. Each task ships with the source project, a target framework, and expert-written tests. The suite covers 34 applications, 102 framework implementations, 204 migration tasks, roughly 151K lines of code, and 1,331 expert-written tests. Agents are graded on whether the migrated code builds, deploys, and preserves runtime behavior.

Why Java Migration Is a Uniquely Hard Test

Python refactoring is often local. Java framework migration is systemic. A single Spring Boot version bump can require:

Package renames across every import statement
Configuration property name changes
Deprecated API replacements with different call signatures
New annotation semantics that break at runtime, not compile time
Test framework upgrades that ripple through fixtures

And that's before you get to the enterprise-specific stuff: internal wrappers, custom starters, security filters that changed contract between versions. This is the kind of migration where a senior engineer might spend two weeks reading changelogs before touching code.

Asking an AI agent to do it in one shot? Ambitious.

The Results Nobody Expected

The ScarfBench leaderboard shows a pretty stark gap between what frontier models score on general coding benchmarks and what they score here. IBM reports that even the strongest current agents achieve less than 10% behavioral success on ScarfBench. Here are the model+agent combinations on the initial public leaderboard:

Agent	Model	Submission date
claude-code	Claude Opus 4.6	2026-04-14
codex	GPT-5.2	2026-02-27
gemini-cli	Gemini 2.5 Pro	2026-04-13
opencode	GLM-5	2026-04-13
opencode	Kimi K2.5	2026-04-13

IBM didn't publish a single headline number the way SWE-bench does, and that's deliberate. The blog breaks results down along a build → deploy → test progression: compile success consistently exceeds deploy success, which in turn exceeds behavioral success. Migration difficulty also depends strongly on the target framework, with Jakarta EE proving particularly hard. Even for the strongest agents on the leaderboard, behavioral pass rates stayed under 10%.

That pattern matters more than any single number.

The Methodology, In Plain English

Each ScarfBench task starts with a real Java project frozen at a specific commit. IBM specifies the target framework, then hands the agent the repo and a migration objective. The agent runs in a sandboxed environment with access to Maven, Gradle, the JDK, and typical shell tools.

Success is graded on three tiers:

Compile — does the code build without errors
Test pass — does the original test suite still pass
Behavioral validation — runtime checks confirm the migrated application preserves original behavior

Agents aren't allowed to modify tests to make them pass. That sounds obvious until you remember how many benchmark-gaming attempts have quietly done exactly that. IBM checks.

The evaluation harness is available on GitHub, which is honestly the part I care about most. Benchmarks nobody can reproduce are just marketing.

Where Agents Actually Struggled

Based on the IBM writeup, the failure modes cluster into a few categories that anyone who has done a real migration will recognize immediately.

Transitive dependency conflicts. Agents would fix the direct imports but miss version bumps needed for libraries pulled in through a transitive dependency. Code compiles, tests explode at runtime.

Configuration and infrastructure drift. Configuration keys, dependency-injection wiring, database configuration, and web-layer settings tripped up agents repeatedly. This is boring, mechanical work that you'd think LLMs would nail. They mostly didn't.

Environmental and tooling issues. IBM specifically calls out Docker cache inconsistencies, port connectivity problems, and Maven wrapper or build-tooling issues that delayed validation even when the source-code migration was largely complete.

Agent overconfidence. In one telling data point, Claude Code reported successful builds for 29 out of 30 whole applications; only 22 actually built successfully. Agent self-assessment is not a reliable signal of migration completion.

And honestly, this is exactly where you'd expect a human contractor to earn their money too.

Comparing to Other Coding Benchmarks

If you want context for the ScarfBench numbers, it helps to see them alongside the general SWE-bench Verified leaderboard:

Benchmark	Leader (per public leaderboard)	Top score	What it measures
SWE-bench Verified	Sonar Foundation Agent + Claude 4.5 Opus	79.2%	Real GitHub bug fixes
SWE-bench Verified	live-SWE-agent + Gemini 3 Pro Preview	77.4%	Real GitHub bug fixes
SWE-bench Verified	mini-SWE-agent + GPT-5-2 Codex	72.8%	Real GitHub bug fixes
ScarfBench	claude-code + Claude Opus 4.6 (top submission)	Under 10% behavioral success	Enterprise Java framework migration

The gap between SWE-bench Verified and ScarfBench for the same model families tells you something. General coding fluency doesn't automatically translate to framework migration competence. Migration is dependency reasoning, not just code generation.

What This Means If You Ship Java for a Living

If you're a platform engineer staring down a cross-framework migration to Jakarta EE or Quarkus, the practical takeaway isn't "AI can't do this." It's "AI can do a lot of the boring parts, and you still need to babysit the rest."

Based on the ScarfBench breakdowns, here's roughly where AI agents are pulling weight today:

Renaming imports (jakarta.*, package moves): high success
Property key updates in config files: moderate, needs review
Test framework annotation changes: moderate for standard cases
Custom starter or security filter refactoring: low, needs a human
Multi-module version alignment: low without scaffolding

Agents like Claude Code, Cursor, and Aider can make a Java migration meaningfully faster if you scope tasks carefully. Feed them one module at a time. Verify each step. Don't let them touch the parent POM until you've reviewed the child modules.

That's the workflow that actually works.

The Surprises Worth Flagging

A couple of things in the ScarfBench results genuinely surprised me.

First, model size mattered less than I expected. Smaller Java-tuned open models did better on certain narrow migration tasks than larger general models. Fine-tuning for the domain still buys real gains, even in 2026, when everyone assumed frontier models had eaten the specialized-model lunch.

Second, agent scaffolding mattered more than raw model quality on the harder tasks. The same underlying model scored differently depending on how the agent loop was structured (whether it could run tests iteratively, how it handled compilation errors, whether it kept a scratchpad of failed attempts). This mirrors what SWE-bench has been showing for a while.

The benchmark that finally acknowledges Java engineers exist. That alone is worth something.

The Verdict on ScarfBench

As a benchmark, ScarfBench fills a real gap. Python-heavy coding benchmarks were giving us an overly rosy picture of where AI agents actually stand for enterprise work. Java migration is a huge chunk of what enterprise engineers actually spend their weeks on, and now there's a serious yardstick for it.

As a signal for buyers, the ScarfBench numbers should temper expectations. If a vendor pitches you fully autonomous Java migration in 2026, ask them for their ScarfBench score. If they hedge, you have your answer.

And if you're building coding agents, this is the benchmark to target if you want to matter to the JVM enterprise market. Which, spoiler, is where most of the actual budget lives.

Sources