Where can I download the DeepSWE benchmark and run it myself?

The full benchmark is open-sourced at github.com/datacurve-ai/deep-swe. You can clone the repo, pull the task set, and run verifiers locally against any model with an API endpoint. The team is also accepting community contributions for new tasks and verifier patches via pull request.

Is DeepSWE compatible with agent frameworks like SWE-agent or AutoCodeRover?

Yes. DeepSWE follows a similar input/output contract to SWE-bench style harnesses, so existing agent scaffolds should work with minor adapter code. The longer required output tokens may push you to increase context limits or chunk solutions, especially when driving Claude Opus or Gemini 3 Pro through an agent loop.

How much does it cost to evaluate a frontier model on DeepSWE?

Cost varies by model, but expect roughly 2x the token usage of a SWE-bench Pro run because of the longer output requirement. For Claude Opus 4.6 at $5/$25 per million tokens, a full evaluation pass typically lands in the $200-500 range depending on retries. Cheaper models like Mistral Large 2.1 will run under $100.

Does DeepSWE include security-sensitive or production-critical tasks?

The current task set focuses on functional correctness rather than security hardening or production-grade concerns like backwards compatibility. If you need security-specific evaluation, pair DeepSWE with a tool like Snyk or a dedicated security benchmark. The team has mentioned security tasks may come in a future release.

Will SWE-bench be retired now that DeepSWE exists?

Unlikely in the near term. SWE-bench Verified remains the most widely cited coding benchmark in research papers and lab announcements, and it has 18+ months of historical model scores attached to it. DeepSWE will probably complement rather than replace it, similar to how LiveCodeBench coexists with HumanEval.

DeepSWE Benchmark: 91 Repos, 5 Languages, Zero Leaks

A new contamination-free coding benchmark just dropped on r/MachineLearning, and it might be the cleanest look we've gotten at how frontier models actually perform on real software engineering work. It's called DeepSWE, built by datacurve-ai, and it tries to fix the biggest problem with existing public benchmarks: data leakage.

The pitch is simple. Every task is written from scratch. No scraped GitHub PRs. No commits the models might have seen during pretraining. And the solutions require roughly 5.5x more code than SWE-bench Pro tasks, despite prompts that are about half the length.

So what does the data actually show? Let's get into it.

What's the DeepSWE benchmark?

DeepSWE is an open-source coding benchmark covering 91 repositories across 5 programming languages, designed to evaluate how frontier AI models handle real software engineering tasks. Unlike SWE-bench, every task is hand-authored rather than scraped from public repos, eliminating pretraining contamination. Verifiers test behavior, not implementation details.

Bar chart comparing task characteristics of DeepSWE and SWE-bench Pro

And that last part matters more than it sounds. Most public coding benchmarks lean on test files that ship with the original PR, which means models can sometimes overfit to the test shape rather than solve the actual problem. DeepSWE flips this by having humans write verifiers from scratch that check what the code does, not how it's written.

The benchmark was posted on r/MachineLearning in late June 2026, and the GitHub repo is already public.

Why another coding benchmark?

Fair question. We already have HumanEval, LiveCodeBench, SWE-bench Verified, SWE-bench Pro, BigCodeBench, and roughly forty others, plus our own agentic LLM benchmark roundup. Why one more?

Because most of them are leaking like sieves.

HumanEval was published in 2021. Every serious frontier model has effectively memorized it. SWE-bench is built from real GitHub PRs, which means the patches sit in pretraining corpora. When Claude Opus 4.5 lands near 79% on SWE-bench Verified, we genuinely do not know how much of that is real generalization versus recall.

The DeepSWE authors argue their four pillars address this directly:

Contamination-free: tasks written from scratch
High diversity: 91 repos across 5 languages
Real complexity: 5.5x more code per solution than SWE-bench Pro
Reliable verification: behavior-checking verifiers written by humans

That third point is the spicy one. Long prompts can hide poor model reasoning behind verbose spec-following. Short prompts that require lots of output force the model to actually plan.

Benchmark methodology breakdown

The construction process is the part worth dwelling on. According to the project's published materials, tasks are authored by human engineers who:

Pick a real repository (often mid-sized OSS projects, not toy code)
Design a feature or bug fix that doesn't exist as a public commit
Write a prompt that's roughly half the length of SWE-bench Pro's average
Hand-write verifiers that test the resulting behavior

The 5-language spread covers TypeScript, Go, Python, JavaScript, and Rust, per the project README. Python and TypeScript carry the bulk of the tasks, as you would expect. The 91-repo pool is intentionally varied: web frameworks, CLI tools, data libraries, infrastructure code.

Here's the rough scale comparison the authors highlight:

Metric	DeepSWE	SWE-bench Pro
Prompt length	~0.5x	1x (baseline)
Required solution code	5.5x	1x
Output tokens needed	~2x	1x
Repository count	91	varies
Languages	5	mostly Python
Contamination risk	None claimed	High (public PRs)

Looks pretty solid on paper. The real test will be whether the benchmark gets independent replication.

How frontier models score on existing coding benchmarks

DeepSWE itself has a live leaderboard at deepswe.datacurve.ai, but absolute scores there are still in flux as more agents submit. For context, here is where top models currently land on the adjacent SWE-bench Verified leaderboard:

Model	SWE-bench Verified (Lite / Verified leaderboard)
Claude 4.5 Opus (high reasoning)	~79%
Doubao-Seed-Code	~79%
Gemini 3 Pro Preview	~77%
Claude 4 Sonnet + leading scaffolds	~75-77%
GPT-5 + leading scaffolds	~72-74%

Numbers are from the SWE-bench leaderboard snapshot in mid-2026; treat any specific cell as approximate until you check the live board.

Laptop displaying an AI model coding benchmark leaderboard

The gap between HumanEval (frontier models long ago crossed 90%) and SWE-bench Verified (top scores still in the high 70s) is the contamination story in one chart. HumanEval is essentially solved. Real engineering work? Still hard.

Early reports suggest frontier models score noticeably lower on DeepSWE than on SWE-bench Verified, which is what you would expect from a clean benchmark with no pretraining overlap. The specific numbers will firm up as more labs publish their runs against it.

What the numbers actually mean

This caught our eye. The interesting signal isn't the absolute scores. It's the delta between contaminated and uncontaminated benchmarks.

If a model scores in the high 70s on SWE-bench Verified but lands much lower on DeepSWE, the gap is not pure capability drop. It is partly a measure of how much the older benchmark was inflated by memorization. That has real implications for anyone making procurement decisions based on those scores.

A few takeaways stand out:

Scaffolding matters more than we admit. Notice how the same Claude or GPT model can shift several points on SWE-bench Verified depending on the scaffold it runs in. Pure model capability is rarely what you're buying. You're buying model + use + tools. DeepSWE's longer required output tokens (2x SWE-bench Pro) should expose models that depend heavily on tight scaffolds.

Diversity is the silent killer. A benchmark dominated by Python web frameworks rewards models trained on Django and Flask. Spreading across 5 languages and 91 repos forces actual generalization. Models that lean hard on Python pattern-matching will likely tank on Rust and Go tasks.

Short prompts, long answers is a brutal combo. This is the design choice that excites me most. Most production coding agents don't get a six-paragraph spec. They get a Jira ticket that says "add SSO to the admin panel" and have to figure the rest out. DeepSWE's prompt-to-output ratio mirrors that reality.

Notable surprises in the design

A few things in the methodology raised eyebrows in the comments.

The verifier philosophy is the standout. By testing behavior rather than implementation, DeepSWE lets models solve problems in unexpected ways. SWE-bench has historically been criticized for verifiers that essentially require the model to write the same patch as the original human author. That's not engineering, that's copying.

The 91-repo number is also weirdly specific. Most benchmarks round to 50, 100, or 500. The fact that the team published 91 suggests they actually had quality cutoffs and didn't pad with low-quality repos to hit a marketing number. Small detail. Reads as honest.

Two developers reviewing code on a monitor at a shared desk

One legitimate concern: a benchmark of 91 repos across 5 languages probably means relatively few tasks per language. If Rust gets, say, 30 tasks and Java gets 25, the per-language signal-to-noise ratio will be limited. The aggregate score will be useful. Language-specific rankings should be treated cautiously until sample sizes grow.

Practical implications for developers

If you're picking a coding agent for actual work, what does this change?

First, stop treating SWE-bench Verified as the definitive score. It's still useful (the verified subset is much cleaner than the full set), but the 85-90% range models are now hitting probably doesn't reflect real-world performance. Pair those scores with LiveCodeBench for recency, our own 2026 coding LLM rankings, and DeepSWE for contamination control.

Second, if you're choosing between Claude Code, Cursor, Aider, or GitHub Copilot, the benchmark deltas between underlying models are probably smaller than the scaffolding deltas between the tools. Pick the tool whose workflow matches yours, then pick the model.

Third, watch this benchmark closely over the next few months. The first wave of official scores will tell us which labs were quietly overfitting to public benchmarks and which were actually building solid coding capability. (My bet: the gap will be ugly for some popular models. Not naming names.)

What's next for the benchmark

The repo is on GitHub, and the team is open-sourcing the verifiers along with the tasks. That's the right move. Closed benchmarks become political artifacts. Open ones get pressure-tested by independent researchers, which is exactly what you want.

The key questions for the next 6 months:

Will major labs (OpenAI, Anthropic, Google, DeepSeek) actually run their flagships against it and publish results?
Will the contamination claim hold up under adversarial review? (Someone will try to prove tasks leaked.)
Will the benchmark expand beyond 91 repos to give per-language scores statistical weight?
Can it stay contamination-free as it gets more attention, or will the tasks themselves leak into pretraining corpora?

That last one is the existential question for every public benchmark. The moment a benchmark gets famous, it gets crawled. The half-life of a clean coding benchmark is probably 12-18 months at this point.

But for now? DeepSWE is the most credible-looking contamination-free coding benchmark to come out of the open-source community in a while. Worth bookmarking, worth watching, and worth weighing against the scores you've been quoting in procurement decks.

Sources