DeepSWE Benchmark: 91 Repos, 5 Languages, Zero Leaks
DeepSWE is a fresh contamination-free coding benchmark spanning 91 repos and 5 languages. Here's what the numbers say about frontier coding agents.
DeepSWE is a fresh contamination-free coding benchmark spanning 91 repos and 5 languages. Here's what the numbers say about frontier coding agents.

A new contamination-free coding benchmark just dropped on r/MachineLearning, and it might be the cleanest look we've gotten at how frontier models actually perform on real software engineering work. It's called DeepSWE, built by datacurve-ai, and it tries to fix the biggest problem with existing public benchmarks: data leakage.
The pitch is simple. Every task is written from scratch. No scraped GitHub PRs. No commits the models might have seen during pretraining. And the solutions require roughly 5.5x more code than SWE-bench Pro tasks, despite prompts that are about half the length.
So what does the data actually show? Let's get into it.
DeepSWE is an open-source coding benchmark covering 91 repositories across 5 programming languages, designed to evaluate how frontier AI models handle real software engineering tasks. Unlike SWE-bench, every task is hand-authored rather than scraped from public repos, eliminating pretraining contamination. Verifiers test behavior, not implementation details.

And that last part matters more than it sounds. Most public coding benchmarks lean on test files that ship with the original PR, which means models can sometimes overfit to the test shape rather than solve the actual problem. DeepSWE flips this by having humans write verifiers from scratch that check what the code does, not how it's written.
The benchmark was posted on r/MachineLearning in late June 2026, and the GitHub repo is already public.
Fair question. We already have HumanEval, LiveCodeBench, SWE-bench Verified, SWE-bench Pro, BigCodeBench, and roughly forty others, plus our own agentic LLM benchmark roundup. Why one more?
Because most of them are leaking like sieves.
HumanEval was published in 2021. Every serious frontier model has effectively memorized it. SWE-bench is built from real GitHub PRs, which means the patches sit in pretraining corpora. When Claude Opus 4.5 lands near 79% on SWE-bench Verified, we genuinely do not know how much of that is real generalization versus recall.
The DeepSWE authors argue their four pillars address this directly:
That third point is the spicy one. Long prompts can hide poor model reasoning behind verbose spec-following. Short prompts that require lots of output force the model to actually plan.
The construction process is the part worth dwelling on. According to the project's published materials, tasks are authored by human engineers who:
The 5-language spread covers TypeScript, Go, Python, JavaScript, and Rust, per the project README. Python and TypeScript carry the bulk of the tasks, as you would expect. The 91-repo pool is intentionally varied: web frameworks, CLI tools, data libraries, infrastructure code.
Here's the rough scale comparison the authors highlight:
| Metric | DeepSWE | SWE-bench Pro |
|---|---|---|
| Prompt length | ~0.5x | 1x (baseline) |
| Required solution code | 5.5x | 1x |
| Output tokens needed | ~2x | 1x |
| Repository count | 91 | varies |
| Languages | 5 | mostly Python |
| Contamination risk | None claimed | High (public PRs) |
Looks pretty solid on paper. The real test will be whether the benchmark gets independent replication.
DeepSWE itself has a live leaderboard at deepswe.datacurve.ai, but absolute scores there are still in flux as more agents submit. For context, here is where top models currently land on the adjacent SWE-bench Verified leaderboard:
| Model | SWE-bench Verified (Lite / Verified leaderboard) |
|---|---|
| Claude 4.5 Opus (high reasoning) | ~79% |
| Doubao-Seed-Code | ~79% |
| Gemini 3 Pro Preview | ~77% |
| Claude 4 Sonnet + leading scaffolds | ~75-77% |
| GPT-5 + leading scaffolds | ~72-74% |
Numbers are from the SWE-bench leaderboard snapshot in mid-2026; treat any specific cell as approximate until you check the live board.

The gap between HumanEval (frontier models long ago crossed 90%) and SWE-bench Verified (top scores still in the high 70s) is the contamination story in one chart. HumanEval is essentially solved. Real engineering work? Still hard.
Early reports suggest frontier models score noticeably lower on DeepSWE than on SWE-bench Verified, which is what you would expect from a clean benchmark with no pretraining overlap. The specific numbers will firm up as more labs publish their runs against it.
This caught our eye. The interesting signal isn't the absolute scores. It's the delta between contaminated and uncontaminated benchmarks.
If a model scores in the high 70s on SWE-bench Verified but lands much lower on DeepSWE, the gap is not pure capability drop. It is partly a measure of how much the older benchmark was inflated by memorization. That has real implications for anyone making procurement decisions based on those scores.
A few takeaways stand out:
Scaffolding matters more than we admit. Notice how the same Claude or GPT model can shift several points on SWE-bench Verified depending on the scaffold it runs in. Pure model capability is rarely what you're buying. You're buying model + use + tools. DeepSWE's longer required output tokens (2x SWE-bench Pro) should expose models that depend heavily on tight scaffolds.
Diversity is the silent killer. A benchmark dominated by Python web frameworks rewards models trained on Django and Flask. Spreading across 5 languages and 91 repos forces actual generalization. Models that lean hard on Python pattern-matching will likely tank on Rust and Go tasks.
Short prompts, long answers is a brutal combo. This is the design choice that excites me most. Most production coding agents don't get a six-paragraph spec. They get a Jira ticket that says "add SSO to the admin panel" and have to figure the rest out. DeepSWE's prompt-to-output ratio mirrors that reality.
A few things in the methodology raised eyebrows in the comments.
The verifier philosophy is the standout. By testing behavior rather than implementation, DeepSWE lets models solve problems in unexpected ways. SWE-bench has historically been criticized for verifiers that essentially require the model to write the same patch as the original human author. That's not engineering, that's copying.
The 91-repo number is also weirdly specific. Most benchmarks round to 50, 100, or 500. The fact that the team published 91 suggests they actually had quality cutoffs and didn't pad with low-quality repos to hit a marketing number. Small detail. Reads as honest.

One legitimate concern: a benchmark of 91 repos across 5 languages probably means relatively few tasks per language. If Rust gets, say, 30 tasks and Java gets 25, the per-language signal-to-noise ratio will be limited. The aggregate score will be useful. Language-specific rankings should be treated cautiously until sample sizes grow.
If you're picking a coding agent for actual work, what does this change?
First, stop treating SWE-bench Verified as the definitive score. It's still useful (the verified subset is much cleaner than the full set), but the 85-90% range models are now hitting probably doesn't reflect real-world performance. Pair those scores with LiveCodeBench for recency, our own 2026 coding LLM rankings, and DeepSWE for contamination control.
Second, if you're choosing between Claude Code, Cursor, Aider, or GitHub Copilot, the benchmark deltas between underlying models are probably smaller than the scaffolding deltas between the tools. Pick the tool whose workflow matches yours, then pick the model.
Third, watch this benchmark closely over the next few months. The first wave of official scores will tell us which labs were quietly overfitting to public benchmarks and which were actually building solid coding capability. (My bet: the gap will be ugly for some popular models. Not naming names.)
The repo is on GitHub, and the team is open-sourcing the verifiers along with the tasks. That's the right move. Closed benchmarks become political artifacts. Open ones get pressure-tested by independent researchers, which is exactly what you want.
The key questions for the next 6 months:
That last one is the existential question for every public benchmark. The moment a benchmark gets famous, it gets crawled. The half-life of a clean coding benchmark is probably 12-18 months at this point.
But for now? DeepSWE is the most credible-looking contamination-free coding benchmark to come out of the open-source community in a while. Worth bookmarking, worth watching, and worth weighing against the scores you've been quoting in procurement decks.
Sources
The full benchmark is open-sourced at github.com/datacurve-ai/deep-swe. You can clone the repo, pull the task set, and run verifiers locally against any model with an API endpoint. The team is also accepting community contributions for new tasks and verifier patches via pull request.
Yes. DeepSWE follows a similar input/output contract to SWE-bench style harnesses, so existing agent scaffolds should work with minor adapter code. The longer required output tokens may push you to increase context limits or chunk solutions, especially when driving Claude Opus or Gemini 3 Pro through an agent loop.
Cost varies by model, but expect roughly 2x the token usage of a SWE-bench Pro run because of the longer output requirement. For Claude Opus 4.6 at $5/$25 per million tokens, a full evaluation pass typically lands in the $200-500 range depending on retries. Cheaper models like Mistral Large 2.1 will run under $100.
The current task set focuses on functional correctness rather than security hardening or production-grade concerns like backwards compatibility. If you need security-specific evaluation, pair DeepSWE with a tool like Snyk or a dedicated security benchmark. The team has mentioned security tasks may come in a future release.
Unlikely in the near term. SWE-bench Verified remains the most widely cited coding benchmark in research papers and lab announcements, and it has 18+ months of historical model scores attached to it. DeepSWE will probably complement rather than replace it, similar to how LiveCodeBench coexists with HumanEval.