GLM-5.1 Hits 95% of Claude's Coding Score, Open Source
Zhipu AI's GLM-5.1 scores 94.6% of Claude Opus 4.6's coding performance in testing. Built on GLM-5's open-source SWE-bench record of 77.8%, here's what this means for developers.
Zhipu AI's GLM-5.1 scores 94.6% of Claude Opus 4.6's coding performance in testing. Built on GLM-5's open-source SWE-bench record of 77.8%, here's what this means for developers.

Zhipu AI's GLM-5.1 just scored 45.3 on a coding evaluation using Claude Code as the testing tool — that's 94.6% of Claude Opus 4.6's 47.9 score. For an open-source model, that's remarkably close to the proprietary frontier.
GLM-5.1 launched on March 27, 2026, as an upgrade to GLM-5, which already held the highest open-source SWE-bench Verified score at 77.8%. The original announcement framed GLM-5.1 as "approaching Claude Opus 4.5 coding ability" — and the numbers back that up. While Claude Opus 4.6 still leads on SWE-bench Verified at 80.8%, the gap between proprietary and open-source coding models is shrinking fast.
So what does this actually mean for developers choosing between open-source and proprietary models?
GLM-5.1 is the latest update from Zhipu AI (now operating as Z.ai), one of China's best-funded AI startups. It builds on GLM-5, a Mixture of Experts (MoE) model — 744 billion total parameters with only 40 billion activated per forward pass. That architecture is key: the model punches at heavyweight levels while running far lighter than a dense model of equivalent quality.
Here's what you're getting:
That 128K max output is a standout. Most frontier models cap output well before that, which becomes a real bottleneck when you're asking an AI to generate or refactor large code files. And native MCP (Model Context Protocol) support means GLM-5.1 can plug directly into tool-use workflows that developers are already building around Anthropic's protocol standard — a smart interoperability move from Z.ai.
Here's where things get concrete. GLM-5.1 was evaluated using Claude Code as the testing framework, scoring 45.3 points — a 28% improvement over GLM-5's 35.4 on the same evaluation. Claude Opus 4.6 scored 47.9 on this test, putting GLM-5.1 at 94.6% of Claude's performance. A month ago, that gap was much wider.

The broader SWE-bench Verified picture shows where the GLM-5 family sits relative to the field. GLM-5, the base model, scored 77.8% — the highest among open-source models. Proprietary models still lead, but look at how tight the margins are getting:
| Model | SWE-bench Verified |
|---|---|
| Claude Opus 4.5 | 80.9% |
| Claude Opus 4.6 | 80.8% |
| Claude Sonnet 4.6 | 79.6% |
| GLM-5 (base for GLM-5.1) | 77.8% |
| o3 | 71.7% |
| DeepSeek R1-0528 | 66.0% |
| GPT-4.1 | 54.6% |
| DeepSeek R1 | 49.2% |
Reaching 95% of Claude Opus 4.6's coding performance as an open-source model is arguably a bigger deal than topping a leaderboard. It means the proprietary advantage in coding is now measured in single-digit percentages.
A few important caveats here. GLM-5.1's coding evaluation score of 45.3 was reported by Zhipu AI using Claude Code as the testing framework. Independent verification hasn't happened yet. SWE-bench scores depend heavily on scaffolding and prompting strategy, so direct comparisons across models require identical testing conditions. As our benchmarks coverage explores, no single number tells the whole story.
Can a single benchmark tell you which model is best? Not really. SWE-bench and coding evaluations measure specific skills: autonomous bug-fixing, code generation, multi-file changes. But coding involves much more — code review, architecture decisions, explaining complex systems, debugging production incidents at 2 a.m. No single benchmark captures all of that.
Claude Opus 4.6 leads on HumanEval at 95% and holds a strong position on MMLU at 91%. Its 200K context window matches GLM-5.1's. But GLM-5.1's 128K max output length is a genuine advantage for code generation tasks — that's an enormous amount of code in a single response. When you're refactoring a large codebase or generating boilerplate across multiple files, output length matters more than most benchmarks measure.
The real question isn't whether GLM-5.1 matches Claude on one test. It's whether open-source models have crossed the threshold where they're genuinely competitive for production coding work.
GPT-4.1 scored 54.6% on SWE-bench Verified, well behind both Claude and the GLM-5 family. The current OpenAI frontier is GPT-5.2, which scores 80.0% on SWE-bench — showing just how competitive the top tier has become. The difference between first and fifth place is now less than one percentage point.

Something gets lost in the benchmark discussions: GLM-5.1's architecture is genuinely clever. Think of it like a hospital with 744 specialist doctors, but any given patient only sees 40 of them. You get the breadth of knowledge without paying for everyone to be in the room.
Why should you care about that? Two practical reasons:
Inference cost. Running 40B active parameters is dramatically cheaper than running a full 744B dense model. As of March 2026, GLM-5.1 is available through Z.ai's Coding Plan starting at $3/month (promotional pricing), with standard plans from $10/month across Lite, Pro, and Max tiers. Compare that to Claude Opus 4.6 at $5 per million input tokens and $25 per million output tokens — for high-volume use cases, the cost difference adds up quickly.
Deployment flexibility. While 744B total parameters is still enormous for local hosting, the 40B active count puts it in a more accessible range for enterprise GPU clusters. You won't run this on a gaming PC (744B at FP16 requires ~1.5TB of VRAM), but teams with multi-GPU infrastructure could self-host it under an MIT license with zero API costs. We've seen similar cost-performance tradeoffs play out across the industry as open-source models mature.
GLM-5.1 joins DeepSeek, Llama 4, and Mistral in a growing roster of open-source models that compete with proprietary options. The GLM-5 family's 77.8% SWE-bench score already set an open-source record, and GLM-5.1 pushes practical coding performance even closer to Claude's level.
For developers who've been waiting to ditch expensive API calls for open-source alternatives, GLM-5.1 is the strongest argument yet that the quality gap is closing fast.

The model supports autonomous multi-step coding tasks, long-context refactoring across large codebases, and full agentic workflows — plan, execute, debug, deliver — with native MCP tool support baked in. If Z.ai's numbers hold up under independent testing, that's a compelling package.
But "if the numbers hold up" is doing heavy lifting in that sentence. The AI community has been burned by inflated benchmark claims before, and healthy skepticism is warranted until third-party evaluations confirm the results.
The pressure is on every frontier lab. An open-source model reaching 95% of Claude Opus 4.6's coding performance — just one month after GLM-5's launch — signals that the pace of open-source improvement is accelerating. Will the next iteration close the gap entirely?
For developers, the practical advice is straightforward: wait for independent benchmarks, then try GLM-5.1 on your actual codebase. Benchmarks tell you what a model can do in controlled conditions. Your messy, real-world code will tell you what it will do when it matters.
Sources
Technically yes, but you need serious hardware. GLM-5.1 has 744B total parameters, which at FP16 precision requires roughly 1.5TB of VRAM just to load the weights. Even though only 40B parameters activate per inference pass, the full model needs to reside in memory. Enterprise-grade multi-GPU setups (8-16x A100 80GB or H100 GPUs) are the minimum for self-hosting. Most developers will want to use Z.ai's hosted API through the Coding Plan instead.
Yes. GLM-5.1's model weights are released under an MIT license, the same license used for GLM-5. MIT is one of the most permissive open-source licenses available, allowing commercial use, modification, and redistribution. Check Z.ai's official repository for the full license terms before building commercial products.
On SWE-bench Verified, GLM-5 (the base for GLM-5.1) scores 77.8% versus DeepSeek R1's 49.2% and DeepSeek R1-0528's 66.0%. That's a significant lead. On the Claude Code coding evaluation, GLM-5.1 scored 45.3 points, showing a 28% improvement over GLM-5. For autonomous bug-fixing and multi-file code changes, the GLM-5 family appears stronger based on available benchmarks.
As of March 2026, Z.ai offers GLM-5.1 through its Coding Plan with three tiers: Lite, Pro, and Max. Promotional pricing starts at $3/month, with standard plans from $10/month. For comparison, Claude Opus 4.6 charges $5 per million input tokens and $25 per million output tokens through Anthropic's API.
Yes. GLM-5.1 includes native MCP (Model Context Protocol) support, meaning it's compatible with tools and integrations built around Anthropic's MCP standard. This includes file system access, database queries, API calls, and other tool-use workflows. Native support means you don't need extra scaffolding or adapters — the protocol handling is built into the model's inference pipeline.