Antigravity 2.0 Tops OpenSCAD 3D Benchmark: Full Analysis
Google's Antigravity 2.0 just posted the strongest autonomous result on ModelRift's OpenSCAD LLM benchmark, beating Claude Opus 4.7 and Codex 5.5 on a Pantheon-modeling task.
Google's Antigravity 2.0 just posted the strongest autonomous result on ModelRift's OpenSCAD LLM benchmark, beating Claude Opus 4.7 and Codex 5.5 on a Pantheon-modeling task.

Google Antigravity 2.0 (running Gemini 3.5 Flash High) just posted the strongest autonomous result on ModelRift's OpenSCAD LLM benchmark, scoring 4.5 out of 5 on a Pantheon-modeling task that pitted it against Codex 5.5 High, Claude Opus 4.7, Claude Sonnet 4.6, Cursor's Composer 2.5, and ModelRift's own human-in-the-loop workflow.
That's the headline. The reality underneath is a lot more interesting, because this benchmark exposes weaknesses you'd never see on HumanEval or SWE-bench.
The ModelRift OpenSCAD LLM benchmark is a small, practical evaluation. Each agent gets two reference images of the Pantheon and a short prompt asking it to produce a working .scad file using the OpenSCAD CLI to render previews and iterate.

OpenSCAD is a programmatic CAD language. You type constructive solid geometry (CSG) operations, and the renderer spits out a 3D model. No mouse-driven sketches, no widgets to drag. Every output is reproducible because everything is code.
That property is exactly why it works as an LLM benchmark. The exact prompt ModelRift used:
see two ref images and build .scad file with openscad implementation of pantheon. use openscad CLI (available) to preview your work (by rendering openscad model to .png) and iterate until you are happy with the result.
ModelRift assigns a single subjective quality score from 1 to 5 based on how faithfully the generated geometry captures the Pantheon — the rotunda, dome, oculus, portico, columns, and pediment — and reports run time as a separate axis. This is not a multi-dimensional rubric like HumanEval; it's a practical, eyes-on assessment by the ModelRift team.
So this isn't HumanEval dressed up. Generating syntactically valid Python is one thing. Generating spatially coherent CSG operations for a domed building with the right proportions is a different problem entirely.
According to ModelRift's published results, Antigravity 2.0 running Gemini 3.5 Flash High took the top autonomous score. The full leaderboard:
| Rank | Tool / Model | Quality | Time |
|---|---|---|---|
| 1 | Google Antigravity 2.0 / Gemini 3.5 Flash High | 4.5/5 | ~12 min |
| 2 | ModelRift / Gemini Flash 3.0 (human-in-the-loop) | 3.8/5 | ~10 min |
| 3 | Claude Code 2.1 / Sonnet 4.6 | 3.4/5 | slowest autonomous |
| 4 | Claude Code 2.1 / Opus 4.7 | 3.0/5 | slow |
| 5 | Codex 5.5 High | 3.0/5 | baseline |
| 6 | Cursor 3.5 / Composer 2.5 | 1.4/5 | fastest |
Exact scores and per-run notes live in the ModelRift writeup. The gap between Antigravity and the next autonomous run (Sonnet 4.6 at 3.4) was wider than the gap between Sonnet and Cursor at the bottom. That's a notable spread for any small benchmark.
Code benchmarks like HumanEval, MBPP, and even SWE-bench Verified mostly test logical correctness in domains (web apps, algorithms) where the training data is enormous. GitHub has billions of Python files. Every model has seen a thousand variations of binary search.
OpenSCAD has, by comparison, almost nothing. The total volume of public OpenSCAD code is small. The volume of public architectural OpenSCAD code is tinier still. So models can't memorize their way to a passing grade. They have to actually reason about geometry.

And geometry is hard. When a model writes:
translate([0, 0, 2700]) cube([4000, 6000, 200]);
it's not enough to know that translate takes a vector and cube takes dimensions. The model has to understand that this places a 4m × 6m × 0.2m slab at 2.7m elevation, which means it's a second-floor ceiling slab, which means the first-floor walls need to be 2.7m tall, which constrains every other element in the building.
Most LLMs fall apart somewhere in that chain. ModelRift notes that the harder part was geometric judgment, not the tool plumbing — every agent called the OpenSCAD CLI and rendered PNG previews without setup friction. The differences showed up in the geometry itself: proportions, the relationship between the round drum and the rectangular portico, dome ring placement.
ModelRift's analysis credits Antigravity's result to two things visible in the output. First, it used real Pantheon dimensions rather than making up rough proportions. Second, it was the only agent in the autonomous batch that implemented the building's signature interior coffered ceiling pattern.
The run also took around 12 minutes — slow by Cursor's standard, but ModelRift's takeaway is that speed didn't predict quality. The fastest run (Cursor at fastest, 1.4/5) was the weakest. The longest autonomous run (Sonnet 4.6, slowest in the original batch) produced the cleanest baseline massing.
That lines up with what Google has said publicly about Antigravity being agent-first: more time in the iterate-and-render loop appears to help on this kind of task.
A few things in the data didn't go the way you might expect.
Sonnet 4.6 beat Opus 4.7. That was not the order anyone predicted. Among the autonomous Claude runs, Sonnet 4.6 scored 3.4/5 with the cleanest overall massing and most plausible proportions, while Opus 4.7 came in at 3.0/5 — better structure than Cursor but more monochrome and less convincing than the stronger runs.
Codex tied Opus on quality but lost on export. Codex 5.5 High produced strong detail density during the render loop, including the entablature inscription, but the final exported STL had geometry problems around the portico roof. ModelRift's note: previews and exports are not the same thing, and for anything going to print, the exported mesh needs a separate inspection pass.
Cursor was the speed champion and the quality loser. Cursor 3.5 / Composer 2.5 finished fastest but produced the weakest output. Fast iteration didn't translate to a faithful Pantheon.
That last finding is the one worth chewing on. The past 18 months have trained everyone to assume that speed and capability scale together. This benchmark suggests that for problems where spatial intuition matters more than throughput, that assumption breaks.
Look at the result thumbnails on ModelRift's page and Antigravity 2.0's output is the only one that captures real Pantheon proportions plus the coffered ceiling. It isn't carried by one outlier feature.

ModelRift is also explicit that none of these outputs would pass as a faithful architectural model. The result is a relative ranking on a single difficult task, not a verdict on which model is "best at CAD." The quality gaps between tools are real, but the floor — every system produced valid, renderable OpenSCAD without hand-written CAD — is higher than the team expected.
Should you swap your tooling based on one benchmark? No, never on one benchmark. But there are some real takeaways.
If you're using LLMs for parametric design work (CAD, generative architecture, 3D printing g-code, Blender Python scripting), this benchmark is a better signal than HumanEval for picking a model. A model that can hold geometric state across a long OpenSCAD generation will probably do fine on FreeCAD macros and Blender bpy scripts too.
The other takeaway is ModelRift's own: fully autonomous generation is not the right workflow for this kind of task yet. The ModelRift / Gemini Flash 3.0 human-in-the-loop run scored 3.8/5 — second overall — by letting a person draw arrows and notes on the rendered output and feed that back to the model. For spatial geometry, that human-in-the-loop step matters even with tier-1 models.
The model that wins your specific task may not be the model that tops the headline scoreboard. Pick benchmarks that match your workload.
One concern worth raising. The ModelRift benchmark is new, small (six runs), and run by the same team that builds ModelRift. It hasn't been independently reproduced. The prompt and reference images are public, which means future model releases could be tuned against them.
So treat this as a snapshot, not a verdict. If Antigravity 2.0 is still on top six months from now, after the next round of Claude, GPT, and Gemini launches, the result will mean a lot more. For now, it's a strong data point in a small dataset.
Antigravity 2.0 leading ModelRift's OpenSCAD benchmark matters for two reasons. First, because spatial reasoning in code is a domain that hasn't been solved. Second, because the gap to the next autonomous run was unusually wide. The benchmark suggests Google's agent-first approach is paying off for tasks where geometric state has to be tracked across long generations.
If you're picking a coding model for general work, Claude Opus 4.7 and Sonnet 4.6 are still defensible defaults. But for anything OpenSCAD-shaped, the data now says try Antigravity 2.0 first — and don't dismiss a human-in-the-loop workflow if you have one available.
Sources
Antigravity is Google's agentic coding product. In the ModelRift benchmark it was run with Gemini 3.5 Flash High as the underlying model. Access details and onboarding are at antigravity.google. As of late May 2026, most users hit Antigravity through its dedicated client rather than as a standalone model endpoint.
No, not without significant human review. ModelRift's benchmark scores geometric likeness on a single building, not code compliance, structural engineering, or buildability. Generated OpenSCAD is useful for early-stage massing studies, parametric exploration, and 3D-printed architectural models, but real construction documents still need a licensed architect or engineer in the loop.
On this task, autonomous Sonnet 4.6 produced cleaner massing and more plausible Pantheon proportions than Opus 4.7, even though it took the longest among the autonomous runs. ModelRift's note is that speed did not predict quality — the longer iterate-and-render loop seemed to help models that used the time well. It's a single-task result, not a general claim that Sonnet is stronger than Opus.
A few early efforts exist around code-based 3D generation and shape-from-text, but none are as architecturally specific or as practical as ModelRift's Pantheon evaluation. Expect more domain-specific 3D benchmarks to appear through 2026 as agentic coding tools spread into CAD and parametric design.
No. Text-to-3D models generate meshes directly and don't produce editable code, so they aren't comparable. ModelRift's benchmark is strictly a code-generation test using the OpenSCAD CLI. If you need an editable parametric model that an engineer can modify, code-generating LLMs are the right tool. If you need a one-shot mesh for visualization, text-to-3D systems remain faster.