Grok 4.3 vs Grok 4.20: 5 Real Differences That Matter
xAI shipped Grok 4.20 alongside Grok 4.3 with a rebuilt reasoning stack and agentic tool loop. Same 1M context, same base price — where does the switch actually pay off?
xAI shipped Grok 4.20 alongside Grok 4.3 with a rebuilt reasoning stack and agentic tool loop. Same 1M context, same base price — where does the switch actually pay off?

xAI has a naming problem. Grok 4.20 isn't a minor patch on 4.3, and the version number makes that genuinely confusing for anyone building on the API. So let's clear it up.
The short answer: Grok 4.20 iterates on Grok 4.3 with a rebuilt agentic tool loop and adaptive reasoning behavior. Both models list the same 1 million token context window in the xAI docs and are priced identically at the base tier, so this is not a raw capability jump — it is a behavior and reasoning-stack refinement.
This Grok 4.3 vs Grok 4.20 comparison walks through what actually changed under the hood, where the benchmarks moved, and which one you should be pointing your production traffic at as of mid-2026.
If you're building fresh on xAI and choosing between the two, start on Grok 4.20 — the xAI docs treat it as the newer entry alongside Grok 4.3. If you're already on 4.3, there's no urgent migration pressure: both models share the same 1M context window and the same base pricing.

If you run a high-volume chatbot with short turns, or you're on a tight latency budget for the first-token response, Grok 4.3 is still a reasonable pick. Both models share the same base pricing per the xAI docs, so the choice comes down to behavior rather than cost.
And if you're just kicking the tires on xAI for the first time, either model is a reasonable starting point since the base tier is identical.
| Feature | Grok 4.3 | Grok 4.20 |
|---|---|---|
| Context window | 1M tokens | 1M tokens |
| Reasoning mode | Configurable | Reasoning and non-reasoning variants |
| Tool calling | Function calling, structured outputs | Function calling, structured outputs, multi-agent variant |
| Vision | Yes | Yes |
| Real-time X data | Yes (server-side search tools) | Yes (server-side search tools) |
| API pricing (input) | $1.25 / 1M tokens | $1.25 / 1M tokens |
| API pricing (output) | $2.50 / 1M tokens | $2.50 / 1M tokens |
| Chatbot Arena Elo | N/A (independent ranking not confirmed) | N/A (independent ranking not confirmed) |
Pricing figures reflect the xAI models documentation at time of writing. Always check current pricing before locking in a contract.
Both Grok 4.3 and Grok 4.20 list a 1 million token maximum prompt length in the xAI docs, so context length is not a differentiator between them. That already puts either model ahead of GPT-4o's 128K ceiling.
As with any long-context model, real-world recall past a few hundred thousand tokens depends on the workload. xAI has not published independent needle-in-a-haystack numbers for either model at the time of writing, so if you rely on retrieval from deep context, benchmark it on your own data before committing.
Grok 4.3 exposes a configurable reasoning setting so you can trade latency for depth on demand. Grok 4.20 ships as two distinct variants in the model list — a reasoning variant and a non-reasoning variant — so you pick the behavior at the request level rather than toggling a mode.
The tradeoff: if your traffic mixes simple lookups and hard planning tasks, you may end up routing between the two 4.20 variants, whereas 4.3 lets one endpoint handle both.
Anyone who tried building agents on Grok 4.3 knows the pain. Tool calls worked, but multi-turn tool loops (where the model calls a tool, reads the result, then decides what to call next) were flaky. About one in five runs would either hallucinate a function that didn't exist or loop on the same call.
4.20 rebuilt this. xAI describes the model as tuned for agentic tool calling with reduced hallucinations, though the company has not published independently verified chained-call accuracy numbers. If your agent chains many tool calls, evaluate 4.20 against Claude Opus 4.7 and current OpenAI models on your own workload.
Both models accept image input alongside text, per the xAI docs. xAI has not published side-by-side vision benchmarks between 4.3 and 4.20 that we could independently verify, so if OCR quality or chart interpretation matters to your workload, run your own comparison before switching. For rough context on where the frontier sits, Gemini is worth benchmarking against.

If your workload involves parsing screenshots, invoices, or scientific figures, benchmark both models on representative samples before committing.
The headline feature xAI keeps advertising. Both models can access live X posts via server-side search tools, per the xAI documentation. Concrete latency comparisons between 4.3 and 4.20 have not been published, so measure your own workload if this matters.
This is still the main reason to pick Grok over the alternatives. Nobody else has legal, licensed access to the full X firehose in real time.
xAI lists both Grok 4.3 and Grok 4.20 at the same base rate in the models documentation.
| Tier | Grok 4.3 | Grok 4.20 |
|---|---|---|
| Input tokens | $1.25 per million | $1.25 per million |
| Output tokens | $2.50 per million | $2.50 per million |
| Vision | Same as text | Same as text |
Base pricing is identical, so cost is not a reason to prefer one over the other. Long-context requests over the 200K threshold are priced separately at a higher rate — verify the current xAI pricing for your specific traffic mix before committing.
And yes, both models require a paid subscription. There's no free tier for the API, unlike DeepSeek or the free tier on some Google models.
Things get honest — xAI hasn't published a full independent benchmark suite for Grok 4.20 at the time of writing, so we're working with the vendor's own materials and early community testing. Take internal benchmarks with skepticism, always.
Grok 4.3 and Grok 4.20 will each need independent LMSYS Chatbot Arena rankings before anyone can honestly compare them to Claude and GPT frontier models. Until then, treat any single-source benchmark score as marketing rather than data.

On coding, xAI has not published verified SWE-bench Verified numbers for Grok 4.20 that we could confirm against the official SWE-bench leaderboard. Community estimates circulate, but until an independent submission lands, any specific coding benchmark score for 4.20 is best treated as unverified.
My honest read: Grok 4.20 looks like a solid refinement of 4.3 rather than a frontier leap. If you're already in the xAI ecosystem, or you need the X data integration, it's worth trying. If you're comparison shopping for the hardest coding or reasoning work, run your own evals against Claude and GPT before committing.
A fair comparison has to name what's still broken. Grok as a family still has weaker instruction-following on complex system prompts than Claude does. The API SDK is thinner than what OpenAI ships. Documentation, while improved, still lags behind Anthropic's. And the rate limits on lower-tier accounts feel stingy compared to what you get on OpenRouter with the same spend.
None of this is a dealbreaker. But if you're picking a first model for a serious production build, Claude or GPT still get the safer nod. Grok is the interesting option when you need something they can't do, and X data access is basically the only thing in that category right now.
For a broader look at how Grok stacks up against Anthropic's reasoning model, see our Grok 4.3 vs Claude Fable 5 comparison. Comparing across ecosystems? Our GPT vs Claude Opus 4.6 showdown covers the two other frontier options you should be evaluating alongside Grok.
Grok 4.20 is xAI's newer entry, but at the base tier it shares the 1M context window and the $1.25/$2.50 pricing of Grok 4.3. The differences are in reasoning behavior and tool-calling tuning, not a raw capability jump. If you're on 4.3 today and it works, there is no urgent reason to migrate.
But if you're using Grok as a fast, cheap chat backend and it's working fine, don't fix what isn't broken. 4.3 is still supported and still fast.
And if you're comparison shopping across the whole model space, Grok isn't the top-tier frontier choice for coding or reasoning. It's a solid second-tier model with one genuinely unique feature (X data). For most teams, that's enough to keep it in the rotation. Not enough to make it the default.
Sources
Mostly yes. Existing chat completion requests generally work without changes. The Grok 4.20 model family exposes separate reasoning and non-reasoning variants (for example, `grok-4.20-reasoning` and `grok-4.20-non-reasoning`), so if you were toggling reasoning behavior on 4.3, you may want to select the appropriate 4.20 variant at the model level. Always confirm parameter compatibility against the current [xAI docs](https://docs.x.ai/) before migrating.
Yes, both OpenRouter and Poe added Grok 4.20 within weeks of launch. OpenRouter tends to charge a small markup over xAI direct rates but gives you unified billing across models. Poe is better for individual consumer use, not production API traffic.
Not publicly as of July 2026. xAI has hinted at a fine-tuning API in their roadmap but has not shipped it. If you need custom model behavior, you're limited to prompt engineering and RAG. Anthropic and OpenAI both offer more mature customization paths.
Grok 4.3 remains in active support with no announced deprecation date, but xAI historically deprecates older models within 12 to 18 months of a major successor. Plan for a migration window sometime in 2027 if you're building new infrastructure on 4.3 today.
Rate limits vary by tier and by whether you're using the reasoning or non-reasoning variant. The xAI docs list base tier RPM allowances directly on the [models page](https://docs.x.ai/docs/models). If you're planning burst traffic, check the current limits for your specific model alias and consider requesting a limit increase before migrating production workloads.