Qwen3.5-9B Crushes GPT on Documents—But Has a Glaring Weak Spot
Benchmark data shows Qwen3.5-9B beats frontier models on OCR and field extraction, yet stumbles badly on tables. Here's the honest breakdown.

Benchmark data shows Qwen3.5-9B beats frontier models on OCR and field extraction, yet stumbles badly on tables. Here's the honest breakdown.

What if a 9 billion parameter open-source model could outperform GPT and Claude on document understanding tasks? According to recent benchmark results from the LocalLLaMA community, that's exactly what's happening—but only in specific document domains. This Qwen3.5-9B benchmark comparison breaks down exactly where the small model wins and where it falls flat.
As of March 16, 2026, the Qwen3.5 family has been thoroughly tested across 9,000+ real-world documents in nine different task categories, as documented on the IDP Leaderboard. The results are messy. They're contradictory. And they're way more interesting than "small model is slower than big model."
Qwen3.5-9B isn't a universal document champion. But in text extraction and field extraction? It's genuinely competitive with models that cost 100x more to run via API. That's worth understanding if you're building a local document processing pipeline.
| Use Case | Best Choice | Runner-Up | Why |
|---|---|---|---|
| Text extraction (OCR) | Qwen3.5-9B (78.1) | Gemini 3.1 Pro (74.6) | Qwen wins across all sizes; production-ready for messy PDFs |
| Question answering on docs | Gemini 3.1 Pro (85.0) | Qwen3.5-9B (79.5) | Qwen edges GPT-5.4 by 1.3 points; closes frontier gap significantly |
| Invoice/form field extraction | Gemini 3 Flash (91.1) | Qwen3.5-9B (86.5) | Qwen matches Gemini 3.1 Pro; beats GPT mini models |
| Table extraction | Gemini 3.1 Pro (96.4) | Claude Sonnet 4.6 (96.3) | Qwen maxes out at 76.6—clear architectural weakness |
| Handwriting recognition | Gemini 3.1 Pro (82.8) | GPT-4.1 (75.6) | Qwen trails by 17 points; not suitable for handwritten docs |
| Cost-optimized local processing | Qwen3.5-4B (77.2 OCR) | Qwen3.5-2B | 4B model is 95% of 9B's performance at half the VRAM |
The core insight: Qwen3.5-9B punches above its weight class on extraction tasks but hits an invisible ceiling on structured data (tables). This isn't about model size—it's about training data and architecture.
This is Qwen's strongest domain. The benchmark tests text extraction from dense PDFs, multi-column layouts, and poor-quality scans—the kind of real-world chaos that breaks most document AI systems.
As of March 16, 2026:
| Model | OlmOCR Score | Delta from Qwen-9B |
|---|---|---|
| Qwen3.5-9B | 78.1 | — |
| Qwen3.5-4B | 77.2 | -0.9 |
| Gemini 3.1 Pro | 74.6 | -3.5 |
| Claude Sonnet 4.6 | 74.4 | -3.7 |
| GPT-5.4 | 73.4 | -4.7 |
| Qwen3.5-2B | 73.7 | -4.4 |
Why does a smaller model beat Gemini and Claude here? Two factors:
And here's the kicker—the 4B model is only 0.9 points behind the 9B. For local deployment, that changes the calculus entirely. You get near-identical OCR performance at a fraction of the memory footprint.
If you're extracting text from scanned documents at scale, Qwen3.5-9B is the best open model available. Full stop.
This benchmark tests whether the model can answer questions about document content—tables, charts, text blocks, anything visual in a PDF.
As of March 16, 2026:
| Model | VQA Score | Delta |
|---|---|---|
| Gemini 3.1 Pro | 85.0 | — |
| Qwen3.5-9B | 79.5 | -5.5 |
| GPT-5.4 | 78.2 | -6.8 |
| Qwen3.5-4B | 72.4 | -12.6 |
| Claude Sonnet 4.6 | 65.2 | -19.8 |
| Gemini 3 Flash | 63.5 | -21.5 |
This one's surprising. The 9B model is second only to Gemini 3.1 Pro and beats GPT-5.4 by 1.3 points. It crushes Claude Sonnet by 14 points. That's not a rounding error—that's a meaningful, production-relevant gap.

The gap between 4B and 9B here (7.1 points) tells a different story than OCR. Visual reasoning scales harder than pure text extraction. But even the 4B hitting 72.4 is respectable. And you definitely want the 9B for VQA tasks.
KIE tests the model's ability to pull specific fields from documents: invoice numbers, dates, amounts, vendor names, line items.
As of March 16, 2026:
| Model | KIE Score | Notes |
|---|---|---|
| Gemini 3 Flash | 91.1 | Tiny model punching hard |
| Claude Opus 4.6 | 89.8 | Strongest Claude variant here |
| Claude Sonnet 4.6 | 89.5 | Still excellent |
| GPT-5.2 | 87.5 | Mid-tier GPT |
| Gemini 3.1 Pro | 86.8 | — |
| Qwen3.5-9B | 86.5 | Matches Gemini 3.1 Pro |
| Qwen3.5-4B | 86.0 | Essentially production-equivalent |
| GPT-5.4 | 85.7 | — |
Qwen3.5-9B ties Gemini 3.1 Pro and sits ahead of GPT-5.4 and Claude Haiku 4.5. This is the "use case that actually matters" for most enterprises—invoices, receipts, contracts. A 9B open model that matches frontier performance is a cost-saving home run.
And again, the 4B model is only 0.5 points lower. That's basically noise in production systems.
This is Qwen's ceiling, and it's depressing to look at.
| Model | GrITS Score | Delta |
|---|---|---|
| Gemini 3.1 Pro | 96.4 | — |
| Claude Sonnet 4.6 | 96.3 | -0.1 |
| Gemini 3 Pro | 95.8 | -0.6 |
| GPT-5.4 | 94.8 | -1.6 |
| GPT-5.2 | 86.0 | -10.4 |
| Gemini 3 Flash | 85.6 | -10.8 |
| Qwen3.5-9B | 76.6 | -19.8 |
| Qwen3.5-4B | 76.7 | -19.7 |
Frontier models score between 85–96. Qwen maxes out at 76.7. The truly revealing part: the 4B and 9B scores are essentially identical. This isn't a scale problem. This is an architecture or training data problem.

Qwen was not trained on tables. Or it was trained on tables in a way that doesn't transfer to complex, real-world table layouts. Whatever the reason, tables are a hard no for Qwen3.5 right now.
If table extraction is in your document pipeline, you need a frontier model or a specialized tool. Qwen won't cut it.
| Model | Handwriting OCR Score |
|---|---|
| Gemini 3.1 Pro | 82.8 |
| Gemini 3 Flash | 81.7 |
| GPT-4.1 | 75.6 |
| Claude Opus 4.6 | 74.0 |
| Claude Sonnet 4.6 | 73.7 |
| GPT-5.4 | 69.1 |
| Ministral-8B | 67.8 |
| Qwen3.5-9B | 65.5 |
| Qwen3.5-4B | 64.7 |
Qwen trails by 17 points versus Gemini 3.1 Pro. Not production-ready for handwritten documents. If your use case involves signed forms, doctor's notes, or handwritten annotations, Qwen isn't the answer.
Why does Qwen dominate text extraction but fail at tables? The answer lies in how Qwen3.5 was built.

According to Alibaba's official documentation, Qwen3.5 was trained with a specific emphasis on:
Frontier models (GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro) treat documents as one task among hundreds. They have more model capacity and more diverse training data. They can solve any document task—just not necessarily with the same efficiency as a model purpose-built for that task.
Qwen3.5-9B is the opposite: highly specialized, narrowly excellent.

Qwen-9B | GPT-5.4 | Claude Sonnet 4.6 | Gemini 3.1 Pro
OCR (messy docs) 78.1 | 73.4 | 74.4 | 74.6
Visual Q&A 79.5 | 78.2 | 65.2 | 85.0
Field extraction 86.5 | 85.7 | 89.5 | 86.8
Table extraction 76.6 | 94.8 | 96.3 | 96.4
Handwriting OCR 65.5 | 69.1 | 73.7 | 82.8
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Win categories: 1 | 0 | 1 | 3
Qwen wins decisively on OCR and comes in a strong second on VQA. Gemini wins 3 categories outright (VQA, tables, handwriting). Claude takes field extraction. GPT-5.4 doesn't top any category. But factor in cost, and Qwen's value proposition on OCR and field extraction is unmatched.
Qwen3.5-9B doesn't just win on benchmarks—it wins on economics.
API Costs (per 1M input tokens):
Inference Speed (approximate, 8-bit quantization):
If you process 1 million documents per month with OCR + field extraction, running Qwen3.5-4B locally costs you roughly $200–$500/month in GPU cloud time. The same workload via frontier model APIs could cost $3,000–$10,000/month depending on the provider.
That's not a nice-to-have difference. That's a business model difference.
Qwen3.5-9B doesn't replace frontier models. It displaces them in specific, high-volume use cases where it's demonstrably better at the task and dramatically cheaper.
As of March 16, 2026, this is the competitive market:
The honest take from this Qwen3.5-9B benchmark comparison? If you're building a document processing system in 2026, you now have a real choice. You don't have to use an expensive API. You can run 9B parameters locally, get world-class OCR and field extraction, and pocket the cost savings.
That's a meaningful shift in what's practical for enterprise document AI.
One more decision to make in this Qwen3.5-9B benchmark comparison: which size?
| Model | VRAM (8-bit) | OCR Score | VQA Score | KIE Score | Inference Speed | Recommendation |
|---|---|---|---|---|---|---|
| Qwen3.5-2B | ~4–5GB | 73.7 | N/A | ~85 | 150+ tok/s | Budget tier, field extraction only |
| Qwen3.5-4B | ~7–8GB | 77.2 | ~72.4 | 86.0 | 80–100 tok/s | Best cost/performance ratio |
| Qwen3.5-9B | ~12–15GB | 78.1 | 79.5 | 86.5 | 40–60 tok/s | Production-grade OCR + VQA |
Our pick: Qwen3.5-4B for most workloads. The 9B buys you 0.9 points on OCR and 7 points on VQA, but costs 2x the VRAM and runs 2x slower. Unless you're specifically optimizing for maximum accuracy, the 4B is the sweet spot.
But if you have the GPU hardware and your documents are genuinely messy (multi-language, poor scan quality, dense layouts), the 9B's 78.1 OlmOCR score justifies the extra resources.
Sources
It depends on the task. Qwen3.5-9B beats both on OCR (78.1 vs 73.4 GPT-5.4, 74.4 Claude Sonnet 4.6) and matches or exceeds them on field extraction (86.5 vs 85.7 GPT, 89.5 Claude). But it loses decisively on table extraction (76.6 vs 94.8 GPT, 96.3 Claude Sonnet 4.6) and handwriting (65.5 vs 69.1 GPT, 73.7 Claude Sonnet 4.6).
Use Qwen3.5-4B unless you specifically need maximum OCR or VQA accuracy. The 4B scores 77.2 on OCR (only 0.9 points behind the 9B), runs at 2x the speed, and uses half the VRAM. Cost per token is essentially zero, so speed matters more than model size for most enterprises.
This appears to be an architecture or training data limitation, not a scale problem—both Qwen3.5-4B and 9B max out around 76.6 on GrITS. Qwen was trained heavily on OCR and form documents but was not optimized for table understanding. Frontier models with broader training data handle tables much better (95+).
Qwen3.5-9B local: ~$200-500/month cloud GPU. Claude Sonnet 4.6 API: ~$3,000–$5,000/month for equivalent volume (1M documents). GPT-5.4: ~$2,000-10,000/month. Local open models save 80–90%+ on high-volume document processing.
No. Use Qwen for OCR and field extraction (where it excels), then pass results to a frontier model (Claude, Gemini, GPT) for table extraction or handwritten content. This hybrid approach cuts API costs by 70% while maintaining accuracy where needed.
Yes. On key information extraction (KIE), Qwen3.5-9B scores 86.5, matching Gemini 3.1 Pro and beating GPT-5.4 (85.7). It's production-grade for field extraction workloads and dramatically cheaper than API alternatives.
March 17, 2026 · 10 min read