Rio3.5 vs Qwen3.7: Why This Viral Benchmark Smells Off
A tweet claims Rio de Janeiro's city government built an LLM that beats Qwen3.7. No paper, no leaderboard, no weights. Here's how to read claims like this.
A tweet claims Rio de Janeiro's city government built an LLM that beats Qwen3.7. No paper, no leaderboard, no weights. Here's how to read claims like this.

A tweet posted on June 22, 2026 claims that a model called Rio3.5, supposedly trained by Rio de Janeiro's city government, has beaten Qwen3.7 across an unspecified set of benchmarks. The source making the rounds on Hacker News is a single Twitter post from @zenmagnets. No paper. No leaderboard entry. No weights on HuggingFace. No press release from the Prefeitura do Rio.
And yet the claim is everywhere.
So let's slow down. Because this is exactly the kind of viral AI announcement that needs scrutiny before it becomes received wisdom, and the Rio3.5 benchmark claim is the perfect case study in what to look for.
No. As of late June 2026, the Rio3.5 benchmark claim is unverified. There's no published evaluation methodology, no model card on HuggingFace, no entry on Papers with Code, and no official announcement from any Rio de Janeiro municipal agency. The only primary source is a single social media post.

That doesn't mean it's fake. It means we've nothing to evaluate.
The original post asserts that Rio3.5 outperforms Qwen3.7 on "recent benchmarks." Pay attention to what's missing:
Without those, the headline is meaningless. You could say a calculator app "beats GPT-5 on multiplication benchmarks" and technically be correct.
City governments don't typically train frontier LLMs. The compute alone for a competitive 70B+ model runs into eight figures, and the talent needed sits at Anthropic, OpenAI, Google DeepMind, Alibaba, DeepSeek, and a small handful of well-funded labs. Brazil has real AI research happening at USP and at private players like Maritaca, but a municipal government training a model that outperforms Alibaba's flagship would be genuinely extraordinary.
Extraordinary claims, extraordinary evidence, you know the rest.
The Qwen team has been on a tear. Qwen2.5-72B was already strong, and the Qwen3 series pushed scores into territory that gives Claude and Gemini real competition on coding and multilingual tasks. For a municipal project to leapfrog that without any of the typical signals (preprints, dataset releases, ablations) is the kind of thing you'd want triple-sourced before repeating.
Compare the Rio3.5 announcement to how genuine releases land. When DeepSeek dropped V3, you got a detailed technical report, full benchmark tables across MMLU, HumanEval, MATH, GSM8K, weights on HuggingFace, and reproducible inference code within 48 hours.

Here's what a credible benchmark announcement contains:
| Element | Why it matters |
|---|---|
| Technical report or paper | Describes architecture, training data, eval methodology |
| Public weights or API access | Lets others reproduce results |
| Standard benchmark suite | MMLU, GPQA, HumanEval, SWE-bench, etc., not custom ones |
| Comparison baselines run by the same team | Apples-to-apples, same prompts, same temperature |
| Failure modes documented | Honest labs publish where their model loses |
The Rio3.5 post has zero of these. That's not a small gap.
For context, here's where the verified frontier sits as of June 2026, pulling from the published SWE-bench Verified leaderboard. SWE-bench Verified (real coding tasks) currently ranks Claude 4.5 Opus at the top at roughly 79%, with Gemini 3 Pro Preview close behind near 77%, and GPT-5.2 in the same range. DeepSeek V3.2 Reasoner is the strongest open-weights system on the public board.
For MMLU and GPQA Diamond, scores from leading closed models cluster tightly at the top of the public leaderboards, but specific cells move week to week as new variants and reasoning settings get added. Treat any "X% on MMLU" number as version- and prompt-sensitive, and check it directly on Papers with Code before quoting.
Notice what's not on any of these lists? Rio3.5. And notice Qwen3.7 also isn't sitting at the top of any of them, so even the comparison target in the tweet is a bit of a strawman. Beating Qwen3.7 on some unspecified suite (assuming it happened at all) wouldn't make Rio3.5 a frontier model. It would make it a respectable mid-tier release.
When a tweet hits the Hacker News front page with an AI angle, the comment section does about 40% of the verification work that journalists used to do. And honestly, that's usually a good thing. But the headline still spreads faster than the corrections.
A few patterns from past viral AI claims that turned out to be partial or wrong:
This isn't to say Rio3.5 fits that pattern. It's to say the pattern exists and we should require evidence.
Next time something like this trends, run through this checklist before retweeting:
If the claim survives all five, you've got something real. If it fails any of them, you've got marketing or noise.
If you're building on LLMs, picking models based on Twitter screenshots is a bad strategy. The frontier moves fast, but it moves in public, with papers, weights, and reproducible evals. Stick with models that have those.
Claude 4.5 Opus, the GPT-5.2 family, Gemini 3 Pro, and DeepSeek V3.2 Reasoner are the verified top of the SWE-bench Verified stack for general coding work right now. For open weights you can actually run, the Qwen3 family and DeepSeek V3.2 both have published numbers you can trust. Mistral Large 3 is solid for cost-conscious deployments at $2/$6 per million tokens.

And if Rio3.5 turns out to be real, with weights, a paper, and reproducible scores? Great. We'll cover it then.
Until that happens, the right response to a viral benchmark tweet is the same as it's always been: cool story, where's the eval tap into?
Benchmark theater is a recurring problem in AI. Labs cherry-pick suites where they win. Influencers amplify whichever take gets clicks. And the actual signal (how well does this thing work on your specific workload) gets buried under leaderboard chasing.
The Rio3.5 claim might be a complete fabrication, a real but exaggerated municipal pilot, or something in between. We don't know yet, and pretending otherwise would be dishonest. What we do know is that the bar for "X beats Y on benchmarks" should be a paper plus reproducible code, not a screenshot. Hold every model claim to that bar, including the ones you want to believe.
And treat unsourced viral AI tweets the way you'd treat any other unsourced viral tweet. With your skepticism turned all the way up.
Sources
As of June 23, 2026, there are no Rio3.5 weights published on HuggingFace, GitHub, or any official Rio de Janeiro municipal portal. The only public reference is a single Twitter post. If weights are released later, they'll most likely appear on HuggingFace under a verified organization account with a model card and license.
As of June 23, 2026, there is no model named 'Qwen3.7' published by Alibaba's Qwen team on HuggingFace. The current public Qwen3 lineup includes Qwen3-235B (an MoE model), Qwen3-30B variants, and smaller dense models like Qwen3-8B. A 'Qwen3.7' name in a viral tweet should be treated as unverified until it appears on the official Qwen organization page on HuggingFace.
Typically 24 to 72 hours. Independent researchers on Hugging Face and Reddit's r/LocalLLaMA usually attempt replication within a day if weights are available. If no weights drop within a week of a viral claim, the odds of it being legitimate fall sharply. The Reflection 70B episode in 2024 is the canonical example of how fast community verification works.
For general capability, LMSYS Chatbot Arena (now lmarena.ai) tracks blind human preference and is hard to game. For coding, SWE-bench Verified is the gold standard since it uses real GitHub issues. For reasoning, GPQA Diamond and ARC-AGI-2 are the toughest. Cross-reference at least two leaderboards before drawing conclusions about any model.
Yes, but they're rare and usually national, not municipal. Examples include the UAE's Falcon models from TII and Singapore's SEA-LION. France's national AI strategy has supported Mistral, though Mistral itself is a private company. Municipal-government-trained frontier LLMs are unprecedented because the compute and talent costs sit far beyond city budgets. State or national funding is the realistic minimum.