5 Big Upgrades in Google's Gemini 3.1 Flash Live
Google just dropped Gemini 3.1 Flash Live — a real-time audio AI model with 2x longer conversation tracking, 90+ languages, and seriously better noise filtering. Here's what matters.
Google just dropped Gemini 3.1 Flash Live — a real-time audio AI model with 2x longer conversation tracking, 90+ languages, and seriously better noise filtering. Here's what matters.

Gemini 3.1 Flash Live is Google's newest audio-to-audio model built for real-time voice conversations, and it's rolling out today across Gemini Live, Search Live, and the developer API. Google calls it their "highest-quality audio and voice model yet" — a direct replacement for Gemini 2.5 Flash Native Audio with meaningful improvements in latency, noise handling, and conversation memory.
That's a bold claim from a company that's been iterating on voice AI at breakneck speed. But based on the benchmarks and product rollout, this one actually backs it up.
Let's cut straight to what changed. According to Google's official announcement, Gemini 3.1 Flash Live brings five core upgrades over its predecessor:
1. 2x Longer Conversation Tracking. The model can now follow conversation threads for twice as long as 2.5 Flash Native Audio. If you've ever had an AI assistant lose the plot halfway through a brainstorming session, you know why this matters.
2. Better Noise Filtering. It's significantly better at distinguishing your voice from background sounds — traffic, TV, coffee shop chatter. This is the kind of boring-but-critical improvement that makes voice AI actually usable in the real world.
3. Faster Response Times. Lower latency means fewer awkward pauses. Conversations feel more like talking to a person and less like talking to a satellite phone.
4. Dynamic Tone Adjustment. The model now adjusts its response length and tone based on context. If you sound frustrated, it picks up on that. If you're confused, it shifts gears. Pretty impressive emotional intelligence for a language model.
5. Improved Tool Use. This is the one developers should pay attention to. Gemini 3.1 Flash Live is dramatically better at triggering external tools and function calls mid-conversation — a must-have for building useful voice agents.
Google's real play here isn't just a better voice model — it's building the infrastructure for voice-first AI applications at scale.
And the language support is no joke: 90+ languages for real-time multimodal conversations, as of March 26, 2026.
Google published results on two key benchmarks, and the numbers tell an interesting story.

On Scale AI's Audio MultiChallenge — which tests complex instruction following and reasoning through interruptions and hesitations (basically, messy real-world audio) — Gemini 3.1 Flash Live scored 36.1% with thinking enabled. That leads the field. Now, 36.1% might sound low, but this benchmark is intentionally brutal. It simulates the kind of chaotic audio conditions that break most models.
On ComplexFuncBench Audio, which measures multi-step function calling with various constraints, the model hit 90.8%. That's the number that should excite developers building voice-powered apps. Function calling is where voice AI goes from "cool demo" to "production-ready tool."
A 90.8% score on complex function calling means voice agents can finally do real work, not just chat.
These aren't cherry-picked vanity metrics, either. Both benchmarks test the messy, real-world scenarios where previous voice models consistently fell apart.
The rollout is broad — broader than you might expect for a same-day launch.
Gemini Live (Android and iOS): This is the consumer-facing product, and Google is calling this its "biggest upgrade yet." If you use Gemini Live for hands-free conversations, you're getting the new model today.
Search Live: Now powered by Gemini 3.1 Flash Live and expanding to 200+ countries — basically everywhere AI Mode is currently available, as of March 26, 2026. This includes both audio and video (Google Lens) conversation capabilities. That's a huge geographic expansion.
Live Translate with Headphones: Works with 70+ languages and is expanding to iOS (it was Android-only before). New country availability includes France, Germany, Italy, Japan, Spain, Thailand, and the UK.
Gemini Live API: Available in preview through Google AI Studio. If you're a developer, you can start building with it right now.
Enterprise: Already deployed by Verizon and Home Depot for customer-facing applications. LiveKit has also integrated it into their developer platform for building real-time voice agents.
Here's where it gets interesting — and slightly complicated. The Gemini 3.1 Flash Live API comes with a price bump over its predecessor.
| Modality | Input | Output |
|---|---|---|
| Text | $0.75 / 1M tokens | $4.50 / 1M tokens |
| Audio | $3.00 / 1M tokens | $12.00 / 1M tokens |
| Image/Video | $1.00 / 1M tokens | — |
For context, the previous Gemini 2.5 Flash Native Audio API charged $0.50 for text input and $2.00 for text output per million tokens, as of March 26, 2026. So we're looking at a 50% increase on text input and a 125% increase on text output.
![]()
Is it worth it? That depends entirely on your use case. If you're building production voice agents where quality directly impacts customer satisfaction (think: Verizon's contact center), the improved reliability and function calling probably pay for themselves. If you're running high-volume, cost-sensitive workloads, you'll want to benchmark carefully.
Google does offer a free tier during the preview period, so you can test before committing.
The price increase is real, but so are the quality gains. For voice-first apps, reliability is worth more than a few extra dollars per million tokens.
For comparison, audio tokenization runs at roughly 25 tokens per second, which works out to about $0.005 per minute for audio input and $0.018 per minute for audio output. Not cheap for always-on applications, but reasonable for transactional voice interactions.
Let's zoom out. The real significance of Gemini 3.1 Flash Live isn't just another model update — it's a signal about where the entire industry is heading.

Voice AI has been the "next big thing" for years, but most implementations have been frustrating. They lose context. They can't handle background noise. They break when you try to do anything beyond simple Q&A. And the function calling? Unreliable at best.
What Google is doing here is systematically fixing each of those pain points. The 2x conversation memory addresses context loss. The noise filtering handles real-world conditions. The 90.8% function calling score makes tool use actually dependable.
But Google isn't the only player. OpenAI has been pushing hard on real-time audio through its Realtime API. ElevenLabs dominates text-to-speech quality. And the open-source community keeps closing the gap.
The difference is that Google has distribution. When you ship a model simultaneously to Gemini Live, Search, Translate, and an enterprise API — all on day one — that's not a research paper. That's a platform play. (Google is making similar moves in other domains too — see our breakdown of Google's Lyria 3 API launch.)
The Gemini 3.1 Flash Live API is in preview right now, which means breaking changes and rapid iteration are still on the table. The developer documentation lists it as a "New Preview" model.
Developers building on it should expect the typical preview-to-stable timeline — probably a few months before it hits general availability. But the fact that it's already powering consumer products like Gemini Live and Search Live suggests the model itself is production-grade, even if the API terms aren't.
For the broader market, this raises the bar on what "good enough" looks like for real-time voice AI. If your voice agent can't handle noise, loses context after two minutes, or fumbles function calls, you're now competing against a model that does all three well — and it's backed by Google's distribution machine.
So yeah. Voice AI just got a lot more interesting.
Sources
Audio input costs approximately $0.005 per minute ($3.00 per million tokens at ~25 tokens/second), and audio output costs about $0.018 per minute ($12.00 per million tokens). Text input and output are cheaper at $0.75 and $4.50 per million tokens respectively. Google also offers a free tier during the preview period.
Yes. Gemini Live on iOS gets the 3.1 Flash Live upgrade, and Live Translate with Headphones is expanding to iOS for the first time (it was previously Android-only). The Live Translate feature supports 70+ languages on iOS, with availability in France, Germany, Italy, Japan, Spain, Thailand, and the UK at launch.
Yes, through the Gemini Live API available in preview via Google AI Studio. The model supports text, audio, and video input with audio and text output. LiveKit has already integrated the API into their platform, providing a RealtimeModel class for building production voice agents. Expect the API to reach general availability within a few months.
Google's model leads on Scale AI's Audio MultiChallenge (36.1%) and ComplexFuncBench Audio (90.8%), though direct head-to-head comparisons with OpenAI's real-time API on identical benchmarks are limited. The key differentiator is Google's distribution — Gemini 3.1 Flash Live powers Search Live in 200+ countries on day one, while OpenAI's real-time audio is primarily available through their API and ChatGPT.
No. As of March 2026, the Gemini 3.1 Flash Live API does not support context caching or batch pricing. Billing is per-turn based on the total Session Context Window — meaning accumulated tokens from all previous turns in the conversation count toward your cost. This is important to factor in for long-running voice sessions.