Bilingual Voice Agents Hit a Wall: ASR Code-Switch Benchmark
Frontier ASR models stumble when customers mix two languages in one sentence. A new ServiceNow-AI benchmark exposes how badly, and which models cope best.
Frontier ASR models stumble when customers mix two languages in one sentence. A new ServiceNow-AI benchmark exposes how badly, and which models cope best.

Real customers don't speak one language at a time. They flip mid-sentence, drop English nouns into Spanish verbs, and swap scripts inside a single phrase. So how do the voice agents handle it?
Not great, according to a new benchmark from ServiceNow-AI on the Hugging Face blog. The team evaluated frontier automatic speech recognition systems on code-switched speech, the linguistic reality for a huge slice of bilingual customers, and the results are sobering. Even the strongest models drop accuracy when speakers mix languages, and lower-tier models degrade severely.
And that matters. Because if your contact center promises "24/7 AI voice support in any language," the test isn't whether it handles clean Spanish or clean English. The test is whether it handles a Miami caller who says "necesito cancel my subscription porque the app keeps crashing."
The short version: top frontier systems incur only a small penalty on code-switched speech relative to their monolingual baselines, while lower-ranked models degrade substantially. According to ServiceNow-AI, ElevenLabs Scribe V2, Gemini 3 Flash, and AssemblyAI Universal-3 Pro surfaced as the top models across metrics.

A few headline takeaways from the ServiceNow-AI write-up:
The benchmark is honest about its limits, which is refreshing. It's not claiming a universal ranking. It's claiming that code-switching is a measurable, reproducible failure mode that the industry has been quietly ignoring.
Code-switching is when a speaker alternates between two or more languages within a conversation, often inside the same sentence. Linguists distinguish intra-sentential (switching within a clause) from inter-sentential (switching between sentences). The first one is what wrecks ASR pipelines.
Most production ASR systems still rely on a language identification step early in the pipeline. Pick a language, load the right acoustic and language model, transcribe. That architecture assumes the speaker stays in one language. When they don't, the system either:
Whisper, the open baseline most teams reach for, was trained with a forced language token. That choice helped multilingual coverage, but it built in an assumption that one utterance equals one language. Newer Large Audio Language Models with native audio input (like Gemini 3 Flash) are inherently more flexible and tend to do better on meaning-sensitive metrics, but they still inherit text-side biases toward language separation.
Around 40 million U.S. residents speak Spanish at home, and bilingual code-switching is the norm, not the exception, in that population. India, Singapore, the Philippines, much of West Africa, large parts of urban Europe: all heavily code-switching markets. If a voice agent fails on Spanglish, Hinglish, or Singlish, it's not failing on an edge case. It's failing on the actual customer base.
And contact centers are the front line. A botched transcript means a botched intent, a botched action, and a transferred call. The ROI math on voice AI collapses fast when the deflection rate drops below the human-agent breakeven.
ServiceNow-AI's evaluation started with an internal corpus of IT support and HR interactions, generating code-switched utterances via an LLM (OpenAI GPT-5) and synthesizing audio with ElevenLabs Multilingual V2. Every utterance was reviewed by a native-speaker linguist. The setup, as described in the post:
This methodology matters because a lot of vendor benchmarks quietly evaluate on monolingual test sets and then claim multilingual support. Measuring the WER delta between monolingual and code-switched conditions is the analytical move that exposes the real failure.
The Hugging Face post has the full numbers. The pattern across the leaderboard looks roughly like this:
| Model | Class | Code-switched WER | Notes |
|---|---|---|---|
| ElevenLabs Scribe V2 | Frontier ASR | Best overall | Sometimes beats its own L2 baseline |
| AssemblyAI Universal-3 Pro | Frontier ASR | Tied for top on Spanish-English | Trails Scribe by 0.02-0.13 elsewhere |
| Gemini 3 Flash | LALM | Close third on WER | Leads on AER (semantic metric) |
| Deepgram Nova-3, Mistral Voxtral, Nvidia Parakeet | Mid-tier | Middle ranks | Each leads on at least one pair |
| OpenAI Whisper Large V3 Turbo | Open-source | Bottom (WER 0.16-0.61) | Defaults to English translation |
The practical read: no off-the-shelf open model gives you solid code-switched performance for free. Closed frontier systems and LALMs handle code-switching with the smallest penalties; Whisper requires explicit configuration to avoid its translation default.
Let's translate WER into something a product manager cares about. A WER of 10% on monolingual English is roughly usable for intent classification with a well-tuned downstream LLM. Intent detection accuracy degrades non-linearly with WER once you cross about 20%, especially when the embedded-language words carry high information density (product names, order numbers, account types).
So a customer who says "hola, my account got bloqueada y necesito reset the password" is exactly the worst case. The English verbs carry the action. The Spanish words carry the emotional context. Drop either and the agent escalates.

A few things this benchmark suggests builders should actually do:
A few findings from the post stood out as genuinely non-obvious:
The error pattern is counterintuitive: errors concentrate on the English portions of code-switched utterances, not the non-English matrix-language portions. English is normally these models' strongest language. ServiceNow-AI's explanation is that English embedded spans often carry technical vocabulary and named entities that are harder to transcribe, and that the act of switching itself creates a challenging context regardless of which language is embedded.
Another one: Scribe V2 sometimes performs better on code-switched audio than on its own monolingual L2 baseline, suggesting genuine robustness to bilingual input rather than just degradation tolerance. And the gap between top-tier and bottom-tier models on code-switched audio is much larger than the gap on monolingual audio — code-switching primarily exposes differences in robustness rather than uniformly raising difficulty across all models.
If your voice agent only works when the customer commits to a single language, you don't have a multilingual voice agent. You have a monolingual voice agent that occasionally guesses right.
Depends on the language pair, the use case, and how forgiving your fallback is.

For high-stakes flows (banking, healthcare, anything regulated), the honest answer isn't yet, unless you've built and measured a code-switched eval set for your specific market. The reputational risk of a transcript hallucination is too high, and the benchmark makes it clear hallucinations are a real failure mode on mixed-language input.
For lower-stakes deflection (FAQ answering, status lookups, appointment booking), the calculus shifts. The top three systems in the benchmark — Scribe V2, Gemini 3 Flash, and AssemblyAI Universal-3 Pro — handle code-switched speech well enough that, paired with a strong downstream LLM and a graceful escalation path, the resulting voice agent is workable. ServiceNow-AI's own positioning suggests this is the realistic 2026 deployment pattern: code-switched ASR that's good enough for an LLM agent to reason over, not perfect transcription.
The broader takeaway from this code-switched speech recognition benchmark is that the industry needs to stop treating multilingual support as a binary feature. "Supports Spanish" and "supports the way bilingual customers actually talk" aren't the same product. The first is a checkbox. The second is real work, and it requires evaluation infrastructure that almost nobody is publishing.
ServiceNow-AI publishing this benchmark is a good step. The next step is the rest of the industry doing the same.
Sources
For monolingual speech, WER under 10% is generally acceptable for intent classification when paired with a strong downstream LLM. Intent detection accuracy degrades non-linearly once WER crosses about 20%, especially when high-information words (account numbers, product names, dates) fall inside the embedded-language span. Top models in the ServiceNow-AI benchmark stayed close to their monolingual baselines on code-switched audio, while weaker models showed substantially larger penalties.
Yes, and it's a reasonable intervention if you're committed to a self-hosted open-source path. The Hugging Face transformers library supports Whisper fine-tuning natively, and LoRA-based approaches let you do this on a single high-end GPU. Note that the ServiceNow-AI benchmark found Whisper Large V3 Turbo sits at the bottom on code-switched audio largely because it defaults to translating into English; configuration and fine-tuning both help mitigate this.
Whisper was trained with a forced language token, which means it expects one utterance to equal one language. When called on code-switched audio without an explicit language parameter, the ServiceNow-AI benchmark found it tends to default to translating the audio into English rather than transcribing it in the language spoken. That means the matrix-language content is effectively lost, which destroys WER but is somewhat masked by semantic metrics.
Four pairs: Spanish-English, French-English, Canadian French-English, and German-English. The non-English language served as the matrix in each case, with English embedded at varying lengths. The benchmark does not cover Hindi-English, Mandarin-English, Arabic-English, or other widely code-switched pairs, which is a notable limitation if your customer base speaks those languages.
For new bilingual voice agent builds in 2026, the ServiceNow-AI benchmark suggests Large Audio Language Models like Gemini 3 Flash perform well on semantic metrics (SWER and AER) even when their raw WER trails dedicated ASR systems like ElevenLabs Scribe V2 and AssemblyAI Universal-3 Pro. The tradeoff is typically higher latency and cost per turn. For high-volume, latency-sensitive applications, a top-tier dedicated ASR feeding into a separate LLM is often the cleaner choice.