What word error rate is considered acceptable for a production voice agent?

For monolingual speech, WER under 10% is generally acceptable for intent classification when paired with a strong downstream LLM. Intent detection accuracy degrades non-linearly once WER crosses about 20%, especially when high-information words (account numbers, product names, dates) fall inside the embedded-language span. Top models in the ServiceNow-AI benchmark stayed close to their monolingual baselines on code-switched audio, while weaker models showed substantially larger penalties.

Can I fine-tune Whisper on code-switched data myself?

Yes, and it's a reasonable intervention if you're committed to a self-hosted open-source path. The Hugging Face transformers library supports Whisper fine-tuning natively, and LoRA-based approaches let you do this on a single high-end GPU. Note that the ServiceNow-AI benchmark found Whisper Large V3 Turbo sits at the bottom on code-switched audio largely because it defaults to translating into English; configuration and fine-tuning both help mitigate this.

Why does Whisper perform so badly on code-switched audio?

Whisper was trained with a forced language token, which means it expects one utterance to equal one language. When called on code-switched audio without an explicit language parameter, the ServiceNow-AI benchmark found it tends to default to translating the audio into English rather than transcribing it in the language spoken. That means the matrix-language content is effectively lost, which destroys WER but is somewhat masked by semantic metrics.

Which language pairs were covered by the ServiceNow-AI benchmark?

Four pairs: Spanish-English, French-English, Canadian French-English, and German-English. The non-English language served as the matrix in each case, with English embedded at varying lengths. The benchmark does not cover Hindi-English, Mandarin-English, Arabic-English, or other widely code-switched pairs, which is a notable limitation if your customer base speaks those languages.

Should I use a multimodal LLM with native audio input instead of a separate ASR model?

For new bilingual voice agent builds in 2026, the ServiceNow-AI benchmark suggests Large Audio Language Models like Gemini 3 Flash perform well on semantic metrics (SWER and AER) even when their raw WER trails dedicated ASR systems like ElevenLabs Scribe V2 and AssemblyAI Universal-3 Pro. The tradeoff is typically higher latency and cost per turn. For high-volume, latency-sensitive applications, a top-tier dedicated ASR feeding into a separate LLM is often the cleaner choice.

Bilingual Voice Agents Hit a Wall: ASR Code-Switch Benchmark

Real customers don't speak one language at a time. They flip mid-sentence, drop English nouns into Spanish verbs, and swap scripts inside a single phrase. So how do the voice agents handle it?

Not great, according to a new benchmark from ServiceNow-AI on the Hugging Face blog. The team evaluated frontier automatic speech recognition systems on code-switched speech, the linguistic reality for a huge slice of bilingual customers, and the results are sobering. Even the strongest models drop accuracy when speakers mix languages, and lower-tier models degrade severely.

And that matters. Because if your contact center promises "24/7 AI voice support in any language," the test isn't whether it handles clean Spanish or clean English. The test is whether it handles a Miami caller who says "necesito cancel my subscription porque the app keeps crashing."

Key Findings From the Code-Switched Speech Recognition Benchmark

The short version: top frontier systems incur only a small penalty on code-switched speech relative to their monolingual baselines, while lower-ranked models degrade substantially. According to ServiceNow-AI, ElevenLabs Scribe V2, Gemini 3 Flash, and AssemblyAI Universal-3 Pro surfaced as the top models across metrics.

Developer analyzing audio waveform on a MacBook in a softly lit office

A few headline takeaways from the ServiceNow-AI write-up:

The cost of code-switching varies by language pair and model. Top models handle it with surprisingly small penalties; weaker ones degrade sharply.
Errors concentrate on the English portions of utterances rather than the matrix-language portions, which is counterintuitive since English is what these models usually handle best.
The benchmark covers four language pairs relevant to enterprise contact centers: Spanish-English, French-English, Canadian French-English, and German-English. The non-English language serves as the matrix, with English embedded.
OpenAI's Whisper Large V3 Turbo sits at the bottom of the rankings. When called on code-switched audio without an explicit language parameter, Whisper defaults to translating into English rather than transcribing — a known limitation that wrecks language preservation.

The benchmark is honest about its limits, which is refreshing. It's not claiming a universal ranking. It's claiming that code-switching is a measurable, reproducible failure mode that the industry has been quietly ignoring.

Why Code-Switching Breaks Voice Agents

Code-switching is when a speaker alternates between two or more languages within a conversation, often inside the same sentence. Linguists distinguish intra-sentential (switching within a clause) from inter-sentential (switching between sentences). The first one is what wrecks ASR pipelines.

Most production ASR systems still rely on a language identification step early in the pipeline. Pick a language, load the right acoustic and language model, transcribe. That architecture assumes the speaker stays in one language. When they don't, the system either:

Forces the embedded-language words into the dominant language's phoneme space (you get gibberish words that sound vaguely right).
Detects a switch too late and produces a hallucinated transcript for the boundary region.
In Whisper's case, defaults to translation rather than transcription, silently converting non-English spans into English text.

Whisper, the open baseline most teams reach for, was trained with a forced language token. That choice helped multilingual coverage, but it built in an assumption that one utterance equals one language. Newer Large Audio Language Models with native audio input (like Gemini 3 Flash) are inherently more flexible and tend to do better on meaning-sensitive metrics, but they still inherit text-side biases toward language separation.

The Bilingual Customer Problem Is Bigger Than You Think

Around 40 million U.S. residents speak Spanish at home, and bilingual code-switching is the norm, not the exception, in that population. India, Singapore, the Philippines, much of West Africa, large parts of urban Europe: all heavily code-switching markets. If a voice agent fails on Spanglish, Hinglish, or Singlish, it's not failing on an edge case. It's failing on the actual customer base.

And contact centers are the front line. A botched transcript means a botched intent, a botched action, and a transferred call. The ROI math on voice AI collapses fast when the deflection rate drops below the human-agent breakeven.

How the Benchmark Was Built

ServiceNow-AI's evaluation started with an internal corpus of IT support and HR interactions, generating code-switched utterances via an LLM (OpenAI GPT-5) and synthesizing audio with ElevenLabs Multilingual V2. Every utterance was reviewed by a native-speaker linguist. The setup, as described in the post:

Four language pairs tested, all with English as the embedded language and the non-English language as the matrix.
Word Error Rate (WER) as a primary metric, plus Semantic WER (SWER) and Answer Error Rate (AER) — the last measures whether an LLM can answer downstream comprehension questions from the transcript.
Comparison between code-switched and monolingual conditions (both matrix-language-only and English-only) to isolate the switching penalty.
A lineup of seven systems including frontier ASRs, Large Audio Language Models, and open-source ASRs.

This methodology matters because a lot of vendor benchmarks quietly evaluate on monolingual test sets and then claim multilingual support. Measuring the WER delta between monolingual and code-switched conditions is the analytical move that exposes the real failure.

Results Snapshot

The Hugging Face post has the full numbers. The pattern across the leaderboard looks roughly like this:

Model	Class	Code-switched WER	Notes
ElevenLabs Scribe V2	Frontier ASR	Best overall	Sometimes beats its own L2 baseline
AssemblyAI Universal-3 Pro	Frontier ASR	Tied for top on Spanish-English	Trails Scribe by 0.02-0.13 elsewhere
Gemini 3 Flash	LALM	Close third on WER	Leads on AER (semantic metric)
Deepgram Nova-3, Mistral Voxtral, Nvidia Parakeet	Mid-tier	Middle ranks	Each leads on at least one pair
OpenAI Whisper Large V3 Turbo	Open-source	Bottom (WER 0.16-0.61)	Defaults to English translation

The practical read: no off-the-shelf open model gives you solid code-switched performance for free. Closed frontier systems and LALMs handle code-switching with the smallest penalties; Whisper requires explicit configuration to avoid its translation default.

What the Numbers Actually Mean for Voice Agent Builders

Let's translate WER into something a product manager cares about. A WER of 10% on monolingual English is roughly usable for intent classification with a well-tuned downstream LLM. Intent detection accuracy degrades non-linearly with WER once you cross about 20%, especially when the embedded-language words carry high information density (product names, order numbers, account types).

So a customer who says "hola, my account got bloqueada y necesito reset the password" is exactly the worst case. The English verbs carry the action. The Spanish words carry the emotional context. Drop either and the agent escalates.

Bar chart showing higher word error rates on code-switched speech across ASR model classes

A few things this benchmark suggests builders should actually do:

Stop reporting only monolingual WER. If your contact center serves a bilingual market, you need a code-switched eval slice, full stop.
Don't rely on language ID alone. Architectures that force a hard language decision early in the pipeline are structurally bad at this. Look at end-to-end multilingual ASR or native-audio LLMs.
Configure Whisper carefully. The benchmark shows Whisper's default behavior on code-switched audio is to translate, not transcribe. If you're sticking with Whisper, that's a fixable configuration trap.
Measure semantic metrics, not just WER. Gemini 3 Flash beats AssemblyAI on AER even when AssemblyAI wins on WER, because semantic preservation matters for downstream task accuracy.

The Surprises

A few findings from the post stood out as genuinely non-obvious:

The error pattern is counterintuitive: errors concentrate on the English portions of code-switched utterances, not the non-English matrix-language portions. English is normally these models' strongest language. ServiceNow-AI's explanation is that English embedded spans often carry technical vocabulary and named entities that are harder to transcribe, and that the act of switching itself creates a challenging context regardless of which language is embedded.

Another one: Scribe V2 sometimes performs better on code-switched audio than on its own monolingual L2 baseline, suggesting genuine robustness to bilingual input rather than just degradation tolerance. And the gap between top-tier and bottom-tier models on code-switched audio is much larger than the gap on monolingual audio — code-switching primarily exposes differences in robustness rather than uniformly raising difficulty across all models.

If your voice agent only works when the customer commits to a single language, you don't have a multilingual voice agent. You have a monolingual voice agent that occasionally guesses right.

Practical Implications: Should You Ship a Bilingual Voice Agent Today?

Depends on the language pair, the use case, and how forgiving your fallback is.

Product team reviewing voice agent performance metrics on a meeting room display

For high-stakes flows (banking, healthcare, anything regulated), the honest answer isn't yet, unless you've built and measured a code-switched eval set for your specific market. The reputational risk of a transcript hallucination is too high, and the benchmark makes it clear hallucinations are a real failure mode on mixed-language input.

For lower-stakes deflection (FAQ answering, status lookups, appointment booking), the calculus shifts. The top three systems in the benchmark — Scribe V2, Gemini 3 Flash, and AssemblyAI Universal-3 Pro — handle code-switched speech well enough that, paired with a strong downstream LLM and a graceful escalation path, the resulting voice agent is workable. ServiceNow-AI's own positioning suggests this is the realistic 2026 deployment pattern: code-switched ASR that's good enough for an LLM agent to reason over, not perfect transcription.

The broader takeaway from this code-switched speech recognition benchmark is that the industry needs to stop treating multilingual support as a binary feature. "Supports Spanish" and "supports the way bilingual customers actually talk" aren't the same product. The first is a checkbox. The second is real work, and it requires evaluation infrastructure that almost nobody is publishing.

ServiceNow-AI publishing this benchmark is a good step. The next step is the rest of the industry doing the same.

Sources