Research

Voice Agents Struggle With Code-Switched Speech Across Four Language Pairs

ServiceNow and Hugging Face benchmark ASR models on bilingual customer interactions, revealing significant performance gaps when speakers mix languages mid-sentence.

Last verified:

Code-Switching Emerges as a Real Enterprise ASR Challenge

ServiceNow AI and Hugging Face released the first systematic benchmark of how automatic speech recognition (ASR) systems handle code-switched speech—the natural linguistic behavior where bilingual speakers alternate between languages mid-sentence or even mid-word. According to the Hugging Face Blog, the researchers built the benchmark after a customer asked how voice agents would perform for a largely bilingual customer base. The benchmark covers four language pairs relevant to enterprise use: Spanish-English, French-English, Canadian French-English, and German-English, evaluated across HR and IT service management scenarios including benefits inquiries, password resets, and device troubleshooting.

The timing of this benchmark reflects a gap in the ASR field: despite over half the world’s population being bilingual, few systematic studies have evaluated how frontier voice models handle code-switched speech in production settings. This matters because transcription errors in early-stage ASR propagate downstream through intent recognition, entity extraction, and ticket routing—magnifying the business impact of a misheard phrase in an enterprise helpdesk environment.

Performance Varies Sharply by Language Pair and Model Architecture

According to Hugging Face, the benchmark evaluated seven ASR systems, including Large Audio Language Models (LALMs), frontier commercial systems, and open-source alternatives. The cost of code-switching—the performance gap between code-switched and monolingual speech—was not uniform. ElevenLabs Scribe V2, Google’s Gemini 3 Flash, and Assembly AI Universal 3-Pro surfaced as top performers across Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER) metrics.

The choice of three evaluation metrics reflects a practical insight: exact transcription accuracy (WER) does not always correlate with downstream task success. SWER measures whether the transcription preserves semantic meaning despite minor word-level errors, while AER directly evaluates whether the ASR output allows correct answers to downstream tasks like password-reset requests. This distinction is critical in enterprise settings where a phonetically similar but semantically wrong transcription can still route a customer correctly or preserve the intent of a support request.

Methodology and Benchmark Release

ServiceNow and Hugging Face released both the benchmark dataset and evaluation harness, called AU-Harness, for reproducible voice-model evaluation. The benchmark uses the non-English language as the matrix language—the dominant language in which the speaker embeds code-switched English segments at varying lengths. This design choice mirrors real customer interactions where, for example, a French-speaking IT support caller might ask “Pouvez-vous réinitialiser mon password?” (Can you reset my password?), mixing French with English technical jargon.

The internal corpus started with IT support and HR interactions from ServiceNow’s customer base, then was expanded and synthesized to cover a wider range of code-switching scenarios. The deliberate focus on enterprise HR and IT domains—rather than casual conversation—emphasizes the operational stakes: a misdirected ticket or misunderstood policy requirement has concrete business consequences.

Why This Matters

Enterprise voice agents are increasingly serving multilingual customer bases, yet the default practice has been to deploy English-trained ASR models and hope for the best. This benchmark provides evidence that code-switching degrades performance in measurable ways, but also that the gap is not insurmountable. Teams deploying voice agents to bilingual customer bases can now use AU-Harness to benchmark their own ASR choices and understand the specific language-pair costs they will incur.

The ranking of ElevenLabs, Google, and Assembly AI suggests that newer frontier models—particularly those trained on diverse multilingual data at scale—begin to handle code-switching with acceptable accuracy. However, the variability across language pairs implies that no single model is a universal solution. Organizations serving Spanish-English customer populations will need to evaluate different models than those serving French-English or German-English pairs. Over time, this benchmark may drive ASR vendors to explicitly optimize for code-switched speech, narrowing the performance gap and raising the baseline for voice-agent quality in multilingual enterprise settings.

Frequently Asked Questions

What is code-switching and why does it matter for voice agents?

Code-switching is the seamless alternation between two languages mid-utterance—common for over half the world's bilingual population. For voice agents handling enterprise customer interactions, transcription errors from code-switched speech propagate through the entire downstream pipeline, affecting ticket routing and policy comprehension.

Which ASR systems performed best on the benchmark?

According to Hugging Face, ElevenLabs Scribe V2, Google Gemini 3 Flash, and Assembly AI Universal 3-Pro showed the strongest performance across the four language pairs evaluated.

What metrics did the benchmark use to measure ASR performance?

The benchmark measured Word Error Rate (WER) for transcription accuracy, Semantic Word Error Rate (SWER) to detect meaning-preserving errors, and Answer Error Rate (AER) to assess downstream task correctness.

#ASR #voice-agents #multilingual #code-switching #benchmarking #enterprise