#benchmarking

Voice Agents Struggle With Code-Switched Speech Across Four Language Pairs

Research Jun 10, 2026

ServiceNow and Hugging Face benchmark ASR models on bilingual customer interactions, revealing significant performance gaps when speakers mix languages mid-sentence.

Elmer Data's Watch Test Exposes a Gap Between Conversational AI and Visual Reasoning

Research May 18, 2026

A new analysis shows that large language models excel at language tasks but struggle with seemingly simple visual reasoning—like reading analog clocks.

Hugging Face Adds Private Datasets to the Open ASR Leaderboard to Fight Benchmark Gaming

Research May 6, 2026

Hugging Face introduces private ASR evaluation datasets from Appen Inc. and DataoceanAI to block benchmaxxing, with scores visible via an opt-in toggle.

BlaGPT Brings Modular Language Model Benchmarking to Small-Scale Research

Research May 6, 2026

GitHub user erogol's BlaGPT offers an open-source research sandbox for evaluating LM architectures and components on compact datasets.