#evaluation

OpenAI Outlines Framework for Independent Model Evaluations

Research May 30, 2026

OpenAI shares lessons on designing trustworthy third-party evaluations for frontier AI models, emphasizing the role of task environments and validity checks.

Enterprise AI Hits a Wall: Frontier Models Struggle Below 50% on Real-World IT Operations Tasks

Research May 28, 2026

A new benchmark reveals that even the most capable AI systems struggle with diagnosing complex infrastructure failures, scoring below 50% on Site Reliability Engineering scenarios.

Hugging Face and IBM Research Launch Open Agent Leaderboard to Measure Real-World System Performance

Research May 18, 2026

A new benchmarking framework evaluates complete AI agent systems—not just models—across six diverse tasks, reporting both quality and cost metrics for practical deployment decisions.

Hugging Face Adds Private Datasets to the Open ASR Leaderboard to Fight Benchmark Gaming

Research May 6, 2026

Hugging Face introduces private ASR evaluation datasets from Appen Inc. and DataoceanAI to block benchmaxxing, with scores visible via an opt-in toggle.