OpenAI Outlines Framework for Independent Model Evaluations
OpenAI shares lessons on designing trustworthy third-party evaluations for frontier AI models, emphasizing the role of task environments and validity checks.
Academic papers, novel architectures, training techniques, and fundamental AI research breakthroughs.
25 articles · ← All articles
OpenAI shares lessons on designing trustworthy third-party evaluations for frontier AI models, emphasizing the role of task environments and validity checks.
Research shows LLMs incorporate contradictory statements into reasoning, even when explicitly told the claims are false.
A new benchmark reveals that even the most capable AI systems struggle with diagnosing complex infrastructure failures, scoring below 50% on Site Reliability Engineering scenarios.
Research shows that imperfect LLM-based evaluators can still meaningfully improve AI agent performance, challenging the assumption that evaluation noise is prohibitively harmful.
WIRED's fact-checking team reports that AI systems fail verification more often than most users realize, challenging assumptions about their reliability.
Hackers are moving past crude prompt-injection attacks to exploit how chatbots handle nuanced conversation—a shift that reveals deeper structural weaknesses in AI safety design.
Dharma's DharmaOCR benchmark shows task-specific fine-tuning can beat parameter scale in production AI economics.
At Google I/O, DeepMind CEO Demis Hassabis highlighted the tension between task-specific AI systems like WeatherNext and agentic LLM-based researchers that could eventually operate independently.
OpenAI's new reasoning system claims to have resolved an 80-year-old conjecture in combinatorial geometry, with peer review from top mathematicians—a stark contrast to last year's false victory lap.
A new approach to AI agent architecture prioritizes physical or environmental substrate over language models, challenging the dominant LLM-vectorstore pattern.
Allen Institute releases OlmoEarth v1.1, a more efficient earth-observation model family that maintains v1 performance while reducing compute through shorter token sequences.
Project Genie can now generate interactive simulations anchored to real streets using 280 billion images from 20 years of Street View data collection.
Google DeepMind's AI system helps biologists identify genetic pathways that reverse cellular senescence, validating novel hypotheses in weeks rather than months.
A new analysis shows that large language models excel at language tasks but struggle with seemingly simple visual reasoning—like reading analog clocks.
A new benchmarking framework evaluates complete AI agent systems—not just models—across six diverse tasks, reporting both quality and cost metrics for practical deployment decisions.
Mass-produced studies citing legitimate datasets are overwhelming journal editors, creating a crisis that worsens as AI improves at mimicking competent research.
Hugging Face introduces private ASR evaluation datasets from Appen Inc. and DataoceanAI to block benchmaxxing, with scores visible via an opt-in toggle.
OpenAI and five hardware partners release MRC through the Open Compute Project to reduce congestion and hardware-fault disruptions in large GPU clusters.
GitHub user erogol's BlaGPT offers an open-source research sandbox for evaluating LM architectures and components on compact datasets.
A new architecture called SubQ targets 12 million token context windows while sidestepping the quadratic compute scaling that limits standard transformers.
A ternary-weight 1.7B model achieves 442 T/s on Apple M4 Max, demonstrating how ultra-compact weight encoding translates to real-world on-device inference speed.
A new arXiv preprint examines whether known large language model biases can be deliberately exploited to distort AI-generated search summaries.
A peer-reviewed Harvard and Beth Israel study finds OpenAI's o1 model achieved accurate triage diagnoses in 67% of cases versus 50–55% for attending physicians.
Google DeepMind's AI co-clinician achieved a critical-error rate of zero in 97 of 98 simulated clinical queries, outperforming tools already in routine physician use.
David Silver, who built AlphaGo at DeepMind, argues large language models are fundamentally capped by human data and has founded Ineffable Intelligence to pursue reinforcement learning instead.