Research News

Academic papers, novel architectures, training techniques, and fundamental AI research breakthroughs.

OpenAI Outlines Framework for Independent Model Evaluations

Research May 30, 2026

OpenAI shares lessons on designing trustworthy third-party evaluations for frontier AI models, emphasizing the role of task environments and validity checks.

Large Language Models Retain False Information Despite Explicit Warnings

Research May 30, 2026

Research shows LLMs incorporate contradictory statements into reasoning, even when explicitly told the claims are false.

Enterprise AI Hits a Wall: Frontier Models Struggle Below 50% on Real-World IT Operations Tasks

Research May 28, 2026

A new benchmark reveals that even the most capable AI systems struggle with diagnosing complex infrastructure failures, scoring below 50% on Site Reliability Engineering scenarios.

Noisy LLM Evaluators Prove Effective for Agent Training Despite Imperfection

Research May 27, 2026

Research shows that imperfect LLM-based evaluators can still meaningfully improve AI agent performance, challenging the assumption that evaluation noise is prohibitively harmful.

A Professional Fact-Checker's Assessment: AI Accuracy Gaps Wider Than Public Believes

Research May 26, 2026

WIRED's fact-checking team reports that AI systems fail verification more often than most users realize, challenging assumptions about their reliability.

Chatbot Jailbreaks Evolve Beyond Simple Exploits as AI Systems Learn Conversational Vulnerabilities

Research May 25, 2026

Hackers are moving past crude prompt-injection attacks to exploit how chatbots handle nuanced conversation—a shift that reveals deeper structural weaknesses in AI safety design.

Specialized 3B Models Now Outperform Frontier APIs on Enterprise OCR Tasks at 50x Lower Cost

Research May 24, 2026

Dharma's DharmaOCR benchmark shows task-specific fine-tuning can beat parameter scale in production AI economics.

Google's AI-for-Science Strategy Pivots Toward Autonomous Agents Over Specialized Tools

Research May 24, 2026

At Google I/O, DeepMind CEO Demis Hassabis highlighted the tension between task-specific AI systems like WeatherNext and agentic LLM-based researchers that could eventually operate independently.

OpenAI's Geometry Breakthrough Rehabs Its Math-Problem Credibility After 2025 Overreach

Research May 22, 2026

OpenAI's new reasoning system claims to have resolved an 80-year-old conjecture in combinatorial geometry, with peer review from top mathematicians—a stark contrast to last year's false victory lap.

Flipping the AI Agent Stack: Why Embodiment Comes Before Language Models

Research May 21, 2026

A new approach to AI agent architecture prioritizes physical or environmental substrate over language models, challenging the dominant LLM-vectorstore pattern.

OlmoEarth v1.1 cuts satellite-imagery inference costs by 3x through token optimization

Research May 21, 2026

Allen Institute releases OlmoEarth v1.1, a more efficient earth-observation model family that maintains v1 performance while reducing compute through shorter token sequences.

Google DeepMind integrates Street View into Genie world model for real-world simulation

Research May 20, 2026

Project Genie can now generate interactive simulations anchored to real streets using 280 billion images from 20 years of Street View data collection.

DeepMind's Co-Scientist AI Cuts Aging Research Analysis From Months to Days

Research May 20, 2026

Google DeepMind's AI system helps biologists identify genetic pathways that reverse cellular senescence, validating novel hypotheses in weeks rather than months.

Elmer Data's Watch Test Exposes a Gap Between Conversational AI and Visual Reasoning

Research May 18, 2026

A new analysis shows that large language models excel at language tasks but struggle with seemingly simple visual reasoning—like reading analog clocks.

Hugging Face and IBM Research Launch Open Agent Leaderboard to Measure Real-World System Performance

Research May 18, 2026

A new benchmarking framework evaluates complete AI agent systems—not just models—across six diverse tasks, reporting both quality and cost metrics for practical deployment decisions.

AI-Generated Research Papers Are Flooding Academic Publishing, Straining Peer Review

Research May 16, 2026

Mass-produced studies citing legitimate datasets are overwhelming journal editors, creating a crisis that worsens as AI improves at mimicking competent research.

Hugging Face Adds Private Datasets to the Open ASR Leaderboard to Fight Benchmark Gaming

Research May 6, 2026

Hugging Face introduces private ASR evaluation datasets from Appen Inc. and DataoceanAI to block benchmaxxing, with scores visible via an opt-in toggle.

OpenAI Open-Sources MRC: A New Networking Protocol for Supercomputer-Scale AI Training

Research May 6, 2026

OpenAI and five hardware partners release MRC through the Open Compute Project to reduce congestion and hardware-fault disruptions in large GPU clusters.

BlaGPT Brings Modular Language Model Benchmarking to Small-Scale Research

Research May 6, 2026

GitHub user erogol's BlaGPT offers an open-source research sandbox for evaluating LM architectures and components on compact datasets.

SubQ Claims 12-Million-Token Context at Sub-Quadratic Cost

Research May 6, 2026

A new architecture called SubQ targets 12 million token context windows while sidestepping the quadratic compute scaling that limits standard transformers.

Bonsai 1.7B Hits 442 Tokens Per Second on M4 Max: Ternary Weight Efficiency in Practice

Research May 5, 2026

A ternary-weight 1.7B model achieves 442 T/s on Apple M4 Max, demonstrating how ultra-compact weight encoding translates to real-world on-device inference speed.

Can LLM Biases Be Weaponized to Hijack AI Search Overviews?

Research May 5, 2026

A new arXiv preprint examines whether known large language model biases can be deliberately exploited to distort AI-generated search summaries.

Harvard Study: OpenAI's o1 Outdiagnoses Emergency Room Physicians in Blinded Trial

Research May 4, 2026

A peer-reviewed Harvard and Beth Israel study finds OpenAI's o1 model achieved accurate triage diagnoses in 67% of cases versus 50–55% for attending physicians.

DeepMind's AI Co-Clinician Clears Near-Perfect Benchmark, Proposing a New Model for Medical Teamwork

Research May 3, 2026

Google DeepMind's AI co-clinician achieved a critical-error rate of zero in 97 of 98 simulated clinical queries, outperforming tools already in routine physician use.

AlphaGo's Creator Says LLMs Are a Dead End — and Raised $1.1 Billion to Prove It

Research Apr 29, 2026

David Silver, who built AlphaGo at DeepMind, argues large language models are fundamentally capped by human data and has founded Ineffable Intelligence to pursue reinforcement learning instead.