How accurate was OpenAI's o1 model compared to doctors in the Harvard study?

OpenAI's o1 offered the exact or very close diagnosis in 67% of triage cases, compared to 55% and 50% for the two attending physicians in the study.

Does this mean AI is ready to replace emergency room doctors?

No. The study's authors call for prospective real-world trials before any clinical deployment, and note there is currently no accountability framework for AI diagnoses.

Harvard Study: OpenAI's o1 Outdiagnoses Emergency Room Physicians in Blinded Trial

OpenAI’s o1 model outperformed attending physicians at emergency triage in a peer-reviewed study published this week in Science. Conducted by a team at Harvard Medical School and Beth Israel Deaconess Medical Center, it is among the most rigorous clinical head-to-head evaluations of large language models to date — though the authors are careful to distinguish strong benchmark performance from readiness for autonomous deployment.

Blinded Against Real ER Cases

According to TechCrunch, researchers drew on 76 actual Beth Israel emergency room cases, feeding the same unprocessed electronic health record data to both human physicians and OpenAI’s o1 and GPT-4o models. A separate group of attending physicians — unaware of which diagnoses were human-generated and which were AI-generated — then scored every result.

O1 produced the exact or very close diagnosis at initial triage in 67% of cases; the two human physicians scored 55% and 50% respectively. Arjun Manrai, who directs an AI lab at Harvard Medical School and is a lead author on the paper, said the model “eclipsed both prior models and our physician baselines” across virtually every benchmark tested. The performance gap was most pronounced at first contact — precisely when clinical information is scarcest and the stakes are highest.

What the Study Does Not Cover

The evaluation was limited to text drawn from electronic health records. TechCrunch reports the authors acknowledge that “existing studies suggest that current foundation models are more limited in reasoning over nontext inputs,” placing imaging, physical examination findings, and similar data outside the scope of these conclusions.

Accountability gaps are equally unresolved. Adam Rodman, a Beth Israel physician and co-lead author, told The Guardian there is “no formal framework right now for accountability” around AI diagnoses, and that patients still “want humans to guide them through life or death decisions.”

Why This Matters

TechCrunch reports the study frames its findings as evidence of an “urgent need for prospective trials to evaluate these technologies in real-world patient care settings” — not a green light for clinical rollout. That framing matters: the shift from retrospective chart review to live patient trials is the next meaningful evidentiary bar. A 12–17 percentage-point accuracy advantage at triage is not a rounding error; replicated prospectively and at scale, gains of that magnitude could reshape emergency medicine workflows. The harder constraint now is institutional — building the governance structures that make accountability possible before the technology outruns the frameworks designed to contain it.

Blinded Against Real ER Cases

What the Study Does Not Cover

Why This Matters

Frequently Asked Questions