ChatGPT Health missed half of emergencies in alarming new study

OpenAI’s ChatGPT Health tool failed to recognize more than half of true medical emergencies when tested with realistic patient scenarios, according to a peer-reviewed study published in Nature Medicine on February 23, 2026. The finding raises sharp questions about the safety of consumer-facing AI tools that millions of people may already be using to decide whether they need urgent care. When the system got it wrong, it pointed patients toward routine appointments instead of emergency departments, a mistake that in real life could prove fatal. The Nature Medicine authors, whose work has also been highlighted in news coverage, argue that these errors are not rare edge cases but a systematic pattern of underestimating risk.

The study arrives at a moment when AI-driven symptom checkers are being aggressively marketed as convenient front doors to the health system. OpenAI has positioned ChatGPT Health as an assistant that can help users decide whether to seek care, yet the research suggests that in many high-stakes situations it offers false reassurance instead. Experts quoted in the Guardian report warned that laypeople are likely to treat the chatbot’s answers as de facto medical advice, especially when the tool responds in confident, authoritative language. That dynamic turns statistical misclassification into a potential public health hazard.

How Researchers Stress-Tested the AI

The study, titled “ChatGPT Health performance in a structured test of triage recommendations,” used 60 clinician-authored vignettes designed to cover a range of medical situations, from minor complaints to life-threatening crises. Researchers ran each vignette through 16 factorial conditions, producing 960 total outputs that could be scored against established clinical standards. The design was intentionally adversarial: the vignettes included textbook presentations of conditions like diabetic ketoacidosis and impending organ failure, cases where any trained emergency physician would immediately escalate care. By varying details such as patient age, comorbidities and how symptoms were described, the team probed whether the model’s performance shifted with subtle changes in wording.

Among cases that clinicians classified as gold-standard emergencies, the system under-triaged 52% of them, meaning it recommended a lower level of urgency than the situation demanded. In practical terms, a patient describing symptoms of diabetic ketoacidosis might be told to schedule a doctor’s visit rather than call an ambulance. That gap between what the AI advised and what the patient actually needed is the core danger the researchers set out to quantify, and the results were worse than many in the field expected. Additional analysis available through the publisher’s portal underscores that the errors were not confined to borderline presentations but included clear-cut emergencies.

Suicide Risk Assessment Fell Short

The study’s failures were not limited to physical emergencies. Researchers also tested ChatGPT Health against scenarios involving suicidal ideation, using the government-backed SAFE-T framework published by the Substance Abuse and Mental Health Services Administration as a benchmark. SAFE-T provides official risk stratification that maps specific warning signs, such as a patient expressing a plan or intent, to recommended dispositions ranging from outpatient referral to immediate emergency evaluation. The protocol draws on the Columbia-Suicide Severity Rating Scale, a validated clinical instrument that defines clear thresholds for when suicidal ideation crosses into high-acuity territory requiring immediate intervention.

When ChatGPT Health encountered vignettes that met those high-risk thresholds, it frequently recommended outpatient follow-up rather than emergency evaluation. That pattern is especially troubling because people in acute suicidal crisis are among the least likely to advocate for themselves or push back against advice that tells them to wait. Anyone experiencing a mental health crisis can reach the SAMHSA crisis line for immediate support, but the study’s results suggest that an AI tool sitting between the patient and that resource could actually delay help. In the context of suicide prevention, where minutes can matter, a model that consistently downplays risk is not just inaccurate; it is potentially dangerous.

Why Regulation Has Not Caught Up

A central tension exposed by this research is the gap between what ChatGPT Health does in practice and how federal regulators classify it. The FDA maintains a four-part test to determine whether clinical decision support software qualifies as a medical device subject to oversight. Tools that merely present information without locking users into a specific clinical action can potentially avoid device classification altogether. ChatGPT Health, which frames its outputs as suggestions rather than directives, may fall on the unregulated side of that line, even though millions of users treat its answers as authoritative medical guidance. That ambiguity allows powerful triage tools to be deployed to the public without the kind of premarket evaluation required for far simpler medical devices.

That regulatory architecture was designed for an era when decision-support tools were used by trained clinicians inside hospital systems, not by anxious patients Googling symptoms at 2 a.m. The distinction between “information” and “recommendation” collapses when the person reading the output has no medical training to second-guess it. Until regulators update their frameworks to account for consumer-facing AI triage tools, products like ChatGPT Health will continue to operate in a gray zone where the stakes are high but formal accountability is minimal. The Nature Medicine authors and outside commentators quoted in media reports have called for clearer rules that treat high-impact triage guidance as a regulated function, regardless of whether it is delivered through a friendly chatbot interface.

Who Bears the Greatest Risk

The people most likely to rely on a free AI chatbot for health guidance are often those with the fewest alternatives. Patients in rural areas, uninsured individuals, and people who face long wait times for primary care appointments are precisely the users who might treat ChatGPT Health’s output as a substitute for professional triage. If the tool tells them their chest pain or confusion can wait, many will wait. The 52% under-triage rate documented in the Nature Medicine analysis takes on a different weight when mapped onto populations that already face barriers to emergency care. For these groups, a false sense of reassurance can mean the difference between a survivable event and a catastrophe.

Most coverage of AI health tools focuses on whether the technology can match a doctor’s accuracy. That framing misses the more pressing question: what happens when it confidently gets it wrong? A doctor who under-triages a patient is accountable through malpractice standards, hospital review boards, and licensing requirements. An AI chatbot faces none of those consequences. The patient bears the full cost of the error, and in emergency medicine, that cost can be measured in minutes of oxygen to the brain or hours before insulin reaches the bloodstream. As the Guardian’s reporting on user behavior notes, people often seek online reassurance precisely when they are most vulnerable, amplifying the impact of any misdirection.

What Needs to Change Before the Next Version Ships

The Nature Medicine study does not argue that AI can never play a role in triage; instead, it sets out a roadmap for what would need to change before tools like ChatGPT Health could be considered safe for widespread use. First, developers would have to treat emergency recognition as a safety-critical function, subject to rigorous testing and continuous monitoring rather than informal beta feedback. That would mean systematically stress-testing models on curated vignettes that represent the full spectrum of emergencies, including suicidal crises, and publishing performance metrics that clinicians and regulators can independently scrutinize. It would also require building in conservative defaults: when the model detects even a small chance of a life-threatening condition, it should err on the side of recommending urgent evaluation rather than routine follow-up.

Second, any future deployment of AI triage tools to the general public will need guardrails that go beyond a disclaimer saying “this is not medical advice.” Clear, plain-language warnings should be embedded directly into high-risk interactions, along with friction that nudges users toward real-world care, such as prominently displaying emergency numbers or crisis hotlines when symptoms or statements cross validated danger thresholds. Regulators, for their part, will need to decide whether tools that generate individualized urgency recommendations should be regulated as medical devices regardless of how their creators label them. Until that alignment between technical design, clinical evidence and oversight is in place, the safest role for systems like ChatGPT Health may be as educational aids under professional supervision, not as unsupervised gatekeepers to life-or-death decisions.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X

ChatGPT Health missed half of emergencies in alarming new study

How Researchers Stress-Tested the AI

Suicide Risk Assessment Fell Short

Why Regulation Has Not Caught Up

Who Bears the Greatest Risk

What Needs to Change Before the Next Version Ships

Author

Get weekly updates with the latest news and tips!

More in Health

IG

FB

PIN

LI

X