ChatGPT Health fails hard at spotting deadly emergencies, study warns

OpenAI’s consumer health tool, ChatGPT Health, failed to recognize life-threatening emergencies in more than half of tested scenarios, according to a peer-reviewed study published in Nature Medicine. The research, which generated 960 responses from clinician-authored medical vignettes, found that the AI system told users they did not need emergency care in 52% of cases where immediate treatment was the correct answer. With the tool reaching millions of users since its January 2026 launch, the findings raise direct questions about whether AI-driven triage can be trusted when seconds matter.

The authors emphasized that their work focused on triage recommendations, not diagnosis or long-term management advice. Even within that narrower scope, the magnitude of under-triage suggests a mismatch between how confidently the product is presented to the public and how cautiously it performs when confronted with high-stakes decisions. For patients and clinicians already wary of algorithmic medicine, the results will likely reinforce concerns that current large language model systems are not ready to replace human judgment in emergency settings.

Half of True Emergencies Went Unrecognized

The study used 60 medical scenarios written by practicing clinicians, spanning 21 specialties and multiple acuity levels to stress-test ChatGPT Health’s ability to sort patients by urgency. Each vignette was run through the system multiple times under varied conditions, producing 960 total responses and allowing researchers to see how stable the tool’s advice was when prompts were slightly altered. When they compared the AI’s recommendations against expert-determined gold-standard triage levels, the results were stark: in cases that genuinely required emergency department visits, ChatGPT Health directed users away from urgent care 52% of the time.

That failure rate is not a marginal shortcoming. Under-triage, the clinical term for telling a sick patient they are less sick than they actually are, can delay treatment for heart attacks, strokes, sepsis, and other conditions where every minute of delay worsens outcomes. The structured factorial design of the study, which systematically varied demographic details, symptom descriptions, and contextual information, suggests this is not a problem isolated to one type of emergency or a narrow set of prompts. Instead, the pattern held across a broad range of presentations, indicating a systemic gap in how the tool assesses severity and communicates risk to lay users.

Guardrail Failures in Mental Health Scenarios

Beyond general emergency triage, the study exposed specific breakdowns in ChatGPT Health’s safety guardrails during mental health scenarios. According to reporting by medical editor Melissa Davey in the Guardian, the system’s crisis-intervention banners, which are designed to appear when a user describes suicidal thoughts, behaved inconsistently. In suicidal-ideation scenarios that included lab results, the banners appeared as intended. But in other configurations of the same type of scenario—where the distress was just as severe but the input lacked laboratory data—the safety prompts did not trigger at all, leaving users with standard conversational replies.

The study’s lead author described these guardrail failures as dangerous, pointing to the gap between what users expect from a branded health product and what the system actually delivers. When someone types symptoms of a psychiatric emergency into a tool explicitly marketed for health guidance, the absence of a crisis banner is not just a software bug. It is a missed intervention point that could mean the difference between seeking help and staying silent. The inconsistency is especially troubling because it suggests the guardrails respond to surface-level patterns in the text rather than to the clinical gravity of the situation. A user who phrases their distress differently, or who omits certain data points, may receive no safety warning at all, even while believing they are interacting with a medically robust system.

A Pattern Across AI Health Tools

The Nature Medicine findings do not exist in isolation. Separate physician-led red-teaming research examined how multiple publicly available chatbots responded to patient-posed medical questions and found high rates of problematic or unsafe answers. That work, which tested several large language models rather than ChatGPT Health alone, suggests the triage problem is not unique to OpenAI’s product but reflects a broader limitation of general-purpose AI systems handling medical queries without sufficient domain-specific constraints. Across models, the researchers documented cases where chatbots offered false reassurance, incorrect medication guidance, or incomplete safety advice.

Earlier peer-reviewed work published in Scientific Reports documented similar weaknesses in a different context. That study tested ChatGPT’s ability to perform START triage, a protocol used in mass casualty incidents, and reported low baseline accuracy before the model received explicit instruction in the triage method. Only after being carefully prompted and constrained did performance improve, underscoring that these systems do not natively “understand” emergency medicine. Taken together with the new Nature Medicine data on emergency triage, the evidence points to a recurring pattern: without targeted training and rigorous external evaluation, AI health tools tend to produce answers that sound reasonable but fall short of the precision and consistency required when lives are at stake.

Millions of Users, No Regulatory Guardrails

ChatGPT Health launched in January 2026 as OpenAI’s consumer-facing health assistant, and it quickly attracted millions of users. That scale makes the 52% under-triage rate a population-level concern rather than an academic curiosity. If even a fraction of those users follow the tool’s advice during a genuine emergency and delay seeking care, the downstream effects on hospital admissions, intensive care utilization, and mortality could be measurable, even if difficult to attribute directly to a single digital product.

No direct response or corrective data from OpenAI addressing the study’s findings has surfaced in the available reporting. The absence of public comment from the company is notable given the severity of the claims and the peer-reviewed status of the research. At the same time, there is still no clear regulatory framework governing AI-powered consumer health tools: agencies that oversee drugs and medical devices have not yet defined when conversational systems cross the line into regulated clinical decision support. As news outlets encourage readers to stay engaged with ongoing coverage, through offerings such as weekly print editions and digital subscriptions, they are also documenting how the gap between rapid product deployment and slow-moving oversight continues to widen.

What the Study Means for People Using AI for Health Advice

The practical takeaway is blunt: anyone using ChatGPT Health or a similar AI tool to decide whether they need emergency care is relying on a system that, in controlled testing, got that call wrong more often than it got it right for true emergencies. The study did not test real-world usage patterns, and actual patient outcomes tied to AI-guided triage decisions remain unmeasured. But the controlled vignettes were written by clinicians to reflect realistic presentations, and the 52% failure rate held across a wide range of conditions, ages, and comorbidities. For now, clinicians and patient advocates are likely to continue urging people to treat such tools as supplementary information sources rather than as substitutes for nurse hotlines, primary care advice, or emergency services.

For individuals who still choose to consult AI tools, experts quoted in media coverage stress several practical safeguards: when in doubt, err on the side of calling emergency services; treat chatbot responses as starting points for questions to ask a human clinician; and be especially cautious when the system downplays chest pain, breathing difficulties, sudden neurological changes, or expressions of self-harm. News organizations that have invested in explaining these risks, including outlets that invite readers to support independent health reporting or to sign in for personalized coverage, are likely to keep highlighting one core message: until regulation, transparency, and performance benchmarks catch up, AI health chatbots should be treated as experimental tools, not as reliable arbiters of when a potentially life-threatening emergency room visit can safely wait.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X

ChatGPT Health fails hard at spotting deadly emergencies, study warns

Half of True Emergencies Went Unrecognized

Guardrail Failures in Mental Health Scenarios

A Pattern Across AI Health Tools

Millions of Users, No Regulatory Guardrails

What the Study Means for People Using AI for Health Advice

Author

Get weekly updates with the latest news and tips!

More in Health

IG

FB

PIN

LI

X