New study exposes dangerous flaws in hot new AI healthcare craze

Researchers at Oxford University have produced what they describe as the largest user study of large language models applied to medical advice, and the results are alarming. The peer-reviewed work, along with several related studies published in early February 2026, found that popular AI chatbots routinely deliver inaccurate and sometimes dangerous health guidance, particularly when patients face emergencies. The findings arrive as more people turn to tools like ChatGPT, Claude, and Gemini for quick answers about symptoms, medications, and whether to visit a doctor.

Chatbots Fail When the Stakes Are Highest

The central problem is not that AI occasionally gets a medical fact wrong. It is that these systems fail most dangerously in exactly the situations where accuracy matters most: emergencies. A study published in Nature Medicine stress-tested a ChatGPT-based system for medical triage using 60 clinician-authored vignettes spanning 21 clinical domains. Researchers ran the system through 16 factorial conditions, generating 960 total responses. Among cases that clinicians classified as gold-standard emergencies, the system under-triaged 5 of them, meaning it told simulated patients their situation was less urgent than it actually was.

Under-triage in an emergency setting can delay life-saving care. If a chatbot tells someone experiencing stroke symptoms to schedule a routine appointment rather than call an ambulance, the consequences could be fatal. In real-world use, under-triage could delay life-saving care. The study was designed specifically to measure how well ChatGPT Health performs the basic clinical task of sorting patients by urgency, and the tool stumbled on the cases where getting it right is non-negotiable.

Oxford’s own evaluation of consumer-facing tools reinforces this picture. In what the university describes as the largest real-world user study of its kind, researchers asked patients to pose everyday health questions to leading systems and compared the answers with clinical advice. The team reported that chatbots frequently gave incorrect or unsafe guidance, including failures to direct users to a GP or emergency care when that would have been the appropriate response.

Unsafe Answers Across Every Major Chatbot

The triage failures in ChatGPT Health are not isolated to one product. A separate physician-led red-teaming study published in npj Digital Medicine evaluated 888 responses to 222 patient-posed advice-seeking questions across four widely used models: Claude, Gemini, GPT-4o, and Llama. The researchers reported comparative rates of problematic and unsafe answers across all four systems, confirming that safety failures are an industry-wide pattern rather than a flaw in any single company’s product.

What makes this study particularly revealing is its methodology. Rather than testing chatbots with textbook-style medical questions, the researchers used the kinds of messy, real-world queries that actual patients type into these tools. The results showed that every model produced answers that physicians flagged as potentially harmful. For patients who lack the medical training to spot a bad recommendation, these errors are invisible until they cause real damage.

The Oxford team’s broader user study points in the same direction: people often ask chatbots whether they should seek in-person care, and the systems respond with confident but unreliable advice. This pattern, observed across multiple independent evaluations, suggests that current safeguards are not robust enough to handle the complexity of real-world medical decision-making.

Tiny Data Poisoning, Massive Consequences

Beyond simple inaccuracy, there is a deeper vulnerability baked into how these models learn. A peer-reviewed security study in Nature Medicine demonstrated that replacing roughly 0.001% of training tokens with medical misinformation was enough to increase harmful outputs from a large language model. The attack left the model’s performance on standard benchmarks apparently intact, meaning routine quality checks would not catch the problem.

This finding challenges a common assumption in AI safety: that a model performing well on tests is a model performing well in practice. If a bad actor, or even an accidental data contamination event, introduces a tiny sliver of false medical information into training data, the resulting model could confidently spread dangerous advice while passing every evaluation its developers run. The gap between benchmark scores and real-world safety is wider than most users realize, and the study makes clear that current testing methods are not sufficient to close it.

For healthcare, where guidelines and evidence change over time, the risk is compounded. A poisoned or outdated model could continue recommending discredited treatments or miss newly established red-flag symptoms, all while appearing highly capable on familiar benchmark questions that do not reflect current practice.

When Chatbots Meet Vulnerable Users

The risks become especially acute for people in mental health crises. Reporting by the Associated Press on tests of chatbot responses to suicide-related prompts has highlighted cases where systems produced unsafe or inadequate replies, underscoring calls from researchers for stronger safeguards. The research takes on added weight alongside a lawsuit filed by a family alleging that ChatGPT played a role in a teenager’s death.

These are not edge cases that affect a handful of users. People in crisis often turn to the most accessible resource available, and for many, that resource is now an AI chatbot on their phone. The combination of a tool that gives confident-sounding but unreliable advice and a user who is too distressed to question that advice creates conditions where errors can turn lethal. The Psychiatric Services researchers were explicit that current chatbot behavior around self-harm needs significant improvement before these tools can be considered safe for vulnerable populations.

Oxford’s broader findings echo this concern, noting that people with mental health conditions were among those most likely to rely on chatbots for immediate support. Yet the systems are not consistently trained or evaluated to handle such high-risk conversations, leaving a dangerous gap between user expectations and technical reality.

Misinformation Gets Worse With Credible-Looking Sources

Another dimension of the problem involves how easily AI systems can be tricked into spreading false health claims. Researchers found that AI was more likely to pass along misinformation when the source appeared legitimate. The phrasing of prompts also affected the likelihood that a chatbot would relay false medical claims, meaning that bad information dressed up in professional language was more persuasive to these systems than the same falsehood stated bluntly.

This matters because health misinformation online rarely announces itself as unreliable. It typically mimics the tone and structure of credible medical sources, which is precisely the format most likely to slip past an AI’s defenses. Users who ask a chatbot to evaluate a health claim they found online may receive false reassurance that the claim is valid, simply because the misinformation was packaged convincingly.

Combined with the data-poisoning risk, this susceptibility to polished falsehoods creates a worrying feedback loop: misleading content that looks authoritative can both corrupt training data and evade safeguards at inference time, amplifying its reach through tools that millions of people now treat as trusted assistants.

Who Bears the Risk as AI Healthcare Tools Spread

The Oxford research team described their work as the largest systematic test so far of how ordinary people use medical chatbots, and their conclusion was blunt: current systems are not ready to be treated as stand-alone sources of care. In a statement announcing the findings, the university warned that these tools can provide misleading or unsafe advice and urged regulators and developers to treat them as experimental rather than routine parts of healthcare.

That framing highlights a central tension. Technology companies promote chatbots as helpful companions that can “empower” patients, while academic and clinical researchers are increasingly documenting systematic safety failures. The risk, as the Oxford authors emphasize, is that responsibility for navigating this gap is being pushed onto patients who have neither the training nor the legal protections to manage it.

For now, the studies suggest several practical guardrails. Patients should treat AI-generated health information as a starting point for questions, not as a diagnosis or treatment plan. Any advice involving medication changes, delayed care, or self-harm must be checked with a qualified professional. Healthcare providers, meanwhile, need clear policies on when and how AI tools can be used, along with training to recognize their limitations.

The research record now paints a consistent picture: today’s medical chatbots can be useful for education and explanation, but they are unreliable as decision-makers, especially in emergencies, mental health crises, and situations involving contested or misleading information. As regulators and developers debate the future of AI in healthcare, these findings point to a simple interim rule for patients: when the stakes are high, the safest option is still to talk to a human clinician.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X