The BMJ warns medical AI can hallucinate and miss diagnoses

Ask an AI chatbot whether a persistent headache could signal something serious, and you will likely get a detailed, authoritative-sounding answer. The problem, according to a growing body of research published by The British Medical Journal, is that some of those answers are fabricated. The chatbot does not flag uncertainty or say “I don’t know.” It simply invents a plausible response and delivers it with the same polish it uses for verified facts. Researchers call this failure “hallucination,” and in a medical context, it can mean a patient walks away with a false sense of reassurance or, worse, pursues a treatment that was never appropriate. In one documented test case from the BMJ’s cross-sectional study, a chatbot presented a detailed but wholly inaccurate description of vaccine side-effect profiles, citing mechanisms that do not appear in the medical literature, yet delivered the fabrication in the same fluent, citation-like style it used for accurate responses.

What the BMJ has documented

The BMJ has published a series of peer-reviewed papers that, taken together, build a detailed case against trusting AI chatbots for health guidance without verification.

A BMJ feature article drew a striking parallel: just as doctors sometimes fill gaps in their knowledge with plausible-sounding statements, large language models do the same thing, except at enormous scale and with zero awareness that the information is wrong. The difference is that a doctor can be questioned, corrected, or held accountable. A chatbot simply moves on to the next prompt.

A separate cross-sectional study published in the BMJ tested how major publicly available chatbots handled health-disinformation prompts. The results were uneven. Some tools refused to engage with dangerous questions. Others produced detailed but inaccurate treatment suggestions or mischaracterized the evidence behind vaccines and common medications. Safeguards varied so widely from one platform to the next that switching chatbots offered no guarantee of better accuracy.

An accompanying editorial warned that the consequences would not fall equally. Patients who already face barriers to seeing a doctor, whether because of cost, geography, or long wait times, are more likely to rely on free AI tools. If those tools deliver confident misinformation, existing health inequities could deepen.

When AI invents what it claims to see

The hallucination problem is not limited to text. A preprint posted on arXiv (arXiv:2501.13011, not yet peer-reviewed) described what its authors call “mirage reasoning” in multimodal AI models, systems designed to interpret both text and images. In controlled experiments, researchers gave these models prompts implying that a medical image had been provided when no image was actually attached. Rather than flagging the missing input, the models invented detailed diagnostic narratives: describing lesions, identifying abnormalities, and offering clinical interpretations of images that did not exist.

The implications for diagnostic imaging are serious. A radiologist or emergency physician who trusts a model’s output without independently reviewing the scan could act on a finding that was never there, or miss one that was. The preprint has not yet been validated by independent teams or tested against the specific commercial models entering clinical use, so it should be read as an early warning rather than settled science. But the behavior it documents, AI confidently analyzing nothing, is exactly the kind of failure that could cause harm in fast-paced clinical settings.

How major AI developers have responded

As of May 2026, the major companies behind the most widely used large language models have acknowledged the hallucination problem in broad terms but have not published detailed, independently verified data on how often their medical outputs are inaccurate. OpenAI’s usage policies note that ChatGPT “can make mistakes” and advise users not to rely on it for medical advice. Google has added health-specific disclaimers to Gemini responses and has published research on Med-PaLM, a medically tuned model, but has not released public benchmarks showing how frequently its consumer-facing tools hallucinate on clinical questions. Neither company has published a systematic audit of its safeguards against health misinformation comparable to the BMJ’s cross-sectional study. The gap between developer acknowledgments and independently measurable safety data remains wide.

Where regulation stands

The U.S. Food and Drug Administration moved to address part of this landscape on January 6, 2026, when it issued its Clinical Decision Support Software Final Guidance (re-issued January 29). The document draws a line between AI-powered software that qualifies as a regulated medical device and tools classified as “non-device” clinical decision support. A central requirement is transparency: the software must allow healthcare professionals to independently verify the basis of any recommendation it generates.

The FDA held a public town hall on March 11, 2026, to walk developers, hospital administrators, and clinicians through the guidance, a signal that the agency considers this an active enforcement priority rather than a theoretical framework.

What the guidance does not do is name LLM hallucinations as a specific regulated risk category. The device-versus-non-device framework was designed before generative AI entered mainstream clinical discussion, and whether it adequately captures the failure modes unique to these models remains an open question. The transparency standard assumes clinicians have the time and expertise to second-guess every AI suggestion. In an understaffed emergency department or a rural clinic where one physician covers multiple roles, that assumption may not hold.

The gaps that still need filling

No published data yet quantifies how often AI hallucinations have caused direct patient harm in real clinical settings. The BMJ evidence establishes that hallucinations occur and that safeguards are inconsistent, but the studies tested chatbot outputs in controlled research environments. The gap between a laboratory demonstration and a documented clinical injury is significant, and as of May 2026, no aggregated adverse-event data from national safety reporting systems has been published to close it.

It also remains unclear how clinicians are actually using general-purpose chatbots day to day. Without systematic observational studies or mandatory reporting, there is no reliable way to know how often fabricated content ends up in clinical notes, discharge instructions, or informal advice to patients.

Developers, meanwhile, have not standardized how they communicate model limitations. Product interfaces tend to emphasize capabilities while burying caveats about hallucinations in fine print or generic disclaimers. Regulators have not yet specified how prominently risks must be displayed or how to convey the probabilistic nature of AI outputs in terms that are meaningful to both busy clinicians and patients with no technical background.

What patients and clinicians should do before trusting a chatbot’s medical answer

For patients who use AI chatbots to research symptoms or treatment options, the practical takeaway from the BMJ’s work is blunt: no current large language model reliably knows the difference between information drawn from its training data and information it has fabricated on the spot. Treating chatbot output as a starting point for a conversation with a doctor, rather than as a substitute for professional care, is the safest approach. Bringing a printout or screenshot of a chatbot’s answer to an appointment gives a clinician the chance to correct errors before they shape decisions.

For clinicians, the calculus is more layered. AI tools can save time by summarizing long documents, translating jargon into patient-friendly language, or generating first drafts of educational materials. But allowing those outputs to silently shape a diagnosis, a prescription, or a prognostic conversation introduces risk that current evidence cannot fully quantify. The BMJ’s research and the FDA’s guidance both point in the same direction: every AI-generated recommendation should be cross-referenced against up-to-date clinical guidelines and trusted databases, and the use of AI in the decision-making process should be documented.

Until stronger real-world safety data emerges, the most defensible posture is to treat generative AI the way a teaching hospital treats a first-year resident’s work: potentially useful, occasionally insightful, but never to be accepted without supervision.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X