Study finds AI health chatbots do not improve self-diagnosis

A randomized trial led by researchers at the University of Oxford found that people who used AI chatbots to assess their health symptoms performed no better, and in some cases worse, than those who relied on their own judgment. The study, which tested 1,298 UK participants across 10 doctor-written medical scenarios, challenges a growing assumption that large language models can serve as reliable first-stop tools for health questions. With millions of people already turning to chatbots like ChatGPT for medical guidance, the results raise serious questions about whether these tools help or harm everyday health decisions.

What the Oxford Trial Actually Measured

The study, reported in a leading medical journal, was a preregistered randomized trial, a design that sets a high bar for reliability in clinical research. Participants were split into groups: some worked through the scenarios without any AI help, while others had access to one of three large language models, specifically GPT-4o, Llama 3, or Command R+. Each scenario was crafted by physicians to simulate real health complaints a patient might bring to a doctor, such as chest pain, persistent cough, or abdominal discomfort, and participants had to decide what conditions might explain the symptoms and how urgent the situation was.

The central finding was stark. Participants who used LLM assistance identified relevant conditions in fewer than 34.5% of scenarios. That rate did not beat the control group working without AI support. In some measures, including correct triage for urgent problems, the AI-assisted group actually underperformed those relying on their own reasoning. The result was consistent across all three chatbot models tested, which means the failure was not limited to one company’s product or one type of AI architecture.

This distinction matters because the models themselves, when tested in isolation on medical exam questions, often score well. The gap between a chatbot answering a multiple-choice board exam and a real person trying to use that same chatbot to figure out whether their chest pain needs urgent care turns out to be wide. Lead author Andrew Bean and colleagues argue that the analysis illustrates how interacting with humans poses a challenge “even for top” AI models. The models can generate plausible, detailed answers, but those answers do not reliably translate into safer decisions when filtered through non-expert users.

Why Clinical Knowledge Fails in Conversation

The disconnect between AI performance on standardized tests and real-world use reflects a basic mismatch in how these tools are built versus how people actually use them. Medical board exams present structured, well-defined questions with clear answer sets. A person experiencing symptoms, by contrast, may describe them vaguely, focus on the wrong details, or misunderstand what the chatbot is asking. The AI may generate a technically accurate paragraph that the user then applies incorrectly to their own situation.

A separate physician-led safety evaluation reinforces this concern. That study tested four major chatbots, including Claude, Gemini, GPT-4o, and Llama 3, against hundreds of real patient questions and evaluated 888 responses for safety and quality. Even when users asked direct, advice-seeking questions, model outputs were frequently unsafe, either by downplaying serious symptoms, suggesting inappropriate self-care, or failing to flag the need for urgent attention. The problem was not that the AI lacked medical facts but that its answers, delivered without clinical context or follow-up, could steer patients toward wrong conclusions.

This pattern suggests that the real bottleneck is not primarily the AI’s training data. It is the interaction itself. People do not query chatbots the way a clinician takes a patient history. They skip relevant details, accept the first plausible-sounding answer, and lack the background to judge whether a response truly applies to them. The chatbot, meanwhile, has limited ability to push back, ask clarifying questions in a clinically meaningful sequence, or recognize when a user seems to be heading toward a dangerous misinterpretation. Guardrail language such as “this is not medical advice” does little to counteract the authority implied by confident, fluent text.

Broader Evidence Points the Same Direction

The Oxford trial is not an outlier. A systematic review synthesizing evidence across symptom assessment apps, large language models, and laypeople found that LLM accuracy typically ranged between about 58% and 76% on diagnostic and triage tasks. Those numbers might sound reasonable in isolation, but for a tool that many users treat as a medical authority, a 24% to 42% error rate on triage decisions carries real consequences. A missed urgent condition or a false reassurance about a serious symptom can delay care by hours or days, or send already stretched emergency services more low-risk cases.

The systematic review also helps put the Oxford result in proportion. The trial’s finding that chatbot-assisted participants identified relevant conditions in fewer than 34.5% of cases sits well below even the lower bound of standalone LLM accuracy. That gap likely reflects the compounding effect of human error layered on top of AI error. When a chatbot gives an imperfect answer and a non-expert interprets it imperfectly, accuracy does not stay flat. It drops, sometimes sharply, as misunderstandings accumulate.

Across studies, another pattern emerges: models often perform better on diagnosis than on triage, even though triage (deciding how quickly someone needs care) is the piece with the highest stakes. An AI that can list possible causes of a rash but cannot reliably distinguish between “see a doctor this week” and “go to the emergency department now” may still leave patients dangerously exposed.

What Clinicians Say About Patient Safety

Rebecca Payne, a GP and clinical academic who served as clinician lead on the Oxford study, warned about the patient-safety implications, including wrong diagnoses and failure to recognize urgency. Her concern is grounded in what primary care doctors see daily: patients who arrive having already decided what is wrong with them based on an internet search. AI chatbots risk accelerating that pattern by delivering answers with a tone of clinical authority that a simple web search result does not carry.

The trial was designed by a team spanning the Oxford Internet Institute’s digital policy researchers and the university’s medical faculty, with senior author Adam Mahdi and lead author Andrew Bean coordinating the work. That cross-disciplinary structure matters because it means the study was not built solely by computer scientists evaluating AI on AI’s terms. It was framed around whether the technology actually changes health outcomes for the people using it, and whether those changes are beneficial or harmful.

Clinicians involved in the research emphasize that even apparently “good” AI advice can be risky if it displaces professional care. A chatbot that reassures a user about chest discomfort that really needs urgent investigation, or that encourages self-management for symptoms of sepsis, can have consequences far beyond a wrong answer on a test. For general practitioners already working at capacity, a wave of AI-shaped self-diagnoses could also make consultations more complex and time-consuming.

The Gap Most Coverage Misses

Much of the public conversation about AI in healthcare focuses on whether chatbots can pass medical licensing exams or match physician accuracy on curated test sets. That framing flatters the technology by measuring it under ideal conditions. The Oxford trial flips the question: does access to a highly capable chatbot actually improve laypeople’s decisions about their own health? In this experiment, the answer was no.

That gap between system-level performance and user-level outcomes is the piece many headlines miss. It is possible for a model to be “smarter” on paper and yet make its users worse off, because it changes how they behave. People may over-trust a confident answer, delay seeking care, or disregard their own symptoms if the chatbot’s explanation sounds reassuring. Conversely, anxious users may be nudged into unnecessary emergency visits by worst-case scenarios that the model includes “just in case.”

There is also a risk that AI tools widen existing inequalities. People with higher health literacy may be better at spotting when an answer does not quite fit, or at using the chatbot as a prompt for questions to ask a doctor. Those with less background knowledge, or with limited access to in-person care, are more likely to treat the AI as a substitute clinician. The same technology could therefore be mildly helpful for the already well-served and actively harmful for those with the fewest safety nets.

What Responsible Use Might Look Like

None of this means that AI has no role in healthcare. The evidence so far suggests that the safest applications are those where models support professionals rather than replace them. For example, a chatbot might help clinicians draft letters, summarize long records, or generate checklists that a human then reviews. Within a university setting such as Oxford’s broader research ecosystem, teams are also exploring how to embed AI into tightly supervised clinical workflows where outputs are always checked before reaching patients.

For the public, however, the message from the current evidence base is more cautious. Treat general-purpose chatbots as information tools, not diagnostic engines. Use them to prepare questions for an appointment, to understand terms you have already heard from a professional, or to explore reputable guidelines, but not to decide whether you can safely stay home with worrying symptoms. If in doubt, the priority should still be talking to a qualified clinician or trusted service such as a national health advice line.

The Oxford researchers also argue that regulators and developers need to shift how they evaluate health AI. Benchmarks based on exam performance or static test sets are not enough. Trials that randomize real users, as in this study, reveal a more sobering picture of what happens when technology meets messy human behavior. As interest grows, reflected in initiatives such as the Oxford Internet Institute’s public updates on digital health research, policymakers will need to decide how far to let consumer-facing chatbots stray into clinical territory.

For now, the Oxford trial offers a clear takeaway: simply handing powerful language models to patients and hoping for better decisions is not a safe bet. Until tools are designed, tested, and regulated with real users in mind, the promise of AI-assisted self-care will remain more marketing slogan than medical reality.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X