OpenAI’s ChatGPT models have shown a sharp jump in accuracy on a high-stakes medical licensing exam, with newer versions outperforming their predecessors by a significant margin. A peer-reviewed study published in Scientific Reports tested three generations of the model on China’s National Medical Licensing Examination and found that GPT-4o delivered substantially better results than GPT-3.5. The findings provide an external benchmark for measuring how model performance is changing on a real-world professional assessment.
Medical Exam Scores Reveal a Generational Leap
The study, published by Scientific Reports (a Nature Portfolio journal), evaluated the performance of GPT-3.5, GPT-4, and GPT-4o on the Chinese National Medical Licensing Examination. This standardized test serves as a gatekeeper for medical practice in China and covers a broad range of clinical knowledge, from diagnostics to pharmacology. By running all three model generations through the same question set, the researchers created a controlled comparison that isolates improvement across versions rather than across tasks. The authors describe their methodology and results in detail in the Scientific Reports paper, which serves as the primary source for the performance claims.
GPT-4o, the newest model in the study, posted the strongest results. The paper reports that GPT-4o’s accuracy improved by roughly a third (a relative increase) compared with GPT-3.5 on the exam’s multiple-choice questions, a gap large enough to be academically meaningful. The accuracy gains from GPT-3.5 to GPT-4o represent the kind of generational improvement that matters most in fields where wrong answers carry real consequences. A medical licensing exam is not a trivia quiz. It tests applied reasoning, the ability to synthesize patient information, and the capacity to distinguish between similar diagnoses. That GPT-4o handled these demands better than its predecessors suggests that OpenAI’s training pipeline is not just expanding knowledge coverage but also refining reasoning over complex clinical scenarios.
Why a Chinese Medical Exam Is a Strong Benchmark
Most AI accuracy claims rely on English-language tests or internal benchmarks that OpenAI itself designs. The Chinese National Medical Licensing Examination sidesteps both of those limitations. It is a real-world, high-stakes assessment administered to human professionals, and it is conducted in Mandarin. That makes it a useful stress test for whether AI improvements hold up outside English-centric environments and beyond synthetic question sets crafted by model developers.
The choice of benchmark also raises a question that the broader AI industry has been slow to address. If accuracy gains are measured primarily on English-language tasks, they may overstate how well these models perform for non-English-speaking users. A model that scores well on a Chinese medical exam suggests progress on multilingual reasoning, but a single study cannot confirm whether that progress is consistent across languages, specialties, or cultural contexts. The gap between benchmark performance and real-world clinical utility remains wide, and researchers have flagged this distinction repeatedly in evaluations of medical AI systems.
Earlier work cited within the study, including a hepatology-focused analysis in the Journal of Hepatology, had already identified limitations in GPT-3.5’s ability to handle specialized medical queries. That prior research serves as a baseline, highlighting that performance can vary in domain-specific clinical question-answering and that important limitations remained for the older model. The newer Scientific Reports study builds on that foundation by demonstrating that subsequent model versions have closed some of those gaps on a broad, standardized exam, though not necessarily in every specialty or edge case.
What the Improvement Actually Means in Practice
A roughly 33 percent relative improvement in accuracy sounds dramatic, and in the context of a medical exam, it is. But context matters. The improvement is relative, meaning it measures the percentage change from one model’s score to another’s, not an absolute shift of 33 percentage points. A model moving from a low baseline to a moderately higher score can produce a large relative gain without necessarily reaching expert-level performance. For example, an increase from 60 percent to 80 percent accuracy is a 20-point jump but represents a 33 percent relative improvement. The study’s framing emphasizes this relative comparison to highlight how quickly the underlying architecture and training data have evolved.
That distinction is not a knock on the progress. It is a reminder that percentage gains need to be read carefully, especially when they appear in headlines or marketing materials. For a patient relying on an AI-assisted diagnosis, the relevant question is not how much better the model got compared to its predecessor. The relevant question is whether the model is accurate enough to trust in a clinical setting, and under what constraints. The study provides evidence of improvement, not evidence of clinical readiness, and it does not claim that GPT-4o can independently meet the standards required of licensed physicians.
This is where the gap between AI benchmarks and medical practice becomes most apparent. Passing a licensing exam, or even scoring well on one, does not mean a model can replace a physician. Licensing exams test knowledge recall and structured reasoning under controlled conditions. Clinical medicine requires judgment under uncertainty, communication with patients, awareness of social and cultural context, and the ability to integrate information that no multiple-choice question can capture, such as subtle physical findings or evolving symptoms over time. AI models are getting better at the first set of skills. The second set remains largely out of reach because it depends on embodied experience, ethical reflection, and responsibility for outcomes.
The Risk of Overreading Benchmark Gains
One pattern in AI coverage deserves scrutiny: the tendency to treat benchmark improvements as proof that a model is ready for deployment in sensitive fields. (This article focuses on one recent peer-reviewed update rather than a broader roundup.) The Scientific Reports study is careful to frame its findings as a comparison across model versions, not as an endorsement of unsupervised clinical use. That framing is easy to lose when the results get compressed into a headline or a marketing slide that highlights percentage gains without caveats.
The broader AI industry has a track record of promoting benchmark scores as evidence of general capability. OpenAI, Google, and other major labs routinely publish results showing their models outperforming humans on standardized tests, from bar exams to advanced placement subjects. Those results are real, but they measure a narrow slice of what matters. A model that aces a medical exam in a controlled setting may still hallucinate drug interactions, misinterpret patient history, or fail to flag rare conditions that a human clinician would catch based on pattern recognition developed over years of practice.
For readers evaluating these claims, the most useful question is not whether the model improved but whether the improvement changes what the model can safely do. In the case of GPT-4o’s medical exam performance, the answer is nuanced. The model is clearly better at structured medical reasoning than GPT-3.5 was, and that could make it more useful as a tool for education, exam preparation, or drafting clinical documentation under human supervision. Whether that translates into safer, more reliable AI-assisted healthcare tools depends on factors the study was not designed to measure, including how the model handles ambiguous inputs, how it performs under time pressure, how often it produces confident but incorrect answers, and how it interacts with the messy, incomplete data that defines real clinical encounters.
What This Means for AI in Healthcare
The practical takeaway from this research is twofold. First, OpenAI’s newer models are measurably better at medical knowledge tasks than their predecessors. That is a verifiable, peer-reviewed finding grounded in performance on a demanding, real-world exam. Second, the distance between “better at exams” and “safe for patients” is still significant, and no single benchmark can close it. Bridging that gap will require prospective clinical trials, rigorous post-deployment monitoring, and clear regulatory frameworks that treat AI as a tool to augment clinicians rather than a drop-in replacement.
For healthcare systems considering AI integration, the study offers a useful data point but not a green light. The Chinese National Medical Licensing Examination is a demanding test, and strong performance on it suggests that models like GPT-4o can reliably handle many textbook-style clinical questions. That, in turn, could support applications such as decision-support checklists, draft differential diagnoses for review, or educational platforms that help trainees reason through complex cases. Yet every such use case must be designed around the assumption that the model can still be wrong in unpredictable ways, with clinicians retaining final responsibility for decisions.
Ultimately, the sharp jump in exam scores documented in the Scientific Reports article is best understood as a milestone, not an endpoint. It shows that large language models are moving quickly up the ladder of formal medical knowledge, including in non-English settings, and that earlier limitations identified in specialty-focused studies are not fixed ceilings. At the same time, it underscores how much work remains to turn those raw capabilities into systems that are robust, transparent, and accountable enough for frontline care. The technology is advancing, but the standards for patient safety are higher still, and that is exactly as it should be.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.