General AI now performs at or above human-expert level on standardized tests in many fields

Large language models built by Google, OpenAI, and academic research teams have matched or exceeded human-expert scores on standardized medical licensing exams, according to multiple peer-reviewed evaluations published over the past year. Med-PaLM 2, developed by Google Research and DeepMind, achieved state-of-the-art results on USMLE-style questions and earned the explicit label “expert-level” from its own creators. GPT-4, tested independently by researchers including Microsoft Research collaborators, cleared the USMLE passing threshold by a wide margin. These results raise a direct question for medical schools, licensing boards, and hiring managers: if a general-purpose AI can pass the same exams used to certify physicians, what does that mean for the professionals who spent years earning those credentials?

Why expert-level AI exam scores create pressure on medical training

The immediate tension is not whether AI can answer multiple-choice questions. It is whether institutions that rely on standardized test scores as gatekeeping tools will be forced to rethink how they evaluate competence. Medical residency programs, for example, use USMLE Step scores to screen thousands of applicants each cycle. When a language model clears the same bar that separates competitive candidates from rejected ones, the signal value of the exam itself comes into question.

A testable version of this concern is that if models that clear expert thresholds on broad, multi-domain evaluation suites continue to improve, entry-level diagnostic roles in medicine could face measurable displacement within 24 months, visible in residency application volumes and hiring data. That hypothesis remains unproven. No public dataset yet tracks whether AI-assisted triage or diagnostic support has reduced demand for junior clinicians. But the building blocks are in place. AI systems are already being piloted in radiology reads, pathology screening, and clinical decision support, all areas where early-career physicians traditionally build skills.

The gap between exam performance and clinical deployment is real, but it is narrowing fast enough to force decisions. Licensing bodies have not publicly stated whether AI benchmark scores will change exam design or policy. Medical schools have not announced curriculum changes tied to AI capabilities. That silence is itself a signal: institutions are watching, but none has moved first. As more health systems experiment with AI-based tools in documentation, coding, and preliminary chart review, the pressure on educators to define what uniquely human skills trainees must master will only increase.

Benchmark evidence from MultiMedQA, USMLE, and HELM

The strongest evidence comes from three distinct research efforts, each using different models and evaluation methods, that converge on the same finding. Google Research and DeepMind authors introduced the MultiMedQA benchmark, which aggregates medical question-answering datasets including MedQA (built from USMLE-style questions), PubMedQA, and several others. Their initial model, Med-PaLM, posted strong scores across these datasets but still showed safety and quality gaps flagged by physician evaluators during human review of long-form answers.

The successor system addressed many of those shortcomings. Med-PaLM 2, described by Google Research and DeepMind authors as achieving expert-level performance in medical question answering, reported state-of-the-art results on USMLE-style MedQA. Physician evaluators preferred Med-PaLM 2 answers over earlier model outputs on long-form consumer medical questions, a qualitative signal that goes beyond raw accuracy on multiple-choice items. The study also documented improvements in factuality and reductions in potentially harmful advice, though it emphasized that residual safety issues remained.

Separately, GPT-4 was evaluated on USMLE practice materials covering Steps 1, 2CK, and 3 by researchers including Microsoft Research collaborators. That study found GPT-4 exceeded the USMLE passing threshold by a substantial margin, outperforming earlier models including medically fine-tuned baselines. A peer-reviewed evaluation published in a biomedical journal independently confirmed GPT-4’s strong performance on USMLE-style questions, providing clinical-domain corroboration outside the teams that built the model. Together, these results suggest that general-purpose language models, not just domain-specialized systems, can reach exam-level competence in medicine.

Beyond medicine, Stanford’s Center for Research on Foundation Models developed the HELM framework to evaluate language models across knowledge domains, reasoning tasks, and robustness measures. The HELM methodology tracks performance on dozens of scenarios and metrics, offering a broader lens than any single professional exam. While HELM does not map directly onto official licensing tests outside medicine, it shows that the pattern of expert-range performance extends across subject areas, not just clinical knowledge. Models that do well on HELM scenarios involving scientific QA, reading comprehension, and multi-step reasoning often also post strong scores on specialized benchmarks such as MultiMedQA.

What the exams do not measure and what to watch next

Every result cited above carries the same caveat: these models were tested on practice questions and released datasets, not on live, proctored exam administrations. The USMLE is administered by the National Board of Medical Examiners, which has not publicly commented on whether AI performance data will alter exam design, scoring, or access policies. No medical licensing body has issued guidance on how to treat AI benchmark scores in credentialing decisions, and there is no indication that current licensure pathways will accept AI-generated exam results as evidence of competence.

The clinical gap is equally important. Passing a written exam is a necessary but insufficient condition for practicing medicine. Standardized tests do not assess bedside manner, procedural skill, or the ability to manage uncertainty in real time with a real patient. Physician evaluators in the MultiMedQA work flagged safety concerns in long-form answers, including occasional fabrication of facts, overconfident statements about uncertain diagnoses, and incomplete counseling on risks and follow-up. These are precisely the areas where human clinicians rely on tacit knowledge, ethical judgment, and interpersonal communication rather than recall of textbook facts.

For now, that gap shapes how health systems are deploying AI. Many pilots position language models as decision-support tools that draft notes, summarize literature, or suggest differential diagnoses, with a licensed clinician responsible for verification and final decisions. In this configuration, exam-level performance is a floor, not a ceiling: the model must be good enough to be useful, but its output is filtered through human judgment. The open question is whether, as models continue to improve, regulators will allow them to act with greater autonomy in narrow domains, such as preliminary triage or follow-up reminders, and what evidence will be required to justify that shift.

Several indicators will show whether exam results are beginning to reshape the medical workforce. One is whether residency programs and employers start to discount standardized test scores, on the assumption that AI tools will be universally available and can boost any candidate’s performance. Another is whether training programs explicitly teach students how to collaborate with AI systems-learning when to trust a model’s suggestion, when to override it, and how to document that interaction in the medical record. A third is whether malpractice insurers and hospital risk committees issue guidance on acceptable use of AI in diagnosis and treatment planning.

There is also a risk of overreaction. High scores on benchmarks can create a misleading impression that models are ready to replace clinicians wholesale. The evidence base does not support that conclusion. The same studies that celebrate expert-level accuracy also document nontrivial rates of clinically significant error. Moreover, standardized questions are carefully curated and stripped of the messy context that characterizes real-world cases: incomplete histories, conflicting lab results, language barriers, and patients with multiple overlapping conditions. Until models are tested and validated in those settings, exam performance should be treated as an early milestone, not an endpoint.

Still, the direction of travel is clear enough that stakeholders can no longer ignore it. Medical schools will have to decide whether to treat language models as banned calculators, mandatory tools, or something in between. Licensing boards will need to consider how to preserve the integrity of take-home or unproctored assessments in a world where AI can generate plausible answers to almost any written question. And health systems will face pressure-from cost-conscious administrators, from patients seeking faster access, and from clinicians themselves-to define where AI support is appropriate and where only a human will do.

The arrival of AI systems that can pass medical licensing exams does not, by itself, automate the practice of medicine. It does, however, expose how much of the current credentialing infrastructure is built around tasks that machines can now perform. The next phase will be less about proving that models can answer questions and more about deciding what kinds of questions we still need humans to answer, how to measure those uniquely human abilities, and how to integrate AI safely into the workflows that determine who gets care, and how quickly, in the first place.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X

Global Font

General AI now performs at or above human-expert level on standardized tests in many fields

Why expert-level AI exam scores create pressure on medical training

Benchmark evidence from MultiMedQA, USMLE, and HELM

What the exams do not measure and what to watch next

Dorian Maddox

Author

iSeeCars studied 312 million vehicles and crowned the Toyota Tacoma the most reliable midsize truck

Wearing hearing aids cut the risk of dementia by about a third in a seven-year study

A 1,600-year-old mummy was embalmed with a page torn from Homer’s Iliad

India is diving deeper to probe Dwarka, a legendary lost port said to lie off its coast

Waymo’s driverless taxis are rolling into four more cities, including Las Vegas and Denver

More in AI

AI

Researchers used AI to rebuild the face of a Pompeii victim who died fleeing Vesuvius

AI

OpenAI’s new voice mode can listen and talk at the same time, killing the awkward pause

AI

Microsoft says its AI diagnosed tough cases four times better than a panel of doctors

AI

Regulators just ordered Google to open Android to rival AI assistants on two billion phones

AI

Meta is cutting 8,000 jobs and shifting thousands of workers onto its AI teams

AI

Nearly half of US companies using ChatGPT say it has already replaced workers

AI

A new bill in Congress would bar chatbots from posing as your doctor or lawyer

AI

AI is making romance-and-crypto scams roughly four times more profitable

IG

FB

PIN

LI

X

IG

FB

PIN

LI

X

General AI now performs at or above human-expert level on standardized tests in many fields

Why expert-level AI exam scores create pressure on medical training

Benchmark evidence from MultiMedQA, USMLE, and HELM

What the exams do not measure and what to watch next

Author

Get weekly updates with the latest news and tips!

More in AI

IG

FB

PIN

LI

X