Researchers urge human factors focus for safer AI medical devices

Federal regulators and academic researchers are converging on a shared warning: AI-powered medical devices will not deliver on their clinical promise unless developers and oversight bodies treat the human side of the equation with the same rigor they apply to algorithmic accuracy. A string of recent regulatory actions, consensus guidelines, and federally funded research projects all point toward the same gap, where the way clinicians actually interact with AI tools in hospitals and exam rooms remains poorly measured and loosely governed. The push to close that gap is gaining momentum across multiple U.S. agencies and international bodies, raising the prospect that human factors testing could soon become a standard requirement for AI device approval and monitoring.

FDA Seeks Input on Real-World AI Performance

The U.S. Food and Drug Administration has opened a formal channel for outside expertise on a problem that lab testing alone cannot solve. Through a new request for comment (docket FDA-2025-N-4203), the agency is asking how to measure and evaluate AI-enabled medical device performance after deployment in real clinical settings. The filing acknowledges that AI performance can shift with changes in clinical practice, patient populations, data inputs, and infrastructure, a set of variables that are deeply tied to human behavior rather than to the software itself.

That framing matters because most current AI device evaluations concentrate on technical benchmarks such as sensitivity, specificity, and area under the curve. Those metrics tell regulators how well an algorithm performs on a curated dataset. They say far less about what happens when a radiologist ignores a flagged finding because the alert arrived during a shift change, or when a primary care physician overtrusts a risk score because the interface offers no easy way to interrogate its reasoning. The FDA’s move to solicit structured input on postmarket evaluation signals that the agency wants methods for capturing exactly those kinds of human–system breakdowns.

Advisory Committee Frames the GenAI Challenge

The comment period builds on groundwork laid at the FDA’s Digital Health Advisory Committee meeting held on November 20 and 21, 2024, which centered on total product lifecycle considerations for generative AI-enabled medical devices. The agency asked external experts to advise on premarket evaluation, risk management, and postmarket monitoring for GenAI tools, a class of devices that can produce novel outputs rather than simply classifying inputs.

A supporting discussion paper prepared for that meeting directly addresses what information should be conveyed to users to improve transparency and control risks. It also asks what should be included in premarket evaluation and lifecycle performance characterization, including how developers will monitor for performance drift and unanticipated failure modes. The document’s emphasis on user-facing transparency is notable because it shifts the regulatory conversation from “does the model work?” to “can the clinician tell when it does not?” That distinction has direct safety consequences: a device that performs well on average but fails unpredictably in certain clinical contexts can be more dangerous than a less accurate tool whose limitations are clearly communicated.

International Principles Target Lifecycle Governance

Regulators outside the United States are moving along a parallel track. The International Medical Device Regulators Forum issued a final set of Good Machine Learning Practice guiding principles in January 2025, and the FDA has situated that document within its own lifecycle framework for AI and machine learning devices. The GMLP principles cover quality management, data representativeness, performance evaluation, and change control, all areas where human factors intersect with technical design choices and deployment decisions.

Separately, an international group of researchers published the FUTURE-AI consensus guideline in a BMJ article, offering a peer-reviewed framework for trustworthy and deployable AI in healthcare. The guideline includes explicit usability components such as defining human–AI interactions and oversight, clarifying responsibility for decisions, and specifying training requirements for clinicians who will use AI tools. Taken together, the IMDRF principles and the FUTURE-AI guideline suggest that the academic and regulatory communities have reached broad agreement on the problem. Where they still lack consensus is on enforcement, specifically, how to verify that developers actually conduct meaningful human factors work rather than treating it as a checkbox exercise appended to technical validation.

Federal Research Targets Errors and Burnout

The Agency for Healthcare Research and Quality has put federal dollars behind the effort to answer that question. AHRQ funded a project titled Artificial Intelligence and Human Factors in Healthcare Quality and Safety, which documents an organized research effort to integrate human factors engineering into AI implementation. The project’s stated goals include reducing errors and burnout, two outcomes that depend as much on interface design and workflow integration as on algorithmic precision.

That focus on burnout deserves attention because it reframes the human factors argument beyond patient safety alone. Clinicians who spend extra cognitive effort managing poorly designed AI alerts or second-guessing opaque recommendations face fatigue that compounds over shifts and weeks. If AI tools add friction rather than remove it, the net effect on care quality can be negative even when the underlying model is technically sound. The Department of Health and Human Services has signaled broader interest in this dynamic through digital health initiatives that emphasize usability, equity, and clinician well-being, but translating research findings into binding device requirements remains an open challenge.

Why Technical Accuracy Alone Falls Short

A perspective published in NEJM AI frames the core tension clearly: AI-enabled medical devices show promise for improving health care outcomes, yet real-world effectiveness depends not only on strong technical performance, but also on how people actually use the tools in practice. The authors argue that focusing narrowly on metrics such as accuracy or AUC obscures the ways in which workflow disruptions, ambiguous interfaces, and misaligned incentives can erode benefits or even introduce new safety risks.

In other words, a highly accurate sepsis prediction model that fires dozens of low-precision alerts per shift may train clinicians to ignore it, while a slightly less accurate tool that is tightly integrated into order entry and provides clear rationale might meaningfully change behavior. Human factors engineering offers methods, such as usability testing, cognitive walkthroughs, and simulation-based trials, to surface these issues before devices are deployed at scale. Yet many AI developers still treat such work as optional or defer it to late-stage pilots, where there is limited room to redesign core features.

Toward Human-Centered Regulatory Expectations

Regulators are now grappling with how to convert these insights into enforceable expectations. One emerging approach is to require that premarket submissions for AI-enabled devices include structured evidence of human factors testing: who the intended users are, what tasks they will perform, what errors are most likely, and how the design mitigates those risks. Another is to mandate postmarket surveillance plans that explicitly track user interaction patterns, override rates, and context-specific failures, rather than relying solely on aggregate performance statistics.

The FDA’s recent outreach, the IMDRF’s GMLP principles, and academic frameworks like FUTURE-AI all point toward a model in which AI oversight extends across the full lifecycle of a device, from problem selection and data curation to deployment, monitoring, and iterative updates. In that model, human factors are not a final hurdle but a throughline: shaping which clinical problems are worth automating, determining how outputs are presented, and informing when systems should defer to human judgment or require additional confirmation.

For developers, this shift will likely mean earlier and deeper engagement with clinicians, patients, and human factors specialists, as well as more investment in simulation environments that approximate real-world conditions. For health systems, it will require governance structures that treat AI implementation as an organizational change effort rather than a simple software installation. And for regulators, it will demand new review expertise, updated guidance, and data infrastructure capable of capturing how AI devices behave once they leave the lab.

The convergence of regulatory signals and research agendas suggests that the era of evaluating AI-enabled medical devices on technical accuracy alone is drawing to a close. What replaces it is still under construction, but its contours are clear: a human-centered framework in which usability, workflow fit, transparency, and clinician well-being are treated as core safety features, not peripheral concerns. Whether AI ultimately fulfills its promise in medicine may depend less on the next breakthrough model and more on how rigorously the health system learns to design, test, and govern the human interactions that surround it.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X