Hospitals adopting ambient AI scribes to reduce clinician burnout are running into an uncomfortable side effect: the technology appears to increase billable output per physician, raising costs that insurers are unwilling to absorb without new safeguards. Peer-reviewed research now links these tools to measurable jumps in billing metrics, while federal data on documentation errors gives payers ammunition to push back. The result is a growing standoff between health systems that view AI scribes as legitimate efficiency gains and insurers that see a new vector for inflated claims.
What is verified so far
Two primary studies published in JAMA Network Open form the evidentiary backbone of this dispute. The first is a randomized trial of ambient AI scribes integrated into an electronic health record system, available through the PubMed Central archive. That trial measured documentation time, workflow changes, and clinician experience, providing controlled evidence on what these tools actually alter in daily practice. Its design, a randomized comparison rather than a before-and-after case study, gives it more weight than the vendor-funded pilots that hospitals often cite when justifying adoption.
The second study used event-study and difference-in-differences methods with controls to tie ambient AI scribe adoption to changes in physician financial productivity, specifically relative value units (RVUs) per week. RVUs are the standard currency Medicare and private insurers use to set physician reimbursement. When RVUs rise without a corresponding increase in patient volume or acuity, payers interpret it as documentation inflation rather than genuine clinical work. This study’s translation of RVU shifts into potential payer spend is what makes it central to the cost debate.
Both JAMA Network Open papers sit within the broader ecosystem of peer-reviewed biomedical research indexed by the National Library of Medicine. That context matters because the methodologies they use—randomization in one case and quasi-experimental econometric techniques in the other—are aligned with standard health services research practices rather than bespoke analytics commissioned by technology vendors.
On the payer side, the Centers for Medicare and Medicaid Services published 2025 supplemental improper payment data covering the measurement period from July 1, 2023 through June 30, 2024. The report, accessible through CMS’s error-rate portal, catalogs how often documentation and coding errors lead to improper payments in Medicare fee-for-service claims. While the CMS data is not AI-scribe-specific, it establishes the baseline rate at which documentation problems already generate overpayments, giving insurers a concrete reference point when arguing that AI-assisted notes could worsen existing patterns.
Separately, the Office of the National Coordinator for Health IT released a data brief on hospital trends in predictive AI use between 2023 and 2024, tracking how rapidly facilities are adopting tools that touch documentation and billing workflows. That brief provides institutional context on the speed of deployment, which matters because faster adoption can outpace the development of audit frameworks and governance structures designed to catch misuse.
What remains uncertain
The most significant gap in the current evidence is causation. The JAMA Network Open studies establish that AI scribes change documentation behavior and that billing metrics shift after adoption. But neither study proves that the higher RVU output reflects inappropriate coding rather than more accurate capture of work physicians were already performing. Hospitals argue the latter: that clinicians have historically under-documented complex visits because of time pressure, and that AI scribes simply record what was always happening. Insurers counter that when every note is longer and more detailed by default, coders and algorithms will assign higher-complexity billing levels whether or not the clinical encounter warrants it.
Another unresolved question is how AI scribes interact with existing compliance programs. Many health systems already run internal audits to detect upcoding and documentation gaps, but the research does not yet show whether those programs are being updated to account for AI-generated content. If compliance teams treat AI notes as equivalent to human notes, they may miss systematic patterns (such as boilerplate language that nudges encounters into higher-paying codes) that are specific to the technology.
No major insurer, including large national carriers, has published a formal policy response specifically targeting AI-scribe-driven billing changes. The friction is visible in trade discussions and secondary reporting, but the absence of official position papers means the scope of proposed fixes, whether audits, rate adjustments, or preauthorization rules, is still speculative. Similarly, no hospital association has released aggregate data defending AI adoption against upcoding accusations, leaving the debate to individual case studies and vendor testimonials.
CMS has not issued guidance or rulemaking that addresses AI-driven RVU increases directly. The federal payment infrastructure for programs such as Medicaid and the Children’s Health Insurance Program, described on the agency’s public portal, and federal coverage standards for private plans, summarized on HealthCare.gov, were designed around human documentation norms. Regulators have not yet signaled how they plan to adapt error-rate testing to account for AI-generated notes. The improper payment data from the July 2023 to June 2024 measurement window predates widespread ambient scribe deployment, so it cannot isolate AI-specific effects.
One analytical question the evidence raises but does not resolve is distributional: if AI scribes produce uniformly thorough documentation across all providers, they may compress the billing distribution upward. Physicians who previously billed at lower complexity levels could now generate notes that justify mid-to-high-level codes. That pattern would look like upcoding in aggregate even if each individual note is accurate. Distinguishing systemic documentation improvement from systemic code inflation will require studies designed to compare AI-generated notes against independent clinical reviewers, and no such study has been published.
How to read the evidence
Readers and policymakers should weigh these sources differently based on their design and scope. The randomized trial cited above is the strongest available evidence on operational effects because it compares clinicians using AI scribes with those who are not, under controlled conditions. Its findings on reduced documentation time and improved perceived workload support the claim that AI scribes can alleviate burnout, but the study was not powered to evaluate billing outcomes or payer costs.
The RVU-focused study carries more weight on financial questions, but it remains observational. Even with difference-in-differences methods and robust controls, unmeasured factors (such as concurrent changes in coding education or shifts in case mix) could contribute to the observed increases in billed services. Policymakers should therefore treat its estimates of potential spending impact as plausible upper bounds, not definitive forecasts.
The CMS improper payment statistics serve a different role. They are descriptive rather than causal, showing how often documentation and coding errors already lead to overpayments in Medicare fee-for-service. When paired with research on AI scribes, they highlight the risk that any technology which systematically lengthens and structures notes could amplify existing error patterns. However, they do not, on their own, prove that AI will worsen error rates, only that the system is sensitive to documentation quality.
Patients and consumer advocates navigating these debates can turn to federally supported assistance programs that explain coverage rules and appeals processes. CMS maintains a resource hub for community organizations and navigators on the assister portal, which outlines how beneficiaries can challenge denied claims or questionable bills. While that site does not yet address AI-specific disputes, it is part of the infrastructure that will be tested if billing conflicts over AI-generated documentation reach the consumer level.
Policy implications and next steps
In the near term, the most likely response from insurers is targeted auditing rather than broad exclusions of AI-generated notes. Payers already conduct post-payment reviews focused on high-cost services and outlier providers; integrating AI-scribe flags into those workflows is a relatively incremental change. Hospitals, anticipating that scrutiny, may invest in internal audits that compare AI-assisted documentation against encounter recordings or clinician attestations to demonstrate that higher RVUs reflect real work.
Regulators face a more complex task. Updating improper payment measurement programs to distinguish between legitimate documentation gains and upcoding will require new sampling strategies and possibly new categories of error. One option is to create audit cohorts specifically for encounters documented with AI scribes, allowing CMS and private payers to track whether error rates diverge from human-only baselines. Another is to encourage or require vendors to build audit logs that show how notes were generated and edited, giving investigators a clearer view of clinician oversight.
For now, the research record supports two cautious conclusions. First, ambient AI scribes do appear to change documentation in ways that meaningfully affect billing metrics, at least in early adopter settings. Second, the existing federal oversight apparatus was not designed with this technology in mind, leaving a regulatory lag that both hospitals and insurers are trying to navigate. Future studies that directly compare AI-generated notes with independent clinical reviews, coupled with transparent reporting from payers and health systems, will be essential to determining whether the current billing bump reflects overdue recognition of clinical complexity or a new form of digital upcoding.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.