When a first-time parent types “when can babies eat honey” into Google, the answer that appears at the top of the page now often comes not from a linked website but from an AI-generated summary. A new academic audit suggests those summaries, known as AI Overviews, sometimes deliver advice about infant care and pregnancy that conflicts with established medical guidance, raising questions about the reliability of a feature Google has placed front and center in its search results.
The study, titled “Auditing Google’s AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy” and published as a preprint on arXiv (arXiv:2511.12920), was led by Ines Zelenkauskaite and colleagues. It represents one of the first systematic efforts to measure how accurately Google’s generative search summaries perform in a high-stakes health category. Its findings, while preliminary, move the conversation beyond viral screenshots of bizarre AI answers and toward structured, repeatable evidence.
What the audit actually measured
Zelenkauskaite and her co-authors built a fixed set of baby care and pregnancy questions and ran them through Google Search, recording when AI Overviews appeared and comparing the content of those summaries against trusted medical sources. They flagged cases where the AI-generated text contradicted established guidance, left out critical safety caveats, or blended accurate and inaccurate statements in ways that could mislead a non-expert reader.
By repeating queries and documenting outputs over time, the audit aimed to distinguish one-off glitches from recurring patterns. The choice of domain was deliberate: people searching for information about infant feeding, sleep safety, or pregnancy warning signs are often making urgent, health-sensitive decisions. A flawed summary in this context carries different weight than an incorrect answer about a pop culture question.
The paper found that AI Overviews appeared for a substantial share of the tested queries and that inconsistencies with medical consensus were not rare outliers. The authors argue this makes baby care and pregnancy a useful stress test for the safety of generative search features more broadly, since even a modest error rate can translate into a large number of people encountering questionable health advice given the volume of these searches.
Because the study is a preprint, it has not yet undergone formal peer review. That distinction matters: the specific numbers it reports should be treated as provisional until outside reviewers have scrutinized the sample size, query selection, and criteria used to classify outputs as inconsistent. The methodology, however, is transparent and replicable, which allows other researchers to build on or challenge the work.
What Google has and hasn’t said
Google has not published its own error-rate data for AI Overviews in health-related queries, nor has it offered detailed accuracy breakdowns by topic area such as pediatrics or obstetrics. The company has previously stated that AI Overviews are designed to surface helpful information quickly and that the feature is continuously improved through testing and user feedback.
During the initial rollout of AI Overviews in May 2024, Google publicly acknowledged some errors after widely shared examples of bizarre or dangerous suggestions circulated online. Reporting from The Washington Post at the time included expert commentary suggesting the problems might be structural rather than easily patched, tied to how large language models remix web content into confident-sounding summaries regardless of underlying accuracy.
Google has since made updates to AI Overviews, including adding more prominent source citations and refining how the feature handles sensitive health topics. Whether those changes would materially alter the audit’s findings is unclear, since the preprint does not appear to account for the most recent iterations of the product. The company did not provide a direct response to the specific inconsistencies documented in the study.
The gaps that remain
Several important questions sit outside the scope of this audit. The study measures what the system outputs, not what happens afterward. There is no long-term data on whether parents actually follow the advice they see in AI Overviews, whether they scroll past the summary to check traditional links, or whether they cross-reference what they read with a pediatrician. Demonstrating concrete harm would require a different kind of research, one that tracks user behavior and health outcomes rather than search results alone.
Without an internal benchmark from Google, outside researchers are also left to build their own measurement frameworks. Two audits using different query sets or different definitions of “inconsistency” could reach different conclusions about the same feature, even if they are observing similar underlying behavior. That methodological variability is normal in early-stage research, but it means no single study should be treated as the final word.
There is also a broader tension that extends beyond Google. Any search engine that places AI-generated text above traditional links is making an implicit promise to users: this answer is good enough to act on. The baby care audit is one of the first attempts to test that promise with systematic evidence in a domain where the stakes are genuinely high.
What parents and caregivers should know
For anyone using Google’s AI Overviews to look up health information about infants or pregnancy, the practical guidance from medical professionals has not changed: treat any AI-generated summary as a starting point, not a diagnosis or a care plan. Organizations such as the American Academy of Pediatrics and the American College of Obstetricians and Gynecologists maintain publicly accessible guidance on common questions about feeding, sleep, and prenatal care. A licensed healthcare provider remains the most reliable source for decisions that could affect a child’s health or a pregnancy.
The audit does not suggest that every AI Overview on these topics is wrong. It suggests that the error rate is high enough to warrant caution, particularly for users who may not have the medical background to spot when a summary has omitted a safety caveat or blended reliable and unreliable information. Until Google publishes transparent accuracy benchmarks or independent peer-reviewed studies confirm specific error rates, that uncertainty is reason enough to verify before acting.
Where the research goes from here
As of spring 2026, the preprint has not yet appeared in a peer-reviewed journal, though its public availability on arXiv means it is already circulating among AI accountability researchers. If the methodology holds up under review, it could serve as a template for similar audits in other sensitive domains: mental health queries, medication interactions, financial planning, legal rights.
For regulators and platform designers, the emerging body of audit research points toward a few concrete needs: transparent benchmarks for AI-generated health content, clear disclosure to users about the limitations of generative summaries, and interface designs that make it easy to see where the AI’s claims originate. The technology behind AI Overviews is not going away. How rigorously it is tested, and how much responsibility companies accept for its mistakes, is still being worked out.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.