When a graduate student at a mid-size U.S. research university recently submitted a literature review drafted with the help of ChatGPT, two of the 14 cited papers turned out not to exist. The references looked plausible: real-sounding journal names, believable author combinations, publication dates that fit the timeline. But the articles were fabrications, invented wholesale by a language model trained to produce text that looks right rather than text that is right. The student’s experience is not an outlier. Peer-reviewed research, rising subscription fees, and tightening usage caps are converging to make AI tools less attractive to the scientists who were among their earliest adopters.
Fabricated citations, inflated conclusions, and ballooning costs
Three distinct bodies of evidence spell out the trouble.
Unreliable outputs. A preprint posted on arXiv in early 2025 examined how large language models summarize scientific research and found that LLM-generated summaries were substantially more likely than human-written ones to contain sweeping generalizations of results. In fields like medicine and environmental science, where a single overstated conclusion can warp clinical guidelines or policy decisions, that tendency is not a minor annoyance. It is a liability. The study has not yet been peer-reviewed as of mid-2025, so its specific figures should be treated as preliminary, but the pattern it describes aligns with concerns researchers have raised since generative AI entered the lab.
Fabricated references. A peer-reviewed study published in 2023 in Scientific Reports, part of the Nature Portfolio, documented that ChatGPT fabricates bibliographic citations and produces substantive citation errors at quantifiable rates, broken down by model version. These are not edge cases. The models generate plausible-looking strings of text; they do not retrieve verified records from a database. For any researcher preparing a manuscript, a single phantom reference discovered during peer review can sink the paper’s credibility and delay publication by months. The study tested earlier ChatGPT versions, and newer releases may perform differently, but no comparable peer-reviewed audit of the latest models has yet been published.
Escalating costs. A 2025 working paper from the National Bureau of Economic Research, drawing on real API transaction data from OpenRouter and Microsoft Azure, documents persistent price dispersion across the LLM market. Its central finding: open-source models can be roughly 90 percent cheaper than closed-source alternatives when matched on capability. As of May 2026, the major closed-source providers have not published unified academic rate cards. OpenAI’s ChatGPT Plus subscription sits at $20 per month for individual users, with a higher-tier Pro plan at $200 per month; Anthropic’s Claude Pro costs $20 per month; and Google’s Gemini Advanced runs $20 per month bundled with a Google One AI Premium plan. API pricing varies by model and usage volume, and all three companies adjust rates and token limits frequently, so researchers should check current pricing pages before committing grant funds. That gap means labs locked into proprietary platforms are paying a steep premium, one large enough to reshape how principal investigators allocate limited grant funding. For small teams and institutions in low- and middle-income countries, the premium can effectively put frontier tools out of reach. LLM pricing shifts quickly, so the specific figures represent a snapshot rather than a fixed reality, but the structural cost divide between open and closed ecosystems has persisted through multiple pricing cycles.
What nobody has measured yet
No large-scale institutional survey has quantified how many researchers have scaled back or abandoned AI tool usage because of these combined pressures. Anecdotal accounts from individual labs suggest a pullback: some principal investigators now instruct students to treat AI outputs as rough drafts requiring line-by-line verification. But anecdotes are not data, and the scale of any shift remains unclear without systematic tracking across disciplines and geographies.
Whether AI providers plan to close the affordability gap for academics is equally uncertain. OpenAI launched a ChatGPT Edu tier in 2024, and Google has offered research credits through various programs, but neither initiative has become a universal solution. Many universities negotiate enterprise licenses on their own, and those agreements tend to be opaque, covering some use cases while leaving others unaddressed. Without transparent pricing roadmaps, labs face real difficulty budgeting for multi-year projects that depend on AI-assisted workflows, and early-career researchers may hesitate to build methods around tools whose costs could spike mid-grant.
The financial toll of unreliable outputs is another open question. The citation-fabrication and generalization-bias studies establish that errors occur at measurable rates, but no peer-reviewed research has calculated the downstream cost in researcher hours, retracted papers, or delayed grant deliverables. Isolated case studies exist, such as manuscripts rejected after reviewers flag nonexistent references, but they do not add up to a reliable aggregate figure. Until that work is done, claims about the total economic burden of AI errors on the research enterprise remain hypotheses, not established facts.
There is also the question of whether open-source models can truly substitute for proprietary ones across specialized scientific tasks. The NBER paper establishes dramatic cost savings, but in domains like computational biology, econometrics, or legal scholarship, performance may depend on nuanced domain knowledge that generic models do not yet possess. A 90 percent discount means little if researchers spend equivalent time correcting lower-quality results, or if errors slip through and damage reputations.
How strong is the evidence?
Readers should weigh these sources on a sliding scale. The Scientific Reports study on citation fabrication is the most vetted: formally peer-reviewed, published in a respected journal, and built on quantified, reproducible error rates. The arXiv preprint on generalization bias is methodologically straightforward but has not yet cleared independent review, so its specific numbers are best treated as directional rather than definitive. The NBER working paper carries institutional weight and is grounded in real market transaction data, though it has not undergone traditional journal peer review. All three are stronger foundations for discussion than social media threads or opinion columns, which can capture the mood in academic communities but cannot establish rates of failure or patterns of institutional response.
What researchers and institutions can do before the next grant cycle
For scientists weighing whether to keep investing time and money in AI tools, the most immediate step is verification. Every reference an LLM proposes should be checked against library databases or publisher websites to confirm the article exists, the metadata are correct, and the cited passage actually supports the claim. The citation-fabrication data alone makes this non-negotiable, regardless of which model or version a researcher uses.
Labs on fixed budgets should benchmark open-source alternatives against the proprietary tools they currently pay for. A structured test, using a small set of representative tasks like summarizing recent papers in a subfield or drafting methods sections, can reveal whether a cheaper model performs well enough once human oversight is factored in. The savings could free up resources for the verification work that unreliable outputs demand, whether that means hiring a research assistant or investing in better reference-management software.
Institutions and funders also have leverage. Universities can run centralized evaluations of AI tools, publish internal guidance on acceptable uses, and negotiate clearer terms with vendors so that individual labs are not left to navigate opaque pricing alone. Grant-making bodies can encourage applicants to budget realistically for both AI access and the human labor needed to audit AI-assisted work. Until the evidence base expands to cover long-term outcomes, the safest approach is to treat large language models as fallible instruments whose benefits and risks require continuous, honest measurement, not as autonomous research partners that can be trusted on their own.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.