The biggest sites on the web are now slamming their doors on AI crawlers — charging millions for the data that has quietly been training the world’s chatbots

For years, AI companies sent their crawlers across the open web like combine harvesters through an unguarded field. Forums, news archives, photo libraries, Q&A threads: all of it was scooped up, processed, and fed into the large language models behind ChatGPT, Gemini, Claude, and their competitors. The hosts of that content mostly never saw a dime. That arrangement is collapsing.

Reddit disclosed $203 million in data-licensing contracts in its 2024 SEC registration filing, turning user-generated posts into a distinct revenue stream. Cloudflare, the infrastructure company that handles roughly 20 percent of all web requests, launched a product in mid-2025 that blocks AI crawlers by default unless site owners grant explicit, paid access. News Corp struck a deal with OpenAI reportedly worth more than $250 million over five years. The Associated Press, Shutterstock, and Stack Overflow have all signed their own licensing agreements. Taken together, these moves mark a fundamental shift: the open web that trained the first generation of chatbots is becoming a metered resource, and the companies that built billion-dollar models on freely available data now face a growing bill.

Reddit put a price tag on its archive

The clearest financial benchmark comes from Reddit’s Form S-1, filed with the U.S. Securities and Exchange Commission ahead of the company’s March 2024 IPO. In that document, Reddit reported data-licensing arrangements with an aggregate contract value of $203 million, spanning two to three years. The filing did not name the buyers or break down per-deal pricing, but subsequent quarterly earnings reports confirmed that data licensing had become a material and growing revenue line, separate from advertising.

What made the disclosure significant was not just the dollar figure but the signal it sent. Reddit hosts one of the largest archives of conversational, human-written text on the internet, covering everything from medical advice to product reviews to niche hobby discussions. AI labs prize that kind of data because it reflects how real people actually talk, argue, and explain things. By monetizing access, Reddit established a precedent: platforms sitting on large pools of authentic human content can and will charge for it.

Reddit was not alone. News Corp’s multi-year agreement with OpenAI, announced in May 2024, covered content from The Wall Street Journal, the New York Post, and other titles. The AP signed a licensing deal with OpenAI in July 2023, granting access to portions of its news archive. Stack Overflow, the developer Q&A platform, reached its own agreement with Google. Each deal had different terms, but the pattern was consistent: organizations that had spent years or decades building content libraries recognized those libraries had a new, paying customer base.

Cloudflare handed the same power to smaller sites

Large platforms can negotiate directly with AI labs. A local news site or a niche blog cannot. That gap is what Cloudflare’s Pay Per Crawl product was designed to close.

Launched into private beta on July 1, 2025, Pay Per Crawl is part of Cloudflare’s broader AI Crawl Control suite. It lets site owners set per-crawl prices and choose which AI bots get access. When an unauthorized crawler hits a page, the server returns an HTTP 402 “Payment Required” response, a status code that has existed in the HTTP specification for decades but was almost never used until Cloudflare gave it a practical application. According to Cloudflare’s product announcement, the default setting blocks AI crawlers unless the site owner explicitly permits them, flipping the long-standing assumption that web content is fair game for automated collection.

By mid-2026, the product has moved beyond its initial beta phase, though Cloudflare has not released detailed adoption statistics. The potential reach is substantial: Cloudflare sits in front of millions of domains, meaning a single toggle in a dashboard could cut off crawler access across a wide swath of the web. Whether most site owners will actually flip that switch remains an open question. Some may prefer the visibility that comes with being included in AI-generated answers. Others, particularly publishers that depend on original content for revenue, have strong incentives to demand payment.

The legal landscape is still catching up

These commercial arrangements are taking shape against a backdrop of unresolved legal battles. The most prominent is The New York Times v. OpenAI, filed in December 2023, in which the Times alleges that OpenAI’s models were trained on millions of its copyrighted articles without permission. OpenAI has argued that its use of publicly available text qualifies as fair use under U.S. copyright law. As of mid-2026, the case has not reached a final ruling, and its outcome could either validate or undermine the legal foundation that AI companies have relied on to justify large-scale scraping.

Other cases are working through courts in the U.S. and Europe. Thomson Reuters v. Ross Intelligence, which addressed whether an AI legal research tool could train on Westlaw content, resulted in a jury verdict against Ross in 2024, though appeals and related disputes continue. In the European Union, the AI Act’s transparency provisions, which took effect in August 2025, require AI developers to document and disclose the datasets used to train general-purpose models. That regulatory pressure adds another reason for AI companies to formalize data access through licensing rather than relying on unacknowledged scraping.

Until case law and regulation settle, the commercial deals function as a parallel system. Platforms and publishers are not waiting for courts to define their rights. They are asserting those rights through contracts, technical barriers, and pricing tools, creating facts on the ground that courts and regulators will eventually have to account for.

What this means for the next generation of AI models

For AI developers, the economics of training are shifting from a one-time infrastructure cost (build a crawler, run it across the web) to an ongoing licensing expense. Companies like OpenAI, Google, Anthropic, and Meta now face a choice: concentrate licensing budgets on a handful of high-value sources like major news organizations and social platforms, or invest in systems that manage thousands of micro-payments to smaller sites through tools like Pay Per Crawl.

The first approach is simpler but narrows the diversity of training data. The second preserves breadth but adds operational complexity and cost. Either way, the era when a well-configured crawler and a permissive robots.txt file were enough to build a world-class language model is over.

There are downstream consequences for model quality, too. If large portions of the web become inaccessible to crawlers that do not pay, future models may be trained on a less representative slice of human knowledge. Niche forums, independent blogs, and small-publisher archives, the long tail of the web that gave early models much of their texture, could disappear from training sets unless AI companies are willing to pay for access at scale.

For publishers and site owners weighing their options, the practical starting point is straightforward: check server logs to see which AI crawlers are already visiting, and at what volume. That baseline determines whether charging for access could generate meaningful revenue or whether the traffic is too thin to matter. Cloudflare customers can explore the AI Crawl Control dashboard directly; sites not already on the platform would need to migrate DNS before gaining access to the pricing tools.

The free lunch is over, but the menu is still being written

The direction is unmistakable. Platforms with large archives of human-generated content have discovered that AI companies need their data badly enough to pay for it. Infrastructure providers are building the plumbing to extend that same leverage to smaller publishers. Legal and regulatory frameworks are tightening around unauthorized scraping. Every signal points the same way.

What remains uncertain is the pace and the price. No third-party audit has measured how much AI training costs have risen across the industry. No public dataset tracks the number of domains now charging for crawler access. Reddit’s $203 million figure and News Corp’s reported $250 million-plus deal are confirmed data points, not proxies for the entire market. The full cost of training the next generation of models will depend on how quickly the rest of the web decides to put up its own toll booths, and how much AI companies are willing to pay to get through them.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X

The biggest sites on the web are now slamming their doors on AI crawlers — charging millions for the data that has quietly been training the world’s chatbots

Reddit put a price tag on its archive

Cloudflare handed the same power to smaller sites

The legal landscape is still catching up

What this means for the next generation of AI models

The free lunch is over, but the menu is still being written

Author

Get weekly updates with the latest news and tips!

More in AI

IG

FB

PIN

LI

X