Morning Overview

The biggest sites on the web are now slamming their doors on AI crawlers — charging millions for the data that has quietly been training the world’s chatbots

For years, AI companies treated the open web like an all-you-can-eat buffet. Crawlers from OpenAI, Google, Anthropic, and others swept through news sites, forums, and knowledge bases, vacuuming up text to train large language models. The sites got nothing in return. That arrangement is collapsing. By mid-2025, some of the internet’s most trafficked platforms have started locking their content behind licensing deals, lawsuits, and network-level tollbooths, and the price tags are not small.

Reddit put a number on it

The first hard dollar figure came from Reddit. In its Form S-1 registration statement, filed with the U.S. Securities and Exchange Commission in February 2024 ahead of its IPO, Reddit disclosed data-licensing contracts with an aggregate value of $203 million. The deals carried terms of two to three years, meaning the money would arrive as recurring revenue rather than a lump sum.

That filing mattered because it converted rumor into regulatory fact. SEC disclosures carry legal liability for material misstatements, so the $203 million figure is among the most reliable data points in the entire AI-training-data debate. Since going public, Reddit has continued to report revenue from these arrangements. Its quarterly earnings filings through early 2025 show “other revenue,” which includes data licensing, growing steadily, a sign the contracts are producing real cash, not just headline numbers.

Reuters, citing unnamed sources, separately reported that Google had agreed to pay Reddit roughly $60 million per year for access to its content. Neither company has publicly confirmed the specific terms, so the figure should be treated as credible but unverified.

Cloudflare built the tollbooth

While Reddit monetized its own archive, Cloudflare went after the problem at the infrastructure layer. Because Cloudflare’s network sits in front of a significant share of global web traffic, its decisions ripple far beyond any single site.

In September 2024, the company launched a one-click toggle that lets any customer block known AI crawlers at the network edge. No config files, no custom rules. One button, and bots from OpenAI, Anthropic, and others lose access.

Then, in March 2025, Cloudflare went further. It introduced a pay-per-crawl system that replaces blocking with billing. The mechanism uses an HTTP 402 “payment required” response and cryptographically signed headers. A site owner sets a price per page load. If the crawler pays, it gets the content. If it doesn’t, the gate stays shut. The design turns robots.txt, which was always a voluntary suggestion, into an enforceable transaction.

Cloudflare has not released audited adoption numbers for either tool, so it is not yet clear how many sites have activated pay-per-crawl or what prices they are setting. The capability exists. Whether it is reshaping crawler behavior at scale is an open question.

They are not alone

Reddit and Cloudflare are the clearest examples, but the shift extends well beyond them. The New York Times sued OpenAI and Microsoft in December 2023, alleging that its copyrighted journalism was used to train ChatGPT without permission. The case, still working through federal court as of mid-2025, could set precedent for whether scraping copyrighted content for AI training qualifies as fair use.

Other publishers chose licensing over litigation. News Corp announced a deal with OpenAI in May 2024, reported by The Wall Street Journal to be worth more than $250 million over five years, granting access to content from The Wall Street Journal, the New York Post, and other properties. Stack Overflow, the developer Q&A platform, confirmed a licensing agreement with OpenAI as well. And Automattic, the parent company of WordPress.com and Tumblr, struck data deals with both OpenAI and Midjourney, as reported by 404 Media and confirmed by the company.

The pattern is consistent: platforms that once gave away their content passively are now demanding compensation or cutting off access entirely.

Why robots.txt stopped being enough

The old system relied on trust. A website would place a robots.txt file at its root directory, politely asking crawlers not to visit certain pages. Search engines mostly honored those requests because they had reputational incentives to do so. AI crawlers, many of them new and operating under less public scrutiny, did not always follow the same norms.

Reporting from outlets including TollBit and research from the Data Provenance Initiative found that many AI bots ignored robots.txt directives or used user-agent strings that made them difficult to identify. That erosion of the honor system is a key reason Cloudflare’s enforcement tools gained traction so quickly. When voluntary signals fail, technical gates become the fallback.

What AI companies do next will decide the outcome

The gates are going up, but whether they hold depends on how AI labs respond. They have several options, none of them simple. They can pay the asking prices, which would validate data licensing as a durable business model. They can turn to synthetic data or curated datasets that avoid the web entirely, though researchers have raised concerns about model quality when training loops rely too heavily on AI-generated text. They can challenge the legal basis for these restrictions in court, arguing that scraping publicly accessible content is protected under fair use or similar doctrines. Or they can try to route around the barriers technically, using proxies, spoofed headers, or other methods to evade detection.

Each path carries risk. Paying up at the scale the largest publishers are demanding could add hundreds of millions in annual costs to companies already burning through capital. Relying on synthetic data may degrade model performance. Legal battles take years and produce uncertain outcomes. And evasion, if exposed, would invite regulatory backlash at a moment when lawmakers in both the U.S. and the EU are already scrutinizing AI data practices.

What is no longer in doubt, as of June 2025, is the direction of the trend. The era of free, frictionless web scraping for AI training is ending. Reddit proved the data has a price. Cloudflare proved the price can be enforced. The New York Times proved it can be litigated. The only remaining question is how much the new rules will cost, and who will end up paying.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.