The biggest sites on the web are now slamming their doors on AI crawlers — charging millions for the data that has quietly been training the world’s chatbots

For years, AI companies helped themselves to the open web. Crawlers swept through Reddit threads, news archives, and forums by the billions of pages, feeding the language models behind ChatGPT, Gemini, and their competitors. Nobody asked permission. Nobody paid a dime.

That arrangement is collapsing. By mid-2026, the web’s biggest content platforms and infrastructure providers have drawn a hard line: if you want the data, you negotiate and you pay. Reddit has turned its massive archive of user-generated conversation into a formal licensing business. Cloudflare, which routes and protects traffic for millions of websites, now blocks AI crawlers by default. And a growing list of publishers, from the Associated Press to News Corp, have signed deals worth hundreds of millions of dollars with AI labs willing to write checks rather than face lawsuits.

Together, these moves mark the end of the “public means free” era of AI training and the beginning of something that looks a lot more like a toll road.

Reddit put a price tag on the conversation

Reddit’s shift from open archive to paid data supplier became public in February 2024, when the company filed its S-1 registration statement with the U.S. Securities and Exchange Commission. That filing, available through the SEC’s EDGAR system, disclosed data licensing arrangements carrying an aggregate contract value of $203 million. It was one of the first times a major platform attached a specific, audited dollar figure to the sale of web content for AI training.

The most prominent deal behind that number became clear almost immediately. Bloomberg reported in February 2024 that Google had agreed to pay Reddit roughly $60 million per year for access to its data, a deal that gave Google’s AI models a firehose of real human conversation spanning nearly two decades of posts and comments.

Since going public, Reddit has continued to build on that foundation. The company’s subsequent earnings disclosures have shown data licensing revenue growing as a distinct line of business, separate from its core advertising income. The message to the rest of the industry was unmistakable: platform data has a price, and that price is nine figures.

Cloudflare turned the infrastructure into a gate

If Reddit’s licensing deals represent the content side of the equation, Cloudflare’s moves represent the plumbing. In 2025, Cloudflare, which sits between users and an estimated 20 percent of all websites, announced a default-deny policy for AI crawlers. Under this approach, bots built to scrape training data are blocked unless a site owner explicitly allows access or sets up a payment arrangement.

The company also introduced a product called Pay Per Crawl, initially launched in private beta, which lets publishers require payment from AI crawlers on a per-request basis. Site owners can configure each crawler individually, choosing to block it, charge it, or allow it through. One price applies to all crawlers set to “Charge,” creating a standardized toll system rather than forcing publishers to negotiate separately with every AI lab that comes knocking.

The practical effect is significant. Before Cloudflare’s policy change, a site owner who wanted to block AI scrapers had to identify each bot, update a robots.txt file, and hope the crawler respected it. (Many did not.) Now, Cloudflare’s network handles identification and enforcement automatically. AI developers who want data from Cloudflare-protected sites must be recognized and categorized before they can operate at scale.

The licensing wave goes far beyond Reddit

Reddit was early, but it was not alone. By mid-2026, a string of major content owners have struck their own deals with AI companies, each one reinforcing the idea that training data is a commodity with real market value.

The Associated Press signed a licensing agreement with OpenAI in July 2023, granting access to portions of its news archive. News Corp, the parent company of The Wall Street Journal, the New York Post, and other publications, reached a deal with OpenAI reportedly worth more than $250 million over five years, as reported by The Wall Street Journal in May 2024. Stack Overflow, the developer Q&A platform, licensed its data to Google. Axel Springer, the European media giant behind Politico and Business Insider, signed with OpenAI as well.

Not every content owner chose the licensing route. The New York Times sued OpenAI and Microsoft in December 2023, alleging that the companies used millions of its articles to train AI models without permission. That case, still working through the courts as of mid-2026, has become the highest-profile legal test of whether large-scale AI training on copyrighted content qualifies as fair use. Its outcome could either validate the paid licensing model or undercut it entirely.

What this means for AI companies and everyone else

The shift from free scraping to paid access creates pressure that flows in multiple directions.

For the largest AI labs, the new costs are manageable. Companies like OpenAI, Google, and Anthropic are backed by billions in funding and can absorb licensing fees as a cost of doing business. For smaller AI startups and open-source research groups, the math is harder. If the best training data sits behind paywalls and toll gates, the barrier to building competitive models rises, potentially concentrating power among the companies that can afford to pay.

For publishers and content creators, the deals represent a new revenue stream at a time when digital advertising has been under pressure and AI-generated summaries threaten to siphon traffic away from original sources. But the economics are uneven. A platform like Reddit, with billions of posts and a direct relationship with AI buyers, has leverage that a mid-sized news outlet or independent blog does not. Cloudflare’s Pay Per Crawl system could help level that playing field by giving smaller publishers a mechanism to charge, but how widely it gets adopted and at what price points remains to be seen.

For ordinary users, the downstream effects are less visible but no less real. The quality and breadth of the data that trains AI models shapes the answers those models give. If certain sources get walled off or priced out of training sets, the models could develop blind spots, skewing toward the content that was cheapest or easiest to license rather than the most accurate or representative.

The questions that haven’t been answered yet

Several critical details remain unresolved. Reddit’s SEC filings disclosed the $203 million aggregate figure but did not break out individual deal terms, counterparty names beyond what was publicly reported, or renewal conditions. Whether that total reflects a handful of large contracts or a broader set of smaller agreements is not fully clear from the public record.

Cloudflare has not released adoption data for Pay Per Crawl: how many publishers have activated it, what prices they have set, or how many AI crawlers have been blocked or charged since the default-deny policy took effect. Whether major AI developers will simply pay the tolls, route around Cloudflare-protected sites, or shift to alternative datasets is an open strategic question.

The legal landscape is equally unsettled. Beyond the Times lawsuit, courts in the U.S. and Europe are weighing multiple cases that touch on AI training, copyright, and consent. The EU’s AI Act, which began phasing in during 2024, includes transparency requirements around training data that could further push AI companies toward documented, licensed sources. If courts or regulators ultimately endorse broad training rights over publicly accessible content, platforms and infrastructure providers might face pressure to justify their tolls. If rulings go the other way, the paid access model hardens into the new default.

What is already clear is the direction of travel. The assumption that the open web was an all-you-can-eat buffet for AI training has been replaced by a system of gates, tolls, and contracts. The data that powers the world’s chatbots is no longer an unmetered resource, and the fight over who controls it, who profits from it, and who gets locked out is just getting started.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X

The biggest sites on the web are now slamming their doors on AI crawlers — charging millions for the data that has quietly been training the world’s chatbots

Reddit put a price tag on the conversation

Cloudflare turned the infrastructure into a gate

The licensing wave goes far beyond Reddit

What this means for AI companies and everyone else

The questions that haven’t been answered yet

Author

Get weekly updates with the latest news and tips!

More in AI

IG

FB

PIN

LI

X