The biggest sites on the web are slamming their doors on AI crawlers — charging millions for the data that has quietly been training the world’s chatbots

For years, AI companies helped themselves to the open web. Crawlers from OpenAI, Google, Anthropic, and others swept through news sites, forums, and social platforms, vacuuming up text that would become the training fuel for a generation of chatbots. The content creators who produced that material were rarely asked and never paid.

That era is ending, and the price tags are now public. Reddit disclosed in its 2024 IPO registration statement that it had signed data licensing contracts worth $203 million in aggregate, converting billions of user-generated posts into a commercial product sold directly to AI labs. Google alone agreed to pay roughly $60 million for access to Reddit’s archive, according to reporting by The Associated Press. By mid-2026, Reddit is far from alone. A wave of publishers, platforms, and media companies have either locked their doors to AI crawlers or started demanding payment to open them.

Reddit put a price on the open web

Reddit’s data licensing deals, entered into in January 2024 and disclosed in the company’s S-1 prospectus filed the following month, represent the clearest public benchmark for what a single platform can charge when it decides AI companies must pay for training data. The $203 million figure is not an estimate. It is a contractual value reviewed by Reddit’s auditors and legal counsel before submission to the SEC, a primary source subject to federal disclosure rules and potential liability for misstatement.

The Google contract, worth approximately $60 million, accounts for a significant share of that total. The identities of Reddit’s other licensees and the per-deal breakdown have not been publicly confirmed. What is clear is that the arrangement formalized a relationship that had previously operated through automated crawlers pulling publicly available pages without permission or payment.

The timing was strategic. Reddit structured these deals as it prepared to go public, and data licensing featured prominently in the company’s revenue narrative for investors. In its subsequent public filings as a listed company, Reddit has continued to report data licensing as a material and growing revenue line, reinforcing that this was not a one-off maneuver but an ongoing business.

The robots.txt rebellion

Reddit’s licensing deals grabbed headlines, but a quieter revolt was already spreading across the web. Starting in mid-2023 and accelerating through 2024 and 2025, thousands of websites updated their robots.txt files to block AI-specific crawlers like OpenAI’s GPTBot, Anthropic’s ClaudeBot, and others. Robots.txt is a decades-old protocol that tells automated bots which parts of a site they may or may not access. It operates on an honor system, but major AI companies have publicly committed to respecting it.

The scale of the blocking has been significant. Research by Originality.ai, which tracked robots.txt changes across the top 1,000 websites, found that a large and growing share of major sites had added restrictions targeting AI crawlers by late 2024. News publishers were among the most aggressive. Organizations including The New York Times, The Washington Post, Reuters, and hundreds of smaller outlets moved to block or restrict AI bots, signaling that the industry’s default posture had shifted from open access to closed doors.

Blocking crawlers is a defensive move, not a revenue strategy. But it sets the stage for negotiation. A publisher that blocks GPTBot can then offer to unblock it, for a price, through a licensing agreement. That dynamic has turned robots.txt into a bargaining chip and transformed what was once a technical courtesy into a commercial lever.

Major publishers are cutting deals

The licensing market that Reddit helped pioneer has expanded rapidly. In May 2024, News Corp announced a multi-year content licensing agreement with OpenAI reportedly valued at more than $250 million over five years, covering publications including The Wall Street Journal, the New York Post, and several Australian outlets, as reported by Reuters. The Associated Press signed its own licensing deal with OpenAI in 2023, one of the earliest such agreements between a major news organization and an AI company.

Dotdash Meredith, the publisher behind brands like People, Better Homes & Gardens, and Investopedia, struck a deal with OpenAI in 2024 to license its content. Axel Springer, the German media giant that owns Politico and Business Insider, reached a similar agreement. In each case, the pattern is the same: a publisher with a large archive of professionally produced content agrees to provide structured access to an AI company in exchange for payment and, in some cases, attribution or traffic-sharing arrangements.

Not every publisher has chosen to negotiate. The New York Times filed a landmark copyright infringement lawsuit against OpenAI and Microsoft in December 2023, alleging that the companies used millions of Times articles without permission to train their AI models. That case, still working through federal court as of mid-2026, could reshape the legal framework for AI training data and influence whether future deals are struck voluntarily or mandated by courts.

The user consent problem

Reddit’s situation highlights a tension that licensing deals alone cannot resolve. The platform’s content is generated almost entirely by unpaid contributors who posted under terms of service granting Reddit broad rights to commercialize their submissions. Those terms were written long before large-scale AI licensing was a realistic business model, and whether they adequately cover the sale of user posts to train chatbots is a question that has drawn scrutiny from privacy advocates and legal scholars.

Reddit users were not consulted before the deals were signed, and they do not share in the revenue. The same dynamic applies, in different forms, across social platforms, forums, and any site where the value of the archive depends on contributions from people who were never told their words might train a machine learning model. For news publishers, the copyright question is more straightforward: the organization owns the articles and can license them. For platforms built on user-generated content, the moral and legal ground is murkier.

The European Union’s AI Act, which began phased implementation in 2024, includes transparency requirements around training data that could force AI companies to disclose more about what they use and where it comes from. Whether those provisions will give individual contributors meaningful leverage, or simply add a compliance layer that large platforms and AI labs navigate without changing their practices, remains to be seen.

What the market looks like now

By mid-2026, the market for AI training data has moved well past the experimental phase. Multiple deals worth tens or hundreds of millions of dollars have been signed. Major publishers have established licensing teams or hired outside negotiators to handle AI-related inquiries. AI companies, for their part, have built out content partnerships divisions and allocated significant budgets for data acquisition, a cost of doing business that barely existed three years ago.

But the market is far from settled. Pricing remains opaque, with deal values varying widely depending on the size of the archive, the quality and recency of the content, and the negotiating leverage of each party. Some AI companies are investing heavily in synthetic data, generated by models themselves rather than scraped from the web, as a way to reduce dependence on licensed content. Others are exploring collective licensing frameworks, similar to those used in the music industry, that would allow AI labs to pay into a pool distributed across many publishers.

The legal landscape is also in flux. Beyond the Times lawsuit, several other copyright cases involving AI training data are moving through U.S. courts, and their outcomes could either validate the licensing model or upend it. If courts rule that training on copyrighted material constitutes fair use, the incentive for AI companies to pay for data could weaken considerably. If they rule the other way, licensing could become not just common but legally required.

Where the leverage actually sits

Reddit’s $203 million in contracts proved something that was previously theoretical: platforms and publishers can convert free crawler access into substantial revenue. But the durability of that leverage depends on factors that are still unfolding. If AI models increasingly rely on synthetic data or on sources that remain freely available, the bargaining power of content owners could erode. If regulation tightens or courts side with copyright holders, it could strengthen.

For now, the clearest takeaway is structural. The relationship between the companies that create and host content and the companies that build AI systems has been permanently renegotiated. What was once an unspoken arrangement, where crawlers took what they wanted and publishers either did not notice or did not object, has been replaced by explicit commercial terms. The doors are not all closed, but they are no longer unlocked. And the price of entry is climbing.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X