Reddit and the web’s biggest sites are slamming their doors on AI crawlers — charging millions for the data that quietly trains the world’s chatbots

For years, AI companies treated the open web like an all-you-can-eat buffet. They sent automated crawlers across forums, news sites, and social platforms, vacuuming up billions of words to train the large language models behind ChatGPT, Claude, Gemini, and their competitors. The hosts of that content rarely objected, and almost never got paid.

That era is ending. Reddit has sued AI developer Anthropic in federal court, alleging unauthorized scraping of its user-generated posts. The platform simultaneously holds a roughly $60 million licensing deal with Google that grants paid access to the same archive. And Reddit is far from alone: across the web, publishers and platforms are blocking AI crawlers, demanding licensing fees, or heading to court. The message is uniform, even if the tactics vary. If you want our data, pay for it.

Reddit draws the sharpest line

The lawsuit, Reddit, Inc. v. Anthropic PBC, was filed in the United States District Court for the Northern District of California, as confirmed by the official court docket. The case is in its early procedural stages, and no ruling has been issued. The specific legal theories Reddit is pressing, whether copyright infringement, breach of terms of service, or some combination, have not been detailed in publicly available filings as of June 2026.

On the commercial side, Reddit took a very different approach with Google. The platform struck a deal valued at roughly $60 million, according to reporting by the Associated Press. Under that agreement, Google can train AI models on Reddit posts and use the data to improve its products. Reddit, in turn, gains access to some of Google’s AI tools. The deal was disclosed alongside Reddit’s 2024 initial public offering, signaling that data licensing had become central to the company’s revenue pitch to investors.

The two moves are not contradictory. They are complementary. Reddit is not opposed to AI companies training on its content. It is opposed to AI companies training on its content for free. The distinction reframes Reddit’s 18-year archive of human conversation not as an open resource, but as a commercial product with a price tag.

A pattern spreading across the web

Reddit’s strategy fits a much larger shift. Across the internet, major content hosts have moved from passive tolerance of AI crawling to active resistance.

The New York Times sued OpenAI and Microsoft in December 2023, alleging that millions of its articles were used without permission to train GPT models. That case remains active and has become a bellwether for how courts will treat AI training under copyright law. Condé Nast, publisher of Wired, Vogue, and The New Yorker, followed with its own lawsuit against OpenAI in 2024. The Associated Press took a licensing route instead, signing a deal with OpenAI to provide access to its news archive.

Stack Overflow, the question-and-answer site that is a cornerstone of software development, struck its own licensing agreement with Google, monetizing the millions of technical answers its community had contributed over more than a decade. That deal drew sharp criticism from Stack Overflow’s volunteer contributors, many of whom felt their unpaid labor was being sold out from under them.

Beyond lawsuits and deals, the most widespread pushback has been technical. Hundreds of websites have updated their robots.txt files, the simple text documents that tell automated crawlers which pages they may and may not access, to block AI-specific bots like OpenAI’s GPTBot, Anthropic’s ClaudeBot, and others. A 2024 study by the Data Provenance Initiative, a research group tracking AI training data, found that more than 25% of high-quality web sources commonly used in AI training datasets had restricted crawler access. The trend has only accelerated since.

What courts have not yet settled

No definitive federal ruling has established whether large-scale AI training on publicly posted content is legal under U.S. copyright law. The question sits at the intersection of fair use doctrine, terms-of-service enforcement, and novel questions about how transformative a use must be to escape infringement claims.

The Reddit v. Anthropic case could contribute to that legal record, but it is too early to predict whether it will reach a decision on the merits or end in a settlement. Anthropic has not issued a detailed public response to the lawsuit based on available reporting as of June 2026. Whether the company will argue fair use, contest the factual allegations, or seek to resolve the matter privately remains unknown.

The Reddit-Google deal carries its own open questions. The full contract terms, including what restrictions govern how Google can use the data, how long the agreement lasts, and whether Reddit retains any control over outputs generated from its content, have not been publicly disclosed. The $60 million figure is the only confirmed financial metric.

A separate and thorny issue is user consent. Reddit users created the posts at the center of these disputes voluntarily, often under pseudonyms, and many did so expecting their words would be read by other humans, not ingested into corporate AI models. Whether Reddit updated its terms of service before packaging that content for sale, and whether users were given meaningful notice, has not been fully clarified. The question of who actually owns the value in a platform’s archive, the company or the millions of people who filled it, has no clean legal answer yet.

Rising costs and a narrowing field

The practical effect of all this for AI developers is straightforward: training data is getting expensive. Companies that once built datasets by crawling the open web now face a choice between negotiating costly licensing deals or risking litigation. At $60 million for a single platform’s archive, the price of assembling a competitive training corpus is climbing fast.

That cost structure favors the largest players. Google, OpenAI, and a handful of other well-capitalized firms can absorb tens of millions in licensing fees. Smaller AI startups and academic research labs, which historically relied on freely available web data, face a much harder path. The risk is a concentration of advantage: the richest companies get the best data, build the best models, and pull further ahead, while open-source and academic AI efforts lose access to the raw material they need.

For publishers and platforms, the calculus is different but equally consequential. Data licensing represents a new and potentially significant revenue stream at a time when advertising models are under pressure and subscription growth has plateaued for many outlets. Reddit’s $60 million deal, disclosed during its IPO, was explicitly framed as a growth story. Other platforms are watching closely to see whether the market for AI training data matures into something durable or proves to be a one-time windfall.

The web’s open commons is closing

What is unfolding is not just a series of individual disputes. It is a structural transformation of how the internet’s content is valued and controlled. For two decades, the implicit bargain of the web was that content posted publicly could be read, indexed, and linked to by anyone. Search engines built trillion-dollar businesses on that bargain, and users accepted it because search drove traffic back to the source.

AI training breaks that loop. When a language model ingests a Reddit thread or a news article, it does not send readers back to the original page. It absorbs the information, synthesizes it, and serves it up in a chatbot response with no link, no attribution, and no ad impression for the publisher. That asymmetry is what has turned passive tolerance into active resistance.

The lawsuits, the licensing deals, and the robots.txt blocks are all different expressions of the same underlying demand: if AI companies are going to extract value from the web’s content, the people and organizations that created it want a share. How large that share will be, and who will be forced to pay it, are questions that courts, corporate negotiations, and perhaps regulators will be sorting out for years to come. But the direction is already clear. The doors are closing, and the price of entry is going up.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X

Reddit and the web’s biggest sites are slamming their doors on AI crawlers — charging millions for the data that quietly trains the world’s chatbots

Reddit draws the sharpest line

A pattern spreading across the web

What courts have not yet settled

Rising costs and a narrowing field

The web’s open commons is closing

Author

Get weekly updates with the latest news and tips!

More in AI

IG

FB

PIN

LI

X