Inside the AI startup racing to scan and trash millions of books

In the race to feed artificial intelligence systems with ever more data, one startup has quietly turned to a very old medium: used print books. Instead of treating those volumes as artifacts to be preserved, it has treated them as raw material to be scanned, shredded, and converted into training data for large language models. The result is a collision between the culture of rare book rooms and the growth-at-all-costs logic of the AI industry.

At the center of this story is Anthropic, the company behind the Claude chatbot, which has spent millions of dollars buying and then destroying physical books to build its models. The company’s approach has forced librarians, authors, and technologists to confront a blunt question: when knowledge is locked in paper, who gets to decide whether the future lies in careful preservation or in industrial-scale digitization and disposal?

The quiet logistics of turning paper into training data

The basic workflow is deceptively simple. Anthropic and its partners bought large quantities of used print books, often in bulk, then ran them through high speed scanning operations before sending the physical copies to be pulped. According to one detailed account, Anthropic spent many of dollars to purchase millions of print books, often in used condition, and relied on service providers to digitize them without seeking compensation or explicit consent from the authors whose work was being ingested. The physical books were treated as a consumable input, not a cultural resource to be stewarded.

Earlier in 2024, executives at Anthropic explored even more aggressive ways to scale this pipeline. Internal proposals described in one report show the company considering deals with major booksellers to access their inventory for scanning and destruction. At one point, Anthropic executives approached the New York bookstore Strand with a plan that would have involved scanning and then disposing of large numbers of volumes, a pitch that prompted a Strand spokesperson, identified as Reached, to stress that the store was still evaluating how to respond and that “we are working on this.” Those discussions, detailed in an account of Anthropic’s outreach to Strand, underscored how the company was willing to treat entire retail inventories as potential fuel for its models.

Why millions of books ended up in the shredder

From Anthropic’s perspective, the logic behind this strategy is straightforward. Large language models like Claude need vast, diverse text corpora, and print books remain one of the richest, most carefully edited sources of long form writing. By buying used copies, the company could sidestep licensing negotiations and simply own the physical objects outright. One legal analysis notes that Anthropic’s service providers scanned these books and then destroyed them, effectively converting them into a proprietary dataset that could be reused indefinitely. The same analysis emphasizes that this approach allowed Anthropic to ingest millions of works without paying authors or publishers beyond the secondhand purchase price, a practice that critics argue exploits gaps in copyright law even as Anthropic insists it is lawful.

The physical destruction has become a symbol of that tension. A widely cited report described how Anthropic Shredded Millions to Train its AI, turning entire pallets of volumes into confetti once their pages had been captured. The account framed this as a “schnozz smashing on the nose” metaphor for an industry willing to grind up the physical record of human knowledge in order to create digital systems that can mimic it. In that telling, the shredders are not just a practical step in a supply chain, they are a visual shorthand for how AI companies are willing to erase the material traces of culture in pursuit of scale.

Scanning without shredding, and what libraries already knew

What makes Anthropic’s approach so jarring is that it is not the only way to digitize books at scale. Libraries and archives have spent years refining non destructive methods that preserve the physical object while still creating high quality digital copies. The Internet Archive, for example, developed a system known as the Table Top Scribe that allows operators to capture page images without cutting spines or damaging bindings. In one description of this work, the Archive explains how special book collections can come online through careful scanning that keeps rare volumes intact and available to future readers.

That contrast matters because it shows that the choice to shred is not technologically inevitable, it is a business decision. Non destructive scanning is slower and more expensive per book, and it requires a preservation mindset that treats each volume as more than a temporary container for text. By opting for industrial scale destruction, Anthropic signaled that it valued speed and cost efficiency over the long term survival of the physical record. Critics point out that this is especially troubling when the books involved include out of print titles or niche works that may not exist in many other copies, a concern echoed in reporting that notes how buying used physical allowed Anthropic to sidestep licensing while ensuring that the specific copies it acquired would never “live another day” on a shelf.

The legal green light, and why authors are still alarmed

For now, Anthropic’s strategy has received a significant boost from the courts. In a closely watched case, a federal judge concluded that Anthropic’s training on books is protected as fair use, at least under the specific facts presented. The ruling emphasized that the company bought physical books, scanned them, and used the resulting text to train models rather than to sell direct copies, and that this kind of intermediate copying could be lawful. The decision, summarized in a report on how Anthropic’s Training on books was treated as fair use, has left many authors more worried than ever, because it suggests that companies can buy their works secondhand and then repurpose them for AI without permission.

Legal scholars caution that this is not the final word. The same analysis that described Anthropic’s spending on used books also stressed that the court victories for Anthropic and Meta leave “ample room for future defeats,” particularly if plaintiffs can show that AI outputs substitute for the original works or that the copying harms specific markets. Authors’ groups argue that even if the scanning and shredding pipeline is technically lawful today, it undermines the economic incentives that copyright is supposed to protect. They point to the fact that authors received no or explicit consent requests when their books were turned into training data, and they warn that this model could become a template for other AI firms unless lawmakers intervene.

Inside Claude’s appetite for books, and what comes next

To understand why Anthropic was willing to go this far, it helps to look at how central books are to Claude’s capabilities. One detailed account of the company’s data strategy explains how Claude effectively “clawed through” millions of books to learn patterns of narrative, argument, and style that are hard to capture from web text alone. The same report notes that this book heavy corpus is part of what allows Claude to handle long form reasoning and complex instructions, and it frames Anthropic’s investment in used volumes as a deliberate bet that curated print content would yield better models. In that sense, the shredders are the back end of a system that users encounter only as a polished chatbot, a connection made explicit in the description of How Claude AI to reach its current performance.

More from Morning Overview