
OpenAI is fighting a high-stakes legal battle over how it trained its flagship models, and at the center of the storm is a vanished library of pirated books that the company now seems determined not to explain. The company’s refusal to fully account for why those datasets were created, used, and then destroyed is turning a copyright lawsuit into a broader test of how far an AI giant can go in hiding its training pipeline.
What began as a dispute over authors’ rights has morphed into a confrontation over transparency, potential evidence destruction, and the future rules for data-hungry AI systems. The more OpenAI resists disclosing what happened to its book datasets, the more judges and authors are treating the missing files as a window into the company’s culture, its risk tolerance, and its view of the law.
The missing library at the heart of OpenAI’s copyright fight
The core allegation is stark: OpenAI is accused of building powerful language models on top of a massive library of pirated books, then quietly erasing that material once lawsuits and public scrutiny closed in. According to detailed accounts, the company relied on a large book-focused dataset of copyrighted works that had been scraped from piracy sites, a shortcut that gave its models deep familiarity with contemporary fiction and nonfiction without paying the authors who wrote them.
That same dataset is now missing, deleted by OpenAI in what the company describes as a routine decision but what plaintiffs portray as a calculated move to wipe away evidence. The disappearance has turned discovery into a pitched battle, with authors and their lawyers arguing that the erased files could have shown exactly how extensively OpenAI leaned on pirated texts and how central that material was to its research methods.
How authors pushed the dispute into court
The fight over the deleted books did not emerge in a vacuum. It grew out of a broader wave of litigation by writers who say their livelihoods are being undercut by AI systems trained on their work without permission. The Authors Guild, a key trade group for writers, has taken a leading role by suing OpenAI and accusing it of illegally using copyrighted books to train its models, a campaign that has been chronicled in detail by reporters such as Follow Darius Rafieyan.
In court filings, The Authors Guild and individual writers argue that OpenAI’s models can summarize, imitate, or even reproduce their books because those works were ingested wholesale into training datasets. They say the company’s decision to delete key training data, and the fact that staff who collected it are no longer at the company, has made it harder to trace exactly which titles were used and how, a gap that now sits at the center of their copyright claims.
OpenAI’s shifting story about why the datasets vanished
OpenAI’s explanation for the missing library has evolved as judges and plaintiffs have pressed for details. In an unsealed letter, the company told the court that certain AI training datasets were removed in 2022 due to “non-use,” presenting the deletion as a mundane housekeeping step rather than a response to legal risk. That account, which surfaced in the same disclosures that noted the staff who collected the data are gone, is now a key point of contention in the authors’ case over destroyed AI training datasets.
Judges have grown skeptical of OpenAI’s insistence that the deletion was routine and that internal discussions about it are shielded from scrutiny. One ruling found that the company inappropriately maintained that it did not willfully infringe authors’ works even as it blocked discovery into communications about the erasure, rejecting OpenAI’s attempt to treat those conversations as privileged and ordering disclosure of its internal talk about the deletion of pirated books.
A discovery battle that could reshape OpenAI’s defense
The legal fight over the deleted datasets has become a critical front in discovery, with OpenAI losing key motions that it had hoped would keep its internal deliberations sealed. Judges have ruled that the company cannot simply declare all evidence about the erasure to be privileged, a setback that forces OpenAI to turn over more material about who ordered the deletion, when, and why. One detailed account of the proceedings notes that the issue has been a major battleground in discovery and that OpenAI’s attempt to shield all evidence on the erasure as privileged has now been rejected in a closely watched discovery battle.
Those rulings do more than open a window into OpenAI’s internal Slack channels and email threads. They also raise the stakes for the company’s legal strategy, because any evidence that executives knew the datasets contained pirated books and deleted them anyway could feed into claims of willful infringement or even intentional destruction of evidence. That is why OpenAI’s resistance to explaining the deletion is now seen not just as a PR problem but as a potential turning point in the entire case.
The financial stakes: billions on the line
Behind the procedural skirmishes over privilege and discovery lies a staggering financial risk. If the authors can show that OpenAI willfully infringed their copyrights, the company could face statutory damages of up to $150,000 per copyrighted work, a figure that quickly multiplies into the billions when applied across a large catalog of books. The same analysis underscores that OpenAI faces potential damages of up to $150,000 per work if communications reveal willful infringement, a scenario that makes every internal message about the deleted datasets potentially explosive.
Judges are also weighing separate penalties tied to OpenAI’s conduct in discovery itself. One detailed legal analysis notes that the court could issue monetary penalties, limit OpenAI’s defenses, or even issue a default judgment in plaintiffs’ favor if it concludes that the company mishandled evidence or abused privilege claims, a menu of sanctions that underscores how the privilege fight could leave OpenAI at risk of billions in exposure.
Privilege, Slack messages, and the line between caution and cover-up
At the heart of the privilege dispute is a simple but consequential question: when do internal discussions about risky data cross the line from legal consultation into business strategy that must be shared in discovery? OpenAI initially argued that its conversations about deleting the pirated book datasets were protected by attorney-client privilege, but judges have increasingly rejected that sweeping claim, ordering the company to hand over internal Slack messages and other communications that touch on the decision to erase the training data.
Those messages matter because they could reveal whether OpenAI saw the pirated books as a legal time bomb and chose to delete them to reduce liability, a move that plaintiffs say could be construed as intentional destruction of evidence. The same reporting notes that Both Anthropic and OpenAI have been accused of training their AI models on copyrighted materials uploaded to a piracy site, and that internal Slack messages about what to do with those datasets are now central to arguments over whether the companies acted in good faith or engaged in a quiet cover-up.
Courts are narrowing AI’s room to maneuver on “fair use”
OpenAI’s reluctance to explain its book datasets also reflects a shifting legal landscape around AI training and fair use. For years, tech companies argued that ingesting copyrighted works to train models was automatically transformative and therefore protected, but recent rulings have started to chip away at that assumption. In one influential decision, Judge Alsup wrote that you cannot just “steal a work you could otherwise buy (a book, millions of books) so long as you” use it for training, a line that directly challenges the idea that scraping entire libraries is automatically lawful.
That reasoning leaves ample room for future defeats for AI companies that built their models on unlicensed content, particularly when the works at issue are books that can be purchased and that form the core of authors’ livelihoods. It also helps explain why OpenAI might be wary of putting a clear narrative on the record about how it assembled and used its book-focused dataset, since a detailed account could be measured directly against emerging judicial skepticism about treating mass copying as a harmless exercise of training rights.
OpenAI is not alone: a broader industry pattern around pirated books
While OpenAI is currently the most prominent defendant in the authors’ suits, it is not the only tech giant facing questions about pirated books in its training data. Both Anthropic and OpenAI have been accused of relying on copyrighted materials uploaded to piracy sites, a pattern that suggests the industry treated those shadow libraries as a convenient shortcut rather than a legal red line, as detailed in reporting on how Both Anthropic and OpenAI approached their training corpora.
The same dynamic is now surfacing in other corners of Big Tech. A proposed class action lawsuit filed by authors Grady Hendrix and Jennifer Roberson accuses Apple of using their copyrighted books to train Apple Intelligence, alleging that Apple of training its AI with pirated copies of their works and that the system can reproduce the content of the books. That case underscores that the question of pirated book datasets is no longer confined to one company or one model but is instead becoming a defining test of how the entire AI sector sources its data.
Why OpenAI’s silence on its datasets matters beyond this case
OpenAI’s resistance to fully explaining its scrapped book datasets is not just a tactical choice in a single lawsuit. It is also a signal about how the company views transparency in AI development at a moment when regulators, courts, and the public are demanding more clarity about what goes into these systems. By fighting so hard to keep its internal discussions sealed and its data pipeline opaque, OpenAI is effectively betting that it can preserve its competitive edge and avoid setting precedents that might constrain future models, even if that posture fuels suspicion among authors and judges.
The risk is that this strategy backfires, turning a dispute over one deleted library into a broader crisis of trust. Detailed reporting has already framed OpenAI as “desperate to avoid explaining” why it deleted the pirated book datasets, highlighting how the company’s stance could lead to increased fines and higher stakes in court as judges weigh whether its conduct reflects good-faith caution or a pattern of evasion tied to its deleted pirated-book datasets. However the case ends, the unanswered questions around those vanished files are already reshaping how courts, competitors, and creators think about the hidden foundations of generative AI.
More from MorningOverview