After winning 20M ChatGPT logs, news groups push for even more

News publishers have forced a rare look inside the black box of generative AI, winning access to 20 million anonymized ChatGPT logs in their copyright fight with OpenAI. The haul is already significant, but the same groups are now pressing judges to compel even deeper disclosure, including logs OpenAI says it has already deleted. At stake is not only who gets paid for the news that trains artificial intelligence, but also how much transparency courts can demand from one of the most powerful tech companies in the world.

The battle over these logs is rapidly becoming a test case for how far copyright plaintiffs can push discovery in AI litigation. It is also a stress test for OpenAI’s promises about user privacy and data retention, as judges weigh whether the public interest in understanding alleged infringement outweighs the company’s objections.

The legal fight that pried open 20 million ChatGPT logs

The current disclosure order grew out of a consolidated copyright lawsuit brought by major news publishers who say OpenAI built its products on their journalism without permission. They argued that only a massive sample of real user interactions could show whether ChatGPT reproduced their articles in a way that proves copying, so they pushed for 20 million anonymized logs that capture prompts and outputs tied to the disputed content. Earlier coverage described how a federal judge, identified as a Journalist, Editor, Writer, directed OpenAI to provide millions of anonymized ChatGPT user logs to help resolve copyright disputes involving AI companies, setting the stage for the larger production that followed.

Those plaintiffs framed the logs as the only practical way to test whether the system spits out news stories almost verbatim when prompted in certain ways. Reports on the discovery fight note that the news outlets insisted the logs were necessary to see if ChatGPT generated infringing outputs that closely tracked their copyrighted works, particularly when users asked for summaries or full articles. That argument persuaded the court that a broad dataset of conversations was proportionate to the stakes, even as OpenAI warned that such a demand would be burdensome and intrusive.

Judge Stein, Magistrate Judge Wang, and a rare discovery loss for OpenAI

The turning point came when U.S. District Judge Stein upheld an earlier order from Magistrate Judge Wang that OpenAI must hand over the 20 million logs. Judge Stein agreed that the plaintiffs in the multidistrict litigation needed access to the conversations to test their claims, and he rejected OpenAI’s attempt to narrow or overturn the ruling. In a detailed account of the decision, one report explains that Judge Stein upheld Magistrate Judge Wang’s directive that OpenAI produce 20 million ChatGPT chat logs to the plaintiffs in the MDL litigation, cementing the scale of the disclosure.

OpenAI had objected that the order went too far, arguing that it would expose sensitive information and disrupt its operations, but the district judge was not persuaded. Another analysis of the ruling notes that Judge Stein concluded OpenAI’s objections about user privacy and the impact on its business did not outweigh the need for evidence, and that the company had not shown the production would be impossible. By affirming Magistrate Judge Wang’s approach, the court signaled that AI developers cannot easily shield large swaths of usage data from discovery when they are accused of systematic copyright violations.

How OpenAI tried to block the logs, and why the court said no

Before losing on appeal, OpenAI mounted an aggressive campaign to avoid turning over the logs, leaning heavily on privacy and legal privilege arguments. The company cited a Second Circuit wiretapping case to claim that Judge Wang had not properly weighed the privacy interests of users whose conversations would be swept into the production. According to a detailed summary of the objections, OpenAI argued that the discovery rulings failed to account for the implications of the Second Circuit precedent and that Judge Wang did not provide adequate safeguards for the anonymized data.

Those arguments did not carry the day. In a separate account of the same dispute, coverage notes that OpenAI’s objections were rejected after the court concluded that the anonymization protocols and protective orders were sufficient to address privacy concerns. The judge also disagreed with OpenAI’s claim that the logs were irrelevant or disproportionate, siding instead with the plaintiffs’ view that only a large sample could reveal how often ChatGPT generated infringing outputs in response to user prompts. The result left OpenAI facing a sweeping production order that it had fought at every step.

News groups’ next demand: deleted logs and possible sanctions

Winning access to 20 million logs has not satisfied the news organizations, which now argue that the dataset is only a starting point. They are pressing the court to require OpenAI to retrieve millions of additional ChatGPT logs that the company says were deleted under its retention policies. One detailed report on the latest filings describes how news orgs win fight to access 20M ChatGPT logs and now want OpenAI to dig up millions of deleted ChatGPT logs, raising the possibility of sanctions for “mass deletion” if the court finds that relevant evidence was destroyed after litigation was foreseeable.

The plaintiffs’ theory is that if OpenAI continued to purge logs after it knew it faced copyright claims, that could amount to spoliation of evidence. A related summary of the same dispute notes that it appears OpenAI has lost its fight to keep news organizations from digging through the logs and now faces questions about whether any mass deletion of conversations might have erased infringing outputs in the sample. That line of attack raises the stakes beyond simple discovery, potentially exposing the company to penalties if judges conclude it failed to preserve data that should have been kept once the lawsuits were on the horizon.

From 20 million to 120 m: why publishers say the sample is too small

Even as they push for deleted logs, the publishers are also arguing that 20 million conversations are not enough to capture the full scope of alleged copying. Earlier in the litigation, OpenAI had offered 20 million user chats as evidence in a ChatGPT lawsuit, but The New York Times countered that it wanted 120 m conversations to properly test how often its stories were reproduced. A detailed account from Nieman Jour explains that OpenAI offers 20 million user chats as evidence in a ChatGPT lawsuit while The New York Times wants 120 m, underscoring how far apart the sides are on what counts as a representative sample.

That gap reflects a deeper disagreement about how generative AI should be audited. The publishers argue that a smaller dataset risks missing many instances where ChatGPT may have output copyrighted news content almost verbatim, especially if those prompts are rare or tied to specific phrasing. OpenAI, by contrast, has insisted that 20 million anonymized logs are already a massive and burdensome production, and that expanding the scope to something like 120 m conversations would be disproportionate. The court’s willingness to entertain further expansion will signal how much empirical scrutiny AI developers must accept when they are accused of training on protected material without permission.

The core allegation: Lawsuit Claims OpenAI Misused Times Stories

At the heart of the case is a simple but explosive claim: that OpenAI built its models by ingesting and reproducing news content without authorization. The central complaint, often summarized as Lawsuit Claims OpenAI Misused Times Stories, accuses the company of using The New York Times’ work to train AI systems and then allowing ChatGPT to output that work in ways that substitute for the original articles. The publishers argue that this behavior goes beyond transformative use and instead amounts to direct competition with their subscription and licensing businesses.

Reports on the litigation describe how the news outlets say ChatGPT can generate detailed summaries and even near-verbatim passages from their stories when prompted in particular ways. One account notes that the plaintiffs contend OpenAI’s systems were trained on their copyrighted news content without permission and that the resulting outputs can reproduce that material almost verbatim, especially when users ask for specific articles or topics. If the logs confirm that pattern at scale, it would strengthen the argument that the AI products are not just inspired by the news but are, in effect, repackaging it.

What the 20 million logs actually contain

The 20 million logs that OpenAI has been ordered to produce are anonymized records of user interactions, but they still promise an unusually granular look at how ChatGPT behaves in the wild. According to detailed coverage of the order, the logs are expected to include prompts, outputs, and metadata that can be filtered to identify conversations where users requested news content or specific articles from the plaintiff publishers. One report explains that OpenAI must hand over ChatGPT user logs representing 20 million conversations so that news publishers can test whether the system reproduced their material almost verbatim.

Another account of the same ruling notes that OpenAI was ordered to share 20 million ChatGPT user logs with news publishers, with the explicit goal of letting them search for instances where the chatbot generated copyrighted news content without permission. The anonymization is meant to strip out personal identifiers, but the substance of the conversations remains, giving the plaintiffs a vast corpus to mine for examples of alleged infringement. For OpenAI, the production is a double risk: it could reveal problematic outputs that bolster the lawsuit, and it could expose how users actually rely on the system in ways the company has not previously disclosed.

Magistrate Judge Wang Orders Production of logs to News Plaintiffs

The scale and specificity of the production order reflect how Magistrate Judge Wang approached the balance between discovery needs and corporate burden. In a detailed summary of the early stages of the dispute, coverage notes that Magistrate Judge Wang Orders Production of 20M ChatGPT Logs to News Plaintiffs, explaining that on Tuesday Magistrat Judge Wang concluded the logs were central to evaluating whether OpenAI’s models had been trained on and were reproducing the news publishers’ copyrighted works. That framing treated the logs not as a side issue but as the core evidence in the case.

Judge Wang’s order also set important guardrails. Reports indicate that the production must be anonymized and subject to strict confidentiality protections, limiting how the plaintiffs can use or share the data outside the litigation. At the same time, the judge rejected OpenAI’s attempt to drastically narrow the scope, finding that a smaller sample would not give a reliable picture of how often infringing outputs occur. That reasoning has now been endorsed by Judge Stein, which means Magistrate Judge Wang’s approach is likely to influence how other courts handle similar discovery fights in AI copyright cases.

Appeals, affirmations, and the Court ruling against OpenAI

OpenAI did not accept the discovery order quietly, but its efforts to overturn it have so far failed. A detailed account of the appellate phase describes a Court ruling against OpenAI that requires 20 million ChatGPT logs to be disclosed, emphasizing that the judges were not persuaded by arguments that the production would expose trade secrets or irreparably harm the company’s business. The ruling framed the logs as essential to determining whether copyrighted news content was used for AI training without permission, and it treated that question as weighty enough to justify the burden on OpenAI.

Another report on the same decision notes that OpenAI loses US appeal, ordered to produce 20m conversations, and that By Emma Whitford explains how the court dismissed the company’s warnings about user privacy and competitive harm. The judges concluded that existing protective orders and anonymization protocols were adequate, and that OpenAI had not shown that complying with the order would be impossible or disproportionate. With the appeal resolved, the company now faces a firm deadline to complete the production, even as the plaintiffs push for more data and potential sanctions over deleted logs.

OpenAI Ordered to Hand Over logs in the NYT Copyright Case

The New York Times has been one of the most visible players in this fight, and its case has helped define the contours of the broader MDL. A detailed account of the ruling notes that OpenAI was Ordered to Hand Over 20M ChatGPT Logs in NYT Copyright Case, describing how the court sided with the newspaper’s argument that only a massive dataset could reveal whether ChatGPT was outputting its copyrighted news content without permission. The order explicitly tied the production to the need to test whether the chatbot’s responses could substitute for the original articles.

That same report highlights how the NYT case has become a bellwether for other publishers, who see the 20 million logs as a template for what they can demand in their own suits. By forcing OpenAI to hand over such a large volume of conversations, the court has signaled that AI companies cannot rely on secrecy about their training data and outputs when they are accused of infringement. For the Times and its peers, the logs are both a litigation tool and a potential roadmap for future licensing or enforcement strategies if they can show that the AI products are built on their work.

Privacy, user trust, and the specter of mass deletion

Behind the legal maneuvering lies a more uncomfortable question for OpenAI and its users: what exactly happens to ChatGPT conversations, and how long are they kept? The company has insisted that the 20 million logs will be anonymized, but the plaintiffs’ push to recover deleted data has raised concerns about whether OpenAI preserved everything it should once litigation was reasonably anticipated. One detailed analysis of the dispute notes that news orgs want OpenAI to dig up millions of deleted ChatGPT logs and that they are raising the possibility of sanctions for “mass deletion” if the court finds that relevant evidence was destroyed after the lawsuits were in play.

Those allegations intersect with broader debates about how AI companies handle user data. A separate summary of the same controversy, highlighted through a bioethics lens, notes that Not only does it appear that OpenAI has lost its fight to keep news organizations from digging through the logs, but it now faces scrutiny over whether any mass deletion of conversations might have wiped out infringing outputs in the sample. Even if the data is anonymized, the idea that millions of user chats are being preserved, produced, or potentially deleted under legal pressure could affect how people think about the privacy of their interactions with AI systems.

Why the outcome matters far beyond OpenAI

The fight over ChatGPT logs is not happening in a vacuum. It is part of a broader reckoning over whether training large language models on copyrighted material is lawful, and what happens if courts decide it is not. A widely discussed comment thread on Hacker News captures the stakes bluntly, with one user arguing that the whole legal premise of these models is that training on copyrighted material is fair use and warning that if it is not, then companies like Faceb and other AI developers could face existential challenges to their business models. The discovery battle in the OpenAI case is one of the first concrete tests of how that abstract debate plays out in court.

For news organizations, the outcome will shape whether they can force AI companies to pay for access to their archives or at least limit how their work is reused. For OpenAI and its peers, it will determine how much transparency they must accept about their training data and outputs, and how carefully they must preserve user logs once litigation is on the horizon. The 20 million conversations now being turned over are a milestone, but the publishers’ push for deleted logs and a larger 120 m sample shows they are not done pressing for more. Whatever judges decide next will help define the rules of engagement between journalism and generative AI for years to come.

More from Morning Overview