OpenAI faces new threat as authors access internal chats

In a significant turn of events, U.S. District Judge Araceli Martinez-Olguin has granted authors Sarah Silverman, Richard Kadrey, and Christopher Golden access to OpenAI’s internal Slack messages dating back to June 2022. This order, part of an ongoing copyright lawsuit, could expose potentially damaging discussions about the use of pirated books in AI training. The messages, which span over 2,000 pages, include conversations among OpenAI employees about scraping copyrighted materials without permission, potentially undermining the company’s defense in the case filed in the Northern District of California.

Background of the Authors’ Lawsuit

In July 2023, authors Sarah Silverman, Richard Kadrey, and Christopher Golden filed a class-action lawsuit against OpenAI and Microsoft. They accused the company of using their copyrighted books, obtained through unauthorized datasets like Books3, to train its GPT models. The plaintiffs’ core allegations revolve around OpenAI’s use of their works, which they claim constitutes copyright infringement. Tools like the “Have I Been Trained” database have provided evidence supporting their claims, showing the inclusion of their books in the training data.

OpenAI initially defended itself by claiming fair use under copyright law. The company resisted full disclosure during the discovery process, leading the plaintiffs to file a motion to compel. This resistance has now led to a significant court ruling that could potentially weaken OpenAI’s defense.

The Court’s Ruling on Discovery

Judge Araceli Martinez-Olguin, in August 2023, ordered OpenAI to grant the authors access to internal communications, including Slack messages from June 2022 onward. The judge determined that OpenAI’s search terms for relevant documents were too narrow, leading to this order. OpenAI’s objections were rejected, and the company was ordered to produce over 2,000 pages of relevant Slack exports. These messages cover discussions on data sourcing and copyright, potentially providing valuable evidence for the plaintiffs.

The scope of the disclosure is limited to messages involving key employees, such as those in research and engineering teams. This limitation is designed to avoid overly broad production and to focus on the most relevant communications.

Revelations from OpenAI’s Internal Slack Messages

Specific Slack conversations reveal that OpenAI employees, including researcher Suchir Balaji, discussed the use of “shady” torrent sites and pirated books for training data. This includes references to the Shadow Library’s Books3 dataset. These discussions could potentially undermine OpenAI’s public stance on ethical data practices and provide ammunition for the authors’ infringement claims.

Admissions in the messages about awareness of copyright issues are particularly damaging. One engineer noted in 2022 that “we’re just taking whatever’s out there” for AI development. This statement contradicts OpenAI’s public statements and could potentially weaken their defense in the lawsuit.

Implications for OpenAI’s Legal Strategy

The exposed Slack messages could undermine OpenAI’s fair use argument by showing deliberate inclusion of copyrighted material without licensing efforts. This could potentially weaken their defense and strengthen the authors’ case. The broader impact on similar lawsuits, including The New York Times’ separate case, where similar discovery battles are ongoing, could also be significant.

The potential fallout for OpenAI could be severe, with increased scrutiny from regulators and partners likely. The messages reveal internal debates on data legality, which could damage the company’s reputation and lead to further legal challenges.

The ongoing lawsuit and the recent court ruling highlight the complex issues surrounding AI training and copyright law. As AI continues to evolve, these issues are likely to become increasingly important, with potential implications for the entire tech industry. The outcome of this case could set a precedent for future lawsuits and shape the way AI companies handle copyrighted material in the future.

For more details, visit Futurism.