This wild AI tool rips off open source code without breaking copyright

In early 2025, a class-action lawsuit against GitHub, Microsoft, and OpenAI over Copilot’s use of open-source training data was still grinding through federal court. By spring 2026, the case remains unresolved, and the tools at the center of the controversy have only grown more powerful. GitHub Copilot, Amazon CodeWhisperer, and a wave of newer AI coding assistants now generate code at industrial scale, drawing on billions of lines of publicly available source code, much of it released under open-source licenses that were written long before anyone imagined a machine learning model could absorb and remix them.

The result is a legal gray zone that affects every developer who publishes code under an open-source license and every company shipping products built with AI-generated snippets. These tools can reproduce patterns, structures, and sometimes near-verbatim blocks from open-source projects, all while their makers argue that the output does not technically violate copyright. Whether that argument holds up is the question the courts, Congress, and the U.S. Copyright Office are still trying to answer.

What the Copyright Office has actually said

The most authoritative guidance so far comes from the U.S. Copyright Office’s multi-part AI report, released in stages between mid-2024 and early 2025. The Office maintains a dedicated policy hub collecting its reports, notices of inquiry, and public comments on AI and copyright.

Part 2 of that report, focused on copyrightability, drew a line that matters for every developer using an AI assistant. According to a Library of Congress press release, the Office concluded that AI-assisted works can receive copyright protection, but only when a human author contributes enough original creative input. Purely AI-generated output, produced without meaningful human direction, selection, or arrangement, falls outside copyright’s reach.

For a developer, the practical distinction looks like this: if you prompt Copilot, review what it produces, and reshape the output with your own logic and structure, the final code can qualify for protection. If you accept a raw suggestion wholesale and ship it unchanged, it likely cannot. That threshold has real consequences for companies building proprietary products on top of AI-generated scaffolding.

The fair-use shield from Google v. Oracle

A second legal pillar gives AI tool makers a potential defense. In 2021, the Supreme Court ruled in Google LLC v. Oracle America, Inc. that Google’s reimplementation of Java APIs in Android qualified as fair use. A Congressional Research Service analysis of that decision confirmed the principle: copying functional elements of code can be lawful when the new use is sufficiently transformative in purpose and character.

AI vendors have seized on this reasoning. If a generative model produces code that reimplements functionality found in open-source projects without copying verbatim expression, the Google v. Oracle logic suggests the output could survive a fair-use challenge. But that case dealt with a human-directed reimplementation of a specific API, not a model trained on millions of repositories producing output at scale. Whether courts will extend the same reasoning to AI-generated code is an open question, and one that the pending Doe v. GitHub litigation may eventually force them to answer.

The gaps no one has filled

For all the guidance the Copyright Office has issued, it has not addressed the specific mechanics of how AI code generators ingest open-source repositories. No official statement has evaluated whether training a model on GPL-licensed or MIT-licensed code, and then generating functionally similar output, constitutes infringement or fair use. The Office’s framework speaks in general terms about human authorship thresholds without drilling into the license-specific obligations that govern most open-source software.

That silence is conspicuous. The GPL, the most widely used copyleft license, was designed around a simple bargain: you can use and modify the code, but derivative works must carry the same license and make their source available. If an AI model trained on GPL code produces output that is functionally derivative but never triggers the license’s distribution requirements, the reciprocal bargain at the heart of copyleft may be effectively neutralized.

A search of the Copyright Office’s public records system as of early 2026 shows no clearly labeled disputes that squarely test AI code generation against open-source license terms. The Office’s licensing portal does not yet feature a category of compulsory or voluntary licenses tailored to AI training on software. These gaps do not prove conflicts are absent. They indicate that any such disputes have not yet crystallized into the kind of high-profile test case that would force a definitive ruling.

What open-source maintainers are up against

The frustration among open-source developers is not abstract. Organizations like the Software Freedom Conservancy have been investigating Copilot’s use of copyleft-licensed training data since 2022, arguing that the tool’s outputs can reproduce substantial portions of licensed code without preserving downstream obligations. The Doe v. GitHub class-action lawsuit, filed in November 2022 in the Northern District of California, alleges that Copilot violates open-source licenses, the DMCA’s provisions on copyright management information, and developers’ rights under various state laws.

As of spring 2026, that case has survived early motions to dismiss on some claims but has not reached trial. Its outcome could set the first major precedent on whether AI training on open-source code, and the generation of code from those models, triggers license obligations or qualifies as fair use.

Meanwhile, maintainers of widely used projects face a practical dilemma. Their code feeds the training data behind tools that competitors and corporations use freely. Permissive licenses like MIT and Apache 2.0 impose few restrictions, so training on that code may raise fewer legal issues. But copyleft licenses like the GPL were specifically designed to prevent this kind of value extraction without reciprocal contribution. If courts decide that AI-generated output is not a “derivative work” under copyright law, the enforcement mechanism behind copyleft could be significantly weakened.

How the system has handled disruption before

The Library of Congress maintains a collection of historical copyright record books that track how registration practices adapted as new media emerged, from photographs in the 19th century to sound recordings and computer programs in the 20th. The pattern is consistent: an initial period of ambiguity, followed by incremental agency guidance, and eventually legislative action when the existing framework proves insufficient.

Software itself followed this arc. Computer programs were not explicitly covered by U.S. copyright law until the Copyright Act of 1976 and subsequent amendments in 1980. Before that, developers operated in a legal gray zone not unlike the one AI-generated code occupies now. The resolution came not from a single court ruling but from a combination of legislative updates, Copyright Office guidance, and industry practice that gradually clarified what was protectable and what was not.

AI-generated code appears to be on a similar trajectory. The Copyright Office’s staggered reports, the pending litigation, and the growing pressure from open-source communities all point toward a period of incremental clarification rather than a single decisive moment.

Where developers stand right now

For developers working with AI coding tools in 2026, the practical calculus comes down to how much human judgment you layer on top of machine output. The Copyright Office has made clear that human-authored contributions built on AI suggestions can qualify for protection. The Supreme Court’s Google v. Oracle ruling suggests that reimplementing functionality, rather than copying expression, is defensible. But neither of those principles has been tested in a case involving AI-generated code trained on open-source repositories.

Developers who treat AI output as a rough draft, reshaping it with their own structure, logic, and naming conventions, are on firmer legal ground than those who ship unedited suggestions into production. For maintainers of open-source projects, the more pressing concern is not whether their own code is protectable but whether the license terms they chose will be enforceable against a new kind of copying they never anticipated.

The legal system is moving, but it has not arrived. The Copyright Office’s reports, the Doe v. GitHub lawsuit, and the broader policy debate are all pieces of a framework that is still being assembled. Until a court rules on the specific intersection of AI training, code generation, and open-source licensing, both tool makers and developers are navigating by inference rather than precedent. The only certainty is that the answer, when it comes, will reshape how software gets built and who gets credit for it.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X