AI training data lawsuits, explained
Most of the major AI lawsuits target the same conduct: copying copyrighted works into a training corpus. Here is what plaintiffs must prove, what defendants typically argue, and why this question is still legally unresolved.
The phrase "AI training data lawsuit" covers a specific legal theory: that a generative AI company copied copyrighted works (books, articles, images, songs) into a training corpus without authorization, and that the act of copying — not the model's outputs — is itself infringement. This is the theory behind Bartz, Authors Guild, NYT v. OpenAI, Getty v. Stability, and most other major AI cases.
What plaintiffs must prove
A typical training-data complaint alleges three elements:
- Copying. The defendant downloaded, scraped, or otherwise copied the plaintiff's copyrighted works.
- Use in training. Those copies were incorporated into a training dataset for a generative model.
- No authorization. The defendant lacked a license, terms-of-service grant, or statutory exception.
The first element is increasingly easy to prove because most major training corpora — Books3, LibGen, Common Crawl, OpenWebText — are publicly documented. The third element is contested: defendants argue the conduct falls within fair use (US), the TDM exception (EU), or analogous safe harbors elsewhere.
The fair-use defense (US)
In the US, defendants argue that training is transformative — the model produces a new functional capability, not a substitute for the original work — and that the per-work copying is incidental to a transformative purpose. Bartz v. Anthropic's 2025 order accepted that training itself may be transformative but held that downloading pirated copies of books to do the training was not protected by fair use. The two issues are now widely understood as separable.
The cleanest US fair-use ruling against an AI training defendant remains Thomson Reuters v. Ross Intelligence (D. Del. 2025), where the court found that copying Westlaw headnotes to train a competing legal-research product was not fair use because the use was directly substitutive.
The TDM exception (EU)
The EU DSM Directive Article 4 carves out a text-and-data-mining exception for training, but with important limits. GEMA v. OpenAI, decided at first instance by the Munich Regional Court I in November 2025, held that ChatGPT's reproduction of song lyrics was not protected by Article 4 — the court reasoned that the output reproduction, not the training itself, breached the exception. OpenAI has appealed.
What's on the merits at appeal
The two questions that will determine the future of training-data law:
- Whether downloading from shadow libraries is fair use when the resulting use is transformative.
- Whether near-verbatim regurgitation by a model defeats a transformative-use defense even when the training itself was lawfully sourced.
Neither question has reached the Ninth Circuit, the Federal Circuit, or the European Court of Justice. Until they do, the law is district-court law — and individual decisions diverge meaningfully.
What this means in practice
Most major AI labs are now restructuring their training pipelines around two principles: only ingest content with documented lawful provenance (license, terms-of-service grant, or public-domain status), and watermark or otherwise constrain outputs to reduce verbatim reproduction risk. The Bartz settlement made the first principle non-optional. The GEMA Munich ruling pushed labs toward the second.
Frequently asked questions
Is training AI on copyrighted material illegal?
In the US, the question is unresolved at the appellate level. Some district courts (Bartz) suggest training itself may be transformative; others (Thomson Reuters) have rejected the fair-use defense. The legality often depends on whether the training corpus was lawfully acquired.
What is a training-data lawsuit?
A lawsuit alleging that a generative AI company copied copyrighted works into a training corpus without authorization. The theory targets the act of copying, not necessarily the model's outputs.
Why are publishers suing AI companies?
Publishers allege that AI companies copied registered articles into training corpora without licensing, that the resulting models reproduce article content nearly verbatim in some cases, and that the AI products substitute for the original publishers' subscription products.
How do AI companies defend training-data lawsuits?
Typical defenses include: fair use (transformative purpose); the EU TDM exception; lawful acquisition (the works were licensed or in the public domain); de minimis copying; and that any reproduction in outputs is incidental rather than systematic.