Training Data Copyright

Was training the model itself an act of infringement?

The doctrine

The core battleground of AI litigation. Authors, news publishers, record labels, image libraries, and software developers all claim their copyrighted works were copied without license to train large language and diffusion models. The universal defense is fair use.

U.S. courts apply the four-factor fair-use test of 17 U.S.C. §107: purpose and character (including transformativeness), nature of the work, amount and substantiality used, and market effect. Defendants argue training is highly transformative — the model does not retransmit the work but extracts statistical patterns. Plaintiffs argue the use is commercial, copies the work in full, and substitutes for licensed training corpora that publishers and licensing collectives are now offering.

The first U.S. summary-judgment ruling, Thomson Reuters v. Ross Intelligence (D. Del. 2025, Bibas, J.), found training on Westlaw headnotes for a competing legal-research tool was not fair use — heavily weighted by direct market substitution. Bartz v. Anthropic (N.D. Cal. 2025, Alsup, J.) bifurcated the question: training on legitimately acquired books may be fair use, but downloading and retaining pirated copies from shadow libraries is not. Kadrey v. Meta followed a similar provenance-driven path. The corpus's source — licensed, scraped, or pirated — is now the doctrinal hinge.

Outside the United States, the EU's text-and-data-mining exception (Articles 3 and 4 of the DSM Directive) shields some training, but the Munich Regional Court in GEMA v. OpenAI held the exception does not extend to memorized output. The UK High Court in Getty Images v. Stability AI rejected a "model-weights-as-infringing-copies" theory while leaving territorial training questions open.

Leading cases

Bartz v. Anthropic
N.D. Cal. · Alsup, J. · Settled $1.5B

Bifurcated fair-use ruling and the largest copyright settlement in U.S. history.

New York Times v. OpenAI
S.D.N.Y. · Stein, J. · Active

Flagship publisher case; motion-to-dismiss denied; log-preservation order extended discovery.

Authors Guild v. OpenAI
S.D.N.Y. · Consolidated · Active

Output infringement holding October 2025.

Getty Images v. Stability AI
UK High Court + D. Del.

UK trial decided November 2025; U.S. summary judgment pending.

Key holdings

  • Pirated training data. Not fair use under Bartz v. Anthropic; willful infringement exposure.
  • Provenance matters. Courts increasingly distinguish licensed, scraped, and pirated corpora at summary judgment.
  • Market substitution. Existence of a licensing market for training data weighs heavily against fair use (Thomson Reuters v. Ross).