Policy

Book Publishers Target Meta's Llama Over Alleged Piracy-Sourced Training Data

Five major publishers and author Scott Turow claim Meta built its Llama AI models using books stripped from piracy repositories, a charge that could void a fair-use defense.

Last verified:

Five major book publishers and author Scott Turow have sued Meta, alleging the company built its Llama AI models on copyrighted works sourced from piracy repositories. The case cuts to the heart of AI copyright law: fair use may not shield a company that trained on stolen material.

When “Fair Use” Meets Piracy

The cornerstone of most AI copyright defenses is fair use — the claim that model training transforms copyrighted text into something new. That argument assumes the training data was lawfully acquired. According to The Verge AI, Meta allegedly sourced books and journal articles from piracy repositories including LibGen, Anna’s Archive, Sci-Hub, and Sci-Mag, and from the Common Crawl archive, which plaintiffs describe as riddled with unauthorized copies. You cannot claim transformative use of stolen property.

Publishers, Parties, and Evidence

The plaintiffs — Macmillan, McGraw Hill, Elsevier, Hachette, Cengage, and novelist Scott Turow — span academic and trade publishing. Their clearest evidence: researchers fed Llama an opening passage from a Cengage calculus title and the model continued it nearly verbatim. That output pattern implies memorization of training data rather than synthesis, which is the specific harm publishers need to establish.

Earlier courts have not made this path easy. A federal judge sided with Meta in a prior author lawsuit but explicitly noted his ruling carried no broader finding that ingesting protected works for AI training is lawful. A separate ruling found training on legally purchased books could qualify as fair use, yet still allowed a class action over pirated works to proceed — a distinction that proved costly for Anthropic, which settled a pirated-works class action for $1.5 billion. Meta has indicated it intends to contest the current suit on fair-use grounds.

Why This Matters

This case may finally force courts to draw a hard line between training on lawfully held content and training on pirated material. A ruling against Meta on the piracy sourcing question would constrain data-gathering practices across the industry and potentially require AI companies to audit — and disclose — every training corpus. The $1.5 billion Anthropic precedent suggests the financial exposure is real.

Frequently Asked Questions

Which piracy sites did Meta allegedly use to train Llama?

The lawsuit names LibGen, Anna's Archive, Sci-Hub, and Sci-Mag as sources, and also flags Common Crawl — a widely used web archive — as allegedly saturated with unauthorized copies.

How does Anthropic's copyright settlement relate to this Meta lawsuit?

Anthropic settled a class action specifically over works allegedly pirated for AI training — distinct from a separate fair-use ruling covering legally purchased books — for $1.5 billion. The Meta publishers' suit relies on a similar piracy-sourcing theory.

#copyright #Meta #Llama #fair use #generative AI #publishers #AI training data #piracy