Meta's latest legal wheeze is to insist that pirating books is fair use, actually. And it might be working.

artifex@piefed.social · 11 hours ago

Meta's latest legal wheeze is to insist that pirating books is fair use, actually. And it might be working.

ryathal@sh.itjust.works · 10 hours ago

Arguing that training models isn’t fair use us going to be a massive uphill battle, it’s basically reading the book but with a computer. It’s not actually a big deal to people, unless you hold the copyright to a ton of works and want to get a percentage of all the AI income these companies have made.

Torrenting the books is likely absolutely copyright infringement, but that has relatively low payout compared to the money these companies are getting for their models. The training being fair use means that rights holders can’t try to take any money from the model’s use. The statutory limits for infringement even at per work levels aren’t significant compared to the legal cost of proving it happened.

FatCrab@slrpnk.net · 8 hours ago

Anthropic pirating books for their training corpus resulted in the biggest copyright settlement in history–well over a billion. That is still being quibbled over i believe, but they settled because they were likely to pay out more if the case went forward. So I’m not really sure where you’re coming from that infringement via torrenting does not result in monstrously large liability.

ryathal@sh.itjust.works · 7 hours ago

The judge in that case ruled the training wasn’t fair use for pirated books, which left them on the hook for potentially all revenue (likely a court determined percentage) that the model generated for them in addition to statutory damages. That is well north of 1.5 billion.

artifex@piefed.social · 7 hours ago

Which is kind of a pity. Anyone who’s ever written something on the net should be getting royalty checks from these fucks. I’m not exactly famous but I’ve written prolifically in my field of work and have gotten nearly word-for-word reproductions of my articles out of every big model I’ve tested since GPT-3.

OfCourseNot@fedia.io · 9 hours ago

There’s an argument to be made that it is, in fact, not ‘reading’. The training of the model could be considered a lossy compression of the data. And streaming movies in a lossy compression format is not fair use, is it?

ryathal@sh.itjust.works · 7 hours ago

The model doesn’t stream out anyone’s content though. The article mentions that the plaintiffs have provided no examples of a prompt that creates anything substantial.

Streaming a lossy compression would generally be infringement, but there is definitely a point where it becomes not infringement if it’s lossy enough.

What a model generally stores, is factual information that isn’t copyright in the first place. It’s storing word counts, sentence lengths, sentiment analysis, and so on.

Fatal@piefed.social · 9 hours ago

It’s not the storage of the information that matters as much as the presentation. Google’s search index stores a huge amount of copyrighted material, even losslessly. But they only present small snippets at a time which is not considered copyright infringement. The question really is whether or not the information being presented by the models is in a format which is considered copyright infringement. So far, courts have not found that they are.