A four-layered neural network. Sin314/Shutterstock.com

Two days after Judge Alsup released his thoughtful and well-reasoned fair use analyses on Anthropic’s unlicensed use of training data, Judge Chhabria ruled on a similar motion for summary judgment in Kadrey v. Meta on June 25, finding fair use for Meta’s AI training activities. The plaintiffs’ claim that Meta’s use of their works to train Llama AI models was a copyright infringement fails, at least for now.

The ruling corresponds to the comments Judge Chhabria made during and after the summary judgment arguments. During summary judgment arguments, Judge Chhabria repeatedly asked if content created by AI could flood the market with works used for training: “You are dramatically changing, you might even say obliterating, the market for that person’s work, and you’re saying you don’t even have to pay a license to that person for using their work to create the product that’s destroying the market for their work?”

Judge Chhabria’s thoughts changed little in the intervening weeks, and in his ruling invents a new theory on market harm: “market dilution”—a new theory of market harm that Judge Chhabria acknowledges is not found in any earlier cases. This theory suggests that “using copyrighted books to train an LLM might harm the market for those works because it enables the rapid generation of countless works that compete with the originals, even if those works aren’t themselves infringing.”

This marks a sharp departure from longstanding Supreme Court precedent. In Campbell v. Acuff-Rose, the Court made clear that “cognizable market harm” under the fourth factor is limited to “market substitution,” not extending to every possible downstream impact on revenue. By contrast, the market dilution theory treats even indirect, speculative, or long-term effects on potential market value as actionable harm—effectively extending the copyright monopoly well beyond its traditional boundaries.

In Judge Chhabria’s words, such a market dilution theory would encompass practically all competing uses:

Less similar outputs, such as books on the same topics or in the same genres, can still compete for sales with the books in the training data. And by taking sales from those books, or by flooding stores and online marketplaces so that some of those books don’t get noticed and purchased, those outputs would reduce the incentive for authors to create—the harm that copyright aims to prevent.

Perhaps most shocking to copyright scholars, this “market dilution” theory creates a novel proposition that a transformative use can nevertheless be substitutive for the original work under the fourth factor. No prior case has held that a transformative use can lose on fair use due to speculative market disruption alone. If market dilution trumps transformative use, it almost seems inevitable that any new technology or new expression would be challenged as diluting the market of existing works.

There are also serious procedural problems with this approach: the “market dilution” theory sidesteps the substantial similarity analysis entirely. Even if we were to accept that GAI can “dilute” the market of certain rightsholders’, it is still impossible to prove that a model consisting entirely of weights and parameters would be substantially similar to any creative works. It also leaves open the incredibly perplexing question of how one might actually prove market dilution. The variables involved in determining a loss in sales of one’s original works are numerous. It would be almost impossible to narrow them down to determine the effects of one AI company’s particular GAI model with its potential to create works that compete in the broader market for creative works.

Why do we create?

The opinion sets the mood against AI training early on:

[C]ompanies have been unable to resist the temptation to feed copyright-protected materials into their models—without getting permission from the copyright holders or paying them for the right to use their works for this purpose. This case presents the question whether such conduct is illegal. Although the devil is in the details, in most cases the answer will likely be yes. What copyright law cares about, above all else, is preserving the incentive for human beings to create artistic and scientific works. Therefore, it is generally illegal to copy protected works without permission. And the doctrine of “fair use,” which provides a defense to certain claims of copyright infringement, typically doesn’t apply to copying that will significantly diminish the ability of copyright holders to make money from their works.

This line of thought can be boiled down to how companies that train GAI on copyrighted works could undermine the market for those works, which in turn, Judge Chhabria asserts, weakens the incentive for human creators to produce new works.

In reality, human interaction with AI technology so far tells a different story. A more capable AI does not necessarily displace human participation in the same field; often, it enhances it. When Deep Blue defeated the world chess champion in 1997, it did not render professional chess obsolete. On the contrary, it elevated the level of play and deepened public interest in the game. Today, even though no human can match AI in chess play, professional players continue to attract millions of followers on social media and earn competitive incomes—often outperforming many other professions in both visibility and compensation. An equally plausible prediction then, when it comes to how human authors will fare, is that “humans will become more smart and capable” as professor Sejnowski predicted in his essay “AI will make you smarter.” While some fields may face disruption, the evidence so far suggests that AI tends to enhance—rather than replace—human efforts to research, learn, and create.

While Judge Chhabria recognizes that training AI may be transformative in the abstract, he opines that market harm can outweigh even a transformative purpose.

The opinion states that:

[W]hen it comes to market effects, using books to teach children to write is not remotely like using books to create a product that a single individual could employ to generate countless competing works with a miniscule fraction of the time and creativity it would otherwise take. This inapt analogy is not a basis for blowing off the most important factor in the fair use analysis.

It seems the court is not worried at all whether a new licensing market for AI training data could ever develop (there’s evidence suggesting the cost and logistics would be prohibitive), or if such a licensing system would be beneficial to incentivize human creativity in the first place. The opinion brushed off any concerns:

If using copyrighted works to train the models is as necessary as the companies say, they will figure out a way to compensate copyright holders for it. The upshot is that in many circumstances it will be illegal to copy copyright-protected works to train generative AI models without permission. Which means that the companies, to avoid liability for copyright infringement, will generally need to pay copyright holders for the right to use their materials.

The opinion makes these sweeping statements about the future of human authorship, without distinguishing between copyright holders, existing authors and creators, and future authors and creators, whose interests can often be ill-aligned. In our view, the far greater threat to authors’ livelihoods is power concentration among big publishers, platforms, and other intermediaries, all of whom stand to profit even more with a licensing market for AI training data.

Such confusion can also easily result in strange lines of reasoning, in a way that classic cases of fair use such as parody or criticism would not qualify as fair use when applying the court’s logic:

If the law allowed people to copy your creations in a way that would diminish the market for your works, this would diminish your incentive to create more in the future. Thus, the key question in virtually any case where a defendant has copied someone’s original work without permission is whether allowing people to engage in that sort of conduct would substantially diminish the market for the original work.

Whether a use merely will “diminish the market” has never been the standard for fair use. Not only would parody and criticism fail this test to qualify as fair use, had public libraries not existed today, I wonder how Judge Chhabria would rule as to the legality of libraries. After all, public libraries provide such liberal access to culture and knowledge, precisely the kind of access that would undermine the maximization of profits by rightsholders. Besides—applying Judge Chhabria’s logic cited above—if libraries wanted to acquire circulation copies, they can always negotiate for the right licensing price—if libraries are desirable, surely a licensing market will develop.

What did Meta do?

The court described Meta’s curation and reproduction of copyrighted works:

At first, Meta wanted to license books and so tried to negotiate licensing deals with several major publishers. Meta’s head of generative AI discussed spending up to $100 million on licensing. But as negotiations proceeded, Meta realized that licensing would be more difficult than anticipated. For one thing, publishers generally do not hold the subsidiary rights to license books for AI training. These rights are instead held by individual authors, and there is no organization for collective licensing of such rights. Even where publishers do hold AI training licensing rights, they do so regionally rather than globally. For another thing, some publishers apparently ignored Meta’s outreach, and only one gave Meta a pricing proposal.

Eventually, Meta began investigating the possibility of procuring the books (and other text) needed for training by downloading them from “shadow libraries.” A shadow library is an online repository that provides things like books, academic journal articles, music, or films for free download, regardless of whether that media is copyrighted. Meta first used a shadow library in October 2022, when it downloaded the Library Genesis (“LibGen”) database to investigate whether there was value in training Llama on the works it contained. If the answer was yes, the plan was to then set up licensing agreements for those or similar works. Id. But in spring 2023, after failing to acquire licenses and following escalation to CEO Mark Zuckerberg, Meta decided to just use the works acquired from LibGen as training data. And after confirming that LibGen contained most of the works available for license from certain publishers with which it had been negotiating, Meta abandoned its licensing efforts. In early 2024, Meta also downloaded Anna’s Archive, a compilation of shadow libraries including LibGen, Z-Library, and others. . . .

Certain torrenting protocols—including the one used by Meta, called BitTorrent—are, by default, configured so that files downloaded via torrenting may also be reuploaded to other computer systems. This reuploading can occur both while files are still being downloaded (which the parties refer to as “leeching”) and after those files have been fully downloaded (which the parties refer to as “seeding”). Some torrenting protocols—including BitTorrent—are designed to prioritize downloads to users who are also uploading. There is no dispute that Meta torrented LibGen and Anna’s Archive, but the parties dispute whether and to what extent Meta uploaded (via leeching or seeding) the data it torrented. A Meta engineer involved in the torrenting wrote a script to prevent seeding, but apparently not leeching. Therefore, say the plaintiffs, because BitTorrent’s default settings allow for leeching, and because Meta did nothing to change those default settings, Meta must have reuploaded “at least some” of the data Meta downloaded via torrent. . . .

Either way, Meta added the books it downloaded to the datasets it used to train the Llama models. It also post-trained its models to prevent them from “memorizing” and outputting certain text from their training data, including copyrighted material. These training efforts, which Meta calls “mitigations,” appear to have been successful. Meta’s expert witness tested them using a method designed to get LLMs to regurgitate material from its training data (which Meta calls “adversarial prompting”). Even using that method, the expert could get no model to generate more than 50 words and punctuation marks (that is, “tokens”) from the plaintiffs’ books. And the plaintiffs’ expert could only get the Llama model best at regurgitation to generate 50 words and punctuation marks from the plaintiffs’ books in 60% of tests. She also testified that Llama was not able to reproduce “any significant percentage” of them. In short, Llama cannot currently be used to read or otherwise meaningfully access the plaintiffs’ books.

Factor one of fair use

The court found that the first fair use factor favored Meta because its use of the plaintiffs’ books—training a large language model (LLM)—was highly transformative. Unlike the plaintiffs’ works, which are read for entertainment or education, Meta’s LLMs are used to perform a wide range of functions like translating text, generating business reports, or assisting with ideation.

Here, the court takes a more balanced approach to deciding whether a use qualifies as fair use:

The plaintiffs are wrong that the fact that Meta downloaded the books from shadow libraries and did not start with an “authorized copy” of each book gives them an automatic win. To say that Meta’s downloading was “piracy” and thus cannot be fair use begs the question because the whole point of fair use analysis is to determine whether a given act of copying was unlawful. See generally Amicus Br. of Electronic Frontier Foundation.

Although, the court did not follow Judge Alsup’s approach in Bartz, and spent some time analyzing the question of bad faith:

But Meta is also wrong to suggest that its use of shadow libraries is irrelevant to whether its copying was fair use. It’s relevant—or at least potentially relevant—in a few different ways.

First, Meta’s use of shadow libraries is relevant to the issue of bad faith, which is “often taken up under the first factor.” The law is in flux about whether bad faith is relevant to fair use. It seems like good faith versus bad faith shouldn’t be especially relevant: The purpose of fair use is to allow new expression that won’t substitute for the original work, and whether a given use was made in good or bad faith wouldn’t seem to affect the likelihood of that use substituting for the original. But even if bad faith is relevant, it doesn’t move the needle here, given the rest of the summary judgment record.

Second, downloading copyrighted material from shadow libraries would be relevant if it benefited those who created the libraries and thus supported and perpetuated their unauthorized copying and distribution of copyrighted works. In the vast majority of cases, this sort of peer-to-peer file-sharing will constitute copyright infringement. Some of the libraries Meta used have themselves been found liable for infringement. So if Meta’s act of downloading propped up these libraries or perpetuated their unlawful activities—for instance, if they got ad revenue from Meta’s visits to their websites— then that could affect the “character” of Meta’s use. But the plaintiffs have not submitted any evidence about this.

Factor two of fair use

The court found factor two to weigh in favor of the plaintiffs, because their works are “highly expressive works.” The court reasoned:

Meta argues that this factor favors it anyway because Meta only used the plaintiffs’ books to gain access to their “functional elements,” not to capitalize on their creative expression. Meta primarily relies on two Ninth Circuit cases involving “intermediate copying.” In both of those cases, a video game company copied a video game console manufacturer’s copyrighted code and reverse-engineered it to understand certain functional elements of that code. This allowed the game companies to build their own products that would work with the plaintiffs’.

But unlike the uses in those cases, Meta’s use of the plaintiffs’ books does depend on the books’ creative expression. As Meta itself notes, LLMs are trained through learning about “statistical relationships between words and concepts” and collecting “statistical data regarding word order, frequencies [what words are used and how often], grammar, and syntax.” Word order, word choice, grammar, and syntax are how people express their ideas.

It does not seem to matter that the resulting AI model consists of weights and parameters, not readable or understandable by any humans.

Factor Three of Fair Use

Factor three was found to be in favor of Meta because of the third factor’s close relationship with the first factor. The amount taken tends to be reasonable if it is to make a transformative use:

As an initial matter, the amount copied doesn’t seem especially relevant in this case. In a case involving, for instance, a musical parody, copying large portions of the original song might increase the parody’s “potential for market substitution.” But given that Meta’s LLMs won’t output any meaningful amount of the plaintiffs’ books, it’s not clear how or why Meta’s copying would be less likely to lead to the creation of direct substitutes for the books if Meta had copied less of them.

In any event, this factor favors Meta, even though it copied the plaintiffs’ books in their entirety. The amount that Meta copied was reasonable given its relationship to Meta’s transformative purpose. Everyone agrees that LLMs work better if they are trained on more high-quality material. So feeding a whole book to an LLM does more to train it than would feeding it only half of that book. With this in mind, it was “reasonably necessary” for Meta to “make use of the entirety of the works.”

Finding factor three in favor of training AI is the most predictable outcome. Had the court ruled otherwise, AI models will just have to train on 1% of each work instead of the full work.

Factor Four of Fair Use

Citing Harper & Row, the court considers the fourth factor the most important fair use factor. Because plaintiffs in this case failed to offer any real evidence on market harm, the court begrudgingly found in favor of Meta.

Still, the opinion explains a path forward for future claims against AI training:

In a case involving the use of copyrighted works to train generative AI models, there are at least three ways a plaintiff might try to argue that the defendant’s copying harmed the market for the works (or that the market would be harmed if that copying were widespread). First, the plaintiff might claim that the model will regurgitate their works (or outputs that are substantially similar), thereby allowing users to access those works or substitutes for them for free via the model. Second, the plaintiff might point to the market for licensing their works for AI training and contend that unauthorized copying for training harms that market (or precludes the development of that market). Third, the plaintiff might argue that, even if the model can’t regurgitate their own works or generate substantially similar ones, it can generate works that are similar enough (in subject matter or genre) that they will compete with the originals and thereby indirectly substitute for them. In this case, the first two arguments fail. The third argument is far more promising, but the plaintiffs’ presentation is so weak that it does not move the needle, or even raise a dispute of fact sufficient to defeat summary judgment.

In their complaint, the plaintiffs asserted only two types of market harm—that users of Llama can reproduce text from their books, and that Meta’s copying harmed the market for licensing copyrighted materials to companies for AI training. As for market dilution—the notion that allowing companies like Meta to copy their works to train products like Llama would inevitably cause the market for the plaintiffs’ works to be flooded with similar works—the plaintiffs never so much as mentioned it in their complaint. Nor did they mention it in their own summary judgment motion.

The court poses several questions that would need to be answered to bring a viable claim:

First, is Llama capable of generating such books? If it isn’t currently, will it be capable of doing so in the near future? Presumably the answer is yes, but that’s not a foregone conclusion. An LLM could, for instance, be configured to be unable to produce book-length or book-style outputs. So the fact that books are being created by some LLM does not automatically mean that Llama can create them or will be able to do so soon.

Second, what are these AI-generated books? Do they compete with Sarah Silverman’s memoir? With plaintiff Matthew Klam’s book of short stories? With Rachel Louise Snyder’s nonfiction work on domestic violence? The plaintiffs provide no analysis of the markets for their books, no discussion of whether these markets are or could be affected by AI-generated books, and no explanation of whether the existing AI-generated books referenced in the expert report compete in these markets.

Third, what impact does this competition actually have on sales of the books it competes with? Does it drown out those books entirely? Does it just chisel at their sales at the margins? Or, as discussed above and seems likely, does it depend on the book—are readers of romance novels happy to buy AI-generated ones, while all the people who want to read Sarah Silverman’s memoir still want to read it over AI-generated comic memoirs? Whatever the effects have been thus far, are they likely to increase in the future, as more and more AI-generated books are written, and as LLMs get better and better at writing human-like text?

Fourth, how does the threat to the market for the plaintiffs’ books in a world where LLM developers can copy those books compare to the threat to the market for the plaintiffs’ books in a world where the developers can’t copy them? There is no hint of that in the briefs or evidence presented by the plaintiffs.

Judge Chhabria’s market dilution analysis imagines that licensing payments from tech companies to copyright holders will preserve human creativity. So far, there isn’t much evidence to support this view, and we remain skeptical. It seems more likely such licensing schemes will only serve to enrich large content owners, not empower average authors. Trickle-down licensing won’t meaningfully support most authors, and it certainly won’t protect future creators. By prioritizing the economic interests of incumbent publishers, studios, and platforms, this approach locks in existing revenue structures and discourages future authors’ experimentation with new creative tools.

If we want to preserve human authorship in the age of AI, we will need policies that balance support for both past and future authors, as well as the incumbent rightsholders. Fair use has historically done this by limiting substitutive works. Judge Chhabria’s theory would upend this balance by allowing for indirect substitution under a “market dilution” theory that only serve to entrench existing business interests.

Piracy and Shadow Library

The court seems unmoved, at least without further evidence, to judge Meta’s shadow libraries harshly:

Two other issues are relevant to the fourth factor. First, as noted above, is whether Meta’s use of shadow libraries benefited those libraries or their other users. If it did, then this would be relevant to the fourth factor. It would mean that Meta’s copying helped others acquire copyrighted works, potentially including the plaintiffs’ works, without paying for them (and without any indication that those other people were acquiring the works for fair use purposes). But although the plaintiffs discussed Meta’s use of shadow libraries at length, they did not argue that it had these effects or was relevant to the fourth factor beyond allowing Meta to get the books without paying. At the hearing, the plaintiffs’ counsel did suggest that, by using shadow libraries, Meta (and other companies like it) would reduce the stigma associated with shadow libraries and encourage more people to use them. It’s not clear whether this would matter in the overall analysis. But in any event, counsel conceded that the record contains no evidence of this dynamic playing out.

Second is the public benefit associated with Meta’s copying. The plaintiffs say that sanctioning Meta’s conduct would encourage piracy by incentivizing other LLM companies to pirate and to “support and defend” shadow libraries “that make stolen works available for free.” There is no evidence in the record that Meta (or any other LLM developer) is actively supporting or otherwise encouraging widespread use of shadow libraries. As for incentivizing other LLM developers to use shadow libraries, the plaintiffs again beg the question—whether LLM developers should have to pay for the books they use as training data is the issue addressed in this opinion (and, obviously, a factspecific one that can’t be answered uniformly across the board).

Final Thoughts

Compared with the speculative and unprecedented arguments advanced in this ruling, we find Judge Alsup’s decision offers a more grounded and sensible approach to fair use. Going forward, we hope more courts would follow Judge Alsup’s example and maintain a realistic assessment of transformativeness and market harm. Fair use should be flexible, but not so much that it can accommodate every speculative theory of harm.

Discover more from Authors Alliance

Subscribe to get the latest posts sent to your email.

Meta Wins on Fair Use for Now, but Court Leaves Door Open for “Market Dilution”

Why do we create?

What did Meta do?

Factor one of fair use

Factor two of fair use

Factor Three of Fair Use

Factor Four of Fair Use

Piracy and Shadow Library

Final Thoughts

Discover more from Authors Alliance

Leave a Comment

Why do we create?

What did Meta do?

Factor one of fair use

Factor two of fair use

Factor Three of Fair Use

Factor Four of Fair Use

Piracy and Shadow Library

Final Thoughts

Share this:

Discover more from Authors Alliance

Related Posts

Leave a Comment