Category Archives: Issues

Books are Big AI’s Achilles Heel

Posted May 13, 2024

By Dave Hansen and Dan Cohen

Image of the Rijksmuseum by Michael D Beckwith. Image dedicated to the Public Domain.

Rapidly advancing artificial intelligence is remaking how we work and live, a revolution that will affect us all. While AI’s impact continues to expand, the operation and benefits of the technology are increasingly concentrated in a small number of gigantic corporations, including OpenAI, Google, Meta, Amazon, and Microsoft.

Challenging this emerging AI oligopoly seems daunting. The latest AI models now cost billions of dollars, beyond the budgets of startups and even elite research universities, which have often generated the new ideas and innovations that advance the state of the art.

But universities have a secret weapon that might level the AI playing field: their libraries. Computing power may be one important part of AI, but the other key ingredient is training data. Immense scale is essential for this data—but so is its quality.

Given their voracious appetite for text to feed their large language models, leading AI companies have taken all the words they can find, including from online forums, YouTube subtitles, and Google Docs. This is not exactly “the best that has been thought and said,” to use Matthew Arnold’s pointed phrase. In Big AI’s haphazard quest for quantity, quality has taken a back seat. The frequency of “hallucinations”—inaccuracies currently endemic to AI outputs—are cause for even greater concern.

The obvious way to rectify this lack of quality and tenuous relationship to the truth is by ingesting books. Since the advent of the printing press, authors have published well over 100 million books. These volumes, preserved for generations on the shelves of libraries, are perhaps the most sophisticated reflection of human thinking from the beginning of recorded history, holding within them some of our greatest (and worst) ideas. On average, they have exceptional editorial quality compared to other texts, capture a breadth and diversity of content, a vivid mix of styles, and use long-form narrative to communicate nuanced arguments and concepts.

The major AI vendors have sought to tap into this wellspring of human intelligence to power the artificial, although often through questionable methods. Some companies have turned to an infamous set of thousands of books, apparently retrieved from pirate websites without permission, called “Books3.” They have also sought licenses directly from publishers, using their massive budgets to buy what they cannot scavenge. Meta even considered purchasing one of the largest publishers in the world, Simon & Schuster.

As the bedrock of our shared culture, and as the possible foundation for better artificial intelligence, books are too important to flow through these compromised or expensive channels. What if there were a library-managed collection made available to a wide array of AI researchers, including at colleges and universities, nonprofit research institutions, and small companies as well as large ones?

Such vast collections of digitized books exist right now. Google, by pouring millions of dollars into its long-running book scanning project, has access to over 40 million books, a valuable asset they undoubtedly would like to keep exclusive. Fortunately, those digitized books are also held by Google’s partner libraries. Research libraries and other nonprofits have additional stockpiles of digitized books from their own scanning operations, derived from books in their own collections. Together, they represent a formidable aggregation of texts.

A library-led training data set of books would diversify and strengthen the development of AI. Digitized research libraries are more than large enough, and of substantially higher quality, to offer a compelling alternative to existing scattershot data sets. These institutions and initiatives have already worked through many of the most challenging copyright issues, at least for how fair use applies to nonprofit research uses such as computational analysis. Whether fair use also applies to commercial AI, or models built from iffy sources like Books3, remains to be seen.

Library-held digital texts come from lawfully acquired books—an investment of billions of dollars, it should be noted, just like those big data centers—and libraries are innately respectful of the interests of authors and rightsholders by accounting for concerns about consent, credit, and compensation. Furthermore, they have a public-interest disposition that can take into account the particular social and ethical challenges of AI development. A library consortium could distinguish between the different needs and responsibilities of academic researchers, small market entrants, and large commercial actors. 

If we don’t look to libraries to guide the training of AI on the profound content of books, we will see a reinforcement of the same oligopolies that rule today’s tech sector. Only the largest, most well-resourced companies will acquire these valuable texts, driving further concentration in the industry. Others will be prevented from creating imaginative new forms of AI based on the best that has been thought and said. As they have always done, by democratizing access libraries can support learning and research for all, ensuring that AI becomes the product of the many rather than the few.

Further reading on this topic: “Towards a Books Data Commons for AI Training,” by Paul Keller, Betsy Masiello, Derek Slater, and Alek Tarkowski.

This week, Authors Alliance celebrates its 10th anniversary with an event in San Francisco on May 17 (We still have space! Register for free here) titled “Authorship in an Age of Monopoly and Moral Panics,” where we will highlight obstacles and opportunities of new technology. This piece is part of a series leading up to the event.

Writing About Real People Update: Right of Publicity, Voice Protection, and Artificial Intelligence

Posted March 7, 2024
Photo by Jason Rosewell on Unsplash

Some of you may recall that Authors Alliance published our long-awaited guide, Writing About Real People, earlier this year. One of the major topics in the guide is the right of publicity—a right to control use of one’s own identity, particularly in the context of commercial advertising. These issues have been in the news a lot lately as generative AI poses new questions about the scope and application of the right of publicity. 

Sound-alikes and the Right of Publicity

One important right of publicity question in the genAI era concerns the increasing prevalence of “sound-alikes” created using generative AI systems. The issue of AI-generated voices that mimicked real people came to the public’s attention with the apparently convincing “Heart on My Sleeve” song, imitating Drake and the Weeknd, and tools that facilitate creating songs imitating popular singers have increased in number and availability

AI-generated soundalikes are a particularly interesting use of this technology when it comes to the right of publicity because one of the seminal right of publicity cases, taught in law schools and mentioned in primers on the topic, concerns a sound-alike from the analog world. In 1986, the Ford Motor Company hired an advertising agency to create a TV commercial. The agency obtained permission to use “Do You Wanna Dance,” a song Bette Midler had famously covered, in its commercial. But when the ad agency approached Midler about actually singing the song for the commercial, she refused. The agency then hired a former backup singer of Midler’s to record the song, apparently asking the singer to imitate Midler’s voice in the recording. A federal court found that this violated Midler’s right of publicity under California law, even though her voice was not actually used. Extending this holding to AI-generated voices seems logical and straightforward—it is not about the precise technology used to create or record the voice, but about the end result the technology is used to achieve. 

Right of Publicity Legislation

The right of publicity is a matter of state law. In some states, like California and New York, the right of publicity is established via statute, and in others, it’s a matter of common law (or judge-made law). In recent months, state legislatures have proposed new laws that would codify or expand the right of publicity. Similarly, many have called for the establishment of a federal right of publicity, specifically in the context of harms caused by the rise of generative AI. One driving force behind calls for the establishment of a federal right of publicity is the patchwork nature of state right of publicity laws: in some states, the right of publicity extends only to someone’s name, image, likeness, voice, and signature, but in others, it’s much broader. While AI-generated content and the ways in which it is being used certainly pose new challenges for courts considering right of publicity violations, we are skeptical that new legislation is the best solution. 

In late January, the No Artificial Intelligence Fake Replicas and Unauthorized Duplications Act of 2024 (or “No AI FRAUD Act”) was introduced in the House of Representatives. The No AI FRAUD Act would create a property-like right in one’s voice and likeness, which is transferable to other parties. It targets voice “cloning services” and mentions the “Heart on My Sleeve” controversy specifically. But civil societies and advocates for free expression have raised alarm about the ways in which the bill would make it easier for creators to actually lose control over their own personality rights while also impinging on others’ First Amendment rights due to its overbreadth and the property-like nature of the right it creates. While the No AI FRAUD Act contains language stating that the First Amendment is a defense to liability, it’s unclear how effective this would be in practice (and as we explain in the Writing About Real People Guide, the First Amendment is always a limitation on laws affecting freedom of expression). 

The Right of Publicity and AI-Generated Content

In the past, the right of publicity has been described as “name, image, and likeness” rights. What is interesting about AI-generated content and the right of publicity is that a person’s likeness can be used in a more complete way than ever before. In some cases, both their appearance and voice are imitated, associated with their name, and combined in a way that makes the imitation more convincing. 

What is different about this iteration of right of publicity questions is the actors behind the production of the soundalikes and imitations, and, to a lesser extent, the harms that might flow from these uses. A recent use of a different celebrity’s likeness in connection with an advertisement is instructive on this point. Earlier this year, advertisements emerged on various platforms featuring an AI-generated Taylor Swift participating in a Le Creuset cookware giveaway. These ads contained two separate layers of deceptiveness: most obviously, that Swift was AI-generated and did not personally appear in the ad, but more bafflingly, that they were not Le Creuset ads at all. The ads were part of a scam whereby users might pay for cookware they would never receive, or enter credit card details which could then be stolen or otherwise used for improper purposes. Compared to more traditional conceptions of advertising, the unfair advantages and harms caused by the use of Swift’s voice and likeness are much more difficult to trace. Taylor Swift’s likeness and voice were appropriated by scammers to trick the public into thinking they were interacting with Le Creuset advertising. 

It may be that the right of publicity as we know it (and as we discuss it in the Writing About Real People Guide) is not well-equipped to deal with these kinds of situations. But it seems to us that codifying the right of publicity in federal law is not the best approach. Just as Bette Midler had a viable claim under California’s right of publicity statute back in 1992, Taylor Swift would likely have a viable claim against Le Creuset if her likeness had been used by that company in connection with commercial advertising. The problem is not the “patchwork of state laws,” but that this kind of doubly-deceptive advertising is not commercial advertising at all. On a practical level, it’s unclear what party could even be sued by this kind of use. Certainly not Le Creuset. And it seems to us unfair to say that the creator of the AI technology sued should be left holding the bag, just because someone used it for fraudulent purposes. The real fraudsters—anonymous but likely not impossible to track down—are the ones who can and should be pursued under existing fraud laws. 

Authors Alliance has said elsewhere that reforms to copyright law cannot be the solution to any and all harms caused by generative AI. The same goes for the intellectual property-like right of publicity. Sensible regulation of platforms, stronger consumer protection laws, and better means of detecting and exposing AI-generated content are possible solutions to the problems that the use of AI-generated celebrity likenesses have brought about. To instead expand intellectual property rights under a federal right of publicity statute risks infringing on our First Amendment freedoms of speech and expression.

Authors Alliance Submits Long-Form Comment to Copyright Office in Support of Petition to Expand Existing Text and Data Mining Exemption 

Posted January 29, 2024
Photo by Simona Sergi on Unsplash

Last month, Authors Alliance submitted detailed comments in response to the Copyright Office’s Notice of Proposed Rulemaking in support of our petition to expand the existing Digital Millennium Copyright Act (DMCA) exemptions that enable text and data mining (TDM) as part of this year’s §1201 rulemaking cycle

To recap: our expansion petitions ask the Copyright Office to modify the existing TDM exemption so that researchers who assemble corpora of ebooks or films on which to conduct text and data mining are able to share that corpus with other academic researchers, where this second group of researchers qualifies under the exemption. Under the current exemption, academic researchers are only able to share their corpora with other qualified researchers for purposes of “collaboration and verification.” This simple change would eliminate the need for duplicative efforts to remove digital locks from ebooks and films, a time and resource-intensive process, broadening the group of academic researchers who are able to use the exemption. 

Our comment argues that the existing TDM exemption has begun to enable valuable digital humanities research and teaching, but that the proposed expansion would go much further towards enabling this research and helping TDM researchers reach their goals. The comment is accompanied by 13 letters of support from researchers, educators, and funding organizations, highlighting the research that has been done in reliance on the exemption, and explaining why this expansion is necessary. Our thanks go out to our stellar clinical team at UC Berkeley’s Samuelson Law, Technology & Public Policy Clinic—law students Mathew Cha and Zhudi Huang, and clinical supervisor Jennifer Urban—for writing and submitting this comment on our behalf. We are also grateful to our co-petitioners, the Library Copyright Alliance and American Association of University Professors, for their support on this comment. 

Ambiguity in “Collaboration”

One reason the expansion is necessary is the uncertainty over what constitutes “collaboration” under the existing exemption. Researchers have open questions about what level of individual contribution to a project would make researchers “collaborators” under the exemption. As our comment explains, collaboration can come in a number of different forms, from “formal collaborations under the auspice of a grant, [to] ad hoc collaborations that result from two teams discovering that they are working on similar material to the same ends, or even discussions at conferences between members of a loose network of scholars working on the same broad set of interests.” But it is not clear which of these activities is “collaboration” for the purposes of the exemption. And this uncertainty has had a chilling effect on the socially valuable research made possible by the exemption. 

Costly Corpora Creation 

Our comment also highlights the vast costs that go into creating a usable corpus for TDM research. Institutions whose researchers are conducting TDM research pursuant to the exemption must lawfully own the works in question, or license them through a license that is not time-limited. But these costs pale in comparison to the required computing resources—a cost which is compounded by the exemption’s strict security requirements—and human labor involved in bypassing technical protection measures and assembling a corpus. Moreover, it’s important to recognize that there is simply not a tremendous amount of grant funding or even institutional support available to TDM researchers. 

Because corpora are so costly to assemble and create, we believe it to be reasonable to permit researchers to share their corpora with researchers at other institutions who want to conduct independent TDM research on these corpora. As the exemption currently stands, researchers interested in pre-existing corpora must duplicate the efforts of the previous researchers, incurring massive costs along the way. We’ve already seen indications that these costs can lead researchers to avoid certain research questions and areas of study altogether. As our comment explains, this “duplicative circumvention” can be avoided by changing the language of the exemption to permit corpora sharing between qualified researchers at separate institutions. 

Equity Issues

Worse still, not all institutions are able to bear these expenses. Our comment explains how the current exemption’s prohibition on sharing beyond collaboration and verification—and consequent duplication of prior labor—-”create[s] barriers that can prevent smaller and less-well-resourced institutions from conducting TDM research at all.” This creates inequity in what type of institutions can support TDM projects, and what types of researchers can conduct them. The unfortunate result has been that large institutions that have “the resources to compensate and maintain technical staff and infrastructure” are able to support TDM research under the exemption, while smaller institutions are not. 

Values of Corpora Sharing

Our comment explains how allowing limited sharing of corpora under the exemption would go a long way towards lowering barriers to entry for TDM research and ameliorating the equity issues described above. Since digital humanities is already an under-resourced field, the effects of enabling researchers to share their corpora with other academic researchers could be quite profound. 

Researchers who wrote letters in support of the petition described a multitude of exciting projects, and have built “a rich set of corpora to study, such as a collection of fiction written by African American writers, a collection of books banned in the United States, and a curated corpus of movies and television with an ‘emphasis on racial, ethnic, sexual, and gender diversity.’” Many of those who wrote letters in support of our petition recounted requests they’ve gotten from other researchers to use their corpora, and who were frustrated that the exemption’s prohibition on non-collaborative sharing and their limited capacity for collaboration prevented them from sharing these corpora. 

Allowing new researchers with new research questions to study these corpora could reveal new insights about these bodies of work. As we explain, “in the same way a single literary work or motion picture can evince multiple meanings based on the lens of analysis used, when different researchers study one corpus, they are able to pose different research questions and apply different methodologies, ultimately revealing new and original findings . . . . Enabling broader sharing and thus, increasing the number of researchers that can study a corpus, will allow a body of works to be better understood beyond the initial ‘limited set of research questions.’”

Fair Use

The 1201 rulemaking process for exemptions to DMCA § 1201’s prohibition on breaking digital locks requires that the proposed activity be a fair use. In the 2021 proceedings, the Office recognized TDM for research and teaching purposes as a fair use. Because the expansion we’re seeking is relatively minor, our comment explains that the types of uses we are asking the Office to permit researchers to make is also fair use. Our comment explains that each of the four fair use factors favor fair use in the context of the proposed expansion. We further explain why the enhanced sharing the expansion would provide does not harm the market for the original works under factor four: because institutions must lawfully own (or license under a non-time-limited license) the works that their researchers wish to conduct TDM on, it makes no difference from a market standpoint whether researchers bypass technical protection measures themselves, or share another institution’s corpus. Copyright holders are not harmed when researchers at one institution share a corpus created by researchers at another institution, since both institutions must purchase the works in order to be eligible under the exemption. 

What’s Next?

If there are parties that oppose our proposed expansion, they have until February 20th to submit opposition comments to the Copyright Office. Then, on March 19th, our reply comments to any opposition comments will be due. We will keep our readers and members apprised as the process continues to move forward.

Hachette v. IA Amicus Briefs: Highlight on Privacy and Controlled Digital Lending

Posted January 16, 2024

Photo by Matthew Henry on Unsplash

Over the holidays you may have read about the amicus brief we submitted in the Hachette v. Internet Archive case about library controlled digital lending (CDL), which we’ve been tracking for quite some time. Our brief was one of 11 amicus briefs filed that explained to the court the broader implications of the case. Internet Archive itself has a short overview of the others already (representing 20 organizations and 298 individuals–mostly librarians and legal experts). 

I thought it would be worthwhile to highlight some of the important issues identified by these amici that did not receive much attention earlier in the lawsuit. This post is about the reader’s privacy issues raised by several amici in support of Internet Archive and CDL. Later this week we’ll have another post focused on briefs and arguments about why the district court inappropriately construed Internet Archive’s lending program as “commercial.” 

Privacy and CDL 

One aspect of library lending that’s really special is the privacy that readers are promised when they check out a book. Most states have special laws that require libraries to protect readers’ privacy, something that libraries enthusiastically embrace (e.g., see the ALA Library Bill of Rights) as a way to help foster free inquiry and learning among readers.  Unlike when you buy an ebook from Amazon–which keeps and tracks detailed reader information–dates, times, what page you spent time on, what you highlighted–libraries strive to minimize the data they keep on readers to protect their privacy. This protects readers from data breaches or other third party demands for that data. 

The brief from the Center for Democracy and Technology, Library Freedom Project, and Public Knowledge spends nearly 40 pages explaining why the court should consider reader privacy as part of its fair use calculus. Represented by Jennifer Urban and a team of students at the Samuelson Law, Technology and Public Policy Clinic at UC Berkeley Law (disclosure: the clinic represents Authors Alliance on some matters, and we are big fans of their work), the brief masterfully explains the importance of this issue. From their brief, below is a summary of the argument (edited down for length): 

The conditions surrounding access to information are important. As the Supreme Court has repeatedly recognized, privacy is essential to meaningful access to information and freedom of inquiry. But in ruling against the Internet Archive, the district court did not consider one of CDL’s key advantages: it preserves libraries’ ability to safeguard reader privacy. When employing C

DL, libraries digitize their own physical materials and loan them on a digital-to-physical, one-to-one basis with controls to prevent redistribution or sharing. CDL provides extensive, interrelated benefits to libraries and patrons, such as increasing accessibility for people with disabilities or limited transportation, improving access to rare and fragile materials, facilitating interlibrary resource sharing—and protecting reader privacy. For decades, libraries have protected reader privacy, as it is fundamental to meaningful access to information. Libraries’ commitment is reflected in case law, state statutes, and longstanding library practices. CDL allows libraries to continue protecting reader privacy while providing access to information in an increasingly digital age. Indeed, libraries across the country, not just the Internet Archive, have deployed CDL to make intellectual materials more accessible. And while increasing accessibility, these CDL systems abide by libraries’ privacy protective standards. 

Commercial digital lending options, by contrast, fail to protect reader privacy; instead, they threaten it. These options include commercial aggregators—for-profit companies that “aggregate” digital content from publishers and license access to these collections to libraries and their patrons—and commercial e-book platforms, which provide services for reading digital content via e-reading devices, mobile applications (“apps”), or browsers. In sharp contrast to libraries, these commercial actors track readers in intimate detail. Typical surveillance includes what readers browse, what they read, and how they interact with specific content—even details like pages accessed or words highlighted. The fruits of this surveillance may then be shared with or sold to third parties. Beyond profiting from an economy of reader surveillance, these commercial actors leave readers vulnerable to data breaches by collecting and retaining vast amounts of sensitive reader data. Ultimately, surveilling and tracking readers risks chilling their desire to seek information and engage in the intellectual inquiry that is essential to American democracy. 

Readers should not have to choose to either forfeit their privacy or forgo digital access to information; nor should libraries be forced to impose this choice on readers. CDL provides an ecosystem where all people, including those with mobility limitations and print disabilities, can pursue knowledge in a privacy-protective manner. . . . 

An outcome in this case that prevents libraries from relying on fair use to develop and deploy CDL systems would harm readers’ privacy and chill access to information. But an outcome that preserves CDL options will preserve reader privacy and access to information. The district court should have more carefully considered the socially beneficial purposes of library-led CDL, which include protecting patrons’ ability to access digital materials privately, and the harm to copyright’s public benefit of disallowing libraries from using CDL. Accordingly, the district court’s decision should be reversed.

The court below considered CDL copies and licensed ebook copies as essentially equivalent and concluded that the CDL copies IA provided acted as substitutes for licensed copies. Authors Alliance’s amicus brief points out some of the ways that CDL copies actually quite different significantly from licensed copies. It seems to me that this additional point about protection of reader privacy–and the protection of free inquiry that comes with it–is exactly the kind of distinguishing public benefit that the lower court should have considered but did not. 

You can read the full brief from the Center for Democracy and Technology, Library Freedom Project, and Public Knowledge here. 

Licensing research content via agreements that authorize uses of artificial intelligence

Posted January 10, 2024
Photo by Hal Gatewood on Unsplash

This is a guest post by Rachael G. Samberg, Timothy Vollmer, and Samantha Teremi, professionals within the Office of Scholarly Communication Services at UC Berkeley Library. 

On academic and library listservs, there has emerged an increasingly fraught discussion about licensing scholarly content when scholars’ research methodologies rely on artificial intelligence (AI). Scholars and librarians are rightfully concerned that non-profit educational research methodologies like text and data mining (TDM) that can (but do not necessarily) incorporate usage of AI tools are being clamped down upon by publishers. Indeed, libraries are now being presented with content license agreements that prohibit AI tools and training entirely, irrespective of scholarly purpose. 

Conversely, publishers, vendors, and content creators—a group we’ll call “rightsholders” here—have expressed valid concerns about how their copyright-protected content is used in AI training, particularly in a commercial context unrelated to scholarly research. Rightsholders fear that their livelihoods are being threatened when generative AI tools are trained and then used to create new outputs that they believe could infringe upon or undermine the market for their works.

Within the context of non-profit academic research, rightsholders’ fears about allowing AI training, and especially non-generative AI training, are misplaced. Newly-emerging content license agreements that prohibit usage of AI entirely, or charge exorbitant fees for it as a separately-licensed right, will be devastating for scientific research and the advancement of knowledge. Our aim with this post is to empower scholars and academic librarians with legal information about why those licensing outcomes are unnecessary, and equip them with alternative licensing language to adequately address rightsholders’ concerns

To that end, we will: 

  1. Explain the copyright landscape underpinning the use of AI in research contexts;
  2. Address ways that AI usage can be regulated to protect rightsholders, while outlining opportunities to reform contract law to support scholars; and 
  3. Conclude with practical language that can be incorporated into licensing agreements, so that libraries and scholars can continue to achieve licensing outcomes that satisfy research needs.

Our guidance is based on legal analysis as well as our views as law and policy experts working within scholarly communication. While your mileage or opinions may vary, we hope that the explanations and tools we provide offer a springboard for discussion within your academic institutions or communities about ways to approach licensing scholarly content in the age of AI research.

Copyright and AI training

As we have recently explored in presentations and posts, the copyright law and policy landscape underpinning the use of AI models is complex, and regulatory decision-making in the copyright sphere will have ramifications for global enterprise, innovation, and trade. A much-discussed group of lawsuits and a parallel inquiry from the U.S. Copyright Office raise important and timely legal questions, many of which we are only beginning to understand. But there are two precepts that we believe are clear now, and that bear upon the non-profit education, research, and scholarship undertaken by scholars who rely on AI models. 

First, as the UC Berkeley Library has explained in greater detail to the Copyright Office, training artificial intelligence is a fair use—and particularly so in a non-profit research and educational context. (For other similar comments provided to the Copyright Office, see, e.g., the submissions of Authors Alliance and Project LEND). Maintaining its continued treatment as fair use is essential to protecting research, including TDM. 

TDM refers generally to a set of research methodologies reliant on computational tools, algorithms, and automated techniques to extract revelatory information from large sets of unstructured or thinly-structured digital content. Not all TDM methodologies necessitate usage of AI models in doing so. For instance, the words that 20th century fiction authors use to describe happiness can be searched for in a corpus of works merely by using algorithms looking for synonyms and variations of words like “happiness” or “mirth,” with no AI involved. But to find examples of happy characters in those books, a researcher would likely need to apply what are called discriminative modeling methodologies that first train AI on examples of what qualities a happy character demonstrates or exhibits, so that the AI can then go and search for occurrences within a larger corpus of works. This latter TDM process involves AI, but not generative AI; and scholars have relied non-controversially on this kind of non-generative AI training within TDM for years. 

Previous court cases like Authors Guild v. HathiTrust, Authors Guild v. Google, and A.V. ex rel. Vanderhye v. iParadigms have addressed fair use in the context of TDM and confirmed that the reproduction of copyrighted works to create and conduct text and data mining on a collection of copyright-protected works is a fair use. These cases further hold that making derived data, results, abstractions, metadata, or analysis from the copyright-protected corpus available to the public is also fair use, as long as the research methodologies or data distribution processes do not re-express the underlying works to the public in a way that could supplant the market for the originals. 

For the same reasons that the TDM processes constitute fair use of copyrighted works in these contexts, the training of AI tools to do that text and data mining is also fair use. This is in large part because of the same transformativeness of the purpose (under Fair Use Factor 1) and because, just like “regular” TDM that doesn’t involve AI, AI training does not reproduce or communicate the underlying copyrighted works to the public (which is essential to the determination of market supplantation for Fair Use Factor 4). 

But, while AI training is no different from other TDM methodologies in terms of fair use, there is an important distinction to make between the inputs for AI training and generative AI’s outputs. The overall fair use of generative AI outputs cannot always be predicted in advance: The mechanics of generative AI models’ operations suggest that there are limited instances in which generative AI outputs could indeed be substantially similar to (and potentially infringing of) the underlying works used for training; this substantial similarity is possible typically only when a training corpus is rife with numerous copies of the same work. And a recent case filed by the New York Times addresses this potential similarity problem with generative AI outputs.  

Yet, training inputs should not be conflated with outputs: The training of AI models by using copyright-protected inputs falls squarely within what courts have already determined in TDM cases to be a transformative fair use. This is especially true when that AI training is conducted for non-profit educational or research purposes, as this bolsters its status under Fair Use Factor 1, which considers both transformativeness and whether the act is undertaken for non-profit educational purposes. 

Were a court to suddenly determine that training AI was not fair use, and AI training was subsequently permitted only on “safe” materials (like public domain works or works for which training permission has been granted via license), this would curtail freedom of inquiry, exacerbate bias in the nature of research questions able to be studied and the methodologies available to study them, and amplify the views of an unrepresentative set of creators given the limited types of materials available with which to conduct the studies.

The second precept we uphold is that scholars’ ability to access the underlying content to conduct fair use AI training should be preserved with no opt-outs from the perspective of copyright regulation. 

The fair use provision of the Copyright Act does not afford copyright owners a right to opt out of allowing other people to use their works in any other circumstance, for good reason: If content creators were able to opt out of fair use, little content would be available freely to build upon. Uniquely allowing fair use opt-outs only in the context of AI training would be a particular threat for research and education, because fair use in these contexts is already becoming an out-of-reach luxury even for the wealthiest institutions. What do we mean?

In the U.S., the prospect of “contractual override” means that, although fair use is statutorily provided for, private parties like publishers may “contract around” fair use by requiring libraries to negotiate for otherwise lawful activities (such as conducting TDM or training AI for research). Academic libraries are forced to pay significant sums each year to try to preserve fair use rights for campus scholars through the database and electronic content license agreements that they sign. This override landscape is particularly detrimental for TDM research methodologies, because TDM research often requires use of massive datasets with works from many publishers, including copyright owners who cannot be identified or who are unwilling to grant such licenses. 

So, if the Copyright Office or Congress were to enable rightsholders to opt-out of having their works fairly used for training AI for scholarship, then academic institutions and scholars would face even greater hurdles in licensing content for research. Rightsholders might opt out of allowing their work to be used for AI training fair uses, and then turn around and charge AI usage fees to scholars (or libraries)—essentially licensing back fair uses for research. 

Fundamentally, this undermines lawmakers’ public interest goals: It creates a risk of rent-seeking or anti-competitive behavior through which a rightsholder can demand additional remuneration or withhold granting licenses for activities generally seen as being good for public knowledge or that rely on exceptions like fair use. And from a practical perspective, allowing opt-outs from fair uses would impede scholarship by or for research teams who lack grant or institutional funds to cover these additional licensing expenses; penalize research in or about underfunded disciplines or geographical regions; and result in bias as to the topics and regions that can be studied. 

“Fair use” does not mean “unregulated” 

Although training AI for non-profit scholarly uses is fair use from a copyright perspective, we are not suggesting AI training should be unregulated. To the contrary, we support guardrails because training AI can carry risk. For example, researchers have been able to use generative AI like ChatGPT to solicit personal information by bypassing platform safeguards.

To address issues of privacy, ethics, and the rights of publicity (which govern uses of people’s voices, images, and personas), there should be the adoption of best practices, private ordering, and other regulations. 

For instance, as to best practices, scholar Matthew Sag has suggested preliminary guidelines to avoid violations of privacy and the right to publicity. First, he recommends that AI platforms avoid training their large language models on duplicates of the same work. This would reduce the likelihood that the models could produce copyright-infringing outputs (due to memorization concerns), and it would also lessen the likelihood that any content containing potentially private or sensitive information would be outputted from having been fed into the training process multiple times. Second, Sag suggests that AI platforms engage in “reinforcement learning through human feedback” when training large language models. This practice could cut down on privacy or rights of publicity concerns by involving human feedback at the point of training, instead of leveraging filtering at the output stage.  

Private ordering would rely on platforms or communities to implement appropriate policies governing privacy issues, rights of publicity, and ethical concerns. For example, the UC Berkeley Library has created policies and practices (called “Responsible Access Workflows”) to help it make decisions around whether—and how—special collection materials may be digitized and made available online. Our Responsible Access Workflows require review of collection materials across copyright, contracts, privacy, and ethics parameters. Through careful policy development, the Library applies an ethics of care approach to making available online the collection content with ethical concerns. Even if content is not shared openly online, it doesn’t mean it’s unavailable for researchers for use in person; we simply have decided not to make that content available in digital formats with lower friction for use. We aim to apply transparent information about our decision-making, and researchers must make informed decisions about how to use the collections, whether or not they are using them in service of AI.

And finally, concerning regulations, countries like those in the EU have recently introduced an AI training framework that requires, among other things, the disclosure of source content, and the rights for content creators to opt out of having their works included in training sets except when the AI training is being done for research purposes by research organizations, cultural heritage institutions, and their members or scholars. United States agencies could consider implementing similar regulations here. 

But from a copyright perspective, and within non-profit academic research, fair use in AI training should be preserved without the opportunity to opt out for the reasons we discuss above. Such an approach regarding copyright would also be consistent with the distinction the EU has made for AI training in academic settings, as the EU’s Digital Single Market Directive bifurcates practices outside the context of scholarly research

While we favor regulation that preserves fair use, it is also important to note that merely preserving fair use rights in scholarly contexts for training AI is not the end of the story in protecting scholarly inquiry. So long as the United States permits contractual override of fair uses, libraries and researchers will continue to be at the mercy of publishers aggregating and controlling what may be done with the scholarly record, even if authors dedicate their content to the public domain or apply a Creative Commons license to it. So in our view, the real work that should be done is pursuing legislative or regulatory arrangements like the approximately 40 other countries that have curtailed the ability of contracts to abrogate fair use and other limitations and exceptions to copyright within non-profit scholarly and educational uses. This is a challenging, but important, mission.

Licensing guidance in the meantime 

While the statutory, regulatory, and private governance landscapes are being addressed, libraries and scholars need ways to preserve usage rights for content when training AI as part of their TDM research methodologies. We have developed sample license language intended to address rightsholders’ key concerns while maintaining scholars’ ability to train AI in text and data mining research. We drafted this language to be incorporated into amendments to existing licenses that fail to address TDM, or into stand-alone TDM and AI licenses; however, it is easily adaptable into agreements-in-chief (and we encourage you to do so). 

We are certain our terms can continue to be improved upon over time or be tailored for specific research needs as methodologies and AI uses change. But in the meantime, we think they are an important step in the right direction.

With that in mind, it is important to understand that within contracts applying U.S. law, more specific language controls over general language in a contract. So, even if there is a clause in a license agreement that preserves fair use, if it is later followed by a TDM clause that restricts how TDM can be conducted (and whether AI can be used), then that more specific language governs TDM and AI usage under the agreement. This means that libraries and scholars must be mindful when negotiating TDM and AI clauses as they may be contracting themselves out of rights they would otherwise have had under fair use. 

So, how can a library or scholar negotiate sufficient AI usage rights while acknowledging the concerns of  publishers? We believe publishers have attempted to curb AI usage because they are concerned about: (1) the security of their licensed products, and the fear that researchers will leak or release content behind their paywall; and (2) AI being used to create a competing product that could substitute for the original licensed product and undermine their share of the market. While these concerns are valid, they reflect longstanding fears over users’ potential generalized misuse of licensed materials in which they do not hold copyright. But publishers are already able to—and do—impose contractual provisions disallowing the creation of derivative products and systematically sharing licensed content with third-parties, so additionally banning the use of AI in doing so is, in our opinion, unwarranted.

We developed our sample licensing language to precisely address these concerns by specifying in the grant of license that research results may be used and shared with others in the course of a user’s academic or non-profit research “except to the extent that doing so would substantially reproduce or redistribute the original Licensed Materials, or create a product for use by third parties that would substitute for the Licensed Materials.” Our language also imposes reasonable security protections in the research and storage process to quell fears of content leakage. 

Perhaps most importantly, our sample licensing language preserves the right to conduct TDM using “machine learning” and “other automated techniques” by expressly including these phrases in the definition for TDM, thereby reserving AI training rights (including as such AI training methodologies evolve), provided that no competing product or release of the underlying materials is made. 

The licensing road ahead

As legislation and standards around AI continue to develop, we hope to see express contractual allowance for AI training become the norm in academic licensing. Though our licensing language will likely need to adapt to and evolve with policy changes and research or technological advancements over time, we hope the sample language can now assist other institutions in their negotiations, and help set a licensing precedent so that publishers understand the importance of allowing AI training in non-profit research contexts. While a different legislative and regulatory approach may be appropriate in the commercial context, we believe that academic research licenses should preserve the right to incorporate AI, especially without additional costs being passed to subscribing institutions or individual users, as a fundamental element of ensuring a diverse and innovative scholarly record.

Authors Alliance Submits Amicus Brief to the Second Circuit in Hachette Books v. Internet Archive

Posted December 21, 2023
Photo by Dylan Dehnert on Unsplash

We are thrilled to announce that we’ve submitted an amicus brief to the Second Circuit Court of Appeals in Hachette Books v. Internet Archive—the case about whether controlled digital lending is a fair use—in support of the Internet Archive. Authored by Authors Alliance Senior Staff Attorney, Rachel Brooke, the brief reprises many of the arguments we made in our amicus brief in the district court proceedings and elaborates on why and how the lower court got it wrong, and why the case matters for our members and other authors who write to be read.

The Case

We’ve been writing about this case for years—since the complaint was first filed back in 2020. But to recap: a group of trade publishers sued the Internet Archive in federal court in the Southern District of New York over (among other things) the legality of its controlled digital lending (CDL) program. The publishers argued that the practice infringed their copyrights, and Internet Archive defended its project on the grounds that it was fair use. We submitted an amicus brief in support of IA and CDL (which we have long supported as a fair use) to the district court, explaining that copyright is about protecting authors, and many authors strongly support CDL

The case finally went to oral argument before a judge in March of this year. Unfortunately, the judge ruled against Internet Archive, finding that each of the fair use factors favored the publishers. Internet Archive indicated that it planned to appeal, and we announced that we planned to support them in those efforts. Now, the case is before the Second Circuit Court of Appeals. After Internet Archive filed its opening brief last week, we (and other amici) filed our briefs in support of a reversal of the lower court’s decision.

Our Brief

Our amicus brief argues, in essence, that the district court  judge failed to adequately consider the interests of authors.  While the commercial publishers in the case did not support CDL, those publishers’ interests do not always align with authors’ and they certainly do not speak for all authors. We conducted outreach to authors, including launching a CDL survey, and uncovered a diversity of views on CDL—most of them extremely positive. We offered up these authors’ perspectives to show the court that many authors do support CDL, contrary to the representations of the publishers. Since copyright is about incentivizing new creation for the benefit of the public and protecting author interests, we felt these views were important for the Second Circuit to hear. 

We also sought to explain how the district court judge got it wrong when it comes to fair use. One of the key findings in the lower court decision was that loans of CDL scans were direct substitutes for loans of licensed ebooks. We explained that this is not the case: a CDL scan is not the same thing as an ebook, they look different and have different functions and features. And CDL scans can be resources for authors conducting research in some key ways that licensed ebooks cannot. Out of print books and older editions of books are often available as CDL scans but not licensed ebooks, for example.

Another issue from the district court opinion that we addressed was the judge’s finding that IA’s use of the works in question was “commercial.” We strongly disagreed with this conclusion: borrowing a CDL scan from IA’s Open Library is free, and the organization—which is also a nonprofit—actually bears a lot of expenses related to digitization. Moreover, the publishers had failed to establish any concrete financial harm they had suffered as a result of IA’s CDL program. We discussed a recent lawsuit in the D.C. Circuit, ASTM v. PRO, to further push back on the district court’s conclusion on commerciality. 

You can read our brief for yourself here, or find it embedded at the bottom of this post. In the new year, you can expect another post or two with more details about our amicus brief and the other amicus briefs that have been, or soon will be, submitted in this case.

What’s Next?

Earlier this week, the publishers proposed that they file their own brief on March 15, 2024—91 days after Internet Archive filed its opening brief. The court’s rules stipulate that any amici supporting the publishers file their briefs within seven days of the publishers’ filing. Then, the parties can decide to submit reply briefs, and will notify the court of their intent to do so. Finally, the parties can choose to request oral argument, though the court might still decide to decide the case “on submission,” i.e., without oral argument. If the case does proceed to oral argument, a three-judge panel will hear from attorneys for each side before rendering their decision. We expect the process to extend into mid-2024, but it can take quite a while for appeals courts to actually hand down their decision. We’ll keep our readers apprised of any updates as the case moves forward.


Authors Alliance Releases New Legal Guide to Writing About Real People

Posted December 5, 2023

We are delighted to announce the publication of our brand new guide, the Authors Alliance Guide to Writing About Real People, a legal guide for authors writing nonfiction works about real people. The guide was written by students in two clinical teams at the UC Berkeley Samuelson Law and Public Policy Clinic—Lily Baggott, Jameson Davis, Tommy Ferdon, Alex Harvey, Emma Lee, and Daniel Todd—as well as clinical supervisors Jennifer Urban and Gabrielle Daley, along with Authors Alliance’s Senior Staff Attorney, Rachel Brooke. The guide was edited by Executive Director Dave Hansen and former Executive Director, Brianna Schofield. This long list of names is a testament to the fact that it took a village to create this guide, and we are so excited to finally share it with our members, allies, and any and all authors who need it. You can read and download our guide here

On Thursday, we are hosting a webinar about our guide, where Authors Alliance staff will share more about what went into producing it, those who partnered with us or supported the guide, and the particulars of the guide’s contents. Sign up here!

The Writing About Real People guide covers several different legal issues that can arise for authors writing about real people in nonfiction books like memoirs, biographies, and other narrative nonfiction projects. The issues it addresses are “causes of action” (or legal theories someone might sue under) based on state law. The requirements and considerations involved vary from state to state, so the guide highlights trends and commonalities among states. Throughout the guide, we emphasize that even though these causes of action might sound scary, the First Amendment to the U.S. Constitution in most cases empowers authors to write freely about topics of their choosing. The causes of action in this guide are exceptions to that rule, and each of them is limited in their reach and scope by the First Amendment’s guarantees. 

False Statements and Portrayals

The first section in the Writing About Real People guide concerns false statements and portrayals. This encompasses two different causes of action: defamation and false light. 

You have probably heard of defamation: it’s one of the most common causes of action related to writing about a real person. Defamation occurs when someone makes a false statement about another person that injures that person’s reputation, when the statement is made with some degree of “fault.” The level of fault required turns on what kind of person the statement is made about. For public people—people with some renown or governmental authority—the speaker must exercise “actual malice,” or reckless disregard as to whether the statement is true. But for private people, a speaker must be negligent as to whether the statement was true, meaning that the speaker failed to take an ordinary amount of care in verifying the veracity of the statement. An author might expose themselves to defamation liability if they write something untrue about another person in their published work that is held up as factual, that statement injures a person’s reputation, and the author failed to take the requisite level of care to ensure that the statement was factual. 

False light is similar to defamation, and many states do not recognize false light since these causes of action are so similar. Where defamation concerns false statements represented as factual, false light concerns false portrayals. It can occur when a speaker creates a misleading impression about a subject, through implication or omission, by example. Like defamation, false light requires fault on the part of the speaker, and the public person/private person standards are the same as for defamation. 

Invasions of Privacy

The second section in the Writing About Real People guide concerns invasions of privacy, or violations of a person’s rights to privacy. This covers two related causes of action: intrusion on seclusion and public disclosure of private facts. 

Intrusion on seclusion occurs when someone intentionally intrudes on another’s private place or affairs in a way that is highly offensive—judged by the perspective of an ordinary, reasonable person. For authors, intrusion on seclusion can arise when an author uses research or information-gathering methods that are invasive. This could include things like entering someone’s home without permission or digging through personal information like health or banking records without permission. Intrusion on seclusion might be an issue for authors during the research and writing stages of their processes, not when the work is actually published, as is the case with other causes of action in this guide.

Public disclosure of private facts occurs when someone makes private facts about a person public, when that disclosure is highly offensive and made with some degree of fault, and when the information disclosed doesn’t relate to a matter of public concern. Essentially, public disclosure of private facts liability exists to address situations where a speaker shares highly private information about a person that the public has no interest in knowing about, and the subject suffers as a result. Like defamation and false light, the level of fault required for a speaker to be liable depends on whether the subject is a public or private person, and these levels are the same as for defamation (actual malice for public people, and negligence for private people). This means that authors have much more leeway to share private information about public people than private people. And the “public concern” piece provides even more protection for speech about public people. 

Right of Publicity and Identity Rights

The third section in the Writing About Real People Guide concerns the right of publicity and unauthorized use of identity. Violations of the right of publicity, or unauthorized uses of identity, can occur when someone uses another person’s identity in a way that is “exploitative” and derives a benefit from that use. Importantly for authors, this excludes merely writing about someone in a book, article, or other piece of writing. The right of publicity is mostly concerned with commercial uses, like using someone’s name or likeness to sell a product without permission, but it can also apply to non-commercial uses that are exploitative, like using someone’s identity to generate attention for a work. In most cases, the right of publicity involves uses of someone’s image or likeness rather than just evoking their identity in text, but this is not necessarily the case. This section might be informative for authors who want to use someone’s image on their book cover or evoke an identity in advertising, but most authors merely writing nonfiction text about a real person do not have to worry too much about the right of publicity. 

Practical Guidance

A final section in our guide covers practical guidance for authors on how to avoid legal liability for the causes of action discussed in the guide in ways that are simple to understand and implement. Using reliable research methods and sources, obtaining consent from subjects where that is practicable, and carefully documenting your research and sources can go a long way towards helping you avoid legal liability while still empowering you to write freely.

Authors Alliance Submits Comment to Copyright Office in Generative AI Notice of Inquiry

Posted November 3, 2023
Photo by erica steeves on Unsplash

We are pleased to announce that we have submitted a comment to the Copyright Office in response to their recent notice of inquiry regarding how copyright law interacts with generative AI. In our comment, we shared our views on copyright and generative AI (which you can read about here) and the stories we heard from authors about how they are using generative AI to support their creative labors, research, and the mundane but important tasks being involved with being a working author. The Office received over 10,000 comments in response to its NOI, showing the high level of interest in how copyright regulates AI-generated works and training data for generative AI. We hope the Office will appreciate our perspective as it considers policy interventions to address copyright issues involved in the use of generative AI by creators. You can read our full comment here, or at the bottom of this post. 

You can hear more about our comment, and about contributions from other commenters, at the Berkeley Center for Law and Technology virtual roundtable on Monday, November 13th, where Authors Alliance senior staff attorney Rachel Brooke will be a panelist. The event is free and open to the public, and you can sign up here. 


Since the Copyright Office issued an opinion letter on copyright in a graphic novel containing AI-generated images back in February, the debate about copyright and generative AI has grown to a near fever pitch. Authors Alliance has been engaged in these issues since the decision letter was released: we exist to support authors who want to leverage the tools available in the digital age to see their creations reach broad audiences and create innovative new works, and we see generative AI systems as one such tool that can support authors and authorship. We participated in the Copyright Office’s listening session on copyright issues in AI-generated textual works this spring, and were eager to further weigh in as the Copyright Office wades through the thorny issues involved. 

In late August, the Copyright Office issued a notice of inquiry, asking stakeholders to weigh in on a series of questions about copyright policy and generative AI. These were broken down into general questions, questions about training AI models, questions about transparency and recordkeeping, and various issues related to AI outputs—copyrightability, infringement, and labeling and identification. 

Our Comment

Our comment was devoted in large part to sharing the ways that authors are using generative AI systems and tools to support their creative labors and research. We heard from authors that used generative AI systems for ideation, late stage editing, and generating text. We also learned that authors are using generative AI systems in ways we wouldn’t have anticipated—like creating books of prompts for other authors to use as inputs for generative AI systems. Generative AI has helped authors who don’t publish with conventional publishers create marketing copy and even generate book covers (despite the common adage, these are pretty important for attracting readers). We also heard from researchers using generative AI for literature reviews as well as to make their writing process more efficient so they can focus on doing the work of researching and innovating. Generative AI also has the potential to lower barriers to entry for scientific researchers who are not native English speakers, but want to make contributions to scientific fields in which literature tends to be written in English. 

We also spent some time explaining our views on why the use of copyrighted materials in training datasets for AI models constitutes fair use and how fair use analysis applies when copyrighted materials are included in training datasets. The use of creative works in training datasets is a transformative one with a different purpose than the works themselves—regardless of whether the institutions that develop and deploy them are commercial or nonprofit. And it’s highly unlikely that a generative AI system could harm the markets for the works in the training sets for the underlying models: a generative AI system is not a substitute for a book a reader is interested in reading, for example. We also explained that the market harm consideration (factor four in fair use analysis) should consider the effect of the use (using training data on AI models) on the market for the specific work in question (i.e., in an infringement action, the work that is alleged to have been infringed), and not the market for that author’s other works, similar works, or anything else.

Our comment also argued that new copyright legislation on AI—either to codify copyright’s human authorship requirement and explain how it applies to AI-generated content or to address other issues related to copyright and generative AI—is not warranted. AI systems, AI models, and the ways creators use them are still evolving. Copyright law is already highly flexible, having adapted to new technologies that weren’t anticipated when the copyright legislation itself was enacted. And legislating around nascent technologies can result in laws that are eventually ill-suited to deal with unexpected challenges that new technologies bring about (recall that the DMCA, which has faced a lot of criticism as a statute intended to regulate copyright online, was passed in 1998). We instead suggest that the Office stick with a “wait and see” approach as generative AI and how we use it continue to develop rather than recommending legislation to Congress. 

Next, we explained why a licensing system for AI works in training data is neither desirable nor practicable. Because we consider the use of copyrighted works in training data to be a fair use, licenses are not necessary in the first place. We also explained the host of problems that either a compulsory licensing regime or a collective licensing scheme would bring about. The large size of datasets for training AI models make it difficult to envision systematically seeking licenses for each and every copyrighted work in the training dataset, and the “orphan works problem” means that a majority of rightsholders might not be able to be found. It’s also not clear who would administer licensing under a licensing regime, and we could not think of any appropriate party that exists or is likely to emerge. The Office’s past failed investigations into possible collective rights management organizations (or CMOs) only underscore this point. 

Finally, we echoed our support for the substantial similarity test as a way to handle generative AI outputs that look very similar to existing copyrighted works. The substantial similarity test has been around for decades and has been applied across the country in a variety of contexts. It seems to us to be a good way to approach the rare cases in which generative AI outputs are strikingly similar to copyrighted works (so-called “memorization”) such that a rightsholder might sue for infringement. 

What’s Next?

The same day we submitted our comment, the Biden Administration released an executive order on “Safe, Secure, and Trustworthy Artificial Intelligence,” directing federal agencies to take a variety of measures to ensure that the use of generative AI is not harmful to innovation, privacy, labor, and more. Then on Wednesday, representatives from a coalition of countries (including the U.S.) signed “The Bletchley Declaration” following an AI Safety Summit in the U.K., warning of the dangers of generative AI and pledging to work together to find solutions. All of this is to say that how public policy should regulate generative AI, and whether and how the law needs to change to accommodate it, is a live issue that continues to evolve every day. Dozens of lawsuits are pending about the interaction between copyright and the use of generative AI systems, and as these cases move through the courts, judges will have their opportunity to weigh in. As ever, we will keep our readers and members appraised in any new legal developments around copyright and generative AI. 


Copyright Office Recommends Renewal of the Existing Text Data Mining Exemptions for Literary Works and Films

Posted October 19, 2023
Photo by Tim Mossholder on Unsplash

Authors Alliance is delighted to announce that the Copyright Office has recommended that the Librarian of Congress renew both of the exemptions to DMCA liability for text and data mining in its Notice of Proposed Rulemaking for this year’s DMCA exemptions, released today. While the Librarian of Congress could technically disagree with the recommendation to renew, this rarely if ever happens in practice. 

Renewal Petitions and Recommendations

Authors Alliance petitioned the Office to renew the exemptions in July, along with our co-petitioners the American Association of University Professors and the Library Copyright Alliance. Then, the Office entertained comments from stakeholders and the public at large who wished to make statements in support of or in opposition to renewal of the existing exemptions, before drawing conclusions about renewal in today’s notice. 

The Office did not receive any comments arguing against renewal of the TDM exemption for literary works distributed electronically; our petition was unopposed. The Office agreed with Authors Alliance and our co-petitioners, ARL and AAUP, observing that “researchers are actively relying on the current exemption” and citing to an example of such research that we highlighted in our petition. Apparently agreeing with our statement that there have not been “material changes in facts, law, technology, or other circumstances” since the 1201 rulemaking cycle when the exemption was originally obtained, the Office stated it intended to recommend that the exemption be renewed. 

Our renewal petition for the text and data mining exemption for motion pictures, which is identical to the literary works exemption in all aspects but the type of works involved, did receive one opposition comment, but the Copyright Office found that it did not meet the standard for meaningful opposition, and recommended renewal. DVD CCA (the DVD Copyright Control Association) and AACS LA (the Advanced Access Content System Licensing Administrator) submitted a joint comment arguing that a statement in our petition indicated that there had been a change in the facts surrounding the exemption. More specifically, they argued that our statement that “[c]ommercially licensed text and data mining products continue to be made available to research institutions” constituted an admission that new licensed databases motion pictures had emerged since the previous rulemaking. DVD CCA and AACS LA did not actually offer any evidence of the emergence of new licensed databases for motion pictures. We believed this opposition comment was without merit—while licensed databases for text and data mining of audiovisual works are not as prevalent as licensed databases for text and data mining of text-based works, some were available during the 2021 rulemaking, and continue to be available today. We are pleased that the Office agreed, citing to the previous rulemaking record as supporting evidence.

Expansions and Next Steps

In addition to requesting that the Office renew the current exemptions, we (along with AAUP and LCA) also requested that the Office consider expanding these exemptions to enhance a researcher’s ability to share their corpus with other researchers that are not their direct collaborators. The two processes run in parallel, and today’s announcement means that even if we do not ultimately obtain expanded exemptions, the existing exemptions are very likely to be renewed. 

In its NPRM, the Office also announced deadlines for the various submissions that petitions for expansions and new exemptions will require. The first round of comments in support of  our proposed expansion—including documentary evidence from researchers who are being adversely affected by the limited sharing permitted under the existing exemptions—will be due December 22nd. Opposition comments are due February 20, 2024. Reply comments to these opposition comments are then due March 24, 2024. Then, later in the spring, there will be a hearing with the Copyright Office regarding our proposed expansion. We will—as always—keep our readers apprised as the process moves forward. 

Call to Action: Share your Experiences with Generative AI!

Posted October 9, 2023
Photo by Patrick Fore on Unsplash

Authors Alliance is currently at work on a submission to the Copyright Office regarding our views on generative AI (which you can read about here). If you’re an author who has used generative AI in your research or writing, we’d love to hear from you! Please reach out to Rachel Brooke, Authors Alliance Senior Staff Attorney, at