Update: 1201 Exemption to Enable Text and Data Mining Research

Posted August 24, 2021
Abstract pattern of green oblong shapes on black background
Photo by Michael Dziedzic on Unsplash

Authors Alliance, joined by the Library Copyright Alliance and the American Association of University Professors, is petitioning the Copyright Office for a new three-year exemption to the Digital Millennium Copyright Act (“DMCA”) as part of the Copyright Office’s eighth triennial rulemaking process. If granted, our proposed exemption would allow researchers to bypass technical protection measures (“TPMs”) in order to conduct text and data mining (“TDM”) research on literary works that are distributed electronically and motion pictures. Recently, we met with representatives from the U.S. Copyright Office to discuss the proposed exemption, focusing on the circumstances in which access to corpus content is necessary for verifying algorithmic findings and ways to address security concerns without undermining the goal of the exemption.

Access to Corpora for Verification

In response to suggestions from opponents that the exemption, if granted, should ban researchers from accessing text in their corpora, David Bamman, associate professor at the School of Information at UC Berkeley, shared circumstances in which a researcher would need to access text in a corpus to verify research findings. Drawing from his co-authored article The Transformation of Gender in English-Language Fiction, Dr. Bamman used two examples to demonstrate why access to the research corpus is necessary to verify anomalous research findings. First, Dr. Bamman executed code that produced all lines of text in Nathaniel Hawthorne’s The Scarlet Letter that include both female gendered pronouns and capitalized words to investigate an algorithm’s failure to identify any female characters in the novel. Second, Dr. Bamman executed code that produced all lines of text that included the word “legs” to investigate why this was one of the objects most associated with male characters in the research corpus.

As we have previously explained, while researchers do not need this exemption for the purpose of viewing the full text or images of the works that they or their institutions have already obtained lawfully, researchers must be able to verify their research methods and research results. The scale of many research projects would make verification of anomalous research findings without access to the research corpus prohibitively time-consuming. An outright ban on accessing text in the corpus would make many TDM projects impossible because researchers would not be able to interrogate the conclusions reached by the code they had developed. Moreover, the ability to view corpus text or images is consistent with the research environments of both HathiTrust Data Capsules and Google Book Search, and it is consistent with fair use precedent.

Security Measures

As a threshold matter, we shared our view that the approach of existing § 1201 exemptions that require reasonable security measures keyed to particular, identified risks is consistent with the decisions in Google Books and HathiTrust. In both cases, the Second Circuit identified security measures that were reasonable responses to actual risks. This is consistent with past Copyright Office recommendations that identify the risk to be guarded against, but do not prescribe the security controls to guard against it. To this aim, we suggested language to add to the exemption to more specifically define the harms that exemption users must guard against when implementing security controls—dissemination, downloading, and unauthorized access. We also explained that we not object to the inclusion of the requirement that researchers wishing to avail themselves of the exemption consult with their institution’s information technology office. Institutions of higher education are well positioned to provide this kind of advice, and it would ameliorate some of opponents’ concerns.

In addition, we discussed the various specific security controls and standards opponents advocated for in their post-hearing letters. We explained that while we continue to believe that the Copyright Office’s reasonableness approach is the right one, the intended exemption beneficiaries would still be able to avail themselves of the exemption if certain controls are imposed. These include encryption on the server, limiting access to the collection to those with a legitimate and authorized need, deletion of the collection upon conclusion of the applicable research need, and mechanisms to detect and prevent downloading of stored materials. Other security controls proposed by opponents, however, would render the exemption unusable, and we explained our concerns with these proposals to the Office.

* * *

We look forward to working with the Copyright Office to address opponents’ concerns without undermining the purpose of the proposed exemption. The Librarian of Congress is expected to issue a final decision on the proposed exemption in October 2021. We will keep our members and readers apprised of any updates on our proposed exemption as the process moves forward. We’re grateful to law students and faculty from the Samuelson Law, Technology & Public Policy Clinic at UC Berkeley Law School for their work supporting our petition for this new exemption.