Yesterday, Authors Alliance, joined by the Library Copyright Alliance and the American Association of University Professors, filed a comment with the Copyright Office for a new three-year exemption to the Digital Millennium Copyright Act (“DMCA”) as part of the Copyright Office’s eighth triennial rulemaking process. Our proposed exemption would allow researchers to bypass technical protection measures (“TPMs”) in order to conduct text and data mining research on both literary works that are published electronically and motion pictures.
Background: Section 1201 and Exemptions
Section 1201 of the DMCA prohibits the circumvention of TPMs used by copyright owners to control access to their works. It also prohibits the manufacture or sale of devices or programs designed to circumvent these TPMs. In other words, section 1201 prevents individuals from breaking digital locks on copyrighted works, even when they seek to make a fair use of those copyrighted works or engage in otherwise non-infringing activities.
Because section 1201’s prohibitions can interfere with fair and socially beneficial uses of copyrighted works, the DMCA also provides for a triennial rulemaking process to grant temporary exemptions to the prohibitions. Authors Alliance has participated in each 1201 rulemaking cycle since our founding, petitioning for exemptions and their renewals to help authors enjoy their rights while ensuring their creations reach new audiences during the 2015 and 2018 cycles. For the upcoming 2021 rulemaking, we have petitioned for a new exemption that would allow researchers to bypass TPMs on literary works distributed electronically and films for the purpose of conducting text and data mining (“TDM”) research, in addition to our petition to renew an exemption for multimedia e-books.
Text and Data Mining
Text and data mining refers to automated analytical techniques aimed at analyzing digital text and data in order to generate information that reveals patterns, trends, and correlations in that text or data. TDM has great potential to enable groundbreaking research and contribute to the commons of knowledge. As a highly transformative use of copyrighted works done for purposes of research and scholarship, TDM fits firmly within the ambit of fair use. But the current prohibition on bypassing TPMs in section 1201 makes TDM research on texts and films time consuming and inefficient—and in some cases, impossible—working against the promotion of the progress of knowledge and the useful arts that copyright law has been designed to incentivize.
Because literary works distributed electronically and motion pictures are protected TPMs, researchers—unable to bypass these TPMs due to section 1201—can turn instead to works in the public domain for their TDM research. With regards to films, this avenue is effectively unavailable, since works published after 1925 generally remain under copyright. For literary TDM scholars, literary works published before 1925 remain a potential alternative area of study, but focusing TDM on pre-1925 texts “further reinscribes white men as the center of the field and further marginalizes women and people of color.” Authorship was far less diverse in 1925 than it is today, so TDM research on public domain texts ends up privileging white male voices rather than being representative of authors contributing to the commons of knowledge today.
Our petition for a TDM exemption is accompanied by letters of support from 14 separate authors and researchers currently engaged in TDM research on literary works and films whose work has been hampered by section 1201, and two additional letters from experts who support TDM researchers. Here are just a few examples of their experiences:
The Data Sitters Club is a group of scholars under the Stanford University Literary Lab, “a research collective that applies computational criticism, in all its forms, to the study of literature.” The Data Sitters Club explores research questions in relation to the well-known Baby-Sitters Club series, a series for elementary and middle school aged girls that was popular primarily in the 80s and 90s. The group would like to use computational analysis to investigate the extent to which the characters have distinct voices and explore the series’ treatment of religion, race, adoption, divorce, and disability. The Data Sitters Club sees their study as a step towards exploring the worldview of American women in their 30s and 40s who read the Baby-Sitters Club books as children. It also has the goal of investigating common tropes in the books to explore these questions further.
There are over 200 books in the series, yet literary scholarship on the Baby-Sitters Club is sparse. Due to this gap, the power of TDM to shed new insights on large quantities of text, and the formative effect of children’s literature on its readers, the group sees a particular impetus to explore how the “iconic depiction of girlhood in the upper-middle-class American suburbs” has both mirrored and shaped its readers’ views of the world. Yet because the Baby-Sitters Club books were all written during the latter half of the 20th century, they remain under copyright, and the e-book versions are protected by technical protection measures, as is almost always the case with e-books. Because of section 1201’s prohibition on bypassing TPM, the Data Sitters Club cannot use the Baby-Sitters Club e-books for their project, and are instead forced to manually scan physical books and correct any transcription errors before they can apply their computational analysis to the texts, limiting the amount of texts they can study and detracting from the time they can spend on their important research questions.
Professor Dan Sinykin, an assistant professor at Emory University who teaches English and computational analysis, is currently at work on a book, The Conglomerate Era, which seeks to explore how the conglomeration of U.S. publishing changed fiction: in the 1950s, almost every publisher in the country was independent, but today, despite the continue presence of some independent publishers in the ecosystem, only five multinational media conglomerates dominate the trade market (soon to be four, with the planned merger of Penguin Random House and Simon & Schuster). Professor Sinykin would like to use TDM “to detect patterns of change across thousands of novels across decades” in a groundbreaking exploration of literary history. However, because he seeks to study works published after 1945, which remain protected under copyright, Professor Sinykin’s project is made much more difficult due to section 1201’s prohibition on bypassing technical protection measures.
Because he cannot use the e-book versions of late-20th century novels to do his analysis, Professor Sinykin must use HathiTrust, a digital corpus of works under copyright that scholars can use for TDM purposes with subscriptions or institutional affiliations. Professor Sinykin points out the weaknesses of using HathiTrust, such as the cumbersomeness of using HathiTrust’s “data capsules,” including their limited computing power and the difficulty of accessing the capsules securely. The HathiTrust capsules are also limited to “holdings of select university libraries” and are not representative of fiction during the time period Sinykin wishes to study. Importantly, HathiTrust is not free, making the type of research Sinykin is currently undertaking inaccessible to scholars with fewer resources. If Professor Sinykin could bypass TPM on e-books and use those for his project, he could use more representative fiction texts and would thus be enabled to “write a better, truer book about conglomeration.” He could also teach TDM to his students—the next generation of scholars—to ensure that this work continues in the future.
Professor David Bamman is an assistant professor at UC Berkeley whose research focuses on natural language processing and cultural analytics, and whose current TDM project involves films. Professor Bamman also has experience applying natural language processing to a digitized collection of books which he and his team manually scanned themselves (similar to the Data Sitter’s Club’s workaround) due to concern over section 1201 liability if they instead bypassed TPMs.
In 2018, he became interested in applying TDM techniques—computer vision and video processing techniques specifically—to film, and decided to compile a data set of films to explore whether directorial style in movies can be measured and quantified. Professor Bamman estimated that a dataset of approximately 10,000 films would allow him to conduct this research and explore how directorial style can be decomposed and measured, such as through types and lengths of shots and the color palette used in the film. Yet, cognizant of section 1201’s prohibition on bypassing technical protection measures, Professor Bamman purchased individual DVDs and underwent the burdensome process of playing them on a computer, and using “screen-capture” software to record the movie as it played in real time. This method does not violate section 1201, but proved to be insufficient for Professor Bamman’s project, as it would have apparently taken a human operator 10 years to manually screen capture enough films for him to complete his corpus. As a result, Professor Bamman has abandoned this line of research, despite seeing immense value in research questions around “historical trends in film over the past century.”
We’re grateful to law students from the Samuelson Law, Technology & Public Policy Clinic at UC Berkeley Law School for their work preparing the comment. Responses from commenters who oppose the petition for this exemption are due February 9, 2021 and further comments in support of the petition, or from those who neither support nor oppose the petition, are due March 10, 2021. The Librarian of Congress is expected to issue a final decision on the proposed exemption in October 2021. We will keep our members and readers apprised of any updates on our proposed exemption as the process moves forward.