Does Copyright Require Authorization to Use Data “Subsisting in Copyright Works?”

Posted February 20, 2020

Authors Alliance thanks Matthew Sag, professor at Loyola University Chicago School of Law, for this guest post (originally published on

The World Intellectual Property Organization in Geneva has requested comments on a series of questions about whether “use of the data subsisting in copyright works without authorization for machine learning constitute an infringement of copyright?” I have joined other copyright experts in a submission to WIPO commenting on their questions. This note explains in more detail some of my reservations about use of the phrase “use of the data subsisting in copyright works without authorization” in WIPO’s questions and in our general thinking about the relation between copyright and text and data mining.

The phrase “use of the data subsisting in copyright works without authorization” is unhelpful, to say the least.

To begin with the most obvious problem, the “use” of the data or facts subsisting in copyright works generally requires no authorization. For example, this morning I “used the data” in on the weather page of to my local newspaper to decide whether I should shovel snow or wait for more snow to fall. No doubt, the newspaper is protected by copyright, but the facts contained therein are not.

Moreover, the second problem with the question WIPO proposed is that my “use” of the weather data required no authorization because it did not involve any action on my part implicating the exclusive rights of the copyright owner. I did not make a copy of the newspaper, I did not publicly perform it, I did not turn it into a digital audio transmission, etc.

Both of these points are Copyright 101, but it is easy to lose sight of the fundamentals when contemplating new and unexpected uses of copyrighted works in a rapidly evolving technological environment. It does not make sense to ask “Should the use of the data subsisting in copyright works without authorization for machine learning constitute an infringement of copyright?” in the abstract. Instead, we need to focus with more precision on the potential copyright issues that are actually raised by AI in particular contexts, and to do that we need to understand the relationship between text data mining, on the one hand, and machine learning and AI, on the other.

Text data mining refers to any computational processes for applying structure to unstructured electronic texts and it generally involves employing statistical methods to discover new information and reveal patterns in the processed data.[1]

Machine learning refers to a cluster of statistical and programming techniques that give computers the ability to “learn” from exposure to data, without being explicitly programmed.[2]

The term AI or artificial intelligence is mostly used to refer more sophisticated forms of machine learning, or else to describe speculative accounts of what might be possible with future technology. If we put science fiction and hyperbole to one side, we can proceed to talk about machine learning and AI interchangeably in terms of the relevant copyright issues.

If moving beyond the premise that AI is a magical process that defies human understanding, we can see the third fundamental problem with the phrase “the data subsisting in copyright works.” The notion that AI is using data that “subsists” in copyright works reflects a fundamental misunderstanding of the technology at issue. Unless the copyrighted work is something like a book of used car values, the data does not subsist in the work waiting to be extracted. The data is not a subset of the work. In almost every real-world use case of AI and machine learning, the data is derived by making an external observation about the work.

This is an important point: the non-expressive metadata produced by text data mining does not originate from the underlying copyrighted works. It does not subsist in those works. Instead, it is derived from them by acts of external observation.

As I have explained in a recent paper:

“Imagine plagiarism detection software that reports that student term paper B is substantially similar to an earlier paper A. Paper A originated with student author A, but the observation as to its similarity with student B’s term paper does not originate with either A or B. It originates with the software algorithm programmed to detect plagiarism.

Likewise, a word frequency table derived from Moby Dick did not originate with Herman Melville. Melville obviously realized that he would be writing the word “whale” over and over, but presumably he never set out to make an exact count. In both examples, to the extent the metadata about the work owes its origin to anyone, that person would be the person who derived the data, not the author of the underlying work.”[3]

The false premise that the non-expressive metadata produced by text data mining already “subsists” in the copyrighted works from which it is derived leads to false conclusion that when the data is used, something is taken from the original author. On the contrary, producing non-expressive metadata takes nothing from the original author because under any version of the idea-expression distinction, latent facts are not the property of the author. But even if they were, these are not their facts.

Copyright and Machine Learning

Machine learning/AI raises three distinct sets of copyright issues.


One issue is whether the output of the machine learning algorithm infringes anyone’s copyright. For example, if a computer program was given access to the collected works of Taylor Swift and asked to produce a new pop song that turned out to be substantially similar to “Blank Space,”[4] the new song would be an infringing work. There are no new questions of copyright law to determine in this unlikely scenario.[5] Either the end product is infringing because it is too similar to the original, or it is not. Perhaps the use of AI in this fashion will lead to caselaw in which the bar for similarity for musical infringement is lowered.[6] But if so, there is no need for WIPO get involved at this level of minutiae.

The prospect of a computer being trained on the works of Taylor Swift and producing equally popular new musical works in a Swiftian style that are not substantially similar to any of Swift’s individual copyrighted works is intriguing,[7] but the issues it raises are largely not copyright ones. It may be that renewed attention must be given to data privacy, the right of publicity and analogous concepts in such a scenario, or it may be after the initial moral panic we realize that computer generated music complements rather than substitutes for the work of famous artists. Whatever the case, the prospect of AI competition displacing labor in the creative fields raises questions of social policy that are not unique to the copyright industries. Factory workers, commercial drivers, and retail assistants all face similar prospects.


A second issue is whether a machine learning/AI program itself might be thought of as an infringing reproduction or adaptation of some set of copyrighted works. At an abstract level, the answer to this question depends on how low/high the thresholds of originality and similarity are set in a given jurisdiction. They appear to be slightly higher in the United States than in the EU, but it is not clear if the differences are significant. Even assuming very low thresholds originality and similarity, it is difficult to imagine that a machine learning program could be found to be an infringing reproduction or adaptation of the works it is learning from if we applied traditional standards of copyright law. It is easiest to explain why this is so in the context of text. It is theoretically possible that a text string long enough to infringe copyright could be extracted from a copyrighted work and embedded into machine learning, but it’s not very likely. As I have explained in a recent article, “Ngrams where n is larger than five are uncommon for reasons that have nothing to do with copyright law. They are uncommon because they are computationally expensive and not very useful.”[8]

If the content of a machine learning/AI program did constitute a prima facie reproduction or adaptation of some underlying copyrighted work, the answer to whether the program infringes copyright tracks very closely with the third issue. In the United States, it would be answered the same way.


The third copyright issue raised by machine learning and AI is whether employing text data mining tools to convert conventional texts (books, newspaper articles, music, etc.) into the raw data that feeds these processes violates the copyrights in those underlying texts. In other words, does the use of automated techniques to derive metadata from traditional copyrighted works violate the copyright in those works under traditional copyright law principles? This is the primary question that WIPO should consider.

Fair Use and Machine Learning

In the United States, the answer to this question is “no.” Even though it usually involves large amounts of copying, the process of text data mining does not infringe copyright unless the output of that process also infringes copyright. United States courts have applied the fair use doctrine to maintain the viability of copyright’s fundamental distinction between ideas and expression in the modern technological environment.[9] Our courts have recognized that although various (but not all) text data mining tools constitute prima facie acts of reproduction, such uses are authorized under the fair use doctrine.[10]

As I have recently explained, “[a]llowing text mining and other similar non-expressive uses of copyrighted works without authorization is entirely consistent with the fundamental structure of copyright law.” This is because, “at its heart, copyright law is concerned with the communication of an author’s original expression to the public.”

“[Text data mining] and other non-expressive uses do not communicate original expression to the public (i.e., to any human reading audience for the purpose of being read, understood, or appreciated). As such, even though these uses involve technical acts of copying, they do not conflict with the copyright owner’s exclusive rights.”[11]

Because some of the acts required to engage in text and data mining involve prima facie acts of reproduction, jurisdictions that lack the benefit of an open fair use standard may need to create exceptions to allow for text data mining. This may be true whether text and data mining it is used in conjunction with machine learning/AI or for more prosaic statistical techniques. Alternatively, they might recognize that statutory prohibitions against reproduction only apply to copying for expressive purposes.[12]

The Impact of Protectionist Approaches on the Field

As I summarized in a recent article,[13] if all copyright law were construed to make the use of automated techniques to derive metadata from traditional copyrighted works unlawful without authorization, text data mining, machine learning/AI will continue, but in a much more limited fashion. Under a protectionist approach, text and data mining research utilizing copyrighted works would likely be restricted to highly centralized domains like modern scientific publishing and with respect to a few special standalone collections that have been cleared by rights holders. But research and development would be severely limited in any domain where rights are held diffusely, such as Internet search, plagiarism detection, the study of twentieth century literature, and the study of social networks. The world will not end if there was no way to reconcile copyright with text mining and machine learning, but it would be a much poorer place.

Given the clarifications in the law of the United States, the EU, Japan, and other countries to permit text and data mining with copyrighted works without authorization, a dystopian future where such activities are prohibited without authorization around the world is unlikely. More likely, and still quite problematic, is a scenario where some (predominantly wealthy) countries permit text data mining research, and many others do not (or leave the issue unclear). Countries that prohibit text mining across the board, or in conjunction with machine learning are basically abandoning the fundamental copyright distinction between facts and expression in the digital age. Those nations will cede a substantial competitive advantage to academic and commercial researchers in the United States and other jurisdictions where the precedents in favor of text data mining are now firmly established.[14] And it will raise the specter of substantial cross-border issues – such as whether a corpus of works lawfully created for mining in one country can be lawfully transferred to or used by researchers in another.

Final thoughts

It is good that IP policy makers are taking an interest in the copyright issues relating to text data mining, machine learning and AI. My view is that once they understand the technology and really think about the copyright issues involved, the conclusion that the use of automated techniques to derive metadata from traditional copyrighted works should be permitted (and is already in many jurisdictions) is inescapable. If there are broader issues of social policy that arise in relation to the use of AI and machine learning, we should have an honest conversation about those issues. Those conversations will usually have a lot more to do with inequality.

[1] Matthew Sag, The New Legal Landscape for Text Mining and Machine Learning, 66 J. of the Copyright Soc’y of the USA, 3 (2019), (hereinafter, Sag 2019)

[2] Id.

[3] Id.

[4] Taylor Swift, et al. 2014 (Big Machine).

[5] Yes, I am familiar with the “Travis Bott” but the gulf between incoherent imitation and imitating actual communicative language remains vast. See Starr Rhett Rocque, Meet Travis Bott, the Travis Scott twin whose music and lyrics were created with AI, Fast Company, Feb. 13, 2020 (

[6] Arguably, it already has in some cases. See Williams v. Gaye, 885 F. 3d 1150 (9th Cir. 2018).

[7] For a real world analog, see MuseNet. Will Knight, This AI-generated Musak Shows Us the Limit of Artificial Creativity, MIT Tech. Rev. (Apr. 26, 2019), (describing MuseNet).

[8] Sag 2019.

[9] See Sega Enterprises Ltd. v. Accolade, Inc. 977 F.2d 1510 (9th Cir. 1992); (software reverse engineering) Sony Computer Entertainment v. Connectix 203 F.3d 596 (9th Cir. 2000) (same); A.V. ex rel. Vanderhye v. iParadigms, LLC, 562 F.3d 630 (4th Cir. 2009) (plagiarism detection software); Authors Guild v. HathiTrust, 755 F.3d 87 (2d Cir. 2014) (text data mining of a corpus of millions of books for research purposes including search, meta-analysis); Authors Guild v. Google, 804 F.3d 202 (2d Cir. 2015) (text data mining of a corpus of millions of books for a commercial search engine).

[10] Most explicitly, Authors Guild v. HathiTrust, 755 F.3d 87 (2d Cir. 2014) (text data mining of a corpus of millions of books for research purposes including search, meta-analysis) and Authors Guild v. Google, 804 F.3d 202 (2d Cir. 2015) (text data mining of a corpus of millions of books for a commercial search engine).

[11] Sag 2019.

[12] This is the approach suggested by Prof. Abraham Drassinower. See Abraham Drassinower, What’s Wrong With Copying (2015).

[13] Sag 2019.

[14] To see how firmly established, consider the recent case of Fox News v. TVEyes, in which the plaintiff did not attempt to argue that the extensive copying of its broadcasts to create TVEyes’s analytical database amounted to infringement. Fox News Network, LLC v. TVEyes, Inc., 43 F. Supp. 3d 379, 388 (S.D.N.Y. 2014). Instead, Fox News successfully focused its attack on the fact that the video clips of its content to which TVEyes gave its customers access were a full ten minutes long and that various features of the service allowed subscribers to watch, copy, and distribute those clips. Fox News Network, LLC v. TVEyes, Inc., 883 F.3d 169 (2d Cir. 2018).