Github Copilot Class Action Lawsuit (and why authors and researchers should pay attention)

Yesterday there was a pretty interesting class action lawsuit filed against Github and Microsoft. The suit is about Githubโ€™s Copilot service, which it advertises as โ€œYour AI pair programmer.โ€ As described by Github, Copilot is  โ€œtrained on billions of lines of codeโ€ and โ€œturns natural language prompts into coding suggestions across dozens of languages.โ€ The suit focuses on Githubโ€™s reuse of code deposited with it by programers, mostly under open source licenses, which Github has used to train the Copilot AI.  Those licenses generally allow reuse but commonly come with strings attachedโ€“such as requiring attribution and relicensing the new work under the same or similar terms. The class action asserts, among other things, that Github hasnโ€™t followed those terms because it hasnโ€™t attributed the source adequately and has removed copyright-relevant information. 

Sounds interesting, but you might  be wondering why we care about this lawsuit. For a few reasons: one, it raises some  important questions about the extent to which researchers can use AI to train and produce outputs based on datasets of copyrighted materials, even materials thought generally โ€œsafeโ€ because theyโ€™re available under open licenses. As the suit highlights, materials that are openly licensed arenโ€™t without any restrictions (most include attribution requirements), but when those materials are aggregated and used to craft new outputs, it can be seriously complicated to find the right way to attribute all the underlying creators. If this suit raises the barrier to using such materials, it could pose real problems for many existing research projects. It could also result in further narrowing of what datasets are likely to be used by AI researchersโ€“  resulting an even smaller group of materials that include what law professor Amanda Levendowski refers to as โ€œbiased, low-friction dataโ€ (BLFD), which can lead to some pretty bad and biased results. How and when open license attribution requirements apply is important for anyone doing research with such materials in aggregate. 

Second, the suit at least indirectly implicates some of the same legal principles that authors working on text-data mining projects rely on. Weโ€™ve argued (successfully, before the U.S. Copyright Office) that such uses are generally not infringingโ€“-particularly for research and educational purposes-โ€“because fair use allows for it. Several others, such as Professors Michael Carroll and Matthew Sag, have made similar arguments. Of course, Github Copilot has some meaningful differences from text-data mining for academic research; e.g., it is producing textual outputs based on the underlying code for a commercial application. But the fair use issue in this case could have a direct impact on other applications.

Interestingly, the Github Copilot suit doesnโ€™t actually allege copyright infringement, which is how fair use would most naturally be raised as a defense. Instead, the plaintiffs, as class representatives, make two claims that could implicate a fair use defense: 1) a contractual claim Github has violated the open source license covering the underlying code, which generally require attribution among other things; 2) a claim Github has violated Section 1202 of the Digital Millennium Copyright Act by removing copyright management information (โ€œCMIโ€) (e.g., copyright notice, titles of the underlying works). 

The complaint attempts to avoid fair use issue, asserting that ”the Fair Use affirmative defense is only applicable to Section 501 copyright infringement. It is not a defense to violations of the DMCA, Breach of Contract, nor any other claim alleged herein.โ€ The plaintiffs may well be trying to follow the playbook of another recent open source licensing case, Software Freedom Conservancy v. Vizio, which successfully convinced a federal court that its breach of contract claims, based on an alleged breach of the the GPLv2 license, should be considered separate and apart from a copyright fair use defense.

ย This suit is a little different though. For one, at least five of the eleven licenses at issue explicitly recognize the applicability of fair use; for example, the GNU General Public License version 3 provides that โ€œThis License acknowledges your rights of fair use or other equivalent, as provided by copyright law.โ€ It would seem more of a challenge to convince a court that a fair use defense doesnโ€™t matter when almost half of the licenses explicitly say it does.ย  Likewise, while the text of Section 1202 doesnโ€™t explicitly allow for a fair use defense, its restrictions are only applicable to the removal of CMI when it is done โ€œwithout the authority of the copyright owner or the law.โ€ The plaintiffs claim that fair use isn’t a defense to allegations of a Section 1202 violation, but thats far from clear, and it may be that removal of information pursuant to a valid fair use claim should qualify as removal with the โ€œauthority . . . of the law.โ€ย 

The lawsuit is a class action, so it faces some special hurdles that a typical suit would not. For example, the plaintiffs must demonstrate that they can adequately represent the interests of the class, which it has defined as: 

All persons or entities domiciled in the United States that, (1) owned an interest in at least one US copyright in any work; (2) offered that work under one of GitHubโ€™s Suggested Licenses; and (3) stored Licensed Materials in any public GitHub repositories at any time between January 1, 2015 and the present (the โ€œClass Periodโ€).  


That could pose a challenge given that it seems likely that at least a portionโ€“if not a sizable portionโ€“of those who contributed code to Github under those open licenses may be more sympathetic to Githubโ€™s reuse than the claims of the plaintiffs. In Authors Guild v. Google, another class action suit involving mass copying to facilitate computer-aided search and outputs like snippet view in Google Books, similar intra-class conflicts posed a challenge to class certification (including objections we raised on behalf of academic authors). The Github Copilot suit also includes a number of other claims that mean it could be resolved without addressing the copyright and licensing issues noted above. For now, weโ€™ll monitor the case and update you on outcomes relevant to authors.


Discover more from Authors Alliance

Subscribe to get the latest posts sent to your email.

Scroll to Top