Anticircumvention Law is Not the Right Solution to Webscraping

CC0 License

Lately, large digital platforms have repeatedly turned to courts, framing webscraping as a copyright problem, in the hope to further enclose the open web—with judicial blessings. Some new cases are pending in courts. The justices now have an opportunity to throw a roadblock at the tech companies’ relentless march to unchecked control.

In December, in one of the many pending copyright lawsuits against OpenAI, Judge Stein of the Southern District of New York issued an order to dismiss a novel theory that—like so many ramshackle claims in AI copyright litigation—sought to expand the purview of copyright law beyond its constitutional limits. 

The argument plaintiffs attempted to make in Ziff Davis v. OpenAI was a simple one: that ignoring robots.txt instructions constitutes a violation of the anti-circumvention provisions under 17 U.S.C. §1201. Judge Stein astutely observed that robots.txt does not qualify as “technological protection measures (TPMs)” that effectively control access, but is more similar to a sign on an open lawn that says “keep off the grass.” We believe Judge Stein has it right, because treating simple machine-readable instructions as legally enforceable barriers to access would effectively grant unlimited power to private parties to restrict speech at will. If a program like robots.txt or the website’s terms of service could dictate what is allowed on that website, copyright protection for users—from their right to use uncopyrightable facts to their right to engage in fair use—would be at the mercy of private website owners.

Other equally dubious copyright claims that target webscraping are still pending. Last October, Reddit sued Perplexity AI and several web-scraping service providers, including SerpApi, alleging they bypassed Google SearchGuard to harvest Reddit’s publicly available data displayed as Google search results. Last month, Google directly joined the feud, filing its own lawsuit against SerpApi for circumventing Google SearchGuard.

We don’t know yet how courts will rule on these experimental §1201 claims in Reddit v. Perlexity or Google v. SerpApi. What we do know is that if courts were to uphold Google SearchGuard as an effective TPM under DMCA §1201, private firms capable of implementing complex enough technical controls are granted absolute authority to restrict access and use of publicly available data as well as user generated content. 

Allowing a private firm like Google to exert such control over access to public web content raises several fundamental concerns. Most importantly, copyright law, at its core, exists to promote the progress of science and culture. Reddit and Google have amassed tremendous amounts of data as well as user-generated content; they are not the copyright owner of this content. Allowing them now to have unilateral control over how the content can be used would choke out competition and make culture stale. For instance, because Youtube is the biggest video platform and many creators post exclusively there, having their ToS specifically requiring remixers to get Youtube’s permission—even when a remixer has the creator’s consent—suppresses creativity. And when only a handful of platforms dominate the market, private control over access can easily lead to private censorship.

But we are not advocating for a free-for-all of webscraping. Harmful web scraping exists, and we believe its victims should have legal recourse. But copyright law is simply not the right instrument for that job. The harms associated with webscraping should be addressed through tort and contract claims grounded in demonstrable actual harm, rather than through a strict-liability copyright regime designed with entirely different purposes in mind. 

Scraping can cause serious, tangible harms, particularly on smaller institutions and resource-constrained websites. High volumes of automated requests can slow or even crash servers and disrupt access for legitimate users. This is salient when institutions cannot afford more sophisticated solutions such as load balancers (as the glam-e lab reported, “[some] may operate on infrastructure so rickety that it regularly crashed even before bots started showing up.”). Sites such as Google or Amazon can easily absorb thousands of https requests, whereas a less well-resourced site could be more easily crashed.

As a policy matter, a legal regime that conditions enforceable access restrictions on the sophistication of technical implementation—as Reddit and Google are advocating in their lawsuits—would privilege the most well-resourced actors. Allowing access control only when the barrier program is technically sophisticated would give an unfair advantage to large, well-funded companies while leaving those smaller institutions most vulnerable to scraping bots still exposed to the same risks. If the law were to allow private firms to exclude others from sites at will, that authority should be applied uniformly where even a simple “keep off the grass” sign should be deemed enforceable against scrapers. 

Back in December 2024, the UNC Libraries’ online catalog was overwhelmed by a massive web scraping operation, generating hundreds of simultaneous queries that blocked access for University users. And in February 2025, the biodiversity database Discover Life was slowed to a halt when bots scraped its image collections, temporarily blocking researchers from accessing the site. Similar incidents have affected GLAM repositories and archives more broadly, where surges in automated traffic have slowed access speeds and even crashed servers. 

These are of course the kinds of tangible harms that tort and contract laws are well equipped to address. In its latest motion to dismiss, Perplexity argued SearchGuard is not the type of digital lock the law was intended to protect because SearchGuard merely restricts one method of access, the large-scale automated copying. Perplexity contends that because Google’s search results remain “freely available to the public over the Internet” via any ordinary query, SearchGuard does not “effectively control access” to the work. While this may be a sound argument resting on the technicality of §1201, we hope courts will ultimately reject plaintiff’s broader premise that webscraping should be restricted at all by copyright law, when data is uncopyrightable to begin with or when data is scraped for fair use purposes.

Courts are already demanding that plaintiffs claiming they are harmed by webscraping show actual damages. In X Corp. v. Bright Data, a federal court recently granted summary judgment against the platform formerly known as Twitter, ruling that the plaintiff failed to show that the scraping of public data caused any actual damages. The court emphasized that because X Corp. could not show any actual damage, their breach of contract and trespass to chattels claims could not be sustained. A similar vein of argument should adequately resolve the §1201 anticircumvention claims advanced by Google and Reddit: Courts should focus their analysis on whether there has been actual harm to the plaintiffs’ web infrastructures.

As a practical guiding principle, we have previously urged institutions to evaluate webscraping through the lens of their commitment to Open Access and to the public good. Our shared goal remains to maximize the dissemination and impact of high-quality scholarship in academia, and more broadly for creators to directly determine how their work is shared and reused. We hope the justices in these pending suits will ensure that the creators and the public alike are not locked out by tech companies’ digital fences—ToS or TPMs.


Discover more from Authors Alliance

Subscribe to get the latest posts sent to your email.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top