Institutional Repositories and AI Scraping

We recently received a question regarding the AI scraping of Institutional Repositories, by which we mean online digital archives that provide access to the intellectual output of scholars, often affiliated with a specific institution. The question was in two parts: (1) Do open access Institutional Repositories permit the use of the IR’s materials for AI training?; (2) Are there legal mechanisms, such as Creative Commons licenses, that would prevent the use of IR materials in AI training?

Below, we’ll take on these two questions and offer a third question, one we think all institutions should take equal care to explore: If expanding the reach and impact of your scholarship is one of your highest priorities, how much should you insist on controlling its use for AI training?

Do open access Institutional Repositories permit the use of the IR’s materials for AI training?

Absent a set of robust and tailored barriers to the use of IR materials, the answer to this question will typically be yes, by default. Since permission need not be explicit, IR materials are very likely being actively used in AI training.

AI bot scraping of the web is a widely known and reported phenomenon, one creating enormous costs for online communities (“some open source projects now see as much as 97 percent of their traffic originating from AI companies’ bots, dramatically increasing bandwidth costs, service instability, and burdening already stretched-thin maintainers”) as well as enormous backlash (see e.g., this piece from Mike Masnick, “We’re Walling Off The Open Internet To Stop AI—And It May End Up Breaking Everything Else”)

If you’re hosting resources online, you’re likely experiencing increased web traffic related to Artificial Intelligence. And if your IR is not currently taking steps to throttle AI bot traffic, then it is (perhaps unwillingly) facilitating the use of IR materials for AI training, regardless of whether your institution explicitly sanctions the use of its materials for that purpose.

In the United States, web scraping of publicly available online materials has largely been justified by fair use, which allows for copying without first obtaining permission in certain circumstances, especially when the copy is reused for a “transformative” new purpose. Services that we use every day, from Google’s search engine to the Internet Archive’s Wayback Machine, depend on fair use to scrape the web so their services can work. AI companies have also relied on fair use as they’ve acquired data to train their models.

Are there mechanisms, such as Creative Commons licenses, that would prevent the use of IR materials in AI training?

To begin, Creative Commons licenses are not designed to limit or restrict AI training, particularly any AI training that would qualify as a fair use or fall under another exemption or limitation to copyright. Creative Commons addresses this very question in Understanding CC Licenses and Generative AI (“the licenses do not supersede existing limitations and exceptions; in other words, as a licensor, you cannot use the licenses to prohibit a use if it is otherwise permitted by limitations and exceptions to copyright.”)

So far, the Anthropic and Meta cases indicate to us that at least some AI training will be considered fair use. So, no, Creative Commons licenses are not robust mechanisms for preventing AI training.

It is possible for a website to limit the use of its materials for AI training by resorting to designing restrictive Terms of Service/Terms of Use, with the goal of controlling subsequent uses of the materials. Restrictive Terms of Service would be effective to the degree that (1) AI bots/developers pay heed to these terms of use (your mileage will vary, though it is well known that many AI bots ignore these terms); (2) you are willing and able to enforce your terms of service, up to and including litigation, which can be time consuming, expensive, and may ultimately distract from your mission.

As we have explained elsewhere, use of contract terms to limit fair use can have some seriously bad effects on future preservation and research, so we generally recommend against it. Given that many AI bots will ignore terms of service and a typical IR may not be well positioned to enforce its Terms of Service, this method of control is unlikely to be an effective solution.

Technical Measures may be more effective than legal measures, but come with tradeoffs.

Technical measures that prevent the acquisition of IR materials at scale, thus thwarting at least some of the most abusive AI scraping, are ultimately likely to be more effective than restrictive licenses or IR Terms of Service. Technical measures cannot simply be ignored, in the way that contractual restrictions might be.

We don’t pretend to know everything about the latest technical methods for preventing bots from scraping an IR, but it is clear that technical mechanisms are available and in continuous development. Earlier this year, Cloudflare took steps to reduce or eliminate AI bot traffic. IRs can consider a range of approaches to preventing AI scraping, including services like those provided by Cloudflare, requiring authentication to access IR materials, and limiting downloads by IP address. Each of these approaches will help limit bot traffic, but may also erect barriers that prevent or limit desired uses of the IR (e.g., the use of IR materials for a text and data mining project initiated by a scholar at the institution).

If expanding the reach and impact of research is your priority, how much should you seek to control the use of materials for AI training?

As we’ve argued elsewhere, we think that Institutional Repositories and the scholars who share their works through them should take a moment to evaluate their commitment to Open Access when considering restricting access to AI developers.

We deeply appreciate that there are myriad reasons to be concerned about AI and there are very real infrastructure costs to AI scraping. It may be the case that some of these considerations are too significant, too important to your institution and its constituencies, and too expensive to ignore.

We have argued previously that a renewed commitment to Open Access should also be something that IRs consider as they think through the means by which they control access to their materials. In the end, a commitment to Open Access is a commitment to the public good, one that views the broad dissemination of ideas and knowledge as foundational to progress. If we are genuinely dedicated to the broadest possible sharing of our work and research outcomes, then overly restrictive approaches to AI development do not logically make sense.

If we want AI to be developed with the best information and research available to it, then we should work to ensure that AI tools have unencumbered access to the kinds of peer-reviewed scholarship, high-quality datasets, and research library digital collections found in IRs. If we wish for our ideas to propagate and would like to see our work appropriately credited, then designing our repositories to facilitate AI use (e.g., with Retrieval Augmented Generation based systems) is more likely to accomplish that goal. If we want to mitigate bias and other harms presented by AI developed from less robust and diverse data sources, then unrestricted access to the full diversity of our IRs is imperative.

Discover more from Authors Alliance

Subscribe to get the latest posts sent to your email.

Institutional Repositories and AI Scraping

Discover more from Authors Alliance

Leave a Comment

2 thoughts on “Institutional Repositories and AI Scraping”

Share this:

Discover more from Authors Alliance

Related Posts

Leave a Comment

2 thoughts on “Institutional Repositories and AI Scraping”