
This is the first in a series of essays about the choices that knowledge institutions face as AI reshapes how their collections and resources are used. The people on various sides of these debates are, for the most part, deeply committed to access to knowledge — the disagreement is about means. Some believe that the best way to protect access to knowledge is to impose new conditions on how AI can use it and for what purposes. I think that approach is more likely to undermine access than to save it. This post in particular is about the risk of libraries and archives reimagining themselves as gatekeepers of AI-era access. To help contain these essays, we will be posting them as a series titled “AI & Access to Knowledge.”
—-
Library and Archives 101, pop quiz:
A patron walks in and wants to use an item in your library or archives. You should:
A) Have the patron agree to a legal contract stating that any subsequent use of the item must follow strict attribution rules that credit the library and archives.
B) Tell them they can only use that item in ways that align with the values and ethics of organizational leadership.
C) Warn the patron that if they engage in any future use of that item that the institution disagrees with for whatever reason, they must stop immediately and promise to forget anything they learned from using the item.
D) Lend the patron the item.
—-
As AI mania has swept the world, it has led to some surprising realignment of values, particularly among those who have long supported openness and access to knowledge.
Libraries and archives have not been exempted,[1] and many have experienced collateral damage from a broader set of AI disputes. On one side, publishers attempt to override research freedoms for AI and text data mining researchers by demanding novel contract terms that limit fair use. On the other side, swarms of bots relentlessly scrape digital library collections, raising costs significantly just to keep that infrastructure online. The result has been traumatic and destabilizing. But for the most part, libraries and archives have navigated these challenges well, relying on long-established values that prioritize access to knowledge and free inquiry, while also acknowledging that infrastructure resources are scarce, so in some cases, tools like rate limits or captchas may be necessary to maintain access for the widest possible audience.
But some responses have gone further. Policies are beginning to appear that condition access on institutional control, treat collections as assets to be leveraged, and adapt the language of rights-holder playbooks—consent, credit, and compensation—rather than the language of library values. These frameworks deserve close examination, because the assumptions that underlie them could, if adopted broadly, push libraries and archives away from their foundational commitments to access and free inquiry.
I want to be clear at the outset that I think many of the concerns relating to AI use of archives motivating this rethinking are genuine. Archival work in particular carries relational obligations that must be handled with care. Archives hold the records of communities, living people, and donors who entrusted materials under specific conditions. The custodial relationship an archivist maintains with a community is a core part of that work and is in fact critical to the longer-term preservation and access of those collections. My argument is not that these concerns are overstated, but that some of the policy frameworks now being proposed are the wrong response to them.
This impulse is showing up across the cultural heritage sector internationally. For instance, the KB (the National Library of the Netherlands) has restricted access to its major digital collections for commercial AI training—while making the same materials available to a government-backed Dutch language model. Academic scholarship has framed AI training on cultural materials as a form of digital expropriation. And more broadly, institutional discomfort with AI is beginning to translate into conditions on users—not just internal governance about how a library itself deploys AI tools, but external restrictions on who may access collections and for what purposes, based on the institution’s own ethical stance toward the technology.
The most detailed articulation of this approach that I’ve found so far is the University of Virginia Archival AI Protocol, a 19-page document that bills itself as an “AI training and access standard for archival organizations.” I’ll draw on it throughout this essay, not to single it out, but because it makes explicit several principles that are often implicit in these conversations, and I fear these principles may gain traction and wider adoption. The Protocol’s core rule is blunt: “No access without control. Irreversible models do not get access unless provenance and institutional control are real.” It is organized around three pillars—provenance and attribution; donor and community responsibilities; and institutional control—each of which raises important questions about what libraries and archives are for.
Why ‘control’ is the wrong framework for libraries and archives
The UVA Protocol’s third pillar, institutional control, starts from the premise that training large AI models transforms source materials into parameters that “cannot realistically be reversed.” From this it follows, according to the Protocol, that “to maintain institutional control, the archival organization asserts a ‘Right to Stop’ where the organization can order the cessation of materials ingest and demand the decommission or destruction of the model.”
This is an unusual framing for a library or archives and should cause us to pause. AI has generated dozens of cases about the extent to which copyright holders can flex their rights to prevent AI use, so far with mixed results. Groups like the American Association of Publishers, the Copyright Alliance, and the RIAA have adopted “consent, credit, and compensation” as a mantra. The Protocol’s framework—with its insistence on institutional control, attribution as a precondition to access, and provisions for compensation and benefit-sharing—reads like a version of that same playbook.
“Institutional control” as the core organizing principle for a document about use of special collections materials is jarring because it is so far removed from what libraries and archives do. “Steward,” yes. “Preserve,” of course. But “control”? When I review documents like the Society of American Archivists’ Core Values, or the American Library Association’s Library Bill of Rights, I read about the importance of broad access on a nondiscriminatory basis. I encounter very little that would support the idea of libraries or archives as gatekeepers who condition access on how materials will ultimately be used.
Now, to be fair, the UVA protocol applies to archives and special collections, not general collections. I’ve in some ways lumped together libraries and archives here, but I don’t think unfairly, because they have essentially the same goals of preserving and providing access to knowledge, though their means vary. For example, archives and special collections sometimes necessarily collect more private information about certain users—e.g., for security purposes to protect highly valuable physical objects. Special collections and archives also hold a wide variety of materials, including personal letters, papers, home photographs, etc., that may not have been created for widespread dissemination, raising challenges about how to balance access with potential harms to creators or subjects of those materials.
In some cases, broad access to certain materials can expose sensitive private information or be ruinous to the personal lives of those who are the subject of the records at issue. I’ve advised on certain collections that are so sensitive that reuse or even mere access in any way could put living individuals in danger of physical harm. And of course, special collections and archives are often acquired under highly negotiated terms that donors may feel necessary to impose, that sometimes involve tradeoffs. These may include temporary embargos on access, or restricting access to specific materials or requiring supervised use.[2] In this regard, it would be disingenuous to suggest that the current thinking about closing off access for AI doesn’t have some historical precedent—many libraries and archives hold materials subject to terms imposed by donors that restrict access or use based on a variety of factors. Some of them are less principled than others, for example restrictions to view items based on individualized, personal permission of the donor, but these are not ones to emulate.
Archivists working with oral histories, community records, or the papers of living individuals maintain ongoing relationships with those communities that are genuinely different than, for example, a librarian managing a collection of books. These relationships involve trust-building, negotiation over how materials are described and contextualized. That custodial work is real and important to protect.
Archives have long had to implement access controls in certain targeted circumstances, but those have been proportionate responses to identifiable harms to others, following highly contextual review processes (e.g., these excellent “responsible access workflows” by UC Berkeley). I do think that AI has changed that risk calculus given the scale and sophistication of this technology, and it makes sense for institutions to reassess digital collections with an eye toward whether use either as training data or in other ways with AI systems may pose enhanced risks (for example, whether AI tools could inadvertently or intentionally de-anonymize subjects of collections in ways that might pose new risks to them).
These are important concerns, but they are also tightly connected to specific material with specific risks, which in some cases must be restricted. But an approach that restricts entire categories of use across entire collections or even an entire digital repository for AI use is more like an on-off switch, and it’s an approach that is incongruent with past practice and with the actual risk posed. It imposes a restriction triggered not by the sensitivity of particular materials but by the type of technology the user employs.
Provenance matters, but not as a gate to access
A leading justification for exercising control over AI access is that libraries and archives care—and care a lot—about provenance and attribution. The UVA Protocol, for instance, has as its first pillar this threshold: “If the AI cannot cite its source, it cannot use the archival material. Where provenance and attribution cannot be maintained to the organization’s standard, archival material will not be used for AI training or for public-facing AI services.”
The Protocol’s Appendix B actually sets out a detailed and largely sensible standard for how AI outputs should cite archival sources: item-level granularity, persistent identifiers, hyperlinks back to collection records. For retrieval-augmented generation systems that search a collection at the query stage and return results with citations, these are achievable and worthwhile goals. I’d welcome their adoption.
But the Protocol’s core rule goes much further. It demands that provenance and attribution be maintained at the training stage—that is, that the archival organization be able to trace the influence of its materials through the interior of a trained model. This is where the Protocol departs from what is technically possible and, I’d argue, from the kind of access controls we impose on any other user.
The field of training data attribution (sometimes grouped under the umbrella of “explainable AI” or XAI) has made real progress in recent years. Techniques like influence functions, data Shapley values, and TracIn can produce estimates of how much a given training example contributed to a particular model output. But what these methods produce is not attribution or “provenance” in any sense an archivist would recognize. They produce probabilistic approximations—a fuzzy, reverse-engineered guess about the degree to which a training input nudged model weights in a direction that later influenced an output.
The relationship between a training document and a model’s output is not like the relationship between a source and a citation. In this way it is very similar to how many of us learn. Scholars read and build upon all sorts of materials, including facts and ideas that are absorbed and then later relayed, but never expected to be traced cleanly back to a given source. AI models, too, are mediated by billions of parameters with no direct relation to training data, so efforts to identify sourcing are fundamentally probabilistic. There is no chain of custody to trace, because there is no chain.
This matters because the Protocol treats the inability to provide deterministic training-stage attribution as a reason to deny access entirely. If the AI cannot trace its outputs back through its parameters to specific archival items used in training, the material cannot be used. But this sets up a standard that no current technology can meet—and that may reflect a real impossibility rather than merely an engineering gap. Conditioning access on a capability that cannot exist is, in practice, a blanket prohibition. And a blanket prohibition is a very different thing from a provenance standard.
There is also a tension within the Protocol itself. The Protocol’s preferred architecture is Retrieval-Augmented Generation (RAG)—a process that keeps source materials under institutional control and cites them at query time. And the Appendix B citation standard is designed for exactly that kind of system. So the Protocol already contains a workable answer to the attribution problem within its own pages. The training-stage attribution requirement isn’t filling a gap in the Protocol’s own framework; it’s adding a second, much more demanding requirement on top of a first one that already works. That makes the training-stage requirement look less like a genuine provenance concern and more like a mechanism for blocking a category of use the drafters don’t like.
I think attribution is incredibly important if we want AI tools that actually help people learn and verify claims. But the place where attribution does its useful work is at the point of output—when a user asks a question and receives an answer that they need to be able to check. That’s an output-side problem, and it’s one that RAG architectures are increasingly well-suited to address.
Libraries and archives typically don’t assert preemptive control over how our materials are used, or whether they are used in ways that we care deeply about (like proper citation), and for good reason. Can you imagine libraries restricting access to content to students who have a history of poor citation practices? Or telling an author accused of plagiarism that they are no longer welcome? Of course we don’t support such practices, but we also recognize that the remedy for misuse lies downstream in the norms of scholarship, in editorial review, and in academic integrity processes, not in restricting access at the point of entry. Libraries and archives provide materials; other institutions and systems govern how those materials are used. That division of responsibility is not a gap in the system. And it works precisely because libraries and archives do not make themselves the enforcers of downstream conduct.
How do we manage scraping?
Beyond provenance and attribution, there is a more immediate and practical problem: the sheer volume of automated scraping hitting digital collections infrastructure. Whatever one’s philosophy is on access for AI, the inundation of bots has become so severe for some libraries and archives that it has effectively constituted a DDoS attack. The response has been, very reasonably, to limit crawling just to keep servers available to human readers.
Librarians at the University of North Carolina at Chapel Hill Libraries published a detailed account in Code4Lib Journal of their escalating fight against aggressive AI crawlers. The progression is instructive: they moved from traditional client blocking to request throttling, regional traffic prioritization, novel bot-detection heuristics, commercial web application firewalls, and ultimately Cloudflare Turnstile for in-browser client verification. The crawlers adapted at each stage—rotating through residential proxy networks, spoofing user agents, evading IP-based blocks. Duke University Libraries implemented a similar “proof of work” system. Libraries are spending real staff time and real money keeping their digital collections online in the face of traffic that was never part of the original infrastructure plan.
But notice what characterizes every one of those responses: they are primarily technical measures, aimed at managing traffic so that collections remain available to all users. Rate limiting, robots.txt directives, CAPTCHAs, proof-of-work challenges—these are content-neutral, proportionate, and analogous to things libraries have always done in the physical world. This includes things such as limiting the number of items that can be checked out at once, requiring appointments for fragile materials, closing stacks during low-staffing hours. None of these measures condition access on what the patron intends to do with the material. They manage scarce resources so that access remains broadly available. As the UNC experience shows, these defensive measures alone are not a permanent solution—crawlers adapt, and institutions scramble to keep up. But I think that incompleteness does not require—and should not become the occasion for—a fundamental rethinking of whether and how libraries provide access to their digital collections.
I think that incompleteness points to a need for better infrastructure for providing access at scale, not toward conditioning access on what users intend to do with materials. The institutions best positioned to solve the scraping problem sustainably may be the ones willing to lean into access rather than away from it. Wikipedia faced the same challenge—bandwidth costs from AI crawlers surging roughly 50% since 2024, even as human pageviews declined. The Wikimedia Foundation’s response was not to lock down access but to formalize it, now providing structured data feeds offered through its Enterprise program to companies like Amazon, Meta, Microsoft, and Perplexity.
The result is that Wikipedia gets paid to support its infrastructure, the crawlers get the firehose access they wanted without hammering the servers, and the content remains freely available to everyone else. Wikipedia didn’t compromise its open-access mission to get there. Instead, it built a channel that makes the mission financially sustainable precisely because it offers more access, not less. Now, this was not an easy task and I’m not suggesting this can happen overnight—it took Wikimedia years to get it right—but I don’t see a reason why libraries and archives could not accomplish the same thing, especially if working collectively.
So where else is all this coming from?
Libraries and archives collect, preserve, and provide access to information because we believe that access to knowledge is a public good. The value of a collection is not realized when it sits undisturbed on a shelf or in a digital repository; it is realized when someone uses it. This is not a new or controversial idea.[3]
So what is driving libraries and archives toward a framework built on “institutional control” and a “Right to Stop”?
I think part of it is genuine and understandable feeling of lack of agency. Libraries have watched AI companies scrape digital collections with abandon, consuming bandwidth and infrastructure resources while offering nothing in return. We’ve watched publishers use AI as a pretext to impose new licensing restrictions on materials that researchers already had the right to use. We’ve watched the public discourse around AI veer between utopian hype and apocalyptic dread, with very little room for the kind of careful, contextual thinking that librarians and archivists are trained to do. In that environment, the impulse to assert control is natural. If everyone else is grabbing what they can, why shouldn’t librarians and archivists stake their claim too?
Part of this also seems to be about just getting a good financial deal from those who companies that can afford it, especially in moment when university budgets are being slashed and administrators are increasingly pressured to find new revenue streams. I admit my “Library and Archives 101” quiz at the beginning looks past the reality that there is a huge logistical and financial difference between an undergraduate making an interlibrary loan request for a few books and an AI company seeking access to thousands of works.
So, we encounter framing like in the UVA protocol that aims for compensation and benefit-sharing. With its language about “revenue sharing arrangements,” “competitive advantage,” and the right to “seek financial compensation, cost recovery, or in-kind benefits for AI-related uses,” it’s hard not to see a strategic calculation alongside the ethical one. Archives and special collections have long struggled to fund digitization, processing, and preservation. The AI boom has created a moment where technology companies are eager to acquire training data, and some institutions see an opportunity to leverage their collections into partnerships that could fund the work they’ve never had the resources to do on their own. The Google Books project is an obvious precedent: libraries traded access to their collections for digitized copies they could never have afforded to produce themselves.
I understand the appeal of this approach, and I won’t pretend that the resource constraints are imaginary. But the assumption seems to be that archives should treat cutting off access to their collections as one of their sources of negotiating leverage. While it is true that many institutions hold unique materials, that fact alone does not translate into the kind of bargaining power that libraries and archives should be encouraged to cultivate.
Doing otherwise also creates a perverse incentive structure. If access to archival materials for AI use is conditioned on negotiated benefit-sharing, institutions have a financial incentive to restrict access in order to preserve their bargaining position. Materials that are freely available online have no leverage value, and merely layering digital materials with an “access framework” or terms of service does little to dissuade such uses. So, we now have created an incentive to limit access to everyone as a precondition for first monetizing collections. That is exactly backwards from what libraries are supposed to do.
I recognize the intuition that libraries should treat corporate users differently from scholars and students — that there is something uncomfortable about a company profiting from collections built through public investment and donor generosity. But libraries have historically resisted sorting users by purpose, and for good reason. The line between commercial and noncommercial use is far less clean than it appears, and in the AI context, access barriers designed to constrain corporations will fall hardest on the smaller, less-resourced actors that libraries most want to serve. The irony of implementing a complex regulatory framework of formal agreements, compensation negotiations, and decommission rights for access is that such a system is navigable by any organization with a legal department, and prohibitive for almost everyone else.
The problems with becoming the AI police
There is a further problem with using access restriction as the primary tool for managing AI-related harm: it puts the library in the role of policing downstream conduct that existing law already addresses. Take, for example, the problem of an archives restricting access to an oral history collection for fear that it could be used to generate deepfakes. A valid concern. But deepfakes generated from recordings of living people implicate fraud, defamation, and right of publicity law, and non-consensual synthetic media is increasingly addressed by a variety of state laws. These mechanisms place accountability where it belongs — on the actor who causes harm — rather than requiring the archivist to make prospective judgments about what a user might do with materials before access is granted.
That is a role libraries and archives have been reluctant to play and for good reason: it requires a kind of user surveillance and intent-parsing that is both technically unreliable and antithetical to library values. The remedy for misuse, like the remedy for plagiarism or poor citation practice, lies downstream — in law, in professional norms, in community accountability — not in restricting access beforehand just in case.
The assertion of control here also has some potentially negative repercussions for other parts of the library to effectively license works for researchers.[4] Libraries have long advocated, on behalf of their own institutional AI and TDM researchers, that publishers should not impose contractual limits that inhibit those users fair use rights. It seems difficult to continue to make a persuasive case for why publishers should do so if libraries and archives are doing the very same thing with their own collections.
Beyond the line-drawing problem, administering any such framework and actually making it meaningful by enforcing it is itself is likely to be costly in a number of different ways. Even if libraries and archives aim to only exclude commercial actors, doing so requires first figuring out who they are, which necessarily requires surveilling every user. Once they are identified, it would likely then mean expensive efforts to enforce the new rules. Libraries and archives have options to apply things like website terms of service to apply contractual limitations on access and use of digital collections. They could also resort to DMCA claims for circumventing technical protection measures. Whether either approach is legally enforceable (I have my doubts, but both tactics are currently being litigated), the effectiveness still depends entirely on how many resources the institution is willing to pay to go after violators. Given the current level of library funding, I predict not much. And so what will happen is the good actors, like academic researchers, will avoid use because they are invested in playing by the rules, even if they don’t agree with them. And bad actors will continue undeterred.
Finally, there is also a dimension of enforcement with costs that are hard to calculate: many of the institutions who hold materials of value for AI training are public universities. A state institution conditioning access to information based on judgments about who the requester is and what they intend to do with materials raises concerns that go beyond professional values — it implicates the very free speech principles that libraries have historically existed to protect. Libraries are accustomed to thinking of themselves as defenders of free expression against external pressure. They are less accustomed to recognizing when they might themselves become the source of it.
We don’t tell documentary filmmakers, or biographers, or historians that their access to collections is conditioned on our ongoing ability to control what they do with what they find. We don’t demand the right to “decommission” a published book that drew on archival sources if we later decide we don’t like how the material was used. We provide access, and we trust that the broader ecosystem of law, professional norms, and scholarly ethics will govern downstream use.
And yet, the idea that libraries and archives should condition access on control, that collections are assets to be leveraged, that the institution’s comfort with a user’s technology should determine whether access is granted are circulating broadly. The SAA and ALA values that I mention above have not yet been updated for the age of AI, and I fear that if we internalize these ideas without more critically engaging with them, libraries and archives will be pushed away from their role as facilitators of access and use, and toward yet another chokepoint.
So what can we do?
This is not a call toward passive acceptance of AI company dominance. I understand the anxieties that motivate efforts to exert more control, and they are responding to real problems. But the answer to those anxieties cannot be to remake the library in the image of Elsevier or Disney. Asserting control, demanding compensation, and conditioning access on the institution’s ability to dictate the terms of downstream use—that is not what libraries and archives are for.
There is a more productive path than control, and it’s one that libraries are uniquely positioned to lead. The history of technology is full of moments where incumbents tried to maintain control through restriction, and where the most effective response turned out to be building something better, that aligns with our own values. The open community didn’t win anything by attempting to strong-arm Encarta or Encyclopedia Britannica. Instead, it built Wikipedia, which was successful because it provided broader access and motivated a community of contributors to share their knowledge with the world. Our communities’ most consequential victories have happened the same way—HathiTrust, PubMed Central, PLOS, arXiv—all happened because we created infrastructures that support open inquiry and hewed closely to values that support free and open access. Notably, some of our worst embarrassments, such as the slow conversion of OCLC into a litigious corporate metadata monopoly, have happened when we have compromised on those values.
Libraries are not going to out-monopoly the monopolists by asserting control over their collections. But they can do something far more important: they can ensure that institutions, researchers, and communities they serve have a seat at the table—not by withholding access, but by building the infrastructure that supports open, accountable research of every kind.
[1] For some of this post I’m grouping libraries and archives together but acknowledge that while each may have similar end goals, the means of achieving them can differ significantly.
[2] Even these restrictions often cannot be absolute. The most prominent example for why may be the Belfast Case (records held at Boston College relating to violence in Northern Ireland that was under embargo, but forced to be released due to litigation). But even outside of litigation, there may be instances where legal process or event state law (e.g., state public record law) forces release. The Ahmad v. University of Michigan case is one prominent recent example, where the Michigan Court of Appeals concluded that records held by the University of Michigan library relating to Dr. John Tanton, held under embargo, must be released because they are public records. In that case, I do not think it is wise to let state public record law override donor restrictions (who would ever donate sensitive materials to the University of Michigan again?).
[3] In 1931 S.R. Ranganathan published The Five Laws of Library Science, a book that remains one of the most influential philosophical and practical statements about what libraries are for. His first law—“books are for use”—was a direct response to the custodial mindset that dominated libraries of his era, where books were chained to desks, locked in closed stacks, and treated as objects to be preserved rather than read. Ranganathan spent much of his first chapter cataloguing the ways that institutions had organized themselves around control rather than access, and arguing that every element of library operations—location, hours, staffing, shelving, lending policies—should be redesigned to facilitate use rather than restrict it. His third law—“every book its reader”—pushed further, arguing that libraries should adopt open shelving precisely because it allowed materials to find their users through serendipity and browsing. Under closed-stack systems, a patron had to already know what they wanted. Open access meant the collection could do its own work of connecting materials to the people who needed them. The librarian’s role was to create the conditions for discovery, not to stand between the patron and the materials. Ranganathan wrote against a backdrop of colonial control, where access to knowledge was itself a political question. He understood that the impulse to restrict, to gatekeep, to condition access on the institution’s approval of the user’s purposes, was not a neutral administrative choice. It was a choice with consequences for who gets to learn and who gets to participate in the production of new knowledge. Nearly a century later, that insight has lost none of its force.
[4] It’s worth noting that the UVA protocol explicitly does not apply to public domain materials, which I think is good. Libraries and archives have not always been consistent about not layering constraints on use of public domain materials. However, it’s not clear to me why the principles it articulates should apply differently to copyrighted versus public domain materials, given that it does not identify copyright as a central motivation (and anyway, the libraries or archives are unlikely to be the holder of most of those rights).
Discover more from Authors Alliance
Subscribe to get the latest posts sent to your email.
