Last month, a diverse set of stakeholders gathered at New York University Law School to contribute to an implementation plan for The Public Interest Corpus. This workshop built upon the first project workshop held at Northeastern University Libraries in the Spring through (1) continued refinement of project principles and goals, (2) documentation of research and library service use cases, and (3) collective ideation on prospective year 1-3 and year 4-6 activities for an implemented version of The Public Interest Corpus.
Continued Refinement of Principles and Goals
As in the Northeastern University workshop, we began the day with an exercise focused on refining The Public Interest Corpus principles and goals. Participants contributed a broad range of comments, edits, and suggestions that greatly strengthened project principles and goals.
Some preliminary takeaways:
- The Public Interest Corpus should advance an equitable sustainability model that is responsive to variation in resources of potential supporting organizations. Participants emphasized that a public interest resource must have a sustainability model that enables small, medium, and large organizations to sustainably offer benefits to their communities.
- Commercial sector digitization contracts should not inhibit library ability to provide access to and support the use of data produced through digitization partnerships. There was broad recognition that commercial partners would continue to be fundamental to library mass digitization efforts. As libraries maintain and/or enter into new digitization partnerships they should work to remove clauses including but not limited to data embargoes and clauses that are sufficiently vague so as to cast doubt on allowable uses of data.
- The Public Interest Corpus should concretely encourage values-aligned, downstream use. Examples of measures that concretely encourage values-aligned use include but are not limited to providing data citation user education, platform features that automatically generate data citations, and/or data sharing agreements that require data citation in context of a range research & development scenarios – e.g., a published paper, a generative AI application, and so on.
Documenting Research and Service Use Cases
Significant effort was dedicated to documenting research and service use cases. A research use case exercise elicited common challenges that disciplinary researchers encounter seeking to gain access to and make use of data. A service use case exercise elicited common challenges that libraries encounter seeking to provide services in this space.
Some preliminary takeaways:
Research Use Cases
- The Public Interest Corpus should develop and provide access to corpora that align with multiple notions of comprehensiveness. Researchers noted that determining collection comprehensiveness was context-dependent. In some cases researchers may be looking for as many books as possible to satisfy their definition of comprehensiveness and in other cases they may be looking for a highly curated set of books that correspond to a specific theme. Given the context-dependent nature of comprehensiveness, The Public Interest Corpus should work with user communities to prioritize the creation of corpora at varying scales for specific purposes.
- “Upstream” risk aversion creates an attritional process for “downstream” users. Researchers noted multiple instances where organizational risk aversion relative to computational uses of digitized and/or born digital collections creates a prolonged, drawn out process for accessing and making use of library collections. Researchers hope that The Public Interest Corpus can create an environment where there is less perceived or real risk for organizations and smoother access to collections for end users.
- Researchers want to contribute enhanced collections data back to libraries. In many cases researchers are re-OCRing received collections data and/or taking steps to improve metadata (e.g., normalization, enrichment) but have no easy way to contribute enhanced data back to the library for the benefit of other researchers. Researchers are motivated to see that their work on data enhancement benefits a broader community.
Service Use Cases
- AI capacity is growing in libraries, but it remains to be seen what the library community can achieve together at the level of infrastructure, data, and services. A number of participants noted investments in local AI capacity but acknowledged that little of this capacity has been joined at the community level. Participants reflected on existing community investments in efforts like Hathitrust and expressed interest in determining optimal levels of multi-organizational collaboration on something like The Public Interest Corpus.
- Justifying potential investment in a community solution remains challenging, though not insurmountable. The Public Interest Corpus should continue developing a value proposition that concretely makes the case for how addressing the stated challenge at scale most effectively meets local research needs – e.g., comprehensive, high quality corpora are by necessity the product of combining collections from multiple organizations.
Envisioning the future of The Public Interest corpus
In remaining workshop activities participants ideated on year 1-3 and year 4-6 activities for an implemented version of The Public Interest Corpus.
Some preliminary takeaways:
To centralize or decentralize? Or both? As discussions turned toward technical implementation there was substantial discussion of centralization, decentralization, or some combination of both to make The Public Interest Corpus work. Various technical approaches were discussed including but not limited to MCP, vector stores, and APIs feeding a central repository.
Balance for quantity and quality in corpus development. Participants noted the importance of both quantity and quality in prioritizing future corpus development. In some cases quality could outweigh quantity, in other cases quantity could outweigh quality, and in other cases a balance could be struck. The Public Interest Corpus should work with a range of stakeholders at varying degrees of intensity to achieve the right balance in corpus development and releases.
Plan for future collection scope expansion. Multiple participants suggested that The Public Interest Corpus should plan for expansion of collection scope beyond books data moving forward. Collection scope expansion could include archives and special collections materials. Participants expressed that these collections were of high value and not commonly available in existing training data. They also emphasized the need for deep investigation of potential ethical issues with these materials in the event that collection scope expands.
Next Steps
We offer thanks to our brilliant workshop participants – they trekked through a very hot and humid NYC to help plan for The Public Interest Corpus and somehow maintained good spirits throughout the day!
We have one final workshop this October in Oakland, CA. If you work in the region and are interested in potentially attending please let us know here. If you would simply like to learn more about the project and/or discuss possible collaboration please let us know here.
The project team plans to share The Public Interest Corpus Startup Plan by December 2025.
Discover more from Authors Alliance
Subscribe to get the latest posts sent to your email.