The Public Interest Corpus Update – Oakland Edition

Center for Library & Instructional Computing Services, Undergraduate Library, 1986

The Public Interest Corpus recently completed the last of three planning workshops. The final workshop was hosted at the University of California Office of the President in Oakland, CA, and built on findings from prior workshops held at Northeastern University and New York University Law School. A diverse group of stakeholders helped sharpen The Public Interest Corpus implementation plan by contributing expert insights on the following topics: (1) Users, Uses, and Managing Legal Risk, (2) Data Development and Access, (3) Multi-stakeholder Governance, (4) and Sustainability.ย 

Users, Uses, and Managing Legal Risk 

We began the day with a presentation and discussion of proposed Public Interest Corpus users, anticipated types of uses, and organizational approaches to managing legal risk. The discussion first addressed which users to prioritize for this corpus.  The discussion drew  on insights gained from the planning process, as well as observations about the evolving legal and risk environment (particularly, outcomes in cases such as Bartz v. Anthropic and Kadrey v. Meta), and the growing disparity in access to books for AI applications in the commercial sectors as compared to academic and research settings.  Thus, the project team recommended that the implementation phase for The Public Interest Corpus should primarily serve academic users by providing full-text access to open and in-copyright books data for AI training and computational research more generally. In concert with this recommendation, the project team introduced practical measures that could help organizations manage legal risk associated with academic research use of The Public Interest Corpus. 

Some takeaways from this workshop discussion:

  • Developing an effective data sharing agreement is key. An effective data sharing agreement must address factors including but not limited to (1) striking a balance between supporting ideal research practices (e.g., reproducible research) and managing legal risk (e.g., prohibitions on in-copyright data sharing), (2) clearly addressing issues pertaining to downstream use (e.g., data use vs. multi-sector use of models developed from those data), (3) and ensuring that the agreement is designed in such a way that it can be readily adopted by  research organizations with variable appetite for legal risk.  
  • An implementation phase must center researchers in service development. Various Public Interest Corpus services need to be tested hand in hand with researchers such as additional metadata creation to accompany data releases (e.g., data bias, data quality) as well as the means to evaluate, select, and securely access data. 

Data Development and Access 

Following the Users, Uses, and Managing Legal Risk session, the project team sought feedback on proposed Public Interest Corpus services that could add value to books data (e.g., curation, transformation) as well as the technical means to provide secure full-text access to books data. 

Some preliminary takeaways from this workshop discussion:

  • The Public Interest Corpus should pursue multiple strategies that encourage book data attribution. Attribution is key to author credit and the integrity of research produced using Public Interest Corpus data. Participants discussed a range of attribution options from simple readme files to more granular forms of attribution.
  • The Public Interest Corpus should provide additional metadata with data releases that account for data limitations (e.g., linguistic bias, outmoded forms of knowledge present in the corpus), data transformations, and data quality (e.g., OCR quality). Researchers emphasized the need for this metadata as it helps evaluate the research potential of the data. A variety of models exist to guide the creation of additional contextual metadata such as Hugging Faceโ€™s model card and the Data Nutrition Label
  • The Public Interest Corpus should work with research libraries to assess and plan for how to support researcher use of the Public Interest Corpus. With The Public Interest Corpus primarily focused on data development, access, and use, it is essential to work with research libraries to establish handoffs between Public Interest Corpus services and research library services. 

Multi-Stakeholder Governance 

Time and again, project stakeholders have emphasized the importance of governance. The pace of change is rapid in this space and the complexity of coordinating effort across multiple roles and sectors requires a thoughtful and effective approach to governance. 

Some preliminary takeaways from this workshop discussion:

Governance should provide a level playing field for stakeholders to guide The Public Interest Corpus. Given a commitment to advancing the public interest, governance opportunities must provide equitable opportunities for a diverse range of organizations to guide strategy.  

Governance opportunities as well as advisory opportunities should be provided to stakeholders. As with any well-structured community effort, The Public Interest Corpus will have needs best served by governance and other needs best served by stakeholders operating in an advisory capacity. The Public Interest Corpus should provide both opportunities for engagement. 

Sustainability 

In the closing session, we asked workshop participants for feedback on a Public Interest Corpus sustainability model. It was our sense, given past experience combined with an assessment of the financial health of the higher education sector and disruptions to the Federal and private funding environment, that the Public Interest Corpus must diversify funding streams in order to achieve sustainability. 

Some preliminary takeaways from this workshop discussion:

Moving from a startup phase to a sustaining phase is likely a 4-5 year effort. Diversification of funding is key to the startup phase, requiring significant up-front investment. Over time it will be essential to reduce reliance on initially diversified funding sources (e.g., Federal funding, private funding, commercial partnerships) by transitioning to a funding model that is majority funded by Public Interest Corpus member contributions. 

Encouraging broader use of the AI infrastructure. In addition to the training corpus and related services, the workshop also explored using the same infrastructure with the Model Context Protocol (MCP) as another front end that would reach a broader audiences and serve other kinds of uses. MCP is an open standard that enables LLMs to connect with tools and resources on demand to help answer queries and to run analyses. It could allow for smaller and more nimble uses of the metadata and full text of books within AI environments, such as commercial chatbots and open source AI tools run by universities and researchers.

Commercial partnerships must align with The Public Interest Corpus principles and goals. The Public Interest Corpus principles and goals were developed with iterative feedback from community stakeholders throughout the planning process. The principles and goals are intended to ensure that The Public Interest Corpus maintains its commitment to the public interest. 

Moving Forward 

In December, we will release (1) the final version of The Public Interest Corpus principles and goals and (2) our core deliverable – lessons learned from the planning phase and the direction we believe The Public Interest Corpus should take as it moves toward an implementation phase.


Discover more from Authors Alliance

Subscribe to get the latest posts sent to your email.

Scroll to Top