Releasing The Public Interest Corpus Principles and Goals

Today, we are pleased to release The Public Interest Corpus Principles and Goals. This release builds on the recap of our final planning workshop and anticipates release of our final deliverable later this month.

Early on in The Public Interest Corpus planning process, we were encouraged by our advisory board to dedicate significant effort to the development of multi-stakeholder informed Public Interest Corpus principles and goals. The principles and goals are intended to support collective decision making as The Public Interest Corpus moves from a planning phase to an implementation phase.

Over the course of the year we iteratively developed the principles and goals with feedback from a diverse set of stakeholders (e.g., researchers, authors, librarians, publishers, technologists) across the United States. Hundreds of contributions later we believe we have landed on a set of principles and goals that align interests and lay a path for delivering concrete value to the communities that libraries aim to serve. Special thanks to The Public Interest Corpus contributors and the advisory board.


Principles and Goals

The Public Interest Corpus works with a growing coalition of stakeholders to develop a service that advances the library community’s ability to support the responsible use of their collections for AI research and development and computational research more generally. The initial focus of the service is on a corpus development, discovery, and access solution for books data (digitized and/or born digital text with metadata) at scale. Some estimates suggest that ~162,000,000 books have been created globally, with ~2,200,000 new books published each year. Collectively, libraries steward the most comprehensive source of human inquiry recorded in book form. 

The Public Interest Corpus is inspired by open corpus development efforts from organizations like Wikipedia and PLEAIS. The Public Interest Corpus is also encouraged by efforts like the Institutional Data Initiative and European Books Data Commons. The Public Interest Corpus builds on these efforts by working to provide access to in-copyright as well as public domain books data on terms that are legal and ethical.

Academic researchers in particular have a pressing need to gain access to books data at scale, but they face numerous challenges. To start, accessing in-copyright books data for AI development is extraordinarily expensive. The public interest is not well-served by barriers that de facto restrict books data access to the wealthiest for-profit technology companies. Furthermore, in-copyright books data is typically excluded from broader use on an overly narrow licensing basis, weakening transparency and public trust in AI-driven research. Compounding difficulty, researchers leveraging computational methods are faced with a piecemeal data access ecosystem, where access to data must be pursued across numerous sources with variable pricing, policy, and licenses. Given significant administrative and financial barriers, researchers working within and outside of higher education find themselves pulled toward data of less than optimal quality, exhibiting significant biases, lacking comprehensiveness, and made available in a manner that is both legally and ethically problematic. 

AI researchers and developers must also contend with the fact that many books – public domain and in-copyright – are simply not digitized. Without digitization, the potential of one of humanity’s most comprehensive knowledge bases is simply not usable for AI research and development. Digitization of book collections at scale will require significant, ongoing public and private financial investment. It is essential that contracts supporting public and private digitization partnerships contain terms that safeguard the ability of libraries to combine, enhance, and provide access to data produced through these partnerships. 

In order to support research, teaching, learning, and new forms of creativity, books data should be made available to researchers in accordance with the law and normative community expectations such as promoting author attribution and working to ensure that data bias is well documented. Making books data available at scale strengthens the ability for researchers to leverage AI and other computational methods to meet public interest challenges like fighting misinformation, strengthening understanding of the past and present, and fostering an informed citizenry. It also enables the development of more focused corpora that support fine-tuning existing models and/or development of small models for tailored use cases. 

The Public Interest Corpus depends on key partnerships with libraries, publishers, researchers, and minoritized communities represented in collections that form the corpus. In working with The Public Interest Corpus, libraries advance their research support mission by combining collections from many organizations in order to produce the most comprehensive, high quality corpora for research use; publishers will advance their mission by multiplying the impact of works created by authors they support; researchers will guide corpora development to ensure that corpora are optimally usable; and communities represented in collections will help ensure that corpora are curated and provisioned in a responsible manner – i.e., documenting the presence of books that contain outdated, stolen, and/or harmful knowledge about minoritized communities. 

What principles guide The Public Interest Corpus? 

  1. The Public Interest Corpus … advances equitable access to books data for small, medium, and large organizations.  
  2. The Public Interest Corpus …  supports AI research and development and computational research that addresses public interest challenges (e.g., fighting misinformation, advancing understanding of the past and present, fostering a more informed citizenry). 
  3. The Public Interest Corpus …  addresses corpus limitations (e.g., linguistic bias, outmoded forms of knowledge present in the corpus, and data quality) through production of additional metadata in line with efforts like the Hugging Face Model Card and Data Nutrition Label. 
  4. The Public Interest Corpus … commits to transparency with respect to corpus composition, modification, and agreements in order to increase public trust in research that makes use of the corpus. 
  5. The Public Interest Corpus … values the labor of content creators and works to ensure that their work is recognized through promotion of credit and attribution practices. 
  6. The Public Interest Corpus … adopts practices and infrastructure that aim to reduce the environmental impact of corpus development, discovery, and access. 
  7. The Public Interest Corpus … forms partnerships that concretely address long-term collective needs of academic libraries and the communities they serve (e.g., maximizing access, reducing legal encumbrances). 
  8. The Public Interest Corpus  … is fundamentally guided by diverse stakeholders including but not limited to researchers, librarians, publishers, authors, and technologists. 

What goals should The Public Interest Corpus work to achieve?

  1. Coordinate books data sourcing, discovery, and access across small, medium, and large organizations. 
  2. Create cost efficiencies in access to books data. 
  3. Minimize legal risk for those that seek to provide or make use of books data. 
  4. Curate and provide access to fit-for-purpose books data that exceeds in quality and comprehensiveness what is otherwise available. 
  5. Ensure consistent corpus growth and refinement over time in alignment with user community needs. 
  6. Identify and adopt scalable author credit and attribution methods for authors and rights holders to track reuse. 
  7. Deliver minimum viable solutions
  8. Adopt a fit for purpose governance model
  9. Develop a sustainability model that reduces barriers to books data access for small, medium, and large organizations on an ongoing basis. 

Discover more from Authors Alliance

Subscribe to get the latest posts sent to your email.

Scroll to Top