Open Data

Acknowledgment

We extend our gratitude to the many people who contributed to this project. We are grateful for the excellent research of former Georgetown Law students Daniel Hemming, Jasmin Jimenez, and Lara Ellenberg as well as the contributions of the Georgetown Intellectual Property and Information Policy Clinic. We are indebted to Carly Robinson, Jake Carlson and Ali Krzton for their review and helpful comments along the way. Finally, we would like to thank all Authors Alliance members: Your support makes all our projects possible.

The fundamentals

How to use the FAQ and who is it for?

This FAQ is designed as a practical guide for anyone interested in engaging with open data, whether you are a researcher working with data now or planning to do so in the future, a librarian or research support professional advising others, or anyone navigating questions about data sharing, reuse, and openness in scholarly work. It is meant to provide guidance for the most common questions we have seen, rather than definitive answers for every possible scenario. Most importantly, the FAQ does not apply the law to any individual’s specific circumstances: we do not provide you with legal advice. Laws, institutional policies, and disciplinary norms can vary, and reasonable interpretations may differ. We strongly encourage readers to consult with colleagues and legal counsel at their own institutions to gain additional perspective and support tailored to their particular situation.

Readers can approach this FAQ in two ways. You may read it straight through as a crash course on open data concepts and norms. Alternatively, each question is written to stand on its own, so you can jump directly to the sections most relevant to your immediate needs and return to others later as questions arise. The goal of this FAQ is to help readers make more informed decisions about open data.

What counts as “data” in research and scholarship?
What is open data?
What does “publishing” data mean?
Why is open data important?
When do we not want to make data open?
What are some common expectations around data sharing?
Do I need approval from my employer or collaborators before publishing my data openly?
How do I publish data openly?
Where can I publish my data openly?
Is data in the public domain already open data?
Does making data open affect trade secrets?
Does making data open affect patent rights?
Does making data open affect copyright?
What other legal restrictions must I consider before sharing data openly?

What counts as “data” in research and scholarship?

What counts as “data” vary across disciplines and even individual research projects. Fundamentally, “data” refers to information that forms the evidentiary basis of research findings and scholarly conclusions. In the sciences data may refer to technical measurements of observable natural phenomena. In the social sciences, data may consist of interview transcripts or systematic collection and statistical analysis of information about human behavior produced by researchers. In the humanities, there is debate over whether data refers to information produced by researchers, the objects that humanities researchers study (such as texts, images, or audio recordings produced by others), or both. In humanities disciplines, data may include archival sources, interviews, recordings, artifacts, or organized collections of qualitative or quantitative information.

Context also matters. Within any research project, data may include not only the raw observations themselves but also the metadata describing how the data were collected, processed, and interpreted. Increasingly, this contextual information is treated as an integral part of a dataset, especially under the FAIR principles, which emphasize findability, accessibility, interoperability, and reusability.

It is also important to look at definitions of data provided by government agencies, universities, or journals that encourage or mandate the open publication of data. For example, the White House Office of Science and Technology Policy about open access to federally funded research defines data as “recorded factual material commonly accepted in the scientific community as of sufficient quality to validate and replicate research findings.” NIH similarly defines scientific data as the “recorded factual material commonly accepted in the scientific community as of sufficient quality to validate and replicate research findings, regardless of whether the data is used to support scholarly publications.” PLOS requires researchers to submit their “minimal data set” with any publication in the journal, defining “minimal data set” as “the data required to replicate all study findings reported in the article, as well as related metadata and method.” These definitions are intentionally broad. They recognize that the form of data—whether numerical, textual, or visual—is less important than its function: to document and substantiate research claims.

Ultimately, what counts as data depends not on a universal definition but on the norms of a field and the policies of the institutions, funders, or journals that govern data management and sharing.

What is open data?

Open data refers to data that is freely available, fully discoverable, and reusable under conditions that permit sharing and modification. It denotes both the lack of legal restrictions as well as the practical accessibility of the data. Please see “How do I publish data openly?” below for more information.

Open data is sometimes confused with open access. The term “open access” is most commonly used to refer to a publishing model for making academic or creative works available free of charge and with few restrictions on their reuse. According to the Budapest Open Access Initiative, open access refers to the “free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.”

Both “open access” and “open data” are used in discussions of the broader open science and open research movements, which seek to promote the free and open sharing of research results and data. But data can be made open even when it is associated with research articles that are not themselves open access, and vice versa. The guiding principle here is to make data relied on by researchers publicly available free of charge, with as few legal restrictions on reuse as possible.

Open data can also be used to describe the movement among academics and other interested parties to make research and scholarly data as open as possible. See “Why is open data important?” below for more information.

What does “publishing” data mean?

Publishing data is the key step to making data open. “Publishing” data generally refers to making research data publicly available in a stable, discoverable, and citable form. This almost always takes the digital form nowadays, typically involving depositing datasets in institutional or public repositories.

Once data is published, it cannot truly be “taken back”: even when you try to remove what you have uploaded, others may already have accessed or downloaded the data. Especially when the data lack copyright or contractual restrictions, downstream users may reuse the data despite your efforts at retracting the data. This makes it essential to consider exigent ethical as well as legal responsibilities before publishing any data. See “When do we not want to make data open?” for more information.

Publishing data usually requires some degree of data curation, in which datasets are reviewed to ensure completeness, clarity, and alignment with FAIR principles, and where metadata may be documented to describe the content of the data as well as how the data were collected and structured. Assigning persistent identifiers (like DOIs) to datasets would allow the data to be reliably discovered, cited, and accessed over time.

Please see “Where can I publish my data openly?” for more information.

Why is open data important?

The free public access to the data underlying scientific publications (and access to the software or code used to analyze the data) will allow research to be verified and—when possible—replicated or reproduced. In addition, open data allows not just academics, but also the general public, to use the data to further public interest, including by enabling civic innovation, journalism, and education.

There is also a strong normative argument that publicly funded research should not be shielded from productive uses and access—both the research results and the underlying data should be openly available to the public that helped fund that research. This principle underlies many federal open science policies and is central to the broader movement toward transparency and accountability.

When do we not want to make data open?

While open data is an important goal across many fields, there are some distinct circumstances in which making data fully open is inappropriate or even harmful. Open data principles acknowledge these limitations.

Some data should stay private because of their innate sensitive nature. This could be because of a variety of circumstances, including national security, privacy, confidentiality, or safety concerns. Making such data public can lead to misuse. For example, the release of precise geolocation data for endangered species could expose them to poaching. Even when identifying information is removed, re-identification risks can persist, especially when datasets are large, rich, or can be cross-referenced with other publicly available information. Ethical research norms require minimizing re-identification risks.

Some data should not be shared because of contractual limitations. A researcher may encounter nondisclosure provisions in research agreements—particularly in agreements with private, commercial entities. In that case, the agreement may obligate the researcher to refrain from disclosing the data, subject to the threat of a trade secret misappropriation lawsuit. For example, a researcher entering into an agreement with a pharmaceutical company might be obligated to keep the research—including the data—secret.

Other data may not be ideal to be made open because of economic considerations. Proprietary or trade-secret information may best be protected by keeping them beyond public reach. Publishing data openly could create a “prior art reference” that may be used as evidence to deny a patent application or invalidate a patent. See “Does making data open affect patent rights?” below for more information.

As you can see, open is not the only guiding principle here; we must ensure that any data sharing is ethical, lawful, context-appropriate, and aligned with the public interest as well.

Making data open

What are some common expectations around data sharing?

Researchers are increasingly expected to share the data generated with federal funding or through grant-supported projects, as well as data that underlies their research publications. Funders, publishers, and other stakeholders often require open data to promote transparency and reproducibility. While requirements may vary, common expectations around data sharing include planning for preservation and access, sharing data underlying publications, following community standards, using trusted repositories, and seeking guidance from appropriate institutional or funder resources.

Researchers will need to be mindful of requirements imposed by repositories. Repositories usually follow the desirable characteristics for data repositories outlined by the National Science and Technology Council (NSTC), which include proper metadata, clear licensing or terms of use, long-term preservation, and mechanisms to support reuse.

If the research is federally funded, then it must follow the Data Management and Sharing Plan (DMSP) submitted to the agency. The DMSP describes how data will be stored, documented, preserved, and shared during and after the project. Many U.S. government funder policies provide broad guidance that allows room for researchers to account for differences in disciplines, methodologies, and types of data. Where more specific standards exist, they typically align with established norms or best practices in the relevant research community.

Practical guidance on meeting expectations for data sharing can often be obtained from program officers at funding agencies, institutional librarians, or through specialized resources. SPARC, for example, maintains a comprehensive guide to federal data sharing policies. Researchers are encouraged to engage with these resources early in their projects to ensure compliance and to facilitate the effective reuse of their data by others.

Do I need approval from my employer or collaborators before publishing my data openly?

Yes—before making data openly available, it is important to confirm that you have the right to do so. Only those who actually own the rights to data can authorize its sharing. Before sharing data openly, it is best practice to obtain consent from all parties who may have ownership or claims to the data.

If your institution holds rights to the data, including the copyright, trade secrets, or potential patent rights in the dataset, there are a few options. One approach is to request that the institution itself apply a permissive license, such as a Creative Commons license, to the data. Alternatively, researchers may seek an assignment of rights from the institution, though the institution is under no obligation to grant such requests. Engaging with your institution, program manager, or peers can help clarify these norms and ensure that your data sharing aligns with both legal and professional expectations.

How do I publish data openly?

Publishing data openly involves making it available to others under terms that clearly communicate how the data can be used, shared, and adapted. Creative Commons (CC) licenses and public domain tools provide simple, standardized ways to grant copyright permissions, ensure attribution, and allow others to copy, distribute, or build upon your work.

CC offers six public licenses, each granting different permissions under specific conditions. Conditions are requirements that downstream users must follow when reusing a work. All licenses have the BY requirement which asks that the user attribute the work to its author. Licenses with the ND (No Derivatives) condition does not allow remixing or adapting the work, meaning users cannot add original expression to the pre-existing work. NC (NonCommercial) licenses restrict commercial uses, while licenses without the NC condition allow both commercial and noncommercial uses.

CC also provides a public domain dedication tool, CC0, designed to place a creator’s work into the public domain. Some researchers use CC0 when the public domain status of the data is uncertain, as it gives downstream users legal certainty about their ability to use the materials freely.

Open Data Commons (ODC), maintained by the Open Knowledge Foundation, provides a set of legal tools and licenses specifically for data. ODC licenses grant users the rights to share, adapt, and create derivative works from datasets, while each license imposes different conditions.

It is inappropriate to apply licenses to public domain data. However, a dataset that is creatively selected and arranged may have its own copyright even if it contains public domain materials. In such cases, it can be appropriate to apply a CC license to the dataset while clearly indicating which elements are covered by the license and which are in the public domain.

If a dataset contains copyrightable works owned by third parties, you cannot license those rights without permission. Researchers must contact the rightsholders of underlying data to obtain consent before applying a CC license, or alternatively, researchers must specify that the CC license does not cover those third party materials.

Where can I publish my data openly?

Numerous open data repositories exist for researchers to make their data openly available, including both discipline-specific and cross-disciplinary options. Institutions often maintain their own institutional repositories and have staff to help researchers with their deposit. Funding agencies, such as the NIH, generally recommend that researchers first seek to publish their data in a discipline-specific repository, which can maximize visibility and engagement within the relevant research community. If no appropriate discipline-specific repository exists, researchers may consider generalist repositories that meet open data standards and ideally comply with the FAIR Principles—making data Findable, Accessible, Interoperable, and Reusable. In some cases, funders may specify a particular repository where data must be deposited.

Repository search tools, such as FAIRsharing.org and re3data.org, can help identify suitable discipline-specific repositories. Researchers can also consult lists of recommended repositories maintained by funding institutions or academic journals. Common generalist repositories include Dryad, figshare, and the Harvard Dataverse Network.

It is important to note that simply making data publicly viewable does not make it “open.” See “How do I publish data openly?” above for more information. Researchers should ensure that the data intended to be made open is accompanied by an appropriate open license—such as a Creative Commons or Open Data Commons license—when depositing it into a repository. This ensures that downstream users clearly understand the permissions for reuse, sharing, and adaptation.

Researchers should also consider practical aspects, such as repository longevity, support for metadata standards, and whether the repository assigns persistent identifiers (like DOIs), which help make data citable and discoverable. By evaluating these factors, researchers can choose a repository that best aligns with their goals and the needs of the broader research community.

Is data in the public domain already open data?

The public domain consists of information, data, and other forms of expression that are free from legal restrictions, meaning they can be accessed, used, copied, published, or otherwise utilized without risking liability for intellectual property infringement. It refers to the lack of copyright restrictions. Materials are in the public domain often because they were categorically never subject to any legal restrictions, because their protection has expired, or because the otherwise copyrightable content was created by the federal government.

By contrast, open data both requires the lack of legal restrictions and the practical availability of the data publicly. Therefore, public domain data is not automatically open data. Even if a dataset is technically in the public domain legally speaking, it may still be “closed” if there are contractual or institutional restrictions on its use or sharing binding whoever signed the contract, or if it is simply not publicly accessible—for example, if it has never been formally published or is only available in locations that are difficult to access.

Open data can certainly include public domain material. Some open data may be subject to intellectual property rights, yet it is made openly reusable through licenses or institutional permissions.

Some legal considerations

Does making data open affect trade secrets?

Trade secrets are a form of protection for information that is valuable because it is secret, known only to a limited group of people, and subject to reasonable efforts to maintain its secrecy. Once data is publicly shared or made openly available, it generally no longer meets these requirements and may lose trade secret protection. Therefore, making data open can have a direct effect on trade secret protection. While open data encourages broad sharing of data for public benefit, researchers must balance openness with obligations to protect trade secrets.

Openly sharing data that contains trade secrets usually results in a loss of trade secret protection. Researchers should carefully evaluate any possibilities of trade secrets before deciding to make data open. Seeking appropriate consent from the owner of the trade secret is strongly recommended whenever there is a possibility that the data could be subject to trade secret protection.

Trade secret ownership is determined primarily under state law. In most jurisdictions, a trade secret created by an employee typically belongs to the employer if the trade secret was developed within the scope of employment or under a contract specifying such ownership. Independent contractors may retain ownership of trade secrets in some jurisdictions unless they have agreed otherwise. Because rules vary by state, it is important for researchers to consult applicable state law to understand who owns any trade secrets in their data.

University and institutional policies can also affect trade secret rights. Many research institutions maintain intellectual property policies that clarify ownership of research outputs, including data. Such policies may modify default legal rules by specifying that trade secrets are owned by the institution, shared with the researcher, or assigned back to the researcher under certain conditions.

Before making data open, researchers should check whether any institutional IP or data ownership policies apply and whether consent is required to release data. Non-IP agreements may also restrict disclosure of data. Researchers working under nondisclosure agreements (NDAs) or other contractual obligations with commercial partners may be legally prohibited from publicly sharing data that is considered confidential or a trade secret. Violating such agreements could expose the researcher or their institution to liability for trade secret misappropriation.

Does making data open affect patent rights?

Patents grant their owners exclusive rights to make, use, sell, or offer to sell an invention. However, to be patentable, an invention must be novel and non-obvious. Making research data openly available—even if the data itself is not patentable—can affect patent rights, primarily because public disclosure can create “prior art” that may prevent an invention from being patentable.

Sharing data without considering pending patent applications or inventions in progress can unintentionally jeopardize potential patent rights. For example, if some research data shows the efficacious dosage of a certain drug, it could be used as prior art to invalidate patents on the administration of such a drug. For this reason, researchers may choose to take steps to prevent the potential effects on future patent rights, such as filing a patent application prior to (or within one year of, in the case of the US) the publication of relevant data.

Universities and research institutions often specify in their policies who owns patents and patent rights arising from data created by employees or under institutional funding. Policies typically clarify ownership of copyrights, patents, and trade secrets, and in many cases, patent rights are retained by the institution or shared with the researcher. Before making data open, researchers should review these policies to ensure that public disclosure does not conflict with institutional patent rights.

Filing a patent application before publicly sharing data, consulting institutional policies, and understanding international differences in patent law are essential steps to protect patent rights while participating in open data.

Does making data open affect copyright?

Making data openly available does not affect copyright the same way it may undermine patent or trade secret rights, because as soon as a copyrightable work is created, the copyright will subsist whether or not it is shared publicly.

Copyright protects all original expressions, such as music, literature, software, photographs, and other creative outputs. Purely factual data—numbers, measurements, or unadorned observations—is not subject to copyright. Ideas, methods, processes, systems, and discoveries cannot be copyrighted either.

The line between facts and creative expression in data can be subtle. For instance, 3D scans of dinosaur bones are likely uncopyrightable, but if a researcher modifies the scans in a creative instead of functional way, that added expression may qualify for copyright protection. The selection, arrangement, or compilation of otherwise uncopyrightable data may also be copyrightable if the choices reflect original authorship.

By default, the creator of copyrightable data is the copyright owner. However, the “work made for hire” doctrine may allocate ownership to an employer when copyrightable data is created by an employee within the scope of employment. University IP or data policies, employment agreements, and licensing contracts can further modify copyright ownership rights.

Furthermore, datasets may contain multiple layers of copyrightable content, each potentially owned by a different party. If a dataset includes third-party copyrightable data, the creator of a new dataset may need permission to use those underlying works, even when sharing their dataset openly.

What other legal restrictions must I consider before sharing data openly?

Privacy is an important consideration in addition to potential effects on IP rights discussed above. Research data may include personally identifiable information (PII), sensitive health information, or other types of protected data subject to privacy laws such as HIPAA in the U.S. or the GDPR in the EU. Sharing this information without requisite safeguards could expose the researcher or their institution to legal liability.

Contractual obligations are another key consideration. Research agreements with sponsors, collaborators, or commercial partners may include nondisclosure clauses or other limitations on how data can be shared. Some agreements may require explicit permission before publication or dissemination, and failure to comply could result in claims for breach of contract.

Institutional policies may impose requirements or restrictions on researchers as well. Universities and other research institutions often have rules about data ownership, use, and sharing, which may go beyond intellectual property rights. These policies can affect whether certain datasets can be shared openly and under what conditions. See “Do I need approval from my employer or collaborators before publishing my data openly?” below for more information.

Please also see “When do we not want to make data open?” above for additional information.