Managing Data for Open Science¶

Learning Objectives

After this lesson, you should be able to:

Recognize data as the foundation of open science and describe the "life cycle of data".
Describe how modern cloud and AI technologies transform data management.
Explain the roles of metadata, schemas, and ontologies in making data AI-ready.
Cite tools and resources to improve your data management practices, including cloud-native formats and AI-driven techniques.
Recall the meanings of FAIR, CARE, and TRUST principles.
Know the biggest challenge to effective data management.

Why Should You Care About Data Management?¶

Ensuring that data are effectively organized, shared, and preserved is critical to making your science impactful, efficient, and open.

Don't Forget About Data Management

The biggest challenge to data management is making it an afterthought.

Unfortunately, poor data management doesn't have a high upfront cost. You can do substantial work before realizing you are in trouble. Like a swimmer in a rip current, by the time you realize you are in trouble, you may already be close to drowning.

The solution? Make data management the first thing you consider when starting a research project. It also needs to be a policy you institute right away for your research group.

Scenario 1: Sharing your data

You give unpublished data to a colleague who has not been involved with the project

Could they make sense of it?

Have you included a README.md that describes the structure of the data?

Will they be able to use it properly?

Have you included any user documentation that follows your work to make it reproducible?

Its been five years since you looked at the data

Let's be honest here, the person who is most likely to handle your unpublished data is YOU.

Don't be afraid to admit how much time you've have spent looking at a project that is five years, five months, or five weeks old.

We have all struggled to understand what file names mean, what units table values are in, or what methods were originally used to generate the data.

By using best-practices for data management, you can save your most valuable resource, time !

Well-managed Data Sets:

Make life much easier for you and your collaborators.
Benefit the scientific research community by allowing others to reuse your data.
Are required by most funders,
- June 2025 OSTP Gold Standard Science
"Agencies should encourage depositing raw data and code that contributes to research outcomes in publicly accessible repositories, where appropriate, to facilitate exact replication and support reproducibility through diverse methodological approaches. Agencies should address barriers—such as incomplete reporting or resource constraints—by fostering training, shared infrastructure, and incentives for open science practices. "
Leading journals in molecular and cellular biology (MCB) now frequently mandate public data deposition for published research, enhancing reproducibility.

The resources below are key for publishing and citing MCB data.

Journal/Resource	Voluntary Data Publication	Mandatory Data Publication	Provides DOI for Datasets
Nature	✔
Cell		✔
Science		✔
PNAS		✔
eLife		✔
PLOS Biology		✔
Dryad		✔	✔
Figshare		✔	✔
Zenodo		✔	✔
NCBI		✔	✔
Protein Data Bank (PDB)		✔	✔

Scenario 2: Sharing your data

When you are ready to publish a paper, is it easy to find all the correct versions of all the data you used and present them in a comprehensible manner?

Where are the data?

Easy, it is lost

Data loss is an all too real problem that occurs to researchers at the least expected moments.

Offline back up and mirrored (double) copies

Keep an online sync of your data
Storage archives may be more secure, but are 'dark data' which cannot be accessed
Platforms like CyVerse use an off-site mirror (UArizona + TACC) to keep data secure and safe

Version Controlled

Code should be maintained on version controlled platforms like GitHub, GitLab, or HuggingFace.
Data must also be version controlled.
Keep local and cloud hosted copies of version controlled data.

The Data Life Cycle: A Modern Approach¶

The Data Life Cycle

Data management is the set of practices that allow researchers to effectively and efficiently handle data throughout the data life cycle. Although typically shown as a circle (below) the actual life cycle of any data item may follow a different path, with branches and internal loops. Being aware of your data's future helps you plan how to best manage them.

lifecycle

Image from Strasser et al.

The summary below integrates traditional best practices, adapted from the excellent DataONE best practices primer, with modern approaches for creating cloud-native, AI-ready data products.

Plan¶

Describe the data that will be compiled, and how the data will be managed and made accessible throughout its lifetime.
A good plan considers each of the stages below and is formalized in a Data Management Plan (DMP). A DMP is a formal document that outlines how data are to be handled both during a research project, and after the project is completed.
Why bother with a DMP?
- Stick: Funders like the NSF require them.
- Carrot: Planning makes your project run more smoothly and helps you avoid surprise costs and errors. Working without a data management plan can lead to an inability to make publicly-funded research open, which is a serious consequence.
DMP Tools: Make your life easier by creating DMPs with online tools like DMPTool or Data Stewardship Wizard.
- See Example DMPs and a Bishop article on DMPs.

Collect¶

Have a plan for data organization in place before collecting data.
Collect and store observation metadata at the same time you collect the data.
Take advantage of machine generated metadata.

Assure¶

Record any conditions during collection that might affect the quality of the data.
Distinguish estimated values from measured values.
Double-check any data entered by hand.
Perform statistical and graphical summaries (e.g., max/min, average, range) to check for questionable or impossible values.
Mark data quality, outliers, missing values, etc.

Describe: The Foundation for Automation¶

Comprehensive data documentation (i.e. metadata) is the key to future understanding and use of data. For data to be truly automated and AI-ready, computers need to understand not just its content, but also its structure and context.

Metadata: The Language of Your Data¶

Good metadata is the difference between a dataset being a digital artifact and a findable, reusable resource. In a cloud/API context, metadata is not just for humans; it's the machine-readable information that powers search, validation, and integration.

Descriptive: What is this data? (Title, abstract, author, keywords, ORCID).
Structural: How is the data organized? (Variable names, data types, relationships, schema location).
Administrative: How can I use it? (License, version, access rights, owner).

Standards like DataCite, Dublin Core, and SpatioTemporal Asset Catalog (STAC) for geospatial data provide a formal specification for metadata, turning it into a reliable tool for automated discovery.

Schema: The Blueprint for Your Data¶

A schema is a formal, machine-readable definition of your data's structure. It acts as a contract, ensuring that data conforms to expected formats, types, and constraints. This is absolutely critical for automated analysis and for data shared via an API.

What it defines: Field names, data types (integer, string), required fields, and value ranges.
Why it matters: It enables automated data validation, prevents errors, and allows tools to reliably interact with your data without human intervention.
Common formats: JSON Schema, Avro Schemas, or schemas inherent to data formats like Parquet.

Example: A simple JSON Schema for a measurement

{
  "$schema": "[http://json-schema.org/draft-07/schema#](http://json-schema.org/draft-07/schema#)",
  "title": "Scientific Measurement",
  "description": "A single measurement from a sensor.",
  "type": "object",
  "properties": {
    "timestamp": { "type": "string", "format": "date-time" },
    "sensorId": { "type": "string" },
    "value": { "type": "number" },
    "units": { "type": "string", "enum": ["celsius", "pascals", "meters"] }
  },
  "required": ["timestamp", "sensorId", "value", "units"]
}

Ontologies: the Web of Scientific Knowledge¶

While a schema defines structure, an ontology defines meaning and relationships. It's a formal representation of knowledge in a specific domain, creating a shared vocabulary that links data across different sources.

What it does: Defines concepts (e.g., "Gene," "Protein") and the relationships between them (e.g., a "Gene" encodes a "Protein").
Why it matters: Ontologies allow AI systems to understand the scientific context of your data, enabling more powerful semantic searches.
Examples: FAIRSharing.org lists standards and ontologies for life sciences like the Environment Ontology and Plant Ontology.

Preserve & Integrate: Data as Cloud-Native Products¶

"Cloud-native" means moving away from monolithic files and toward formats and access patterns optimized for the cloud. The modern paradigm is to treat your dataset as a living product delivered via an API.

Analysis-Ready Data (ARD)¶

ARD is data pre-processed to a state where it's ready for immediate use, minimizing the burden on scientists. This is a cornerstone of preparing data for AI model training.

Key characteristics of ARD formats:

Chunked: Data is split into smaller blocks, so you only read the piece you need.
Compressed: Efficiently stored to reduce costs.
Standardized: Uses common, open formats.

Examples of Cloud-Native ARD formats:

Cloud-Optimized GeoTIFF (COG): For efficient streaming of geospatial imagery.
Zarr: For chunked, compressed, N-dimensional arrays (e.g., from simulations or instruments).
Apache Parquet: A columnar format ideal for tabular data, enabling incredibly fast queries.

Data must be preserved in an appropriate long-term archive (i.e. data center).

Discipline-specific repositories like NCBI for sequence data.
General repositories like Data Dryad or CyVerse Data Commons.
Code repositories like Github can get DOIs through Zenodo.

Discover¶

Good metadata allows you to discover your own data!
Modern discovery happens via databases, repositories, and search indices.
Modern APIs like STAC provide a standardized way to search and access data, forming the backbone of cloud-native data discovery.

Analyze: Leveraging AI with Structured Data¶

Once your data is well-structured and cloud-native, you can unlock powerful new ways to interact with it using AI.

Vector Databases: Searching by Meaning¶

Traditional databases search by keywords. Vector databases search by semantic meaning.

Embeddings: A deep learning model converts your data (text, images) into a numerical vector. Similar concepts have similar vectors.
Vector Database: These embeddings are stored in a specialized database (e.g., Pinecone, Weaviate, ChromaDB) that can find the "nearest" vectors to a query with extreme speed. Use Case: Find all datasets semantically related to a paper's abstract, even if they don't share keywords.

Retrieval-Augmented Generation (RAG): Grounding LLMs in Scientific Fact¶

Large Language Models (LLMs) are powerful but can hallucinate. RAG solves this by connecting an LLM to your factual, curated data.

The RAG Workflow:

Query: A user asks a question.
Retrieve: The system searches your scientific database (using vector search or other methods) for the most relevant data.
Augment: The retrieved data is added to the user's original prompt as context.
Generate: The LLM receives the augmented prompt and generates a factually-grounded answer.

Model Context Protocol (MCPs): Connecting LLMs to Your Tools¶

While RAG grounds an LLM in your data, MCP is an emerging standard to ground an LLM in your tools and environment (e.g., local files and scripts). This transforms an LLM from a chatbot into an interactive research assistant that can execute commands like "Run my analyze.py script on the latest dataset and summarize the findings."

Guiding Principles for Data Stewardship¶

Why do we have data principles?

FAIR, CARE, and TRUST are collections of principles from different communities with different objectives.

Ultimately, different communities within different scientific disciplines must work to interpret and implement these principles. Because technologies change quickly, focusing on the desired end result allows these principles to be applied to a variety of situations now and in the foreseeable future.

FAIR Principles¶

In 2016, the FAIR Guiding Principles for scientific data management and stewardship were published in Scientific Data. Read it.

Findable

F1. (meta)data are assigned a globally unique and persistent identifier
F2. data are described with rich metadata (defined by R1 below)
F3. metadata clearly and explicitly include the identifier of the data it describes
F4. (meta)data are registered or indexed in a searchable resource

Accessible

A1. (meta)data are retrievable by their identifier using a standardized communications protocol
A2. the protocol is open, free, and universally implementable
A3. the protocol allows for an authentication and authorization procedure, where necessary
A4. metadata are accessible, even when the data are no longer available

Interoperable

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data

Reusable

R1. meta(data) are richly described with a plurality of accurate and relevant attributes
R2. (meta)data are released with a clear and accessible data usage license
R3. (meta)data are associated with detailed provenance
R4. (meta)data meet domain-relevant community standard

Open vs. Public vs. FAIR

FAIR does not demand that data be open. See one definition of "Open": http://opendefinition.org/

FAIR Assessment

Thinking about a dataset you work with, complete the ARDC FAIR assessment.

What did you learn about yourself and your data?

CARE Principles¶

The CARE Principles for Indigenous Data Governance are for people, and are complementary to the data-centric FAIR principles.

Collective Benefit

C1. For inclusive development and innovation
C2. For improved governance and citizen engagement
C3. For equitable outcomes

Authority to Control

A1. Recognizing rights and interests
A2. Data for governance
A3. Governance of data

Responsibility

R1. For positive relationships
R2. For expanding capability and capacity
R3. For Indigenous languages and worldviews

Ethics

E1. For minimizing harm and maximizing benefit
E2. For justice
E3. For future use

More Resources for CARE & Indigenous Rights

Applying the 'CARE Principles for Indigenous Data Governance' to ecology and biodiversity Nature Ecology & Evolution, 2023.
Carroll et al. (2020) established the CARE Principles for Indigenous Data Governance.
Indigenous Data Sovereignty Networks
Local Contexts

TRUST Principles¶

Lin et al. 2020 The TRUST Principles for digital repositories.

Transparency

Terms of use, preservation timeframe, and services offered by the repository.

Responsibility

Adhering to community standards, stewardship, and managing intellectual property.

User focus

Implementing relevant data metrics and responding to community needs.

Sustainability

Planning for risk mitigation, business continuity, and long-term funding.

Technology

Implementing appropriate standards and security for data management.

Licenses¶

By default, when you make a creative work, that work is under exclusive copyright. If you want your work to be Open and used by others, you need to specify how others can use your work. This is done by licensing your work.

License Examples

MIT License

GNU General Public License v3.0

FOSS material has been licensed using the Creative Commons Attribution 4.0 International License.

License Options from UArizona Library¶

License options for University of Arizona Research Data Repository (ReDATA)

Additional Info¶

General guidance on how to choose a license: https://choosealicense.com/
More good guidance on how to choose a license: https://opensource.guide/legal/
Licensing options for your Github Repository

Conclusion & Next Steps¶

Managing scientific data today is an active, dynamic process. By embracing the full data life cycle, building rich machine-readable context with metadata and schemas, and serving data as cloud-native products via APIs, we make it truly FAIR.

This foundation not only accelerates traditional research but also unlocks the transformative power of modern AI to search, synthesize, and reason about scientific knowledge in ways that were never before possible.

In our next session, we will put these principles into practice in a hands-on workshop where we will define a schema for a dataset, catalog it using STAC, and build a simple RAG-based chatbot to answer questions about it.

References and Resources¶

DataOne best practices
Center for Open Science
The FAIR Principles: https://www.nature.com/articles/sdata201618
DMPTool General Guidance
Data Carpentry
The US Geological Survey Data Management
Repository registry service: http://www.re3data.org/