Dotmatics

Structured Scientific Data for AI

The article argues that AI initiatives in life sciences often fail due to unstructured, fragmented scientific data that lacks consistent schemas, relationships, and context, emphasizing that true structured data—organized at the point of creation with defined formats and linked elements—is essential for producing trustworthy, verifiable AI outputs.

The Case for Structuring at the Point of Work

Most AI initiatives in life sciences follow a familiar pattern: an impressive demo excites leadership, a budget is approved, and a team is assembled. However, after some time, the initiative is often deprioritized. The issue is rarely with the AI model itself, but rather with the data it is given. When AI is applied to real data, the outputs may sound plausible but are unverifiable, untrusted by scientists, and lack traceability. The root cause is that the underlying structured scientific data is not ready for AI, and organizations often realize this too late.

What a Fragile Foundation Actually Looks Like

The data problem in life sciences is well-known to researchers. Data is often:

  • Recorded in free-text notebook entries, each written differently
  • Stored across PDFs, spreadsheets, and proprietary instrument exports that are not interoperable
  • Context and reasoning behind experiments are scattered in memories, chat threads, or unfiled presentations

While the data exists and is accessible, it is semantically opaque to AI. It lacks relationships, lineage, and meaning. AI can generate outputs from such data, but these outputs are not verifiable or trustworthy. Adding more connectors or pipelines does not solve the problem; it only moves fragmented data around faster.

What Structured Data Actually Means

"Structured data" in a lab context means data organized according to a defined schema, with consistent formats, labels, and relationships. For example:

  • Results have defined types
  • Compounds are linked to the assays they were tested in
  • Experiments are connected to protocols and decisions

This structure should be captured at the point of data creation, not retroactively. Data structured after the fact is always a partial reconstruction, with lost context and approximated relationships. Data structured at the point of work carries its context and relationships naturally, as part of the workflow.

The next level is ontology-backed data. An ontology is a formal map of how concepts in a domain relate to each other. In life sciences, this means defining not just what a piece of data is, but how it relates to everything else. This semantic richness enables AI to answer reasoning questions, not just retrieval questions.

Why Retrofitting Is So Difficult

Cleaning up historical data is challenging for several reasons:

  1. 1.Irreversibility: Data not structured at capture is often ambiguous, with lost context that cannot be recovered.
  2. 2.Scale: Retroactively structuring years of data is a massive task that competes with ongoing research and often stalls before completion.
  3. 3.Workflow Persistence: If the workflow that produces unstructured data remains unchanged, new data will continue to be unstructured, recreating the problem.

The solution is not a one-time cleanup project, but an architectural decision to capture structured data at the point of work.

What Good Looks Like

An AI-ready data environment has these characteristics:

  • Data is structured at the point of capture: Scientists' workflows naturally produce AI-ready data without extra effort.
  • Relationships are preserved: Connections between entities are stored explicitly, not inferred later.
  • Experimental context travels with the data: Intent, lineage, and rationale are stored alongside results.
  • Answers are verifiable: AI outputs can be traced and independently inspected.
  • Open platform: External tools can connect to the structured foundation without rebuilding data bridges.

These features require a platform designed for structured scientific work, recognizing that data investment and AI investment are inseparable.

The Governance Dimension

In regulated environments, AI outputs must be auditable. Compliance and regulatory reviewers need to trace and reproduce AI-generated answers. Gartner predicts that most agentic AI initiatives in healthcare and life sciences will stall due to lack of traceability and control, not model capability. Governance is not separate from data quality; it is downstream of it. Structured, ontology-backed data inherently supports auditability and compliance.

The Question Worth Asking

Before starting an AI initiative, organizations should ask: What is the AI going to be operating on?

If the answer involves significant data cleanup, connector strategies, or hoping the model will compensate for poor data, the initiative is starting from the wrong place. The organizations seeing real returns from AI are those whose data is structured at the point of work, connected, and semantically rich.

The model is the last mile. The data is the road. Building the road correctly from the start is the investment that makes AI worthwhile.