Back to Blog
ArticleJune 22, 20265 min

Data Contracts Are the API Versioning Your Data Pipeline Needs

Schema drift keeps breaking pipelines because we're monitoring for changes instead of enforcing contracts. Here's why data contracts are the missing layer between your producers and consumers.

Data Contracts Are the API Versioning Your Data Pipeline Needs

By Andrew Tan


The Problem With Schema Monitoring

Schema monitoring is supposed to catch breaking changes. It doesn't.

A pipeline runs for months without issues. Then an upstream service adds a revenue_v2 field. The old revenue field still exists, but now it's deprecated and always null. The pipeline ingests the nulls happily. No errors. All green lights.

The business metric is just wrong.

This happens because monitoring watches for structural changes, not semantic ones.


Why Monitoring Fails

Most teams set up alerts for new columns. Type changes. Missing fields. A human reviews every alert.

After the fiftieth "new optional field" notification, you stop reading. Your brain auto-approves. INT to BIGINT? Harmless. Approve. Move on.

Real problems slip through. The issue above wasn't structural. It was semantic. A new field appeared — supposedly safe. The old field existed. No breaking changes detected.

The contract was broken. Nobody noticed.

Monitoring catches accidents. You need something that catches lies.


Contracts vs. Registries

A schema registry checks structure. Field names, types, nullability. Important. Not enough.

A data contract checks promises.

  • Did you send a number?
  • Does it mean what you said?
  • Is it positive? In range? Referentially intact?

Think about REST APIs. You don't just check that JSON parses. You check that the endpoint does what the docs say. Break that promise and it's a breaking change, even if the JSON is technically valid.

Data pipelines need the same thing. Downstream systems build on implicit promises. When those break, everything breaks.


What Good Contracts Look Like

The teams that do this well define three things for every dataset:

Structural guarantees. But with a twist: any deviation is breaking. New optional field? Version bump. Sounds painful. Eliminates "stealth semantic changes" entirely.

Semantic expectations. Business rules as validation. Patient age 0-120. Diagnosis codes must exist in the reference table. Timestamps within 24 hours of file creation.

Consumer commitments. Downstream systems declare dependencies. Change a field three critical pipelines use? High risk. Even if it looks "safe" structurally.

Schema changes go from days of coordination to hours. Silent semantic drift drops to zero.


The Hard Part Is Organizational

Contracts force conversations most people don't want to have.

Producers must promise things about data they don't fully control. The CRM team doesn't know every downstream consumer. The mobile team doesn't know how data science uses their events.

Three patterns for ownership:

Producer-owned. The team making the data defines the contract. Clean in theory. Often fails because producers optimize for convenience, not downstream needs.

Consumer-owned. Downstream defines requirements. Protects consumers, but producers can't always comply. You get contracts on paper that get violated in practice.

Platform-mediated. Central team brokers the conversation. More overhead. Actually works.

Platform-mediated with quarterly reviews is expensive in meeting time. Cheap compared to incidents.


Start Small

You don't need a platform to begin.

Write three things for your critical datasets:

What does this represent? Not field definitions. The business concept. "Daily snapshot of active subscriptions" differs from "table has customer_id, plan_type, renewal_date."

What can people rely on? Nullability, update frequency, retention. The stuff everyone's implicitly assuming.

What happens when it breaks? Who do you call? How fast? What's the rollback?

Start with your three most critical assets. That's it.


Contracts Create Problems Too

They ossify. Changing a contract requires coordination. That's the point — prevents breaking changes — but also slows good changes. Teams avoid proposing changes because of the coordination cost.

They lie. A contract is only as good as its validation. Saying "all customer_ids must exist" without checking? Theater. False confidence is worse than none.

They shift blame. Consumer detects a violation. Response: "producer broke their promise." True. Unhelpful. The goal is fixing the data, not assigning blame. You need recovery procedures, not finger-pointing.


The Tooling

Great Expectations and Soda added contract features. Not full platforms, but they enforce semantic expectations at boundaries.

Data Contract Club and AICP are emerging. First-class contracts with versioning and validation.

Data catalogs — Collibra, Alation, Atlan — have contract management now. Usually workflow-heavy, validation-light. Better for docs than enforcement.

At layline.io we embed contracts into workflows. Define data movement, define the promises. Schema expectations, validation rules, quality thresholds. Enforced at runtime, not checked after.

But you don't need fancy tooling. A JSON Schema file with a validation step is a functioning contract. Organizational practice beats technology.


The Test

Pick a critical data asset. Something that would hurt if wrong.

Upstream changes their format. Technically valid — new fields, same types. Semantically wrong. How long until you notice?

If the answer is "when someone complains," you need contracts.

If it's "we'd catch it in monitoring," dig deeper. Does your monitoring catch semantic changes or just structural ones?

The goal isn't perfect data quality. It's preventing the stupid problems. The ones from assumptions nobody wrote down.


Andrew Tan is a serial entrepreneur and founder of layline.io, building enterprise data processing infrastructure that handles both batch and real-time workloads at scale.

Share:

Enjoyed this article?

Subscribe to get more insights delivered to your inbox.