Back to Blog
ArticleApril 23, 20268 min

I Thought My Pipeline Was Resilient Until I Asked These Five Questions

Resilience isn't 'it works.' It's 'it works when everything around it breaks.' Here's how to know the difference.

I Thought My Pipeline Was Resilient Until I Asked These Five Questions

By Andrew Tan

Resilience isn't 'it works.' It's 'it works when everything around it breaks.' Here's how to know the difference.


The migration that should have been a disaster

Six months ago, a SaaS company I advise decided to migrate their primary PostgreSQL instance to a new cloud region. The plan was simple: spin up the replica, promote it, update connection strings, verify everything works. Estimated downtime: fifteen minutes.

What actually happened was more interesting. The replica promotion worked. The connection string updates worked. But the data pipeline that fed their customer analytics dashboard — a pipeline that had run without issue for eighteen months — immediately started producing nonsense. Not errors. Nonsense. Row counts looked fine. Schema was intact. But the conversion funnel metrics were off by 12%, and nobody noticed for four hours because the pipeline was "green" on every monitoring dashboard.

The root cause? The pipeline had a hidden dependency on a read replica that wasn't supposed to be part of its data path. Someone had added it two years earlier as a performance optimization and never documented it. When the old region went offline, the optimization became a single point of failure. The pipeline didn't crash. It just quietly consumed stale data and produced garbage.

This is what I mean when I say resilience isn't uptime. That pipeline had 99.9% uptime. It was technically "resilient" by every metric the team tracked. It just wasn't resilient in any way that mattered when something actually went wrong.

Since that incident, I've started asking five questions before I'll call any pipeline production-ready. These aren't theoretical architecture review questions. They're the ones that expose the gap between "works in normal conditions" and "works when the world is on fire."


Question 1: If this component fails, what else breaks?

Most data pipelines are built like Christmas lights: one bulb goes out and the whole string goes dark. Not because the engineers are careless, but because dependencies accumulate organically over time. A pipeline starts simple. Then it needs reference data, so it reads from a shared cache. Then it needs enrichment, so it calls an internal API. Then it needs aggregation, so it writes to a state store that three other pipelines also use. Before anyone has drawn an architecture diagram, you've built a system where every component is load-bearing and nothing fails in isolation.

I call this the blast radius problem. Resilient pipelines have explicit failure domains. When one piece breaks, the damage is contained. The team gets an alert about a specific component. The rest of the system keeps working, possibly in degraded mode, but without cascading failure.

The Christmas light problem is especially common in batch pipelines that have evolved over years. Each new requirement gets bolted onto the existing flow because rewriting the whole thing feels risky. The result is a pipeline where the "failure mode" is always total failure. There's no partial success. No graceful degradation. Just green or red.

To fix this, you need to design for isolation from the start. Separate ingestion from transformation from serving. Use bounded contexts for state. Assume every dependency will fail and ask: if it does, can the rest of the pipeline continue with reduced functionality? If the answer is no, you don't have resilience. You have optimism.


Question 2: Can this pipeline recover without a human?

The three-in-the-morning test is the one that matters. A pipeline fails at 3 AM. Your on-call engineer gets paged. What happens next?

In most organizations, what happens is a bleary-eyed human opens a laptop, reads some logs, restarts a job, and goes back to bed hoping it doesn't happen again. This isn't recovery. This is delay. The pipeline is down for twenty minutes, an hour, sometimes longer. The data is stale. The downstream systems are producing results based on yesterday's inputs.

Resilient pipelines recover automatically. Not for every failure — some problems genuinely need human judgment — but for the predictable ones. Out-of-memory errors should trigger a retry with adjusted resource limits. Temporary network issues should trigger exponential backoff, not immediate failure. Schema mismatches should route bad records to a dead-letter queue and continue processing the valid ones.

The teams that sleep through the night have invested in self-healing. They've classified their failure modes and automated the responses to the ones that don't require creativity. The 3 AM page becomes rare because the system handles its own predictable problems.

This requires more than just adding retries. It requires designing the pipeline to be retry-safe. Idempotent operations. Deterministic outputs. Clear separation between "this failed because of a transient issue" and "this failed because the input data is fundamentally wrong." The first category should heal itself. The second category should fail loudly and specifically, routing the bad data to a place where a human can inspect it during business hours.


Question 3: When it fails, do I know what actually happened?

Here's a scenario I've seen more than once. A pipeline fails. The logs say "Exception in worker thread." The monitoring dashboard shows a red dot. The alert says "Job failed." And the engineer who gets paged spends the next hour trying to answer a basic question: what was the pipeline doing when it broke?

Most monitoring tells you that something failed. It doesn't tell you why. It doesn't tell you what the pipeline was processing, what state it was in, or what the downstream impact will be. You know the patient is sick. You don't know the symptoms, the diagnosis, or the treatment.

Resilient pipelines are observable. Not just monitored — observable. The difference matters. Monitoring checks if the job finished. Observability lets you reconstruct what happened when it didn't. Distributed tracing that follows a record through every stage. Structured logging that includes context, not just events. Metrics that expose the health of the data, not just the health of the process.

One team I worked with added a simple check that changed everything: they started logging the input record ID at every transformation stage. When something broke, they could trace the exact record through the pipeline and see which stage produced the error. Before that change, debugging took hours. Afterward, it took minutes. The pipeline itself wasn't more reliable. But the system's response to failure became so much faster that the effective downtime dropped by 80%.

If your debugging process involves SSHing into servers and grepping through unstructured log files, you don't have observability. You have archaeology. And archaeology is expensive at 3 AM.


Question 4: Does it protect data integrity when everything else fails?

There's a special category of failure that keeps me up at night: the pipeline that doesn't fail at all. It runs. It completes. It reports success. And it produces wrong data.

This is worse than a crash. A crash is obvious. Wrong data is subtle. It propagates through your systems. It gets used in decisions. It might be days or weeks before someone notices that the numbers don't match reality. By then, you've shipped features based on bad metrics, sent reports with incorrect figures, and made strategic decisions using data that was quietly corrupted somewhere in your pipeline.

Resilient pipelines treat data integrity as a first-class concern, not an afterthought. They validate inputs before processing. They check invariants at stage boundaries. They maintain checksums or counts that let you verify that what went in matches what came out. And when validation fails, they fail the pipeline — loudly, specifically, and with enough context to diagnose the problem.

The word I use here is "fail-closed." A fail-closed pipeline stops when it can't guarantee correctness. A fail-open pipeline keeps going and hopes nobody notices. Most pipelines are fail-open by default because that's the path of least resistance. It takes explicit design decisions to make them fail-closed.

One practical pattern: add a reconciliation stage at the end of every batch pipeline. Count the input records. Count the output records. Verify that the sum of a key metric matches between source and destination. These checks catch the silent failures — the dropped records, the duplicate writes, the join conditions that silently filter out valid data. They're not free. They add latency. But they turn invisible data corruption into visible, actionable errors.


Question 5: Have I tested what happens when it breaks?

This is the question that separates teams who talk about resilience from teams who actually have it. Have you deliberately broken your pipeline in a controlled environment and watched what happened?

Most teams haven't. They test the happy path exhaustively. They verify that correct inputs produce correct outputs. They run load tests to confirm performance under expected volume. And then they deploy to production and hope the unexpected doesn't happen.

The teams that build genuinely resilient pipelines practice failure injection. They kill database connections mid-job. They introduce latency spikes in API calls. They corrupt input records and verify that the pipeline handles them correctly. They run pipelines with half the allocated memory and watch for graceful degradation instead of abrupt crashes.

This isn't chaos engineering for the sake of it. It's validation that your resilience mechanisms actually work. A circuit breaker that you've never triggered might not break. A retry policy that you've never tested might retry infinitely. A dead-letter queue that you've never inspected might be silently dropping every malformed record.

You don't need a sophisticated chaos engineering platform. You need the discipline to ask: what happens if this dependency is down? What happens if this input is malformed? What happens if this job runs twice by accident? And then you need to actually test those scenarios, not just assume they'll be fine.


The bottom line

Resilience isn't a feature you add to a pipeline after it's built. It's a property that emerges from specific design decisions: isolation boundaries that limit blast radius, self-healing mechanisms that handle predictable failures, observability that makes debugging fast, integrity checks that prevent silent corruption, and tested failure modes that validate your assumptions.

The pipeline that survived the database migration I described earlier? It wasn't lucky. It was designed by a team that had asked these five questions and built explicit answers into their architecture. When the hidden dependency failed, the pipeline didn't silently produce garbage. It failed closed, alerted specifically, and routed affected records to a human review queue. The damage was contained to a four-hour delay in one dashboard. No downstream corruption. No bad decisions based on wrong data. No 3 AM emergency.

That's what resilience looks like. Not perfect uptime. Not infinite scalability. Just the confidence that when something breaks — and something always breaks — the system will behave predictably, contain the damage, and tell you exactly what happened.


What's next

If you're looking at your own pipelines right now, start with one question: can I name the five things this pipeline depends on, and do I know what happens when each one fails? If you can't answer that, you've found your starting point.

Pick one dependency. Test its failure mode. Watch what happens. Fix what breaks. Repeat.

Resilience isn't a destination. It's a practice. And the teams that practice it are the ones that sleep through the night.

For teams building streaming pipelines, layline.io provides built-in isolation boundaries, exactly-once processing guarantees, and visual debugging that makes it easier to trace failures when they happen — because they will happen, and what matters is how your system responds.

Try the Community Edition →


Andrew Tan is a serial entrepreneur and founder of layline.io, building enterprise data processing infrastructure that handles both batch and real-time workloads at scale.

Share:

Enjoyed this article?

Subscribe to get more insights delivered to your inbox.