The Augmented Work.
An editorial on AI & the future of work
Essay № 10 · Articles

Your script runs. That doesn't make it a pipeline.

What actually separates a script that runs end-to-end from a production data pipeline — and the four-question contract you should write before you touch the code.

Issue May 2026
Read time 7 minutes
Filed under Data Engineering · Production Systems · Career
Length 1,900 words
In brief
Schematic Two pipelines, one promise. The contract is the difference.

"It runs end-to-end and it's scheduled" is a start line, not a finish line.

A script proves that one execution path worked, once, on your machine, against today's data, while you watched. A production data pipeline proves something much harder: that it will keep working on a Tuesday at 3 a.m., when the source schema changed overnight, nobody's looking, and the only person who'll notice the breakage is the analyst whose dashboard is now quietly wrong.

The difference isn't the code. It's the contract. A pipeline is software that makes promises to three parties — the source it reads from, the consumer it feeds, and the business that depends on the result — and answers four questions in writing: what, when, how, and why. Before you write any code, write that contract. Then hold the pipeline to a concrete readiness checklist (configurable, containerized, validated I/O, versioned, logged, idempotent, retryable, tested, infra-decoupled).

When it breaks at 3 a.m., who finds out, and how long after? The worst data failures aren't loud. A pipeline that crashes and pages you is working as intended — it told you. The dangerous one keeps running, keeps writing plausible-looking numbers, and says nothing.

If the honest answer is "the consumer, three days later," you didn't ship a pipeline. You shipped a liability you haven't named yet.

§ 01 · The Contract A pipeline is a contract, not a script — the four questions

The fastest way to look like a senior engineer is to answer four questions before writing code, in a doc, where your team can see them. A script answers none of these. A pipeline answers all four.

What — the schema and semantics. What's the exact shape of the data coming in and going out? Column names, types, value constraints, what a missing value means. Not "a CSV with some user data" — the actual contract: user_id is a non-null integer, signup_date is ISO-8601, revenue is in cents not dollars. This is a promise to your consumer: here is precisely what you'll get.

When — the schedule and freshness. How often does it run, and how late can the data be before someone downstream is making decisions on stale numbers? Daily at 6 a.m. is a when. "Whenever I remember to run the cell" is not. This is a promise about timeliness — an informal SLA, whether or not anyone calls it that.

How — the lineage you can trace. If a number looks wrong three steps downstream, can you trace it back to where it came from and what happened to it on the way? Lineage is the end-to-end record of a data asset's journey: its origin, every transformation applied, and where it lands. Without it, debugging a bad number means manually reading logs and code until you find the culprit — usually after the consumer already has.

Why — the business purpose. What decision does this data feed? "It populates the exec revenue dashboard" is a why. "I'm not totally sure who uses it" is a five-alarm fire — it means a thing can break and you won't know how much it matters or who to warn. If you can't name the consumer, you can't size the blast radius.

If your "pipeline" can't answer all four, it isn't one yet. It's a script with a cron job and good intentions.

Pipeline contract · v1 The four questions, before the code

Doc № DE-10-04Q
Question What it pins down Promise to Weak vs. strong
What.schema · semantics Column names, types, value constraints, what a missing value means. Consumer
weaka CSV with user data
stronguser_id: non-null int · revenue in cents
When.schedule · freshness Cadence and the max staleness someone downstream can tolerate. Consumer · Business
weakwhen I remember to run it
strongdaily 06:00 UTC · max 2h late
How.lineage you can trace Origin, every transformation, and where it lands. You, the debugger
weakit just shows up in the warehouse
strongraw → staged → marts · per-step row counts logged
Why.business purpose The decision this data feeds, and who makes it. Business
weaknot totally sure who uses it
strongfeeds the exec revenue dashboard, refreshed pre-standup
Signed by the engineer · countersigned by the consumer — before any code is written.
Figure 01 The four-question contract — write these before the code.

§ 02 · The Checklist The production-readiness checklist

Here's the concrete gap between "it runs" and "it's production-grade." This is the part you can literally run down before you call something done. None of it is exotic; it's the baseline the field already agreed on. In Fundamentals of Data Engineering (O'Reilly, 2022) — the closest thing the discipline has to a standard reference — Joe Reis and Matt Housley treat most of this list not as best-practice nice-to-haves but as the operational undercurrents that run beneath every production system. In other words: this isn't my bar. It's the field's.

Run it down against whatever you're about to ship. You don't need all eleven on day one for every internal job. But you do need to know which ones you're skipping and why — because each skipped item is a promise in the contract you've quietly decided not to keep.

Pre-flight · production-readiness Run it down before you call it done

0 / 11 Items covered

I · Reproducible

II · Trustworthy

III · Operable

You haven't started.

Each skipped item is a promise in the contract you've quietly decided not to keep.

Figure 02 The production-readiness checklist — know which ones you're skipping, and why.

§ 03 · The 3 a.m. Test When it breaks, who finds out — and when?

Every item on that checklist serves one question: when this breaks, who finds out, and how long after?

This is the test that separates a pipeline from a liability, because the worst data failures aren't loud. A pipeline that crashes at 3 a.m. and pages you is working as intended — it told you. The dangerous one keeps running, keeps writing plausible-looking numbers, and says nothing. By the time someone notices, the bad data has propagated into dashboards, reports, and decisions.

Scenario · same failure, two outcomes 03:00 · the source schema changes

03:00
Pipeline contracted
03:00Validation rejects bad row 03:01Alert fires · on-call paged 06:00Run held · consumer notified 10:00Fix shipped · backfill clean
Script uncontracted
03:00Bad row written silently 06:00Dashboards refresh on wrong data Day +2Decisions made on the numbers Day +4Consumer flags it · retraction
03:0006:0009:00Day +1Day +2Day +4
Found in minutes by a monitor Found in days by the consumer
Figure 03 Two pipelines, same failure at 03:00. One is found in minutes; the other in days. Press play, or scrub.

And "someone" is rarely the engineer. In Monte Carlo's 2023 State of Data Quality survey (200 data professionals, conducted by Wakefield Research — worth noting Monte Carlo sells observability tooling, so read it as directional), 74% of respondents said business stakeholders find data quality issues first, most or all of the time — up from 47% the year before. The people discovering your broken pipeline are the ones consuming its output, not the ones who built it. That's the liability test failing in the wild, at industry scale.

Asked across 200 data professionals
74%
said business stakeholders find data quality issues first, most or all of the time — up from 47% the year before. Monte Carlo / Wakefield Research, 2023.

The cost of silent failure isn't hypothetical.

Case study · Unity Technologies · 2022

$110M

One unvalidated input. No boundary check. A nine-figure consequence that surfaced in an earnings call instead of an alert.

What happened
Unity ingested bad data from one large customer into the machine-learning model behind its Audience Pinpointer ad-targeting tool. Nothing caught it at the boundary, so it ran.
The damage
Management estimated the hit at ~$110M in revenue for the year. The stock dropped about 37% on the earnings news.

So run the test on whatever you're about to ship. If it breaks tonight, does a monitor catch it, or does the analyst catch it Thursday? If it writes garbage, does validation reject it, or does it land in the warehouse looking fine?

"The consumer, three days later" is not a pipeline. It's a liability with a schedule.

§ 04 · The Notebook Question Notebooks aren't the enemy — promotion-without-refactor is

None of this is an argument against notebooks. Notebooks are the right tool for what they're for: exploration. The tight write-run-see loop, inline plots, poking at a dataset until you understand its shape — a notebook is genuinely better than a .py file for that work, and every good pipeline starts life as one. The exploration is real engineering, not a lesser warm-up act.

The sin isn't writing a notebook. It's promoting the notebook to production unchanged and calling it shipped.

There's hard evidence for why that fails.

Reproducibility · GitHub at scale What happens when you re-run a million notebooks?

Notebooks analysed Pimentel et al., 1.16M from GitHub
1.16M100%
Ran top-to-bottom without error in a clean environment
~278K24%
Reproduced their own results same numbers, same charts
~46K4%
Hidden kernel state and out-of-order cell execution mean a notebook that "works" on your screen often can't be cleanly re-run by anyone — including future you.
Figure 04 Pimentel et al. analysed 1.16M Jupyter notebooks. Only 24% ran cleanly; about 4% reproduced their original results.

The cause is the notebook's own nature: hidden kernel state and out-of-order cell execution mean a notebook that "works" on your screen often can't be cleanly re-run by anyone, including future you. That's fine for exploration. It's disqualifying for a system that has to run unattended, the same way, every night. (Martin Fowler's Thoughtworks team makes the same case from the architecture side: notebooks couple presentation, logic, and data into one file and invite manual tinkering — the opposite of what production needs.)

So the professional move, the day exploration code is headed for prod, is to say the refactor out loud. Not "I'll just productionize the notebook" — but "the notebook proved the logic; now I owe the team a refactor into a tested, validated, idempotent pipeline." Naming that debt is what a senior engineer does. Hiding it inside a copied-over .ipynb is how you become the Unity case study.

§ 05 · The Closing Image A pipeline is a logistics shipment

There's merchandise (the data), a tracking number (lineage), a delivery window (the schedule), and a recipient who's promised exactly what's arriving (the schema). All of it contracted, all of it traceable, all of it accountable when something goes wrong.

Closing image · the parcel One artifact, two versions

SOURCE CONSUMER DATA MANIFEST What WINDOW When TRACKING How RECIPIENT Why CONTRACTED · TRACEABLE · ACCOUNTABLE
SOURCE? UNKNOWN DATA? SMUGGLING NO MANIFEST · NO TRACKING · NO RECIPIENT
Same parcel. Same line. The paperwork is the entire difference.
Figure 05 A pipeline is a contracted shipment. Without the paperwork, it's smuggling.

Run the same shipment with no manifest, no tracking, and no one expecting it at the other end, and you don't have logistics. You have smuggling — and it works right up until the moment it very expensively doesn't.

The one thing to remember

Before you ship: write the four questions. Run the checklist. Ask who finds out at 3 a.m.

If you can answer all three, you built a pipeline. If you can't, you still have time to — which is the entire reason to ask now, instead of in an earnings call.

Sources

  1. Pimentel, J.F. et al. A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks. 1.16M notebooks; 24% ran without error, ~4% reproduced results. IEEE/MSR 2019; PMC, 2021.
  2. Reis, J. & Housley, M. Fundamentals of Data Engineering. O'Reilly, 2022 — idempotency, retries, and lineage as core operational undercurrents.
  3. Martin Fowler / Thoughtworks. Don't put data science notebooks into production. martinfowler.com, updated Nov 2020.
  4. Monte Carlo / Wakefield Research. 2023 State of Data Quality Survey — 74% report business stakeholders find issues first. BusinessWire / TDWI, May 2023.
  5. IBM Institute for Business Value and Unity Q1 2022 earnings reporting — ~$110M revenue impact from ingested bad data. IBM Think; The Motley Fool, May 2022.
  6. Pydantic. v2, Rust-based validation core. pydantic.dev / GitHub, current.
  7. King, A. Parse, Don't Validate. lexi-lambda.github.io, Nov 2019.
Share a passage

Found a line that lands? Hand it to your network.

Pick a card, choose paper or ink, and share where you read.

See all cards