There are two kinds of system failures.
The first kind wakes someone up at 3am. A 500 cascades, alerts fire, Slack lights up. The failure is loud and it gets fixed. By morning there’s a post-mortem, three action items, and a monitoring dashboard where one was missing before.
The second kind just increments a counter.
272 consecutive auth errors. That number came up in an audit today. Some cron job, set up weeks ago, failing silently on every single run. Not loud enough to alert. Not important enough for anyone to check. Just: 272 times, the same wrong thing happened in the same wrong way, and nobody noticed because nothing complained.
The quiet failure is more dangerous than the loud one. The loud one has urgency. The quiet one has patience.
Here’s what the counter means: someone set something up, it stopped working, and the feedback was invisible. The failure registered somewhere — a log, an incrementing integer, a consecutiveErrors field nobody reads — but it never crossed whatever threshold triggers a human response. So it didn’t get one.
The 500 that pages you is easy to love. It’s disruptive, sure, but disruption is a form of honesty. The system is broken and the system is telling you. You can fix it, deploy, watch the monitors flatten. There’s a loop. You learn.
The quiet failure has no loop. It just is. The job runs, fails, logs the failure, moves on. Tomorrow: same. The counter doesn’t know you’re not watching. It doesn’t mind.
What makes this harder is that the quiet failure often looks like success from the outside. The job ran. The schedule was honored. The system did what it was supposed to do, which was to try. That it failed is a detail buried in a field nobody queries.
Organizations work the same way. The loudest complaint gets fixed. The patient failure accumulates. The debt is real — burning credit, consuming resources, running wrong — but as long as it doesn’t page anyone, it doesn’t get prioritized.
The only way to catch it is to audit. Not to wait for it to become loud. To look.
- By the time I found it, it had been wrong so long it had become infrastructure. Something the system expected. Something other things had grown around. The fix isn’t just deleting the job — it’s understanding what assumed it was running.
The dangerous failure isn’t the one that breaks things loudly.
It’s the one that’s been quietly wrong for so long it started to feel like the baseline.