Observability: you can't fix what you can't see

A company I was consulting for had just completed a full system rewrite. The new platform was well built, highly functional, and largely bug-free. The team was experienced, the architecture was sound, and everyone was quietly confident about the upcoming launch.

Then they put it under load. Not even real-world load; roughly a tenth of it. The system crawled to a halt.

What followed was more than a month of skilled developers scratching their heads. They tried everything: optimising queries, adjusting thread pools, scaling infrastructure. Nothing moved the needle in any meaningful way, because they were guessing. The system had been built with almost no instrumentation. No detailed metrics. No way to see where the time was actually being spent.

I had to persuade them to stop, go back, and retrofit a proper metrics-capturing layer. It wasn’t a quick conversation; there was understandable resistance to “going backwards” when the deadline had already slipped. But once the instrumentation was in place, the bottlenecks were obvious. The problems were resolved in a couple of hours. Not days. Hours.

Had that observability been baked in from the start, they would have hit their original deadline.

The cost of zero

Here is the thing that still surprises me after thirty years: the cost of building observability into a system from the beginning is close to zero. A few lines of structured logging. Some well-placed timers. A metrics endpoint that feeds a dashboard. It is trivial work compared to the effort of building the system itself.

And yet team after team skips it. Not out of laziness, but because it feels like a “nice to have.” Something to add later, once the core features are shipped. The problem is that “later” usually arrives in the form of a crisis, and by then you are trying to fight a fire in a building with no windows.

Three things you actually need

Observability gets dressed up in complicated language, but it boils down to three things:

Logs tell you what happened. A user hit an error. A payment failed. A background job timed out. Structured logs, where each entry is a searchable, filterable record rather than a line of free text, are the difference between finding the needle and staring at the haystack. And here is an important point: don’t just log errors and exceptions. Log the normal. Log the routine. When everything is working, those logs form a baseline that shows you what “healthy” looks like. When something does go wrong, that baseline is what sheds light on the darkness. You can’t spot the anomaly if you’ve never recorded the norm.

Metrics tell you how much and how fast. Request rates, response times, error percentages, queue depths. These are the vital signs of your system. When a metric drifts outside its normal range, you know something has changed before your users start complaining.

Traces tell you where. When a single request touches five services and a database, a trace shows you exactly which step took 800 milliseconds instead of 20. Without traces, you are left with finger-pointing between teams, each insisting their service is fine.

You need all three. Logs without metrics leave you reacting to individual events without seeing patterns. Metrics without traces show you something is slow but not why. It is a stool with three legs; remove one and it falls over.

Dashboards are not observability

A common trap is to install Grafana, build a few dashboards, and declare the job done. Dashboards that nobody looks at are worse than no dashboards at all, because they create a false sense of security. “We have monitoring” becomes the answer to every audit question, while the actual alerts go to an email inbox that nobody reads.

Observability is only useful if it is actionable. That means alerts that fire when something genuinely needs attention, thresholds that are tuned to your system’s actual behaviour, and a culture where the team responds to signals rather than waiting for customer complaints. Pilots trust their instruments because those instruments are calibrated, maintained, and taken seriously. The same discipline applies here.

Alert fatigue is the other side of this coin. If every minor fluctuation triggers a page, the team learns to ignore alerts entirely. Tuning your alerts so they mean something is not a one-off task; it is ongoing hygiene, like keeping your tests green.

The business case

For the executives reading this: observability is insurance. You already insure your offices, your equipment, and your people. Observability insures your ability to diagnose and recover from the inevitable failures that come with running software at scale.

The arithmetic is straightforward. Calculate the cost of an hour of downtime for your business: lost transactions, SLA penalties, customer trust, engineering time spent firefighting. Now compare that with the cost of a monitoring stack. For most organisations, the monitoring pays for itself after preventing a single significant incident.

Beyond the insurance argument, good observability gives you strategic visibility. Ask your CTO: can you show me how much each part of the system is actually used? What are the response times for individual functions? How is usage evolving over time? If they can pull up a dashboard and answer those questions in five minutes, you have a team that understands its own product. If the answer is a shrug and a promise to “look into it,” you are making business decisions without data. Observability is not just about catching fires; it is about understanding what your technology is actually doing, every day, in ways that inform product decisions, capacity planning, and investment priorities.

More than that, technical auditors and investors notice this stuff. A company with solid observability signals engineering maturity. A company that cannot answer “how would you know if this system was degrading?” signals risk.

Bake it in

The lesson from that painful month of blind debugging is simple. Observability is not a feature you add after the important work is done. It is part of the important work. The cost of including it from day one is negligible. The cost of not having it, as that team discovered, is measured in missed deadlines, lost confidence, and problems that should have taken hours taking weeks.

If your team is building a new system right now, ask them one question: “If this slows to a crawl under load tomorrow, how will we know where the problem is?” If the answer involves guesswork, you have a gap that is trivially cheap to close today and painfully expensive to close later.

Enjoy this article? You might also like Crisis management: keeping your head when production is down or The cockpit and the code: why tech teams can learn from pilots.

Observability: you can't fix what you can't see

The cost of zero

Three things you actually need

Dashboards are not observability

The business case

Bake it in

Share this article

Comments

Leave a Comment

Ask about David's work