Want a quick email when a new article is published? Subscribe

Crisis Management: Keeping your head when production is down

4 min read 18 Nov 2025
incident responseobservabilityleadershipmentorship

In the mid-90s, I was building a database system for an AAA-rated investment bank to track a $10B portfolio. It was high-stakes, high-pressure, and occasionally, the entire LAN would simply fall over. When it did, the brokers on the trading floor didn’t send polite emails; they started screaming. Literally.

The hardware teams had replaced every cable and switch in the building. Vendors had been in and out. Eventually, the finger was pointed at my database. I was so convinced my queries couldn’t possibly be the culprit that I decided to prove it. I sat at my desk during the peak morning period and told the IT team I’d demonstrate its stability. I clicked ‘Execute’.

Five seconds later, a crackling American voice erupted from the squawk box on my desk: “It’s gone again!”

It was me. My system was opening too many connections and exhausting the limited slots in the network protocol (remember, this was 30 years ago). I’d proven the theory, certainly, but I’d done it at the worst possible moment without warning a soul.

The inevitability of failure

If your software never fails, I would argue you aren’t moving fast enough. In any commercial setting - outside of healthcare or flight systems - total stability is often a sign of stagnation. Systems fail; it is the nature of the beast. The mark of a seasoned team isn’t the absence of crises, but the methodical way they handle them.

When the red lights start flashing, the most dangerous thing you can do is panic-guess. I’ve seen junior engineers (and quite a few seniors) start changing configurations at random, hoping to hit the right lever. This is “voodoo engineering,” and it usually makes the hole deeper.

The answer is in the data

Nine times out of ten, the reason for the failure is staring you in the face. It is in the logs or the metrics.

We live in an age where tools like Grafana can be integrated in a matter of hours. They are worth their weight in gold. If you aren’t monitoring your systems, you aren’t managing them; you’re just hoping for the best. When things go south, the first step is to validate which metrics are out of line. Be deliberate. Verify your assumptions. If something looks slightly “off,” don’t ignore it: that’s usually the thread that unravels the whole mess.

Managing the “Why” vs. the “When”

During a live incident, your stakeholders - from the C-suite to the people on the floor - will be screaming for reasons. They want to know why it is happening. (For more on managing executive and board communication, see my separate piece on that topic.)

You must resist the urge to give them an answer before the fire is out.

The middle of a crisis is not the time for a post-mortem. Giving out half-baked theories only creates confusion and leads to “I thought you said it was the database?” conversations an hour later. Your communication should be frequent but focused on two things:

  1. We are aware and working on it.
  2. The estimated time to recovery (if known).

Keep the “why” for the report you write once the service is restored.

When you’re back on an even keel

Writing up what happened should be the Standard Operating Procedure. It provides several benefits:

Final thoughts

My ego in the 90s wanted to prove I was right. A more pragmatic version of myself would have waited for a quiet window, warned the folks on the trading floor, and tested the theory methodically.

When production goes down, your job isn’t to be the hero who guesses correctly; it’s to be the professional who follows the data. Keep your head, read the logs, and for heaven’s sake, don’t “test” your fixes in the middle of the morning rush without telling anyone.

Comments

Loading comments...

Leave a Comment

All comments are moderated and may take a short while to appear.