Crisis Management: Keeping your head when production is down

In the mid-90s, I was building a database system for an AAA-rated investment bank to track a $10B portfolio. It was high-stakes, high-pressure, and occasionally, the entire LAN would simply fall over. When it did, the brokers on the trading floor didn’t send polite emails; they started screaming. Literally.

The hardware teams had replaced every cable and switch in the building. Vendors had been in and out. Eventually, the finger was pointed at my database. I was so convinced my queries couldn’t possibly be the culprit that I decided to prove it. I sat at my desk during the peak morning period and told the IT team I’d demonstrate its stability. I clicked ‘Execute’.

Five seconds later, a crackling American voice erupted from the squawk box on my desk: “It’s gone again!”

It was me. My system was opening too many connections and exhausting the limited slots in the network protocol (remember, this was 30 years ago). I’d proven the theory, certainly, but I’d done it at the worst possible moment without warning a soul.

The inevitability of failure

If your software never fails, I would argue you aren’t moving fast enough. In any commercial setting - outside of healthcare or flight systems - total stability is often a sign of stagnation. Systems fail; it is the nature of the beast. The mark of a seasoned team isn’t the absence of crises, but the methodical way they handle them.

When the red lights start flashing, the most dangerous thing you can do is panic-guess. I’ve seen junior engineers (and quite a few seniors) start changing configurations at random, hoping to hit the right lever. This is “voodoo engineering,” and it usually makes the hole deeper.

The answer is in the data

Nine times out of ten, the reason for the failure is staring you in the face. It is in the logs or the metrics.

We live in an age where tools like Grafana can be integrated in a matter of hours. They are worth their weight in gold. If you aren’t monitoring your systems, you aren’t managing them; you’re just hoping for the best. When things go south, the first step is to validate which metrics are out of line. Be deliberate. Verify your assumptions. If something looks slightly “off,” don’t ignore it: that’s usually the thread that unravels the whole mess.

Managing the “Why” vs. the “When”

During a live incident, your stakeholders - from the C-suite to the people on the floor - will be screaming for reasons. They want to know why it is happening. (For more on managing executive and board communication, see my separate piece on that topic.)

You must resist the urge to give them an answer before the fire is out.

The middle of a crisis is not the time for a post-mortem. Giving out half-baked theories only creates confusion and leads to “I thought you said it was the database?” conversations an hour later. Your communication should be frequent but focused on two things:

We are aware and working on it.
The estimated time to recovery (if known).

Keep the “why” for the report you write once the service is restored.

When you’re back on an even keel

Writing up what happened should be the Standard Operating Procedure. It provides several benefits:

A record of what happened in case it happens again.
An examination of the root cause and how it can be avoided, or perhaps it just needs to be tolerated. Document that decision. Consider whether feature flags could have allowed a faster recovery.
An evaluation of how the incident plan dealt with the situation. If there is no plan, use this to build one.
Capture the taxonomy of the issue - which part of the system, which users, which functions, the nature of the failure, etc. Over time you’ll find patterns emerge and these tell you where to concentrate your efforts if you want to make the system more reliable. Remember, tech debt is not always bad, but it needs to be managed.

Final thoughts

My ego in the 90s wanted to prove I was right. A more pragmatic version of myself would have waited for a quiet window, warned the folks on the trading floor, and tested the theory methodically.

When production goes down, your job isn’t to be the hero who guesses correctly; it’s to be the professional who follows the data. Keep your head, read the logs, and for heaven’s sake, don’t “test” your fixes in the middle of the morning rush without telling anyone.