Want a quick email when a new article is published? Subscribe
The Cockpit and the Code: Why Tech Teams Can Learn From Pilots
In the early days of a startup, “move fast and break things” is a rallying cry. It’s an exhilarating way to build. But as an organisation matures and the stakes move from “minor inconvenience” to “millions in lost revenue,” that approach becomes a liability. Over three decades in the hot seat, I’ve noticed that the most resilient technical teams aren’t necessarily the ones with the most “rockstar” coders; they are the ones that operate with the disciplined, systemic mindset. And because of my interest in aviation, I’ve noticed, they have a lot of similarity to a flight crew.
Aviation is perhaps the only industry that has truly mastered the art of failing safely. They have turned the fallibility of the human brain into a science. If we want to build software that is genuinely robust, we need to stop treating engineering as a purely creative craft and start treating it as a high-stakes operation. And no, I am not saying that your e-commerce site falling over for a couple of hours is playing at the same stakes as flying hundreds of people at 38,000 feet. But there are interesting things we can all learn from aviation.
The “Just Culture” and the Death of Blame
The most significant gift aviation gave to the world is the concept of a “Just Culture.” When a near-miss occurs in the air, the priority isn’t to find a pilot to fire; it’s to understand how the system allowed the error to happen.
In many engineering departments, there is a lingering, unspoken fear: “If I break production, it’s on my head.” This is a recipe for disaster. When people are afraid of blame, they hide mistakes. They ignore the weird logs and they don’t speak up when a deployment feels off. A pragmatic CTO knows that a human error is actually a design error. If a single developer can take down your entire stack with one accidental command, your system, not your developer, is the failure. We need to build guardrails that make it difficult to do the wrong thing and easy to do the right thing.
The Swiss Cheese Model of Defence
Talk about aviation satefy for a little while and you’ll hear someone mention the “Swiss Cheese Model.” Imagine every layer of your process, your unit tests, your code reviews, your staging environment, your canary deploys, as a slice of Swiss cheese. Each has holes (weaknesses). An accident only happens when the holes in every single slice line up perfectly, allowing a threat to pass through.
Too often, tech teams rely on one “thick” slice—usually a thorough code review or a hero engineer who checks everything. This is a single point of failure. A mature engineering organisation should build “Defence in Depth.” Assume that the unit tests will miss something, assume the reviewer will be tired. Layer defences so that the holes never align.
Aviate, Navigate, Communicate: Managing the Crisis
There is a strict hierarchy of tasks for pilots in an emergency: Aviate (keep the plane in the air), Navigate (figure out where you’re going), and only then, Communicate (talk to Air Traffic Control).
I’ve seen far too many senior technologists get this backwards during a major outage. As soon as the site goes down, they are in the “Communicate” phase—answering frantic DMs from the CEO, updating the status page, and taking Zoom calls with stakeholders. Meanwhile, the engineers (who should be “Aviating”) are being distracted by the noise.
In a crisis, the team needs a “Sterile Cockpit.” This is the rule that during critical phases of flight, all non-essential conversation is forbidden. When the system is haemorrhaging money, the only people talking should be the ones fixing it. The CTO’s job is to act as a bulkhead, protecting the engineers from executive pressure until the “plane” is level again.
The Myth of Tribal Knowledge
A Captain with 20,000 hours of flight time still uses a physical checklist for every takeoff. They don’t do this because they’ve forgotten how to fly. They do it because they know that stress, fatigue, and distraction are part of the human condition.
In tech, we tend to rely on “tribal knowledge.” We assume the Senior Dev knows how to handle the database migration because they’ve done it a dozen times. But what happens when that migration takes place at 3 AM on a Sunday? Memory is a point of failure. Every critical process, from onboarding a new hire to a multi-region failover—should have a written, version-controlled checklist. It takes the “heroics” out of the job and replaces them with predictable, repeatable success.
And just like the junior first officer being encouraged to question the seasoned veteran, remember that good ideas can come from anywhere. Encourage people, even the most junior, to speak up, voice their concerns and share their thoughts.
Trusting the Instruments (but don’t ignore gut feel)
When a pilot is flying in zero visibility, they have to ignore what their body is telling them. Their inner ear might say they are level, but the horizon indicator says they are in a steep bank. They are trained to “trust the instruments.”
This is the essence of modern observability. The gut feel about why a system is slow may well be wrong. Instead, build telemetry that we can trust implicitly, including Real-User Monitoring, distributed tracing, and clear, actionable logs.
However, there is acaveat: gut feel matters. Experienced pilots sometimes feel a vibration or smell something that the instruments haven’t caught yet. In engineering, this is Junior Curiosity. If a developer says, “This deployment felt a bit slower than usual,” don’t dismiss it. That tiny anomaly is often the first hole appearing in your Swiss cheese. Investigate it before the holes line up.
Building for the Go-Around
Finally, we need the Go-Around mentality. If a pilot’s approach to the runway isn’t within a well-defined set of parameters, they don’t try to force the landing. They hit TOGA (take-off/go-around button), power-up, climb, and try again. There is no shame in it.
In software, this is the power of the Rollback. If your deployment shows a deviation in error rates for the first 1% of users, think about hitting the abort button. Don’t try to fix forward while your customers are suffering. The ability to revert to a known-good state in seconds is a fantastic safety net. And there is an even more modern option in software, for which there is no analogy from flying, and that’s feature flags.
We don’t need rockstars or ninjas. We need well-honed crews that work well together. We need a culture that prizes discipline over ego, and systems that are designed to survive the inevitable human errors that come with building complex things.
Comments
Leave a Comment
All comments are moderated and may take a short while to appear.
Loading comments...