When Good Metrics Go Bad

A few years ago, a department I managed shipped a product that solved a problem the entire company had acknowledged for years but nobody had managed to tackle. We did the victory lap with the board. It was a genuine shift in how the business operated, and ultimately it was about making customers happy.

But the overall market was slowing. Revenue for that product line stayed flat. So the board wanted metrics to prove it was working.

We tried a few. Nothing stuck. The product was clearly valuable - users loved it, it kept us competitive - but we couldn’t tie it to a number that would satisfy a spreadsheet.

My argument was simple: you might not have seen revenues go up after we launched this, but I think you’d have seen them go down significantly without it. And more fundamentally, not everything needs a metric to prove it’s a good idea.

Every year, finance does a budget. Where’s the metric that proves budgeting works? There isn’t one. What about performance reviews? I’ve never seen a firm that measures the ROI of annual appraisals. We do these things because they’re sensible practices that help run a business. The absence of a direct metric doesn’t mean something isn’t valuable.

That’s the first metric problem. The other is known as Goodhart’s Law.

Goodhart’s Law

Charles Goodhart, an economist advising the Bank of England in the 1970s, observed: “When a measure becomes a target, it ceases to be a good measure.”

The moment you tell people they’ll be judged by a number, they optimise for the number - not the outcome the number was supposed to represent. This isn’t malice. It’s human nature. And you can find it quite often in some software engineering teams.

The Problem Metrics

Velocity points. Teams learn to inflate estimates over time. A “20-point sprint” at one team means something entirely different at another. Velocity was designed to help teams plan their own capacity, not to compare performance across the organisation. The moment it becomes a target, teams game it.

Lines of code. This one should be dead by now, and mostly it is, but some people still use it. Measuring lines of code encourages bloat and penalises elegant solutions. The best code is often the code you delete.

Tickets closed. When you reward closing tickets, you get ticket-splitting and cherry-picking. Engineers gravitate toward quick wins and avoid the gnarly, important problems that take longer.

Code coverage percentage. I’ve seen teams hit 90% coverage with tests that assert nothing meaningful. They technically cover the lines but don’t actually test behaviour. The metric looks healthy while the safety net has holes.

Deployment frequency. A worthy goal in principle, but when it becomes a target, config changes get counted as “deploys” to hit the number. The metric goes up; actual delivery doesn’t change.

Bug counts. These devolve into endless debates about what counts as a bug versus a feature request. Teams learn to reclassify issues rather than fix them.

And it’s not just engineering. I’ve seen sales teams that want a customer to let an old contract expire and then sign a new one, simply because the commission is higher on a new contract that it is on a renewal. Or the recruiters who create a second profile for the same person because it makes their pipeline look bigger.

Why This Happens

Leadership wants visibility into something inherently difficult to measure. Software development is creative, non-linear work. It doesn’t map neatly onto factory metrics.

There’s also a temptation to compare teams, which is almost always toxic. Different teams work on different problems with different constraints. Comparing their velocity is like comparing their shoe sizes - it’s a number, but it doesn’t tell you anything useful.

Good intentions calcify into gaming. A metric introduced to spot problems becomes a target, and suddenly engineers avoid refactoring because it “doesn’t count” toward sprint goals. Technical debt accumulates because paying it down doesn’t move the numbers that matter to someone’s dashboard.

Again, it can be even worse outside of eningeering, when incentive schemes can be tied to metrics. If you want to watch some people get really creative with metrics, tie their remuneration to it.

What to Measure Instead

Measure outcomes, not outputs. Customer satisfaction, time-to-value for users. These are harder to attribute to a single team, but they’re what actually matters. Revenue is often the mother of all metrics, but it’s noisy, laggy and hard to attribute.

The DORA metrics - deployment frequency, lead time, change failure rate, mean time to recovery - are useful precisely because they measure the health of your delivery pipeline rather than individual productivity. But these work when you use them to monitor a situation over time. They don’t work, in my opinion, for evaluating people, or teams, and they don’t work as targets: Goodhart’s Law.

Qualitative health checks matter. Talk to your teams. Are they proud of what they’re shipping? Do they feel like they’re building something worthwhile? Are they drowning in incidents or able to focus on new work? These conversations tell you more than any dashboard. In my experience, the vast majority of people building software want to work hard, they want to do good work. And the odd one that doesn’t will be be apparent by other means.

The best engineering leaders I know spend less time staring at metrics and more time having honest conversations about what’s working and what isn’t.

The Pragmatic Conclusion

Sometimes the right answer is to trust your judgement. If users like it, customers benefit, and it keeps you competitive, it’s probably working - even if you can’t prove it with a number.

Not everything that counts can be counted.