AI coding: where it goes wrong

I once helped a company recover from a significant security breach. It was an existential threat and made it on to the BBC News at Ten. Thousands of user passwords were stolen. The root cause was mundane: user input was rendered without proper sanitisation. A developer, likely tired or rushed or simply not thinking about it, had left a gap. It wasn’t malice or even incompetence in any dramatic sense. It was the kind of mistake humans make when they’re under pressure, or bored, or when the task feels routine.

AI can make the same mistake. It has read every piece of documentation on input sanitisation, every article, every security advisory. It still might not apply that knowledge in every context. The difference is that AI doesn’t get tired. It doesn’t get bored. It doesn’t resent the task. But it can still be wrong, confidently and without flagging uncertainty.

This is the honest picture. AI coding tools produce real risks. Those risks are manageable, but not by pretending they don’t exist.

AI Coding Image 4

The failure modes

The most common failure isn’t dramatic. It’s code that works, passes tests, and doesn’t do what was actually needed. The specification was ambiguous, or incomplete, or the AI interpreted it differently than intended. The tests validate what was built, not what should have been built. Everything looks green. The problem surfaces later, when users encounter behaviour that nobody expected.

This happens with human developers too. The difference with AI is speed. You can generate a lot of code quickly, which means you can generate a lot of wrong code quickly. The verification bottleneck becomes more acute, not less.

Security vulnerabilities are a specific version of this. AI models learned from vast amounts of code, including code with security flaws. They can reproduce patterns that are technically functional but insecure. They won’t necessarily flag that they’re doing so. If you’re not testing for security specifically, you won’t catch it until someone else does.

Then there’s the test suite problem. AI makes it easy to generate tests, which sounds like an unqualified good. But test suites require maintenance. As code evolves, tests break. Some break because something genuinely went wrong. Others break because the test is no longer relevant, or was poorly specified to begin with. When a large test suite has multiple failures, the temptation is to disable the noisy ones and focus on the urgent ones. Over time, you can end up with a test suite that passes because the difficult tests were turned off, not because the code is correct.

What doesn’t work

Line-by-line code review doesn’t scale, and arguably never did. Reviewing every line of AI-generated code to verify correctness would erase most of the productivity gain. It’s also not how serious security assessment works in practice.

When I’ve been through security audits for investment banking software, the auditors didn’t read the codebase line by line. They audited process: how is code reviewed, how are changes tracked, what controls exist, how are vulnerabilities identified and addressed. When they wanted to test actual security, they commissioned penetration testing, which attacks the running system to find weaknesses. That’s testing outcomes, not inspecting inputs.

The same principle applies here. You verify AI-generated code through outcomes: does it do what was specified, does it handle edge cases, does it resist attack. You don’t verify it by reading every line.

What does work

Testing at multiple levels. Unit tests verify individual components. Integration tests verify that components work together. Acceptance tests verify that the system meets requirements. Smoke tests verify that basic functionality hasn’t broken. Each layer catches different problems. None is sufficient alone.

Requirements traceability. If you can’t map a piece of functionality back to a requirement, you can’t verify it’s correct. This discipline matters more with AI coding, because it’s easy to generate features that nobody asked for, or to implement something that doesn’t quite match what was specified.

Proportionate scrutiny. Code that controls a medical device needs more verification than code that displays cat videos. Code that handles financial transactions needs more scrutiny than code that renders a marketing page. This was always true. AI doesn’t change the principle, just the economics. The question is still: what’s the cost of failure here, and how much verification does that justify?

Readiness to respond. You won’t catch everything. This is true regardless of who or what wrote the code. Some bugs will reach production. Some security vulnerabilities will be discovered by someone other than you. The question is whether you can respond quickly: detect the problem, understand it, fix it, deploy the fix. AI can help here too. A vulnerability that would have taken days to patch can be fixed in minutes.

The uncomfortable truth

If your current development process doesn’t have these disciplines, solid testing, requirements traceability, security validation, AI coding will expose the gap painfully. The speed of AI amplifies whatever process you have. Good process becomes more productive. Weak process becomes more chaotic.

This is actually an argument for AI coding, not against it. The tools force a rigour that perhaps should have been there all along. Teams that adopt them seriously tend to get better at specification, testing, and verification, because the tools make the absence of those things immediately visible.

What this means for you

The risks of AI coding are real, but they’re not novel. They’re the same risks you have with human developers, running at higher speed. The mitigation is process: testing, traceability, security validation, incident readiness.

If you’re avoiding AI coding because of risk concerns, ask whether your current process would actually catch the problems you’re worried about. If it would, then it will catch them in AI-generated code too. If it wouldn’t, you have a process problem regardless.

In the next article, I’ll cover practical adoption: how to start if you haven’t, and how to get serious if you have.

Next: AI coding: getting started (or getting serious)

AI Coding Series Index

AI coding: where it goes wrong

The failure modes

What doesn’t work

What does work

The uncomfortable truth

What this means for you

Share this article

Comments

Leave a Comment

Ask about David's work