Navigate Ways of Working
DevOps Principles
Infrastructure, deployment, and operational practices. Cloud-default hosting, Docker containerisation, IaC, CI/CD pipeline, deployment practices, observability, disaster recovery, and more.
Overview
This document defines the principles, responsibilities, and practices for infrastructure, deployment, and operational concerns. It is deliberately cloud-agnostic and technology-flexible — the principles hold whether the infrastructure is cloud-hosted, co-located, or on-premises, though cloud is the default starting position.
The operating model does not include a dedicated infrastructure team. Infrastructure ownership sits within the platform squad, with one or two DevOps-focused engineers who specialise in infrastructure concerns, cross-system integration, and operational tooling. Product squad developers interact with infrastructure through self-service tooling — they press buttons, they don’t build the buttons.
1. Principles
Default to cloud, but don’t assume it
Cloud hosting is the default. It offers the fastest path to production, the lowest operational overhead at small scale, and the most flexibility. But some use cases require co-located or on-premises infrastructure — regulatory requirements, data sovereignty, latency-sensitive workloads, cost optimisation at scale, or client constraints. The architecture must not be so tightly coupled to a single cloud provider that these options become impossible.
In practice: Abstract infrastructure behind well-defined interfaces. Application code should not contain cloud-provider-specific logic. Infrastructure configuration (IaC) is provider-specific by necessity, but application deployment artefacts (containers) are portable.
Containers as the unit of deployment
Everything runs in containers. Docker is the default runtime. Containers provide consistency across environments, portability across hosting models, and a clean contract between “what the application needs” and “where the application runs.”
Why Docker over Kubernetes: Kubernetes solves problems that most organisations at this scale don’t have yet — multi-region orchestration, complex service meshes, sophisticated autoscaling. It also introduces significant operational complexity, a steep learning curve, and a larger attack surface. Docker Compose or managed container services (ECS, Cloud Run, Azure Container Apps) provide the orchestration needed for most workloads without the overhead. If and when Kubernetes becomes necessary, the containerised architecture makes migration straightforward.
When to reconsider: When the number of independently deployed services exceeds what can be managed with simpler orchestration, when multi-region active-active is required, or when autoscaling needs become genuinely complex. This is a Tier 3 initiative if it happens — it requires a PRD, a feasibility spike, and a go/no-go.
Infrastructure as Code, no exceptions
All infrastructure is defined in code, version-controlled, and deployed through automated pipelines. No manual changes to any environment. No clickops. If it isn’t in the repo, it doesn’t exist.
In practice: Terraform, Pulumi, or equivalent for infrastructure provisioning. Docker Compose or equivalent for service orchestration. Environment configuration managed through code, not portal settings. Secrets managed through dedicated tooling (Vault, AWS Secrets Manager, or equivalent), referenced by code but never stored in it.
Environments: as few as possible, as many as necessary
The default environment strategy is simple: staging + production. Feature flags do the heavy lifting that multiple environments traditionally served — isolating incomplete work, testing with specific user segments, and gradual rollout.
More complex projects may need additional environments — a dedicated integration environment for third-party systems, ephemeral preview environments for UI-heavy work, a performance testing environment with production-like data volumes. These are justified on a case-by-case basis, not mandated by default.
Environment parity: Staging must be architecturally identical to production. Smaller scale is fine — fewer instances, smaller databases — but the same services, same networking topology, same container images. “Works on staging” must mean “will work on production.”
Ephemeral environments: When warranted, these are created and destroyed automatically — spun up for a feature branch or a PR, torn down when merged or abandoned. The platform squad provides the tooling; product squads consume it.
Developer self-service through guardrails
Developers don’t raise tickets to get infrastructure. They use tooling that the platform squad builds and maintains. The self-service model:
- Developers can: spin up new services from templates, deploy to staging, trigger production deployments, view logs and metrics, manage feature flags, create and destroy ephemeral environments (where available).
- Developers cannot: modify networking or security group rules, change IAM policies, alter production database configuration, modify shared infrastructure. These require platform squad involvement.
- The boundary: if it affects only your service, self-service. If it affects shared infrastructure or other squads, go through platform.
The platform squad’s job is to make the right thing easy and the wrong thing hard. If developers are routinely asking the platform squad for help with routine tasks, the self-service tooling has gaps.
You build it, you run it
Product squads own the operational health of their services. On-call responsibility, incident response, and production monitoring sit with the squad that builds the service, not with a centralised operations team. This creates a direct feedback loop: if your code causes incidents, you feel it.
The platform squad provides the tools, patterns, and shared infrastructure that make this viable. Product squads use those tools to monitor and operate their own services.
Observability over monitoring
Monitoring asks “is this metric within an acceptable range?” Observability asks “what is this system doing and why?” Both are needed, but observability is the higher-order goal.
In practice:
- Structured logging — every service logs in a consistent, structured format (JSON). Logs include correlation IDs for tracing requests across services. The platform squad provides logging libraries and patterns; product squads use them.
- Distributed tracing — requests are traceable across service boundaries. Essential for debugging in any system with more than one service.
- Metrics — application-level metrics (request rates, error rates, latency percentiles) and infrastructure metrics (CPU, memory, disk, network). Collected automatically where possible.
- Alerting — alerts on symptoms (error rate up, latency increased), not causes (CPU at 80%). Alert fatigue is actively managed — every alert should be actionable. If an alert fires and nobody needs to do anything, remove it.
Security is infrastructure
Security is not a separate concern bolted on after the architecture is defined. It is part of the infrastructure from day one.
In practice:
- Network segmentation by default — services only communicate with what they need to
- Secrets management through dedicated tooling, never in code or environment files
- Container images scanned for vulnerabilities in CI before deployment
- Principle of least privilege for all service accounts and IAM roles
- TLS everywhere, including internal service-to-service communication
- Dependency scanning automated and continuous
- Production database access restricted, audited, read-only for debugging
See the Security Practices section of the Engineering Process for the full security framework.
2. Roles & Responsibilities
Platform Squad — DevOps Engineers
One or two engineers within the platform squad with dedicated infrastructure focus. They are squad members, not a separate team — they participate in platform squad ceremonies, planning, and retrospectives.
Responsibilities:
- Infrastructure as Code — authoring and maintaining all IaC
- CI/CD pipeline design, build, and maintenance
- Container orchestration and deployment tooling
- Self-service tooling for product squads (service templates, deployment buttons, environment creation)
- Observability stack — logging, tracing, metrics, alerting infrastructure
- Secrets management infrastructure
- Cost monitoring and optimisation
- Disaster recovery planning and testing
- Infrastructure security — network config, IAM, container scanning
- Capacity planning and scaling strategy
- Cross-system integration concerns (shared databases, message queues, API gateways)
- On-call for infrastructure-level incidents
What they don’t do:
- Application-level monitoring configuration (product squads own their own dashboards and alerts)
- Deploying product squad code (self-service, automated)
- Making architectural decisions for product squads (advisory, not directive — Staff Engineer owns architecture)
Head of Engineering
Responsibilities (infrastructure-related):
- Infrastructure strategy — hosting model decisions (cloud, co-located, on-premises), provider selection, long-term capacity planning
- Budget ownership for infrastructure spend; approves significant cost changes (reserved capacity, new services, provider changes)
- Disaster recovery strategy — owns RTO/RPO decisions in collaboration with business stakeholders
- Compliance and regulatory requirements that affect infrastructure (data residency, audit logging, access controls)
- Approving overrides to standard practices where justified (e.g., backward-compatible migration requirement — see Database Migrations below)
- Hiring and development of DevOps-focused engineers within the platform squad
- Escalation point for cross-squad infrastructure disputes
- Quarterly infrastructure health review with platform squad (cost, reliability, capacity, maturity progression)
Staff Engineer
Responsibilities (infrastructure-related):
- Architecture decisions that affect infrastructure (service boundaries, data storage choices, communication patterns)
- Ensuring infrastructure patterns are consistent across squads
- Evaluating when infrastructure complexity needs to increase (e.g., the Kubernetes question)
- ADRs for significant infrastructure decisions
Lead Devs
Responsibilities (infrastructure-related):
- Ensuring their squad’s services follow infrastructure patterns (logging, health checks, configuration)
- Reviewing infrastructure-touching changes (database migrations, new service creation, networking changes)
- First escalation point for squad-level operational issues before platform squad involvement
Product Squad Developers
Responsibilities (infrastructure-related):
- Writing services that conform to infrastructure patterns (health check endpoints, structured logging, configuration via environment variables)
- Creating and maintaining application-level monitoring (dashboards, alerts relevant to their services)
- On-call for their squad’s services
- Using self-service tooling for deployments and environment management
- Flagging infrastructure needs to Lead Dev (who coordinates with platform squad if needed)
3. CI/CD Pipeline
Pipeline Stages
Commit → Build → Test → Scan → Deploy (staging) → Smoke → Deploy (production)
The entire pipeline is automated. Zero human intervention on the happy path. See also the Engineering Process Cheatsheet for a quick-reference view of how this fits into the broader development workflow.
Build:
- Container image built from Dockerfile
- Image tagged with commit SHA (not “latest”, not branch name)
- Image pushed to container registry
Test:
- Unit tests
- Integration tests (where applicable — against real dependencies in containers, not mocks)
- Linting and formatting checks
Scan:
- Static analysis (SAST)
- Dependency vulnerability scanning
- Container image vulnerability scanning
- Secrets detection (pre-commit hooks catch most; CI catches the rest)
- AI-assisted code review
Deploy to staging:
- Automated deployment of new image to staging
- Same deployment mechanism as production
Smoke tests:
- Automated tests against staging to verify core paths work
- Health check verification
- Critical integration points validated
Deploy to production:
- Same mechanism as staging deployment
- Behind feature flags for new functionality
- Percentage rollout for non-trivial changes (see Release Process in the Engineering Process)
Pipeline Principles
Fast feedback: The full pipeline should complete in under 15 minutes. If it’s slower, developers stop waiting for it and start batching changes — which defeats the purpose of trunk-based development. Invest in pipeline speed as a first-class concern.
Deterministic builds: Same commit always produces the same artefact. No external dependencies fetched at build time that aren’t pinned to specific versions. Container images are reproducible.
Pipeline as code: Pipeline configuration lives in the repo alongside the application code. Changes to the pipeline go through the same process as code changes.
No environment-specific builds: One container image, configured at runtime through environment variables and secrets. The image deployed to staging is byte-identical to the one deployed to production.
4. Container Standards
Every Service Must Have
- Dockerfile — multi-stage build, minimal final image (distroless or Alpine-based where practical), non-root user
- Health check endpoint —
/healthor equivalent, returns service status and dependency status. Used by orchestration for readiness and liveness. - Structured logging — JSON format, consistent fields (timestamp, level, service name, correlation ID, message). Use the platform-provided logging library.
- Configuration via environment variables — no config files baked into images, no hardcoded values. Twelve-factor app principles.
- Graceful shutdown — handle SIGTERM, drain connections, complete in-flight requests, exit cleanly. Container orchestration depends on this.
- Resource limits defined — memory and CPU limits specified in orchestration config. Services that don’t declare limits are a risk to everything else on the same host.
Container Image Hygiene
- Base images pinned to specific digests, not floating tags
- Base images updated regularly (automated via Dependabot/Renovate for Dockerfiles)
- Images scanned in CI — critical and high vulnerabilities block deployment
- No secrets, credentials, or sensitive data in images (build args for secrets are also not acceptable)
- Image size minimised — smaller images deploy faster and have a smaller attack surface
Local Development
The local development environment must mirror production as closely as practical. Docker Compose (or equivalent) for running the service and its dependencies locally. The platform squad maintains a standard local development setup. Target: new developer goes from clone to running locally in under 30 minutes.
5. Deployment Practices
Deployment is not release
Deployment puts code in production. Release makes functionality visible to users. These are decoupled through feature flags. Code is deployed continuously; features are released when Product decides.
Zero-downtime deployments
All deployments are zero-downtime. Rolling deployments or blue/green as the default pattern — at least one instance of the service is always available during deployment. The deployment strategy is defined in the orchestration config, not improvised per deployment.
Database migrations
Database migrations deserve special care because they are the hardest thing to roll back.
Rules:
- Migrations are backward compatible by default — old code must work with the new schema, and new code must work with the old schema during the transition window. This is what makes zero-downtime deployment and safe rollback possible.
- Deploy the migration separately from the code that uses it. Add the column first, deploy the code that writes to it second, remove the old column third.
- Large migrations use online migration tools (pt-online-schema-change, pgroll, or equivalent) to avoid table locking. Platform squad provides tooling and guidance.
- Migration scripts are reviewed by Lead Dev or Staff Engineer before execution (one of the PR-gate exceptions)
- Migrations are tested against a production-like dataset before production execution
- Every migration has a documented rollback plan
Overriding backward compatibility: For large structural migrations — splitting tables, changing fundamental data models, restructuring relationships — full bidirectional compatibility can cost more in complexity and risk than the direct approach. Head of Eng can approve an override of the backward-compatibility requirement. The override carries conditions:
- A documented rollback plan that accounts for the incompatibility (this may mean “restore from backup taken immediately before migration” rather than “revert the migration”)
- A maintenance window if the migration cannot be zero-downtime
- A production database backup verified immediately before execution
- The team explicitly accepts that rollback is now harder and slower than the default case
- The decision is recorded as an ADR
The point is not to block large migrations. It is to ensure that when the safety net is removed, everyone knows what they’re giving up and has a plan for the scenario where things go wrong.
Rollback
- Feature flag toggle: seconds. First line of defence. Always available for flag-controlled features.
- Redeploy previous image: minutes. Container registry retains previous images. Rollback is deploying the last known good image — same mechanism as a forward deployment.
- Database rollback: depends on migration complexity. This is why backward-compatible migrations matter — if the old code works with the new schema, a code rollback doesn’t require a database rollback.
Rollback criteria are defined before deployment, not during an incident.
Fix-Forward
Rollback is the default response to a production issue, but it is not always the right one. In some cases, rolling back destroys the conditions needed to diagnose and permanently fix the problem — and the issue simply reappears next time the code is deployed.
Fix-forward means deliberately keeping the broken code in production while developing and deploying a fix, rather than rolling back first. It is a considered exception, not a default, and it requires explicit authorisation.
When to consider fix-forward:
- The issue is hard or impossible to reproduce outside production. It’s triggered by specific production data, user behaviour, race conditions, or scale that can’t be replicated in staging. Rolling back means losing the conditions you need to understand and fix the problem.
- AND the impact is contained. The incident affects a small number of users, a non-critical flow, or has a viable workaround. The business can tolerate the degraded state for the time it takes to develop a fix.
When fix-forward is not appropriate:
- The service is down or severely degraded for a significant user population
- The impact assessment is uncertain (“we think it only affects a few users” is not sufficient)
- There is a data integrity risk (corruption, loss, or inconsistency that worsens over time)
- The cause is completely unknown and the blast radius could grow
Authorisation:
| Situation | Who Authorises |
|---|---|
| SEV2/SEV3, impact clearly contained and understood | Lead Dev (inform Head of Eng) |
| SEV1 or uncertain impact | Head of Eng or CTO |
| Not reachable in time | Lead Dev makes initial call, escalates as soon as possible |
Severity levels (SEV1/SEV2/SEV3) are defined in the Incident Management section of the Engineering Process.
The Lead Dev can make the initial decision to fix forward rather than blocking on reaching a more senior decision-maker. Waiting 45 minutes for authorisation while an incident is in progress defeats the purpose. But the decision is escalated and reviewed — if Head of Eng or CTO would have made a different call, that’s a learning for next time, not a disciplinary issue.
Conditions when fix-forward is authorised:
- Time box agreed upfront. “We will attempt to fix forward. If a fix is not deployed within [X hours], we roll back regardless.” The time box is set at the point the decision is made, not discovered later. Typical range: 2–4 hours for a SEV2, shorter for anything user-visible.
- Actively capture everything. The whole point of fix-forward is that the production conditions are hard to reproduce. While they exist, instrument heavily — detailed logging, request captures, database state snapshots, user reports. Don’t waste the opportunity.
- Monitor the blast radius continuously. If the impact grows beyond the initial assessment, revert to rollback immediately. The assumption of contained impact must be validated throughout, not assumed once.
- Communicate. Affected users or stakeholders are informed if appropriate. The squad’s PM knows. Head of Eng knows.
After the fix is deployed: the incident still gets a full review. The review should assess whether fix-forward was the right call, whether the time box was respected, and whether the reproduction difficulty should be addressed structurally (better staging data, better test coverage, better observability).
6. Observability Stack
Components
The platform squad builds and maintains the observability stack. Product squads consume it.
| Component | Purpose | Platform Squad Owns | Product Squad Owns |
|---|---|---|---|
| Log aggregation | Centralised, searchable logs | Infrastructure, ingestion, retention | Log content, structured fields, correlation IDs |
| Distributed tracing | Request flow across services | Tracing infrastructure, sampling config | Instrumenting their services |
| Metrics collection | Time-series application and infra metrics | Collection infrastructure, dashboarding platform | Application metrics, squad dashboards |
| Alerting | Notification when things go wrong | Alerting infrastructure, infrastructure alerts | Application alerts for their services |
| Error tracking | Aggregate and deduplicate errors | Tooling provision | Configuration, triage, resolution |
Alerting Principles
- Alert on symptoms, not causes. “Error rate exceeded 1%” not “CPU at 80%.”
- Every alert must have a clear owner and a documented response (even if the response is “investigate and escalate if needed”)
- Alert fatigue is a reliability risk. Review alert noise quarterly. If an alert fires more than once a week and never requires action, remove or adjust it.
- Severity levels match incident severity: SEV1 alerts page immediately, SEV2 alerts notify within 30 minutes, SEV3 alerts go to a dashboard
- Alerts include enough context to start investigating — which service, what threshold was breached, a link to relevant dashboards and logs
7. Disaster Recovery & Business Continuity
Backup Strategy
- Databases: automated backups with defined retention (minimum 30 days). Point-in-time recovery capability for production databases.
- Blob/file storage: replicated or backed up depending on criticality.
- Infrastructure state: IaC is the backup. If the environment is destroyed, it can be recreated from code.
- Secrets: backed up separately from the infrastructure. Recovery procedure documented and tested.
Recovery Testing
Backups that aren’t tested are not backups. Recovery must be tested periodically — at minimum quarterly for databases, annually for full environment recreation.
Recovery time objectives (RTO) and recovery point objectives (RPO) should be defined per service based on business criticality. Not everything needs the same recovery guarantee. Platform squad documents these in collaboration with product PMs and Head of Eng.
Runbooks
Every production service has a runbook. The runbook contains:
- Service overview (what it does, who owns it, dependencies)
- Common failure scenarios and diagnostic steps
- Remediation actions for known issues
- Escalation path
- Contact information for external dependencies
Runbooks are maintained by the owning squad and reviewed after every incident that reveals a gap. See Documentation section of the Engineering Process for runbook ownership details.
8. Cost Management
Cloud infrastructure costs are visible and actively managed. They should not be a surprise on a monthly bill.
Principles
- Visibility: cost data is accessible to Lead Devs and PMs, not just the platform squad. If a squad’s service costs spike, the squad should see it.
- Allocation: costs are attributed to squads/services where possible. This isn’t for billing — it’s for awareness. A squad that understands their service costs $X/month makes better architectural decisions.
- Right-sizing: container resource limits are reviewed periodically. Over-provisioning is common and expensive. Under-provisioning causes performance problems. Platform squad provides tooling to identify both.
- Environment costs: non-production environments are sized down and, where possible, shut down outside working hours.
- Reserved capacity: for predictable workloads, reserved instances or committed use discounts are evaluated by the platform squad and approved by Head of Eng.
Cost reviews
Platform squad includes a cost summary in their monthly reporting. Significant cost changes (>20% month-over-month) are investigated and explained. Cost optimisation is part of the engineering allocation, not a separate initiative.
9. AI in DevOps
AI augments infrastructure and operations work the same way it augments development. See the AI Use Case Catalogue for the full inventory of AI applications across the engineering organisation.
| Application | How | Owner |
|---|---|---|
| Infrastructure code generation | AI generates Terraform/IaC from requirements, following established patterns | DevOps engineer reviews |
| Incident diagnostics | AI analyses logs, identifies related deployments, suggests similar past incidents | On-call engineer uses as input |
| Pipeline optimisation | AI analyses pipeline execution times and suggests parallelisation or caching improvements | Platform squad |
| Cost anomaly detection | AI identifies unusual cost patterns before they appear on the monthly bill | Platform squad |
| Security scanning | AI-assisted review of IaC for security misconfigurations beyond rule-based tools | Automated in CI |
| Capacity forecasting | AI projects resource needs based on traffic trends and planned releases | Platform squad + Lead Devs |
| Runbook generation | AI drafts runbooks from service code, infrastructure config, and monitoring setup | Owning squad reviews |
| Migration analysis | AI analyses database migration scripts for locking, backward compatibility, and data safety | Dev + Lead Dev review |
10. Maturity Progression
Not everything needs to be in place on day one. Infrastructure capability grows with the organisation.
Foundation (must have before first production deployment)
- Containerised services with Dockerfiles
- CI/CD pipeline: build, test, scan, deploy
- Infrastructure as Code for all environments
- Centralised logging
- Health check endpoints on all services
- Secrets management (not in code)
- Automated backups for databases
- Basic alerting (service down, error rate spike)
- Staging + production environments
Established (target within first quarter)
- Structured logging with correlation IDs
- Distributed tracing
- Self-service deployment for product squads
- Service templates (new service from template in under an hour)
- Feature flag infrastructure
- Cost visibility by service
- Runbooks for all production services
- Automated rollback criteria
- Container image scanning in CI
- Ephemeral environments (if needed)
Mature (ongoing investment)
- AI-assisted incident diagnostics
- Automated cost anomaly detection
- Capacity forecasting
- Chaos engineering / resilience testing
- Cross-region disaster recovery (if business requires)
- Pipeline optimisation (sub-10-minute feedback)
- Automated dependency updates with auto-merge for passing patches
- Production traffic replay for testing