Navigate Ways of Working
Guide

DevOps Principles

Infrastructure, deployment, and operational practices. Cloud-default hosting, Docker containerisation, IaC, CI/CD pipeline, deployment practices, observability, disaster recovery, and more.

Overview

This document defines the principles, responsibilities, and practices for infrastructure, deployment, and operational concerns. It is deliberately cloud-agnostic and technology-flexible — the principles hold whether the infrastructure is cloud-hosted, co-located, or on-premises, though cloud is the default starting position.

The operating model does not include a dedicated infrastructure team. Infrastructure ownership sits within the platform squad, with one or two DevOps-focused engineers who specialise in infrastructure concerns, cross-system integration, and operational tooling. Product squad developers interact with infrastructure through self-service tooling — they press buttons, they don’t build the buttons.


1. Principles

Default to cloud, but don’t assume it

Cloud hosting is the default. It offers the fastest path to production, the lowest operational overhead at small scale, and the most flexibility. But some use cases require co-located or on-premises infrastructure — regulatory requirements, data sovereignty, latency-sensitive workloads, cost optimisation at scale, or client constraints. The architecture must not be so tightly coupled to a single cloud provider that these options become impossible.

In practice: Abstract infrastructure behind well-defined interfaces. Application code should not contain cloud-provider-specific logic. Infrastructure configuration (IaC) is provider-specific by necessity, but application deployment artefacts (containers) are portable.

Containers as the unit of deployment

Everything runs in containers. Docker is the default runtime. Containers provide consistency across environments, portability across hosting models, and a clean contract between “what the application needs” and “where the application runs.”

Why Docker over Kubernetes: Kubernetes solves problems that most organisations at this scale don’t have yet — multi-region orchestration, complex service meshes, sophisticated autoscaling. It also introduces significant operational complexity, a steep learning curve, and a larger attack surface. Docker Compose or managed container services (ECS, Cloud Run, Azure Container Apps) provide the orchestration needed for most workloads without the overhead. If and when Kubernetes becomes necessary, the containerised architecture makes migration straightforward.

When to reconsider: When the number of independently deployed services exceeds what can be managed with simpler orchestration, when multi-region active-active is required, or when autoscaling needs become genuinely complex. This is a Tier 3 initiative if it happens — it requires a PRD, a feasibility spike, and a go/no-go.

Infrastructure as Code, no exceptions

All infrastructure is defined in code, version-controlled, and deployed through automated pipelines. No manual changes to any environment. No clickops. If it isn’t in the repo, it doesn’t exist.

In practice: Terraform, Pulumi, or equivalent for infrastructure provisioning. Docker Compose or equivalent for service orchestration. Environment configuration managed through code, not portal settings. Secrets managed through dedicated tooling (Vault, AWS Secrets Manager, or equivalent), referenced by code but never stored in it.

Environments: as few as possible, as many as necessary

The default environment strategy is simple: staging + production. Feature flags do the heavy lifting that multiple environments traditionally served — isolating incomplete work, testing with specific user segments, and gradual rollout.

More complex projects may need additional environments — a dedicated integration environment for third-party systems, ephemeral preview environments for UI-heavy work, a performance testing environment with production-like data volumes. These are justified on a case-by-case basis, not mandated by default.

Environment parity: Staging must be architecturally identical to production. Smaller scale is fine — fewer instances, smaller databases — but the same services, same networking topology, same container images. “Works on staging” must mean “will work on production.”

Ephemeral environments: When warranted, these are created and destroyed automatically — spun up for a feature branch or a PR, torn down when merged or abandoned. The platform squad provides the tooling; product squads consume it.

Developer self-service through guardrails

Developers don’t raise tickets to get infrastructure. They use tooling that the platform squad builds and maintains. The self-service model:

  • Developers can: spin up new services from templates, deploy to staging, trigger production deployments, view logs and metrics, manage feature flags, create and destroy ephemeral environments (where available).
  • Developers cannot: modify networking or security group rules, change IAM policies, alter production database configuration, modify shared infrastructure. These require platform squad involvement.
  • The boundary: if it affects only your service, self-service. If it affects shared infrastructure or other squads, go through platform.

The platform squad’s job is to make the right thing easy and the wrong thing hard. If developers are routinely asking the platform squad for help with routine tasks, the self-service tooling has gaps.

You build it, you run it

Product squads own the operational health of their services. On-call responsibility, incident response, and production monitoring sit with the squad that builds the service, not with a centralised operations team. This creates a direct feedback loop: if your code causes incidents, you feel it.

The platform squad provides the tools, patterns, and shared infrastructure that make this viable. Product squads use those tools to monitor and operate their own services.

Observability over monitoring

Monitoring asks “is this metric within an acceptable range?” Observability asks “what is this system doing and why?” Both are needed, but observability is the higher-order goal.

In practice:

  • Structured logging — every service logs in a consistent, structured format (JSON). Logs include correlation IDs for tracing requests across services. The platform squad provides logging libraries and patterns; product squads use them.
  • Distributed tracing — requests are traceable across service boundaries. Essential for debugging in any system with more than one service.
  • Metrics — application-level metrics (request rates, error rates, latency percentiles) and infrastructure metrics (CPU, memory, disk, network). Collected automatically where possible.
  • Alerting — alerts on symptoms (error rate up, latency increased), not causes (CPU at 80%). Alert fatigue is actively managed — every alert should be actionable. If an alert fires and nobody needs to do anything, remove it.

Security is infrastructure

Security is not a separate concern bolted on after the architecture is defined. It is part of the infrastructure from day one.

In practice:

  • Network segmentation by default — services only communicate with what they need to
  • Secrets management through dedicated tooling, never in code or environment files
  • Container images scanned for vulnerabilities in CI before deployment
  • Principle of least privilege for all service accounts and IAM roles
  • TLS everywhere, including internal service-to-service communication
  • Dependency scanning automated and continuous
  • Production database access restricted, audited, read-only for debugging

See the Security Practices section of the Engineering Process for the full security framework.


2. Roles & Responsibilities

Platform Squad — DevOps Engineers

One or two engineers within the platform squad with dedicated infrastructure focus. They are squad members, not a separate team — they participate in platform squad ceremonies, planning, and retrospectives.

Responsibilities:

  • Infrastructure as Code — authoring and maintaining all IaC
  • CI/CD pipeline design, build, and maintenance
  • Container orchestration and deployment tooling
  • Self-service tooling for product squads (service templates, deployment buttons, environment creation)
  • Observability stack — logging, tracing, metrics, alerting infrastructure
  • Secrets management infrastructure
  • Cost monitoring and optimisation
  • Disaster recovery planning and testing
  • Infrastructure security — network config, IAM, container scanning
  • Capacity planning and scaling strategy
  • Cross-system integration concerns (shared databases, message queues, API gateways)
  • On-call for infrastructure-level incidents

What they don’t do:

  • Application-level monitoring configuration (product squads own their own dashboards and alerts)
  • Deploying product squad code (self-service, automated)
  • Making architectural decisions for product squads (advisory, not directive — Staff Engineer owns architecture)

Head of Engineering

Responsibilities (infrastructure-related):

  • Infrastructure strategy — hosting model decisions (cloud, co-located, on-premises), provider selection, long-term capacity planning
  • Budget ownership for infrastructure spend; approves significant cost changes (reserved capacity, new services, provider changes)
  • Disaster recovery strategy — owns RTO/RPO decisions in collaboration with business stakeholders
  • Compliance and regulatory requirements that affect infrastructure (data residency, audit logging, access controls)
  • Approving overrides to standard practices where justified (e.g., backward-compatible migration requirement — see Database Migrations below)
  • Hiring and development of DevOps-focused engineers within the platform squad
  • Escalation point for cross-squad infrastructure disputes
  • Quarterly infrastructure health review with platform squad (cost, reliability, capacity, maturity progression)

Staff Engineer

Responsibilities (infrastructure-related):

  • Architecture decisions that affect infrastructure (service boundaries, data storage choices, communication patterns)
  • Ensuring infrastructure patterns are consistent across squads
  • Evaluating when infrastructure complexity needs to increase (e.g., the Kubernetes question)
  • ADRs for significant infrastructure decisions

Lead Devs

Responsibilities (infrastructure-related):

  • Ensuring their squad’s services follow infrastructure patterns (logging, health checks, configuration)
  • Reviewing infrastructure-touching changes (database migrations, new service creation, networking changes)
  • First escalation point for squad-level operational issues before platform squad involvement

Product Squad Developers

Responsibilities (infrastructure-related):

  • Writing services that conform to infrastructure patterns (health check endpoints, structured logging, configuration via environment variables)
  • Creating and maintaining application-level monitoring (dashboards, alerts relevant to their services)
  • On-call for their squad’s services
  • Using self-service tooling for deployments and environment management
  • Flagging infrastructure needs to Lead Dev (who coordinates with platform squad if needed)

3. CI/CD Pipeline

Pipeline Stages

Commit → Build → Test → Scan → Deploy (staging) → Smoke → Deploy (production)

The entire pipeline is automated. Zero human intervention on the happy path. See also the Engineering Process Cheatsheet for a quick-reference view of how this fits into the broader development workflow.

Build:

  • Container image built from Dockerfile
  • Image tagged with commit SHA (not “latest”, not branch name)
  • Image pushed to container registry

Test:

  • Unit tests
  • Integration tests (where applicable — against real dependencies in containers, not mocks)
  • Linting and formatting checks

Scan:

  • Static analysis (SAST)
  • Dependency vulnerability scanning
  • Container image vulnerability scanning
  • Secrets detection (pre-commit hooks catch most; CI catches the rest)
  • AI-assisted code review

Deploy to staging:

  • Automated deployment of new image to staging
  • Same deployment mechanism as production

Smoke tests:

  • Automated tests against staging to verify core paths work
  • Health check verification
  • Critical integration points validated

Deploy to production:

  • Same mechanism as staging deployment
  • Behind feature flags for new functionality
  • Percentage rollout for non-trivial changes (see Release Process in the Engineering Process)

Pipeline Principles

Fast feedback: The full pipeline should complete in under 15 minutes. If it’s slower, developers stop waiting for it and start batching changes — which defeats the purpose of trunk-based development. Invest in pipeline speed as a first-class concern.

Deterministic builds: Same commit always produces the same artefact. No external dependencies fetched at build time that aren’t pinned to specific versions. Container images are reproducible.

Pipeline as code: Pipeline configuration lives in the repo alongside the application code. Changes to the pipeline go through the same process as code changes.

No environment-specific builds: One container image, configured at runtime through environment variables and secrets. The image deployed to staging is byte-identical to the one deployed to production.


4. Container Standards

Every Service Must Have

  • Dockerfile — multi-stage build, minimal final image (distroless or Alpine-based where practical), non-root user
  • Health check endpoint/health or equivalent, returns service status and dependency status. Used by orchestration for readiness and liveness.
  • Structured logging — JSON format, consistent fields (timestamp, level, service name, correlation ID, message). Use the platform-provided logging library.
  • Configuration via environment variables — no config files baked into images, no hardcoded values. Twelve-factor app principles.
  • Graceful shutdown — handle SIGTERM, drain connections, complete in-flight requests, exit cleanly. Container orchestration depends on this.
  • Resource limits defined — memory and CPU limits specified in orchestration config. Services that don’t declare limits are a risk to everything else on the same host.

Container Image Hygiene

  • Base images pinned to specific digests, not floating tags
  • Base images updated regularly (automated via Dependabot/Renovate for Dockerfiles)
  • Images scanned in CI — critical and high vulnerabilities block deployment
  • No secrets, credentials, or sensitive data in images (build args for secrets are also not acceptable)
  • Image size minimised — smaller images deploy faster and have a smaller attack surface

Local Development

The local development environment must mirror production as closely as practical. Docker Compose (or equivalent) for running the service and its dependencies locally. The platform squad maintains a standard local development setup. Target: new developer goes from clone to running locally in under 30 minutes.


5. Deployment Practices

Deployment is not release

Deployment puts code in production. Release makes functionality visible to users. These are decoupled through feature flags. Code is deployed continuously; features are released when Product decides.

Zero-downtime deployments

All deployments are zero-downtime. Rolling deployments or blue/green as the default pattern — at least one instance of the service is always available during deployment. The deployment strategy is defined in the orchestration config, not improvised per deployment.

Database migrations

Database migrations deserve special care because they are the hardest thing to roll back.

Rules:

  • Migrations are backward compatible by default — old code must work with the new schema, and new code must work with the old schema during the transition window. This is what makes zero-downtime deployment and safe rollback possible.
  • Deploy the migration separately from the code that uses it. Add the column first, deploy the code that writes to it second, remove the old column third.
  • Large migrations use online migration tools (pt-online-schema-change, pgroll, or equivalent) to avoid table locking. Platform squad provides tooling and guidance.
  • Migration scripts are reviewed by Lead Dev or Staff Engineer before execution (one of the PR-gate exceptions)
  • Migrations are tested against a production-like dataset before production execution
  • Every migration has a documented rollback plan

Overriding backward compatibility: For large structural migrations — splitting tables, changing fundamental data models, restructuring relationships — full bidirectional compatibility can cost more in complexity and risk than the direct approach. Head of Eng can approve an override of the backward-compatibility requirement. The override carries conditions:

  • A documented rollback plan that accounts for the incompatibility (this may mean “restore from backup taken immediately before migration” rather than “revert the migration”)
  • A maintenance window if the migration cannot be zero-downtime
  • A production database backup verified immediately before execution
  • The team explicitly accepts that rollback is now harder and slower than the default case
  • The decision is recorded as an ADR

The point is not to block large migrations. It is to ensure that when the safety net is removed, everyone knows what they’re giving up and has a plan for the scenario where things go wrong.

Rollback

  • Feature flag toggle: seconds. First line of defence. Always available for flag-controlled features.
  • Redeploy previous image: minutes. Container registry retains previous images. Rollback is deploying the last known good image — same mechanism as a forward deployment.
  • Database rollback: depends on migration complexity. This is why backward-compatible migrations matter — if the old code works with the new schema, a code rollback doesn’t require a database rollback.

Rollback criteria are defined before deployment, not during an incident.

Fix-Forward

Rollback is the default response to a production issue, but it is not always the right one. In some cases, rolling back destroys the conditions needed to diagnose and permanently fix the problem — and the issue simply reappears next time the code is deployed.

Fix-forward means deliberately keeping the broken code in production while developing and deploying a fix, rather than rolling back first. It is a considered exception, not a default, and it requires explicit authorisation.

When to consider fix-forward:

  • The issue is hard or impossible to reproduce outside production. It’s triggered by specific production data, user behaviour, race conditions, or scale that can’t be replicated in staging. Rolling back means losing the conditions you need to understand and fix the problem.
  • AND the impact is contained. The incident affects a small number of users, a non-critical flow, or has a viable workaround. The business can tolerate the degraded state for the time it takes to develop a fix.

When fix-forward is not appropriate:

  • The service is down or severely degraded for a significant user population
  • The impact assessment is uncertain (“we think it only affects a few users” is not sufficient)
  • There is a data integrity risk (corruption, loss, or inconsistency that worsens over time)
  • The cause is completely unknown and the blast radius could grow

Authorisation:

SituationWho Authorises
SEV2/SEV3, impact clearly contained and understoodLead Dev (inform Head of Eng)
SEV1 or uncertain impactHead of Eng or CTO
Not reachable in timeLead Dev makes initial call, escalates as soon as possible

Severity levels (SEV1/SEV2/SEV3) are defined in the Incident Management section of the Engineering Process.

The Lead Dev can make the initial decision to fix forward rather than blocking on reaching a more senior decision-maker. Waiting 45 minutes for authorisation while an incident is in progress defeats the purpose. But the decision is escalated and reviewed — if Head of Eng or CTO would have made a different call, that’s a learning for next time, not a disciplinary issue.

Conditions when fix-forward is authorised:

  • Time box agreed upfront. “We will attempt to fix forward. If a fix is not deployed within [X hours], we roll back regardless.” The time box is set at the point the decision is made, not discovered later. Typical range: 2–4 hours for a SEV2, shorter for anything user-visible.
  • Actively capture everything. The whole point of fix-forward is that the production conditions are hard to reproduce. While they exist, instrument heavily — detailed logging, request captures, database state snapshots, user reports. Don’t waste the opportunity.
  • Monitor the blast radius continuously. If the impact grows beyond the initial assessment, revert to rollback immediately. The assumption of contained impact must be validated throughout, not assumed once.
  • Communicate. Affected users or stakeholders are informed if appropriate. The squad’s PM knows. Head of Eng knows.

After the fix is deployed: the incident still gets a full review. The review should assess whether fix-forward was the right call, whether the time box was respected, and whether the reproduction difficulty should be addressed structurally (better staging data, better test coverage, better observability).


6. Observability Stack

Components

The platform squad builds and maintains the observability stack. Product squads consume it.

ComponentPurposePlatform Squad OwnsProduct Squad Owns
Log aggregationCentralised, searchable logsInfrastructure, ingestion, retentionLog content, structured fields, correlation IDs
Distributed tracingRequest flow across servicesTracing infrastructure, sampling configInstrumenting their services
Metrics collectionTime-series application and infra metricsCollection infrastructure, dashboarding platformApplication metrics, squad dashboards
AlertingNotification when things go wrongAlerting infrastructure, infrastructure alertsApplication alerts for their services
Error trackingAggregate and deduplicate errorsTooling provisionConfiguration, triage, resolution

Alerting Principles

  • Alert on symptoms, not causes. “Error rate exceeded 1%” not “CPU at 80%.”
  • Every alert must have a clear owner and a documented response (even if the response is “investigate and escalate if needed”)
  • Alert fatigue is a reliability risk. Review alert noise quarterly. If an alert fires more than once a week and never requires action, remove or adjust it.
  • Severity levels match incident severity: SEV1 alerts page immediately, SEV2 alerts notify within 30 minutes, SEV3 alerts go to a dashboard
  • Alerts include enough context to start investigating — which service, what threshold was breached, a link to relevant dashboards and logs

7. Disaster Recovery & Business Continuity

Backup Strategy

  • Databases: automated backups with defined retention (minimum 30 days). Point-in-time recovery capability for production databases.
  • Blob/file storage: replicated or backed up depending on criticality.
  • Infrastructure state: IaC is the backup. If the environment is destroyed, it can be recreated from code.
  • Secrets: backed up separately from the infrastructure. Recovery procedure documented and tested.

Recovery Testing

Backups that aren’t tested are not backups. Recovery must be tested periodically — at minimum quarterly for databases, annually for full environment recreation.

Recovery time objectives (RTO) and recovery point objectives (RPO) should be defined per service based on business criticality. Not everything needs the same recovery guarantee. Platform squad documents these in collaboration with product PMs and Head of Eng.

Runbooks

Every production service has a runbook. The runbook contains:

  • Service overview (what it does, who owns it, dependencies)
  • Common failure scenarios and diagnostic steps
  • Remediation actions for known issues
  • Escalation path
  • Contact information for external dependencies

Runbooks are maintained by the owning squad and reviewed after every incident that reveals a gap. See Documentation section of the Engineering Process for runbook ownership details.


8. Cost Management

Cloud infrastructure costs are visible and actively managed. They should not be a surprise on a monthly bill.

Principles

  • Visibility: cost data is accessible to Lead Devs and PMs, not just the platform squad. If a squad’s service costs spike, the squad should see it.
  • Allocation: costs are attributed to squads/services where possible. This isn’t for billing — it’s for awareness. A squad that understands their service costs $X/month makes better architectural decisions.
  • Right-sizing: container resource limits are reviewed periodically. Over-provisioning is common and expensive. Under-provisioning causes performance problems. Platform squad provides tooling to identify both.
  • Environment costs: non-production environments are sized down and, where possible, shut down outside working hours.
  • Reserved capacity: for predictable workloads, reserved instances or committed use discounts are evaluated by the platform squad and approved by Head of Eng.

Cost reviews

Platform squad includes a cost summary in their monthly reporting. Significant cost changes (>20% month-over-month) are investigated and explained. Cost optimisation is part of the engineering allocation, not a separate initiative.


9. AI in DevOps

AI augments infrastructure and operations work the same way it augments development. See the AI Use Case Catalogue for the full inventory of AI applications across the engineering organisation.

ApplicationHowOwner
Infrastructure code generationAI generates Terraform/IaC from requirements, following established patternsDevOps engineer reviews
Incident diagnosticsAI analyses logs, identifies related deployments, suggests similar past incidentsOn-call engineer uses as input
Pipeline optimisationAI analyses pipeline execution times and suggests parallelisation or caching improvementsPlatform squad
Cost anomaly detectionAI identifies unusual cost patterns before they appear on the monthly billPlatform squad
Security scanningAI-assisted review of IaC for security misconfigurations beyond rule-based toolsAutomated in CI
Capacity forecastingAI projects resource needs based on traffic trends and planned releasesPlatform squad + Lead Devs
Runbook generationAI drafts runbooks from service code, infrastructure config, and monitoring setupOwning squad reviews
Migration analysisAI analyses database migration scripts for locking, backward compatibility, and data safetyDev + Lead Dev review

10. Maturity Progression

Not everything needs to be in place on day one. Infrastructure capability grows with the organisation.

Foundation (must have before first production deployment)

  • Containerised services with Dockerfiles
  • CI/CD pipeline: build, test, scan, deploy
  • Infrastructure as Code for all environments
  • Centralised logging
  • Health check endpoints on all services
  • Secrets management (not in code)
  • Automated backups for databases
  • Basic alerting (service down, error rate spike)
  • Staging + production environments

Established (target within first quarter)

  • Structured logging with correlation IDs
  • Distributed tracing
  • Self-service deployment for product squads
  • Service templates (new service from template in under an hour)
  • Feature flag infrastructure
  • Cost visibility by service
  • Runbooks for all production services
  • Automated rollback criteria
  • Container image scanning in CI
  • Ephemeral environments (if needed)

Mature (ongoing investment)

  • AI-assisted incident diagnostics
  • Automated cost anomaly detection
  • Capacity forecasting
  • Chaos engineering / resilience testing
  • Cross-region disaster recovery (if business requires)
  • Pipeline optimisation (sub-10-minute feedback)
  • Automated dependency updates with auto-merge for passing patches
  • Production traffic replay for testing