Navigate Ways of Working
Engineering Process
The operational reference for day-to-day work. Sprint cadence, refinement, planning, technical design, development workflow, incident management, releases, metrics, onboarding, and more.
1. Organisation Structure
Squads
Product Squads (one per product area)
- 1 PM, 1 Lead Dev, 1 QA, 3-6 Devs
- Each owns a distinct product area end-to-end
Platform Squad (one)
- 1 PM, 1 Lead Dev, 1 QA, 3-4 Devs (platform squads tend smaller)
- Owns: CI/CD, shared libraries, AI tooling, observability, cross-cutting tech debt
Staff Engineer (one per org, no squad)
- Architecture, code quality standards, AI playbook, cross-squad consistency
PM Discovery & Stakeholders: Discovery, roadmap, and stakeholder management (pairs with Head of Product) PM Analytics & Experimentation: Analytics, experimentation, and feedback loops
For more detail on roles, squad composition, and reporting lines, see Team Structure.
Leadership
- Head of Product: Roadmap, prioritisation, stakeholder alignment, final call on what gets built
- Head of Engineering: Technical strategy, delivery health, hiring, final call on how things get built
2. How Work Enters the System
| Source | Entry Point | Prioritised By |
|---|---|---|
| Product roadmap | Squad PM | Head of Product |
| Customer feedback / support | Squad PM | Squad PM |
| Sales requests | Head of Product | Head of Product |
| Internal stakeholders | Squad PM (or Head of Product for cross-cutting) | Head of Product arbitrates conflicts |
| Engineering-initiated (tech debt, reliability, tooling) | Lead Dev | Head of Eng (15—25% standing allocation, no PM approval needed) |
| Production incidents | On-call engineer | Severity level determines response (see Section 8) |
Rules:
- Only the PM can add items to a squad’s backlog
- Only the PM can inject work into a sprint mid-flight, after consulting Lead Dev on capacity impact
- No one goes directly to a developer with requests
- PMs have visibility of engineering allocation work but do not approve or reject it
- Head of Product is the escalation point for priority conflicts and executive-driven interruptions
AI usage in intake: AI triages and categorises incoming requests, linking to existing backlog items. AI drafts initial assessments (rough sizing, squad ownership, roadmap overlap). AI synthesises customer feedback from support tickets to surface patterns so PMs work from data, not anecdote. For a full catalogue of AI applications across workflows, see the AI Use-Case Catalogue.
3. Sprint Cadence
Sprint length: 2 weeks
| When | What | Who | Duration |
|---|---|---|---|
| Monday (sprint start) | Sprint planning | Whole squad | 30-45 mins |
| Mid-week (weekly) | Refinement | PM, Lead Dev, QA, 1-2 devs | 45-60 mins |
| Thursday | Mid-sprint check | Squad | 15 mins |
| Friday (sprint end) | Demo | Squad (stakeholders invited) | 30 mins |
| Every 4 weeks | Retrospective | Squad only (no leadership) | 45 mins |
| Monthly | Architecture review | Lead Devs, Staff Engineer, Head of Eng | 60 mins |
| Monthly | Product strategy sync | PMs, Head of Product | 60 mins |
| Quarterly | Lightweight planning | Leadership + squads | Half day |
Daily standups are not mandated. Squads use async updates. A squad may choose to hold a daily sync — their call.
4. Refinement
When: Weekly, mid-week. Refine work 1-2 sprints ahead, never the current sprint.
Attendees: PM (mandatory), Lead Dev (mandatory), QA (mandatory), 1-2 devs likely to build it.
PM prepares:
- Problem statement: what is the user problem or business objective?
- Evidence: how do we know this is real?
- Success criteria: measurable outcomes
- Constraints: budget, timeline, dependencies, compliance
The session produces:
- Clear problem statement (PM owns)
- Agreed high-level approach (Lead Dev owns)
- Known risks and open questions
- Rough size: small / medium / large
- Acceptance criteria (PM + QA collaborate)
AI usage: PM arrives with AI-drafted problem statement and acceptance criteria as starting point. Lead Dev has AI-generated preliminary technical assessment. QA arrives with AI-generated test scenarios. Human value is in the debate, edge cases, and trade-off decisions.
Rules:
- PM presents the problem, not the solution
- If an item can’t be refined in one session, it’s too big - split or spike
- No estimation theatre - no planning poker, no story points
- No current sprint work discussed
- Only refine what’s likely built in next 1-2 sprints
This refinement process feeds into the broader product process, which describes how ideas move from discovery through to delivery.
5. Sprint Planning
When: First morning of the sprint.
Attendees: Whole squad.
PM prepares:
- Sprint goal (one sentence)
- Candidate items in priority order (all previously refined)
The session produces:
- Sprint goal (one sentence)
- Committed items (priority ordered)
- Known risks and dependencies
- Capacity constraints noted
Process:
- PM presents sprint goal and candidate items
- Lead Dev facilitates: squad assesses capacity (holidays, on-call, overhead)
- Squad pulls items from priority list until realistic
- Devs self-assign or Lead Dev assigns
- QA confirms they can keep pace - if not, that constrains the sprint
Rules:
- If an item isn’t understood, it goes back to refinement - don’t refine in planning
- PM prioritises; engineering assesses capacity; the intersection is the sprint
- PM does not assign work to individual devs
- Commitment is a forecast, not a contract
AI usage: AI suggests realistic item count based on recent throughput history. AI flags potential cross-squad code-level conflicts in committed items. PM uses AI to draft sprint goal before the meeting.
6. Technical Design Review
When: After an item is picked up, before coding starts.
Process:
- Dev spends 30-60 minutes thinking about approach
- Dev writes it down (Slack message, short doc, bullet points): key architectural choices, data flow, state management, error handling, observability approach
- Dev shares with Lead Dev
- Lead Dev responds: “good, go” or has a 15-30 minute conversation about alternatives and risks
- For cross-cutting or infrastructure work, Staff Engineer is the reviewer instead
Output: Agreed approach, captured on the ticket or as an ADR for significant decisions.
Calibration: 2-minute Slack exchange for experienced devs on straightforward items. Longer, more structured conversation for junior devs or complex work. Lead Dev uses judgment.
AI usage: Dev pressure-tests approach with AI before the Lead Dev conversation (“here’s my plan, what are the failure modes I haven’t considered?”). AI raises the floor; Lead Dev adds codebase history, operational context, and team capability judgment that AI can’t provide.
7. Development & Deployment
Developer Workflow
- Dev commits to main behind a feature flag
- Small, frequent commits - each passes CI
- CI runs automatically: tests, linting, static analysis, security scan, AI code review
- Green CI -> auto-deploys to staging -> smoke tests -> promotes to production
- Target: commit to production in under 30 minutes
For more on CI/CD pipeline design, container strategy, and deployment principles, see DevOps Principles.
AI usage in development: Code generation, test writing, documentation. AI-assisted code review runs as part of CI, catching style issues, common bugs, security anti-patterns, and deviation from codebase conventions.
Feature Ownership
- One developer owns a feature end-to-end
- Pairing for complex/risky work (Lead Dev’s call)
- Junior devs own their features with closer Lead Dev oversight
- Features spanning multiple services: one owner coordinates, others contribute
Pull Requests
PRs are not the default. Code goes to main via CI gates.
PR-like review required for:
- Shared platform code / APIs consumed by other squads
- Database migrations
- Security-sensitive changes (auth, payment, PII)
- New joiners (first few weeks, as coaching)
For these: one reviewer, 4-hour SLA, auto-merge if no response.
Post-Merge Review
Lead Devs and Staff Engineer review committed code asynchronously - not a gate, but a learning and consistency mechanism. Concerns become follow-up conversations or new commits.
8. Incident Management & On-Call
Rotation: Weekly, per squad. Platform squad has its own rotation.
Escalation:
- On-call engineer triages and fixes/contains
- Not resolved in 30 mins -> Lead Dev
- Cross-squad -> platform squad on-call
- Customer-facing + severe -> Head of Eng
Severity and response:
| Level | Definition | Response |
|---|---|---|
| SEV1 | Service down or critically degraded | All hands, immediate, stakeholder comms within 30 mins |
| SEV2 | Significant degradation, workaround exists | On-call responds within 30 mins, fix during working hours |
| SEV3 | Minor, no user impact | Logged, enters normal backlog |
Sprint impact: SEV1 drops everything, sprint scope renegotiated. SEV2 handled from buffer; if >half day, Lead Dev and PM discuss displacement. SEV3 enters backlog normally.
Capacity buffer: 10-15% of sprint capacity assumed lost to unplanned work.
Incident reviews: Every SEV1 and systemic SEV2s, within 3 working days. Max 3 owned, dated actions. Output stored alongside ADRs.
Compensation: Time off in lieu.
AI usage: During incidents, AI assists with log analysis, identifying related recent deployments, and suggesting similar past incidents. For incident reviews, AI drafts initial timeline from alerting data and Slack threads. Over time, AI identifies recurring incident themes across squads.
9. Release Process
Feature Flags
| Type | Owner | Purpose |
|---|---|---|
| Engineering | Dev building the feature | Hide incomplete work during development |
| Release | PM decides flip; engineering confirms readiness | Control feature visibility to users |
| Operational | Engineering | Kill switches for production issues (no PM approval needed) |
Release Decision
- Lead Dev confirms technical readiness
- QA confirms quality (exploratory testing complete)
- PM confirms business readiness (docs, support, comms)
- PM flips the flag (or requests engineering to)
Async check (Slack/ticket checklist), not a meeting.
Rollout
- Default: Percentage rollout - As an example (vary on a case by case basis): 5-10% -> monitor 24-48 hrs -> 25% -> 50% -> 100%
- Targeted: User-segment rollout for beta/specific tiers
- Trivial changes: Full rollout immediately
Rollback
- Feature flag toggle: Immediate (seconds). First line of defence.
- Code rollback: Revert commit, redeploy (~30 mins). When flag toggle is insufficient.
- Rollback criteria defined before release: “If error rate exceeds X or latency exceeds Y within 24 hours, roll back.”
Database Migrations
- Must be backward compatible (old code works with new schema and vice versa)
- Deploy migration separately from code
- Large migrations in stages using online migration tools
- Second pair of eyes required (Lead Dev or Staff Engineer)
Flag Cleanup
- Feature fully rolled out or killed -> dev removes flag and conditional/dead code within one sprint
- Platform squad reports flags older than 30 days monthly
- Flags older than 60 days without documented reason escalated to Lead Dev
AI usage in releases: AI monitors rollout metrics in real-time and auto-reverts flags if thresholds breached. AI generates release notes from commits and tickets. AI flags stale feature flags and associated dead code. For database migrations, AI analyses scripts against current schema for locking, data loss, and backward compatibility issues before human review.
10. QA Workflow
| Phase | QA Activity |
|---|---|
| Refinement | Attends session. Asks about testability and edge cases. Collaborates on acceptance criteria. |
| During development | Writes automated tests in parallel with dev work. Pairs with devs on testability. |
| At merge | Owns quality gate criteria. Automated suites run in CI. Exploratory testing for complex/risky changes. |
| At release | Confirms quality as part of release decision. |
| Post-release | Owns production quality monitoring - error rates, regressions, anomaly detection. |
QA does not have a veto on release. QA has a documented voice - if overruled, the risk is acknowledged and logged.
AI usage in QA: AI-assisted test generation - QA reviews and curates rather than writing every test. AI-powered anomaly detection on production metrics. Automated pre/post-release behaviour comparison. AI-generated test scenarios as starting material for refinement.
11. Cross-Squad Dependencies
Platform squad requests:
- “I need this to exist” -> enters platform squad backlog, platform PM prioritises
- “I need this unblocked now” -> draws from platform squad’s 15% unplanned buffer
Weekly sync (15 mins): Platform PM + product squad PMs align on priorities. Head of Product arbitrates conflicts.
Squad-to-squad: Requests go through the receiving squad’s PM. Small favours (<half day) handled informally. Large requests become roadmap items.
Staff Engineer is the early warning system - spots cross-squad collisions and raises them proactively.
Shared services: Owned by platform squad with defined API contracts. Platform squad owns backward compatibility and coordinated migration.
AI usage: AI analyses sprint commitments across squads and flags code-level conflicts. Dependency tracking over time surfaces structural problems. Platform squad uses AI to auto-generate API documentation and migration guides when contracts change.
12. Metrics
Collection
All metrics derived from existing tooling. No manual data entry.
Human disciplines required:
- Move items through project tracker states accurately and promptly
- Tag unplanned items entering a sprint
- Tag engineering-initiated items
- Tag rollbacks/hotfixes in deployment tooling
Delivery Metrics (Squad Level)
| Metric | Source | Target |
|---|---|---|
| Throughput | Project tracker (items to “done” per sprint) | Trend over 4—6 sprints |
| Cycle time | Project tracker (“in progress” to “done”) | Most items within one sprint |
| Deployment frequency | CI/CD pipeline logs | Daily or better |
| Change failure rate | Deployment tooling (rollbacks, hotfixes, emergency flag toggles) | Trending down |
| MTTR | Alerting tool to incident resolution | Minutes (via flag toggle) |
Org-Level Metrics
| Metric | Signal |
|---|---|
| Planned vs unplanned ratio | >20% unplanned consistently = systemic problem |
| Cross-squad dependency wait time | Trending up = platform squad under-resourced |
| Engineering allocation usage | Consistently surrendered = invisible tech debt accumulation |
Product Metrics
Defined per feature at refinement. Measured via analytics tooling. PM owns.
Reporting
- Head of Eng: delivery health trends with context (weekly dashboard review)
- Head of Product: adoption, outcomes, roadmap progress
- Both: quarterly outcome-oriented report to leadership
Do not measure: Individual velocity, story points, hours worked, sprint commitment accuracy.
AI usage in metrics: Automated collection and dashboarding from Git, CI/CD, and deployment tooling. AI-generated weekly squad summaries. Anomaly detection when metrics deviate from trend. Correlation analysis over time.
13. Retrospectives
Frequency: Every 2 sprints (4 weeks). Duration: 45 minutes. Attendees: Squad only. No leadership. Facilitator: Lead Dev or rotating squad member.
Format:
- What’s slowing us down?
- What should we change?
- What’s working that we should protect?
Output: Maximum 2 actions with named owner and date. Reviewed at start of next retro. Same issue 3 retros running with no change -> escalates to Head of Eng.
Cross-squad retro: Quarterly - Lead Devs, Staff Engineer, Head of Eng on systemic issues.
AI usage: Minimal. One application: pattern detection across retros over time (“this theme has appeared in 3 of the last 4 retros”).
14. Responsibilities Matrix
| Decision | Who Decides | Who Has Input | Who Has Visibility |
|---|---|---|---|
| What to build / priority | Head of Product / PM | Head of Eng, Lead Dev | Squads, stakeholders |
| How to build it | Lead Dev / Staff Engineer | PM (constraints), devs | PM |
| Engineering allocation (15-25%) | Head of Eng | Lead Devs | PMs (informed, not approval) |
| Tech debt prioritisation | Lead Dev (within allocation) | Devs | PM |
| Release timing | PM | Lead Dev (readiness), QA (quality) | Head of Product |
| Rollback | On-call engineer / Lead Dev | - | PM, Head of Eng |
| Sprint scope change | PM | Lead Dev (capacity impact) | Squad |
| Hiring | Head of Eng + Head of Product jointly | Lead Devs | - |
| Squad composition | Head of Eng + Head of Product jointly | Lead Devs | - |
| Architecture decisions | Staff Engineer / Lead Devs | Devs | Head of Eng |
| Security standards | Staff Engineer | Platform squad, Head of Eng | All |
| Cross-squad priority conflicts | Head of Product | PMs, Head of Eng | Squads |
15. Onboarding Timeline
| When | Activity | Owner |
|---|---|---|
| Pre-arrival | Laptop, accounts, tooling, AI tools provisioned | Platform squad |
| Day 1 | Environment setup (target: under 2 hours to build/deploy locally) | Platform squad tooling |
| Days 1-5 | Codebase walkthrough (2-3 hours across week) | Lead Dev |
| Days 1-5 | Read AI playbook, “how we work” summary | New joiner |
| Days 1-5 | Shadow refinement and planning | New joiner |
| Days 2-3 | First real commit (bug fix, config change). PR-reviewed by Lead Dev. | New joiner + Lead Dev |
| Weeks 2-4 | Small refined items. Close Lead Dev involvement in design. Pair on at least 1 item. | Lead Dev oversees |
| Week 1 end | Check-in: what’s confusing, blocking, missing? | Lead Dev |
| Week 2 end | Check-in | Lead Dev |
| Week 4 end | Check-in | Lead Dev |
| Month 2 | Medium items independently. On-call shadowing. Post-merge review replaces PR review. | Lead Dev monitors |
| Month 2 end | Full squad member at appropriate autonomy level | Lead Dev confirms |
AI usage in onboarding: New joiners use AI to explore the codebase (“explain what this service does”, “show me authentication flow”). AI assists first contributions. AI generates personalised codebase orientation based on areas the new dev will work in. Supplements Lead Dev walkthrough, doesn’t replace it.
16. Documentation Requirements
| Document | Owner | Location | Updated When |
|---|---|---|---|
| Architecture overview (1 page) | Staff Engineer | Wiki | Architecture changes; verified at monthly sync |
| ADRs | Dev who made the decision | Repo | At decision time |
| API contracts | Owning squad | Repo (auto-generated from code) | API changes |
| Runbooks | Owning squad + QA | Repo | Service launch; after every incident revealing a gap |
| ”How we work” summary | Head of Eng | Wiki | Process changes; annually; onboarding feedback |
| AI playbook | Staff Engineer | Wiki | New patterns or tool changes |
| Incident reviews | Lead Dev | Repo / wiki | After every SEV1 and systemic SEV2 |
AI usage in documentation: AI generates first drafts of runbooks, API docs, and architecture descriptions from codebase. AI keeps API docs in sync with code, flagging or auto-generating updates when endpoints change. AI identifies undocumented services by comparing codebase against existing docs. New joiners query AI as a living, queryable documentation layer.
17. Security Requirements
CI Pipeline (Non-Negotiable)
| Check | Action on Finding |
|---|---|
| Dependency scanning | Block on critical/high; medium/low to backlog |
| SAST | Block on critical |
| Secrets detection (+ pre-commit hook) | Block immediately |
| Container/image scanning | Block on critical |
Access Control
- Principle of least privilege everywhere
- Production DB: read-only for debugging; write access requires Lead Dev approval + logging
- Secrets in dedicated tooling (Vault / AWS Secrets Manager) - never in code, env files, or Slack
Periodic Activities
| Activity | Frequency | Owner |
|---|---|---|
| Dependency audit | Quarterly | Platform squad |
| Penetration test | Annual (external firm) | Head of Eng commissions |
| Security incident review | Per event | Lead Dev runs; Head of Eng attends |
AI usage in security: AI code review in CI targets security anti-patterns beyond rule-based SAST. AI monitors dependency advisories and cross-references against codebase. AI helps devs write secure code as part of normal workflow. AI analyses access logs during incident investigation. AI generates security documentation (data flow diagrams showing PII locations and movement).
18. AI Tooling Standards
Provision: Every dev, PM, and QA gets AI tooling (Claude Pro/Team or equivalent) as standard.
AI Playbook: Maintained by Staff Engineer. Documents where AI assists and where it doesn’t. Squad members contribute patterns. See the AI Use-Case Catalogue for a comprehensive breakdown by workflow stage.
Platform squad builds internal AI tooling: custom prompts, RAG over codebase/docs, automated code review.
| Activity | AI Application |
|---|---|
| Refinement | PM: AI-drafted problem statements, acceptance criteria. Lead Dev: preliminary technical assessment. QA: test scenarios. |
| Technical design | Dev pressure-tests approach with AI before Lead Dev conversation |
| Development | Code generation, test writing, documentation |
| CI/CD | Automated code review, security pattern detection |
| QA | Test generation, anomaly detection, pre/post-release comparison |
| Incident response | Log analysis, related deployment identification, timeline drafting |
| Metrics | Automated dashboards, weekly summaries, anomaly detection |
| Documentation | First drafts of runbooks/API docs, staleness detection, sync with code changes |
| Releases | Rollout monitoring, auto-revert on threshold breach, release note generation, stale flag detection |
19. Head of Product / Head of Eng Operating Model
Weekly 1:1 (30-45 mins): Roadmap alignment, delivery health, cross-squad prioritisation, people issues, upcoming decisions.
Shared dashboards: Both see same delivery and product metrics.
Joint quarterly planning: Head of Product brings demand; Head of Eng brings supply. Reconciled together.
Decision authority:
| Domain | Final Call |
|---|---|
| What to build, priority order | Head of Product |
| How to build, technical approach | Head of Eng |
| Engineering allocation (15-25%) | Head of Eng |
| People (hiring, squad composition) | Joint |
| Unresolvable disagreements | Escalate to CTO/CEO (should be rare) |