Drawing the Line Between SRE and Platform: Responsibilities, KPIs, and Org Design

Primary KW: SRE vs Platform / Responsibilities
Goal: Turn fuzzy boundaries into an operational, plug‑and‑play template (RACI, KPIs, migration steps, interview prompts).


Key Takeaways

  • SRE is the owner of service reliability objectives (SLOs): production operations, incident response, SLO/SLI practice, and change‑risk governance.

  • Platform provides the shared foundation as a product: designs a Golden Path and improves Developer Experience (DevEx) and organizational throughput.

  • Boundary principle: SRE = accountable for operational outcomes of a service; Platform = accountable for supplying mechanisms that make safe operation easy.

  • Anti‑patterns: SRE as a catch‑all manual ops desk; Platform as a ticket factory. Counter with self‑service and a clear Error Budget Policy.


RACI (Roles & Responsibilities)

Legend: R=Responsible (does the work) / A=Accountable (final owner) / C=Consult / I=Inform

1) Operations & Reliability

Item SRE Platform Product Dev Security Infra/Network
SLO/SLI design & operations A/R C C I I
Error budget policy (release gating) A/R C C I I
Production incident response A/R C C C C
Change management (safety valves/CAB) A/R C C C I
Capacity planning A/R C C I C
Postmortems & learning system A/R C C I I

2) Platform & DevEx

Item SRE Platform Product Dev Security Infra/Network
IDP/Developer portal & service catalog I A/R C C I
CI/CD templates & pipelines C A/R C I I
IaC modules & environment self‑service C A/R C C I
Observability stack (logs/metrics/traces) C A/R I C I
Kubernetes/runtime platform C A/R I C C
Golden Path (reference arch/SDKs) C A/R C I I

3) Security & Governance

Item SRE Platform Product Dev Security Infra/Network
IAM/Secrets management (the mechanism) I A/R I C C
Vulnerability mgmt (runtime/libs) A/R C C C I
Policy as Code (e.g., OPA) C A/R I C I
Audit & compliance evidence C A/R I A/R I

Note: In small orgs SRE and Platform may be one team. Even then, keep A (ultimate ownership) explicit per item.


KPIs (with target examples)

SRE KPIs

Metric Definition / How to measure Target example
SLO attainment Quarterly availability/latency etc. ≥ 99.9% (service‑specific)
Error budget burn 1 − SLO attainment ≤ 50% mid‑quarter, max 80%
MTTA / MTTR Detect→ack / restore MTTA ≤ 5 min, MTTR ≤ 30 min (by severity)
Change failure rate Changes causing incidents / all changes ≤ 10–15%
Incident frequency Count by severity / week, month Downward trend
Toil ratio Manual work time / total time ≤ 30% (cap 50%)
Postmortem completion On‑time PMs / total ≥ 90%
Action item completion Preventive actions done on time ≥ 85%

Platform KPIs (Platform as a Product)

Metric Definition / How to measure Target example
Adoption rate Share of teams using IDP/templates/modules ≥ 80%
Time to First Deploy Time from project start to first prod/stage deploy ≤ 1 day (ideally hours)
Env provisioning time From template to runnable env ≤ 30 min
Deployment frequency (DORA) Team median Many/day → ≥ 1/week
Lead time for changes (DORA) PR creation → prod ≤ 24 hours
Platform‑caused incidents Incidents attributable to platform Downward trend; zero Sev‑1
Exception rate Deploys outside Golden Path / all deploys ≤ 10%
DevEx / NPS Quarterly survey ≥ +30
Cost efficiency Platform cost per service Year‑over‑year improvement

Migration Steps (As‑Is → Desired Boundary)

  1. As‑Is assessment: Service inventory, SLO presence, manual ops hotspots, tool sprawl, platform capability audit. Deliverables: RACI draft and issue backlog.

  2. Governance: SLO standards, Error Budget Policy, incident severity taxonomy, change safety valves (staged rollout/rollback) codified.

  3. Platform charter: Target users, value hypotheses, roadmap. Publish the first IDP/service catalog.

  4. Golden Path MVP: Minimal set—CI/CD templates, IaC modules, observability baseline, a standard runtime (e.g., Kubernetes + service mesh).

  5. Engagement model: Self‑service by default. Exceptions via internal SRE review. Showback/chargeback for cost visibility.

  6. Rollout: Lighthouse teams to prove value → scale out. Review adoption/KPIs every sprint.

  7. Eliminate manual tickets: Productize frequent requests (API/CLI/portal). Prioritize by toil ratio.

  8. Security integration: Policy as Code, SBOM/signing, standardized secrets. Track exceptions via risk acceptance.

  9. Continuous improvement: Quarterly RACI/KPI review. Platform backlog prioritized by NPS and behavioral data.

Anti‑pattern early‑warning: (1) Platform becomes an approval desk; (2) SRE becomes manual ops staff. Watch self‑service ratio and toil ratio.


Interview Prompts

For SRE roles

  1. SLO design case: An API misses its p99 latency SLO. Propose SLI/targets, error budget & release policy.

    • Good signals: User‑centric SLOs, observability design, staged rollout, evidence‑based trade‑offs.

  2. Reconstructing incident timelines: From ambiguous logs, rebuild causality. What extra telemetry would you add?

    • Good signals: Hypothesis testing, correlation vs causation, quick experiments, systematized learning.

  3. Reducing change risk: Safely ship hundreds of weekly deploys (gates, feature flags, rollout strategies).

  4. Toil reduction example: Automating a 30‑minute daily task; ROI calculation.

  5. Capacity planning: Seasonality/campaigns; prioritization when SLOs are at risk.

For Platform roles

  1. Multi‑tenant Kubernetes: Isolation, cost sharing, upgrade strategy.

    • Good signals: Namespace/network isolation, OPA/policy, compatibility testing, auto‑rollback.

  2. IDP/Service catalog: Design the journey to first deploy and how you’ll measure it (TTFD).

  3. Operating the Golden Path: Handling teams that need to diverge (exception process/enablement/support).

  4. Migration plan: Phasing VM‑era services to Kubernetes with minimal downtime; observability first.

  5. Platform as a Product: Customer definition, KPIs/NPS, backlog prioritization (data/experiments/declarative SLOs).


Org Design Patterns (by scale)

  • Small (≤ ~50 engineers): SRE and Platform within one team. RACI still names A per item. On‑call rotates weekly.

  • Mid (50–200): Platform has a product manager. SRE uses a mix of embedded (near product teams) and central (standards).

  • Large (200+): Split into IDP, runtime platform, data platform, and central SRE. Add an SLO governance board and partner tightly with FinOps.


Mini‑Templates

Error Budget Policy (example)

  • Period: quarterly. Thresholds: >70% burned → feature freeze; >95% → full stop.

  • Exit criteria: Two consecutive weeks meeting SLOs and preventive fixes shipped.

Postmortem checklist (essentials)

  • Summary / Impact / Timeline / Root cause / Preventive actions / Verification plan / Owner & deadline.

Release safety valves (standard)

  • Automated tests / Canary / Progressive delivery / Health gates / Instant rollback / Change audit logs.


Boundary Self‑Check (10 Yes/No)

  1. Is A for SLOs clearly with SRE?

  2. Is A for the Golden Path clearly with Platform?

  3. Is on‑call supported by systems, not heroics?

  4. Do you measure toil quarterly and set reduction targets?

  5. Do you measure Time to First Deploy?

  6. Is there an exception process with enablement options?

  7. Are change failure rate and rollback time visible?

  8. Do you tag platform‑caused incidents and feed them back?

  9. Is security integrated as Code from design time?

  10. Do you review & publish RACI/KPIs every quarter?


One‑Line Summary

  • SRE owns operational outcomes; Platform owns the tools that make them safe and fast. Anchor both in self‑service and SLO governance, then run by KPIs.

Comments

Popular posts from this blog

Japan Jazz Anthology Select: Jazz of the SP Era

In practice, the most workable approach is to measure a composite “civility score” built from multiple indicators.