Drawing the Line Between SRE and Platform: Responsibilities, KPIs, and Org Design
Primary KW: SRE vs Platform / Responsibilities
Goal: Turn fuzzy boundaries into an operational, plug‑and‑play template (RACI, KPIs, migration steps, interview prompts).
Key Takeaways
- 
SRE is the owner of service reliability objectives (SLOs): production operations, incident response, SLO/SLI practice, and change‑risk governance.
 - 
Platform provides the shared foundation as a product: designs a Golden Path and improves Developer Experience (DevEx) and organizational throughput.
 - 
Boundary principle: SRE = accountable for operational outcomes of a service; Platform = accountable for supplying mechanisms that make safe operation easy.
 - 
Anti‑patterns: SRE as a catch‑all manual ops desk; Platform as a ticket factory. Counter with self‑service and a clear Error Budget Policy.
 
RACI (Roles & Responsibilities)
Legend: R=Responsible (does the work) / A=Accountable (final owner) / C=Consult / I=Inform
1) Operations & Reliability
| Item | SRE | Platform | Product Dev | Security | Infra/Network | 
|---|---|---|---|---|---|
| SLO/SLI design & operations | A/R | C | C | I | I | 
| Error budget policy (release gating) | A/R | C | C | I | I | 
| Production incident response | A/R | C | C | C | C | 
| Change management (safety valves/CAB) | A/R | C | C | C | I | 
| Capacity planning | A/R | C | C | I | C | 
| Postmortems & learning system | A/R | C | C | I | I | 
2) Platform & DevEx
| Item | SRE | Platform | Product Dev | Security | Infra/Network | 
|---|---|---|---|---|---|
| IDP/Developer portal & service catalog | I | A/R | C | C | I | 
| CI/CD templates & pipelines | C | A/R | C | I | I | 
| IaC modules & environment self‑service | C | A/R | C | C | I | 
| Observability stack (logs/metrics/traces) | C | A/R | I | C | I | 
| Kubernetes/runtime platform | C | A/R | I | C | C | 
| Golden Path (reference arch/SDKs) | C | A/R | C | I | I | 
3) Security & Governance
| Item | SRE | Platform | Product Dev | Security | Infra/Network | 
|---|---|---|---|---|---|
| IAM/Secrets management (the mechanism) | I | A/R | I | C | C | 
| Vulnerability mgmt (runtime/libs) | A/R | C | C | C | I | 
| Policy as Code (e.g., OPA) | C | A/R | I | C | I | 
| Audit & compliance evidence | C | A/R | I | A/R | I | 
Note: In small orgs SRE and Platform may be one team. Even then, keep A (ultimate ownership) explicit per item.
KPIs (with target examples)
SRE KPIs
| Metric | Definition / How to measure | Target example | 
|---|---|---|
| SLO attainment | Quarterly availability/latency etc. | ≥ 99.9% (service‑specific) | 
| Error budget burn | 1 − SLO attainment | ≤ 50% mid‑quarter, max 80% | 
| MTTA / MTTR | Detect→ack / restore | MTTA ≤ 5 min, MTTR ≤ 30 min (by severity) | 
| Change failure rate | Changes causing incidents / all changes | ≤ 10–15% | 
| Incident frequency | Count by severity / week, month | Downward trend | 
| Toil ratio | Manual work time / total time | ≤ 30% (cap 50%) | 
| Postmortem completion | On‑time PMs / total | ≥ 90% | 
| Action item completion | Preventive actions done on time | ≥ 85% | 
Platform KPIs (Platform as a Product)
| Metric | Definition / How to measure | Target example | 
|---|---|---|
| Adoption rate | Share of teams using IDP/templates/modules | ≥ 80% | 
| Time to First Deploy | Time from project start to first prod/stage deploy | ≤ 1 day (ideally hours) | 
| Env provisioning time | From template to runnable env | ≤ 30 min | 
| Deployment frequency (DORA) | Team median | Many/day → ≥ 1/week | 
| Lead time for changes (DORA) | PR creation → prod | ≤ 24 hours | 
| Platform‑caused incidents | Incidents attributable to platform | Downward trend; zero Sev‑1 | 
| Exception rate | Deploys outside Golden Path / all deploys | ≤ 10% | 
| DevEx / NPS | Quarterly survey | ≥ +30 | 
| Cost efficiency | Platform cost per service | Year‑over‑year improvement | 
Migration Steps (As‑Is → Desired Boundary)
- 
As‑Is assessment: Service inventory, SLO presence, manual ops hotspots, tool sprawl, platform capability audit. Deliverables: RACI draft and issue backlog.
 - 
Governance: SLO standards, Error Budget Policy, incident severity taxonomy, change safety valves (staged rollout/rollback) codified.
 - 
Platform charter: Target users, value hypotheses, roadmap. Publish the first IDP/service catalog.
 - 
Golden Path MVP: Minimal set—CI/CD templates, IaC modules, observability baseline, a standard runtime (e.g., Kubernetes + service mesh).
 - 
Engagement model: Self‑service by default. Exceptions via internal SRE review. Showback/chargeback for cost visibility.
 - 
Rollout: Lighthouse teams to prove value → scale out. Review adoption/KPIs every sprint.
 - 
Eliminate manual tickets: Productize frequent requests (API/CLI/portal). Prioritize by toil ratio.
 - 
Security integration: Policy as Code, SBOM/signing, standardized secrets. Track exceptions via risk acceptance.
 - 
Continuous improvement: Quarterly RACI/KPI review. Platform backlog prioritized by NPS and behavioral data.
 
Anti‑pattern early‑warning: (1) Platform becomes an approval desk; (2) SRE becomes manual ops staff. Watch self‑service ratio and toil ratio.
Interview Prompts
For SRE roles
- 
SLO design case: An API misses its p99 latency SLO. Propose SLI/targets, error budget & release policy.
- 
Good signals: User‑centric SLOs, observability design, staged rollout, evidence‑based trade‑offs.
 
 - 
 - 
Reconstructing incident timelines: From ambiguous logs, rebuild causality. What extra telemetry would you add?
- 
Good signals: Hypothesis testing, correlation vs causation, quick experiments, systematized learning.
 
 - 
 - 
Reducing change risk: Safely ship hundreds of weekly deploys (gates, feature flags, rollout strategies).
 - 
Toil reduction example: Automating a 30‑minute daily task; ROI calculation.
 - 
Capacity planning: Seasonality/campaigns; prioritization when SLOs are at risk.
 
For Platform roles
- 
Multi‑tenant Kubernetes: Isolation, cost sharing, upgrade strategy.
- 
Good signals: Namespace/network isolation, OPA/policy, compatibility testing, auto‑rollback.
 
 - 
 - 
IDP/Service catalog: Design the journey to first deploy and how you’ll measure it (TTFD).
 - 
Operating the Golden Path: Handling teams that need to diverge (exception process/enablement/support).
 - 
Migration plan: Phasing VM‑era services to Kubernetes with minimal downtime; observability first.
 - 
Platform as a Product: Customer definition, KPIs/NPS, backlog prioritization (data/experiments/declarative SLOs).
 
Org Design Patterns (by scale)
- 
Small (≤ ~50 engineers): SRE and Platform within one team. RACI still names A per item. On‑call rotates weekly.
 - 
Mid (50–200): Platform has a product manager. SRE uses a mix of embedded (near product teams) and central (standards).
 - 
Large (200+): Split into IDP, runtime platform, data platform, and central SRE. Add an SLO governance board and partner tightly with FinOps.
 
Mini‑Templates
Error Budget Policy (example)
- 
Period: quarterly. Thresholds: >70% burned → feature freeze; >95% → full stop.
 - 
Exit criteria: Two consecutive weeks meeting SLOs and preventive fixes shipped.
 
Postmortem checklist (essentials)
- 
Summary / Impact / Timeline / Root cause / Preventive actions / Verification plan / Owner & deadline.
 
Release safety valves (standard)
- 
Automated tests / Canary / Progressive delivery / Health gates / Instant rollback / Change audit logs.
 
Boundary Self‑Check (10 Yes/No)
- 
Is A for SLOs clearly with SRE?
 - 
Is A for the Golden Path clearly with Platform?
 - 
Is on‑call supported by systems, not heroics?
 - 
Do you measure toil quarterly and set reduction targets?
 - 
Do you measure Time to First Deploy?
 - 
Is there an exception process with enablement options?
 - 
Are change failure rate and rollback time visible?
 - 
Do you tag platform‑caused incidents and feed them back?
 - 
Is security integrated as Code from design time?
 - 
Do you review & publish RACI/KPIs every quarter?
 
One‑Line Summary
- 
SRE owns operational outcomes; Platform owns the tools that make them safe and fast. Anchor both in self‑service and SLO governance, then run by KPIs.
 
Comments
Post a Comment