Drawing the Line Between SRE and Platform: Responsibilities, KPIs, and Org Design
Primary KW: SRE vs Platform / Responsibilities
Goal: Turn fuzzy boundaries into an operational, plug‑and‑play template (RACI, KPIs, migration steps, interview prompts).
Key Takeaways
-
SRE is the owner of service reliability objectives (SLOs): production operations, incident response, SLO/SLI practice, and change‑risk governance.
-
Platform provides the shared foundation as a product: designs a Golden Path and improves Developer Experience (DevEx) and organizational throughput.
-
Boundary principle: SRE = accountable for operational outcomes of a service; Platform = accountable for supplying mechanisms that make safe operation easy.
-
Anti‑patterns: SRE as a catch‑all manual ops desk; Platform as a ticket factory. Counter with self‑service and a clear Error Budget Policy.
RACI (Roles & Responsibilities)
Legend: R=Responsible (does the work) / A=Accountable (final owner) / C=Consult / I=Inform
1) Operations & Reliability
Item | SRE | Platform | Product Dev | Security | Infra/Network |
---|---|---|---|---|---|
SLO/SLI design & operations | A/R | C | C | I | I |
Error budget policy (release gating) | A/R | C | C | I | I |
Production incident response | A/R | C | C | C | C |
Change management (safety valves/CAB) | A/R | C | C | C | I |
Capacity planning | A/R | C | C | I | C |
Postmortems & learning system | A/R | C | C | I | I |
2) Platform & DevEx
Item | SRE | Platform | Product Dev | Security | Infra/Network |
---|---|---|---|---|---|
IDP/Developer portal & service catalog | I | A/R | C | C | I |
CI/CD templates & pipelines | C | A/R | C | I | I |
IaC modules & environment self‑service | C | A/R | C | C | I |
Observability stack (logs/metrics/traces) | C | A/R | I | C | I |
Kubernetes/runtime platform | C | A/R | I | C | C |
Golden Path (reference arch/SDKs) | C | A/R | C | I | I |
3) Security & Governance
Item | SRE | Platform | Product Dev | Security | Infra/Network |
---|---|---|---|---|---|
IAM/Secrets management (the mechanism) | I | A/R | I | C | C |
Vulnerability mgmt (runtime/libs) | A/R | C | C | C | I |
Policy as Code (e.g., OPA) | C | A/R | I | C | I |
Audit & compliance evidence | C | A/R | I | A/R | I |
Note: In small orgs SRE and Platform may be one team. Even then, keep A (ultimate ownership) explicit per item.
KPIs (with target examples)
SRE KPIs
Metric | Definition / How to measure | Target example |
---|---|---|
SLO attainment | Quarterly availability/latency etc. | ≥ 99.9% (service‑specific) |
Error budget burn | 1 − SLO attainment | ≤ 50% mid‑quarter, max 80% |
MTTA / MTTR | Detect→ack / restore | MTTA ≤ 5 min, MTTR ≤ 30 min (by severity) |
Change failure rate | Changes causing incidents / all changes | ≤ 10–15% |
Incident frequency | Count by severity / week, month | Downward trend |
Toil ratio | Manual work time / total time | ≤ 30% (cap 50%) |
Postmortem completion | On‑time PMs / total | ≥ 90% |
Action item completion | Preventive actions done on time | ≥ 85% |
Platform KPIs (Platform as a Product)
Metric | Definition / How to measure | Target example |
---|---|---|
Adoption rate | Share of teams using IDP/templates/modules | ≥ 80% |
Time to First Deploy | Time from project start to first prod/stage deploy | ≤ 1 day (ideally hours) |
Env provisioning time | From template to runnable env | ≤ 30 min |
Deployment frequency (DORA) | Team median | Many/day → ≥ 1/week |
Lead time for changes (DORA) | PR creation → prod | ≤ 24 hours |
Platform‑caused incidents | Incidents attributable to platform | Downward trend; zero Sev‑1 |
Exception rate | Deploys outside Golden Path / all deploys | ≤ 10% |
DevEx / NPS | Quarterly survey | ≥ +30 |
Cost efficiency | Platform cost per service | Year‑over‑year improvement |
Migration Steps (As‑Is → Desired Boundary)
-
As‑Is assessment: Service inventory, SLO presence, manual ops hotspots, tool sprawl, platform capability audit. Deliverables: RACI draft and issue backlog.
-
Governance: SLO standards, Error Budget Policy, incident severity taxonomy, change safety valves (staged rollout/rollback) codified.
-
Platform charter: Target users, value hypotheses, roadmap. Publish the first IDP/service catalog.
-
Golden Path MVP: Minimal set—CI/CD templates, IaC modules, observability baseline, a standard runtime (e.g., Kubernetes + service mesh).
-
Engagement model: Self‑service by default. Exceptions via internal SRE review. Showback/chargeback for cost visibility.
-
Rollout: Lighthouse teams to prove value → scale out. Review adoption/KPIs every sprint.
-
Eliminate manual tickets: Productize frequent requests (API/CLI/portal). Prioritize by toil ratio.
-
Security integration: Policy as Code, SBOM/signing, standardized secrets. Track exceptions via risk acceptance.
-
Continuous improvement: Quarterly RACI/KPI review. Platform backlog prioritized by NPS and behavioral data.
Anti‑pattern early‑warning: (1) Platform becomes an approval desk; (2) SRE becomes manual ops staff. Watch self‑service ratio and toil ratio.
Interview Prompts
For SRE roles
-
SLO design case: An API misses its p99 latency SLO. Propose SLI/targets, error budget & release policy.
-
Good signals: User‑centric SLOs, observability design, staged rollout, evidence‑based trade‑offs.
-
-
Reconstructing incident timelines: From ambiguous logs, rebuild causality. What extra telemetry would you add?
-
Good signals: Hypothesis testing, correlation vs causation, quick experiments, systematized learning.
-
-
Reducing change risk: Safely ship hundreds of weekly deploys (gates, feature flags, rollout strategies).
-
Toil reduction example: Automating a 30‑minute daily task; ROI calculation.
-
Capacity planning: Seasonality/campaigns; prioritization when SLOs are at risk.
For Platform roles
-
Multi‑tenant Kubernetes: Isolation, cost sharing, upgrade strategy.
-
Good signals: Namespace/network isolation, OPA/policy, compatibility testing, auto‑rollback.
-
-
IDP/Service catalog: Design the journey to first deploy and how you’ll measure it (TTFD).
-
Operating the Golden Path: Handling teams that need to diverge (exception process/enablement/support).
-
Migration plan: Phasing VM‑era services to Kubernetes with minimal downtime; observability first.
-
Platform as a Product: Customer definition, KPIs/NPS, backlog prioritization (data/experiments/declarative SLOs).
Org Design Patterns (by scale)
-
Small (≤ ~50 engineers): SRE and Platform within one team. RACI still names A per item. On‑call rotates weekly.
-
Mid (50–200): Platform has a product manager. SRE uses a mix of embedded (near product teams) and central (standards).
-
Large (200+): Split into IDP, runtime platform, data platform, and central SRE. Add an SLO governance board and partner tightly with FinOps.
Mini‑Templates
Error Budget Policy (example)
-
Period: quarterly. Thresholds: >70% burned → feature freeze; >95% → full stop.
-
Exit criteria: Two consecutive weeks meeting SLOs and preventive fixes shipped.
Postmortem checklist (essentials)
-
Summary / Impact / Timeline / Root cause / Preventive actions / Verification plan / Owner & deadline.
Release safety valves (standard)
-
Automated tests / Canary / Progressive delivery / Health gates / Instant rollback / Change audit logs.
Boundary Self‑Check (10 Yes/No)
-
Is A for SLOs clearly with SRE?
-
Is A for the Golden Path clearly with Platform?
-
Is on‑call supported by systems, not heroics?
-
Do you measure toil quarterly and set reduction targets?
-
Do you measure Time to First Deploy?
-
Is there an exception process with enablement options?
-
Are change failure rate and rollback time visible?
-
Do you tag platform‑caused incidents and feed them back?
-
Is security integrated as Code from design time?
-
Do you review & publish RACI/KPIs every quarter?
One‑Line Summary
-
SRE owns operational outcomes; Platform owns the tools that make them safe and fast. Anchor both in self‑service and SLO governance, then run by KPIs.
Comments
Post a Comment