Case-Based “Oops” Collection: 10 Platform Engineering Anti-Patterns
Each anti-pattern is presented as a four-part set: Symptoms → Causes → Remedies → Reference Books.
1) Tooling as the Goal (Tool-Shopping Platform)
Case: Company A adopted the latest CI/CD and IaC one after another, yet developer experience did not improve.
- Symptoms: Tool sprawl, no standards, abandoned post-rollout, low adoption.
- Causes: Zero user research; no product mindset or roadmap.
- Remedies: Treat the platform as an internal product; assign a PM; define adoption and effective-use metrics plus SLOs; run a build–measure–learn loop.
- References: Team Topologies; Accelerate (The Science of Lean Software and DevOps).
2) Ticket-Driven Support (No Self-Service)
Case: Company B requires tickets for every environment request; queues never clear.
- Symptoms: Longer lead times, heavy person-dependence, after-hours burnout.
- Causes: Lack of self-service design; slow API-ization and automation.
- Remedies: Provide a self-service portal and public APIs; standardize with catalogs (templates/modules); shift approvals to policy automation.
- References: Site Reliability Engineering; The DevOps Handbook.
3) Over-Governance (Golden Handcuffs)
Case: Company C’s rules are mostly “don’ts,” driving workarounds and shadow IT.
- Symptoms: Frequent exceptions, review bottlenecks, lower psychological safety.
- Causes: Blanket enforcement; no phased rollout or exception design.
- Remedies: Offer a Paved Road; use a scorecard and phased mandates (optional → required); make the recommended path the safest and fastest.
- References: Building Evolutionary Architectures; Team Topologies.
4) Customerless Platform (Solves No One’s Pain)
Case: Company D built “for everyone,” so no one used it.
- Symptoms: Low adoption; divergent stacks; expectation gaps.
- Causes: No personas; no developer-experience (DX) research.
- Remedies: Map developer journeys; choose a priority segment; manage outcomes with a North-Star metric set (Lead Time, Deployment Frequency, Change Failure Rate, MTTR).
- References: Accelerate; Team Topologies.
5) Neglected Golden Path (Overgrown)
Case: Company E’s templates went a year without updates; drift began on day one.
- Symptoms: Outdated starters; per-project manual fixes.
- Causes: No lifecycle ownership; no SLOs.
- Remedies: Version templates/modules; publish deprecation policies; run periodic “path health” reviews and announcements.
- References: The SRE Workbook; Release It!.
6) Metrics Theater
Case: Company F boasts “environments created” and “job counts,” yet business impact is unclear.
- Symptoms: No visible outcomes; ROI can’t be explained.
- Causes: Vanity metrics unlinked to value hypotheses.
- Remedies: Use DORA Four (Lead Time, Deployment Frequency, Change Failure Rate, MTTR) plus adoption and DevEx satisfaction to make outcomes visible.
- References: Accelerate; The DevOps Handbook.
7) YAML Hell (Overexposed Low-Level Config)
Case: Company G teams copy-paste similar YAML; tiny tweaks break pipelines.
- Symptoms: Duplicated configs; misconfig incidents; higher learning cost.
- Causes: Insufficient abstraction; lack of reusable modules.
- Remedies: Provide opinionated abstractions (reusable IaC/CI templates, a Platform API, Backstage plugins) with safe defaults.
- References: Terraform: Up & Running; Kubernetes Patterns.
8) One-True-Answer: “Kubernetes for Everything”
Case: Company H forces every workload onto Kubernetes; cost and delay increase.
- Symptoms: Misfit for batch/data workloads; operational complexity; worse TCO.
- Causes: Technology monoculture; no workload classification.
- Remedies: Embrace runtime diversity (K8s, serverless, batch, data, edge); define a workload taxonomy and fitness criteria.
- References: Kubernetes: Up & Running; Cloud Native Patterns.
9) Bolt-On Security (Gate, Not Guardrails)
Case: Company I requires final-stage security approvals, creating bottlenecks.
- Symptoms: Audit findings; approval queues; more workarounds.
- Causes: No shift-left; preference for manual gates over automated guardrails.
- Remedies: Policy-as-Code (e.g., OPA); SBOM and signing; secrets management; embed scans into the pipeline by default.
- References: Securing DevOps; OpenSSF guides.
10) Build-and-Abandon (No Operations Design)
Case: Company J ships the platform without defining operations or RACI; incidents implode the org.
- Symptoms: No one on pager duty; long incidents.
- Causes: Missing support model, SLA/SLOs, and runbooks.
- Remedies: Clarify RACI; layer support (L1/L2/L3); design on-call/rosters; standardize runbooks and post-incident reviews.
- References: The SRE Workbook; Incident Management for Operations.
Appendix: Quick Symptom Checklist
- Tool sprawl; falling adoption; ticket queues; policy bypasses; low utilization.
- Stale templates; config copy-paste; K8s-as-hammer; approval bottlenecks; no on-call.
How to Use
- Have each team mark applicable Symptoms; prioritize anti-patterns with the most hits for quarterly OKRs.
- Run book clubs for the references and translate takeaways directly into template, policy, and platform improvements.
Comments
Post a Comment