Case-Based “Oops” Collection: 10 Platform Engineering Anti-Patterns

Case-Based “Oops” Collection: 10 Platform Engineering Anti-Patterns

Each anti-pattern is presented as a four-part set: Symptoms → Causes → Remedies → Reference Books.

1) Tooling as the Goal (Tool-Shopping Platform)

Case: Company A adopted the latest CI/CD and IaC one after another, yet developer experience did not improve.

  • Symptoms: Tool sprawl, no standards, abandoned post-rollout, low adoption.
  • Causes: Zero user research; no product mindset or roadmap.
  • Remedies: Treat the platform as an internal product; assign a PM; define adoption and effective-use metrics plus SLOs; run a build–measure–learn loop.
  • References: Team Topologies; Accelerate (The Science of Lean Software and DevOps).

2) Ticket-Driven Support (No Self-Service)

Case: Company B requires tickets for every environment request; queues never clear.

  • Symptoms: Longer lead times, heavy person-dependence, after-hours burnout.
  • Causes: Lack of self-service design; slow API-ization and automation.
  • Remedies: Provide a self-service portal and public APIs; standardize with catalogs (templates/modules); shift approvals to policy automation.
  • References: Site Reliability Engineering; The DevOps Handbook.

3) Over-Governance (Golden Handcuffs)

Case: Company C’s rules are mostly “don’ts,” driving workarounds and shadow IT.

  • Symptoms: Frequent exceptions, review bottlenecks, lower psychological safety.
  • Causes: Blanket enforcement; no phased rollout or exception design.
  • Remedies: Offer a Paved Road; use a scorecard and phased mandates (optional → required); make the recommended path the safest and fastest.
  • References: Building Evolutionary Architectures; Team Topologies.

4) Customerless Platform (Solves No One’s Pain)

Case: Company D built “for everyone,” so no one used it.

  • Symptoms: Low adoption; divergent stacks; expectation gaps.
  • Causes: No personas; no developer-experience (DX) research.
  • Remedies: Map developer journeys; choose a priority segment; manage outcomes with a North-Star metric set (Lead Time, Deployment Frequency, Change Failure Rate, MTTR).
  • References: Accelerate; Team Topologies.

5) Neglected Golden Path (Overgrown)

Case: Company E’s templates went a year without updates; drift began on day one.

  • Symptoms: Outdated starters; per-project manual fixes.
  • Causes: No lifecycle ownership; no SLOs.
  • Remedies: Version templates/modules; publish deprecation policies; run periodic “path health” reviews and announcements.
  • References: The SRE Workbook; Release It!.

6) Metrics Theater

Case: Company F boasts “environments created” and “job counts,” yet business impact is unclear.

  • Symptoms: No visible outcomes; ROI can’t be explained.
  • Causes: Vanity metrics unlinked to value hypotheses.
  • Remedies: Use DORA Four (Lead Time, Deployment Frequency, Change Failure Rate, MTTR) plus adoption and DevEx satisfaction to make outcomes visible.
  • References: Accelerate; The DevOps Handbook.

7) YAML Hell (Overexposed Low-Level Config)

Case: Company G teams copy-paste similar YAML; tiny tweaks break pipelines.

  • Symptoms: Duplicated configs; misconfig incidents; higher learning cost.
  • Causes: Insufficient abstraction; lack of reusable modules.
  • Remedies: Provide opinionated abstractions (reusable IaC/CI templates, a Platform API, Backstage plugins) with safe defaults.
  • References: Terraform: Up & Running; Kubernetes Patterns.

8) One-True-Answer: “Kubernetes for Everything”

Case: Company H forces every workload onto Kubernetes; cost and delay increase.

  • Symptoms: Misfit for batch/data workloads; operational complexity; worse TCO.
  • Causes: Technology monoculture; no workload classification.
  • Remedies: Embrace runtime diversity (K8s, serverless, batch, data, edge); define a workload taxonomy and fitness criteria.
  • References: Kubernetes: Up & Running; Cloud Native Patterns.

9) Bolt-On Security (Gate, Not Guardrails)

Case: Company I requires final-stage security approvals, creating bottlenecks.

  • Symptoms: Audit findings; approval queues; more workarounds.
  • Causes: No shift-left; preference for manual gates over automated guardrails.
  • Remedies: Policy-as-Code (e.g., OPA); SBOM and signing; secrets management; embed scans into the pipeline by default.
  • References: Securing DevOps; OpenSSF guides.

10) Build-and-Abandon (No Operations Design)

Case: Company J ships the platform without defining operations or RACI; incidents implode the org.

  • Symptoms: No one on pager duty; long incidents.
  • Causes: Missing support model, SLA/SLOs, and runbooks.
  • Remedies: Clarify RACI; layer support (L1/L2/L3); design on-call/rosters; standardize runbooks and post-incident reviews.
  • References: The SRE Workbook; Incident Management for Operations.

Appendix: Quick Symptom Checklist

  • Tool sprawl; falling adoption; ticket queues; policy bypasses; low utilization.
  • Stale templates; config copy-paste; K8s-as-hammer; approval bottlenecks; no on-call.

How to Use

  • Have each team mark applicable Symptoms; prioritize anti-patterns with the most hits for quarterly OKRs.
  • Run book clubs for the references and translate takeaways directly into template, policy, and platform improvements.

Comments

Popular posts from this blog

go ahead baby, now on sale!!

Japan Jazz Anthology Select: Jazz of the SP Era