Case-Based “Oops” Collection: 10 Platform Engineering Anti-Patterns

Case-Based “Oops” Collection: 10 Platform Engineering Anti-Patterns

Each anti-pattern is presented as a four-part set: Symptoms → Causes → Remedies → Reference Books.

1) Tooling as the Goal (Tool-Shopping Platform)

Case: Company A adopted the latest CI/CD and IaC one after another, yet developer experience did not improve.

  • Symptoms: Tool sprawl, no standards, abandoned post-rollout, low adoption.
  • Causes: Zero user research; no product mindset or roadmap.
  • Remedies: Treat the platform as an internal product; assign a PM; define adoption and effective-use metrics plus SLOs; run a build–measure–learn loop.
  • References: Team Topologies; Accelerate (The Science of Lean Software and DevOps).

2) Ticket-Driven Support (No Self-Service)

Case: Company B requires tickets for every environment request; queues never clear.

  • Symptoms: Longer lead times, heavy person-dependence, after-hours burnout.
  • Causes: Lack of self-service design; slow API-ization and automation.
  • Remedies: Provide a self-service portal and public APIs; standardize with catalogs (templates/modules); shift approvals to policy automation.
  • References: Site Reliability Engineering; The DevOps Handbook.

3) Over-Governance (Golden Handcuffs)

Case: Company C’s rules are mostly “don’ts,” driving workarounds and shadow IT.

  • Symptoms: Frequent exceptions, review bottlenecks, lower psychological safety.
  • Causes: Blanket enforcement; no phased rollout or exception design.
  • Remedies: Offer a Paved Road; use a scorecard and phased mandates (optional → required); make the recommended path the safest and fastest.
  • References: Building Evolutionary Architectures; Team Topologies.

4) Customerless Platform (Solves No One’s Pain)

Case: Company D built “for everyone,” so no one used it.

  • Symptoms: Low adoption; divergent stacks; expectation gaps.
  • Causes: No personas; no developer-experience (DX) research.
  • Remedies: Map developer journeys; choose a priority segment; manage outcomes with a North-Star metric set (Lead Time, Deployment Frequency, Change Failure Rate, MTTR).
  • References: Accelerate; Team Topologies.

5) Neglected Golden Path (Overgrown)

Case: Company E’s templates went a year without updates; drift began on day one.

  • Symptoms: Outdated starters; per-project manual fixes.
  • Causes: No lifecycle ownership; no SLOs.
  • Remedies: Version templates/modules; publish deprecation policies; run periodic “path health” reviews and announcements.
  • References: The SRE Workbook; Release It!.

6) Metrics Theater

Case: Company F boasts “environments created” and “job counts,” yet business impact is unclear.

  • Symptoms: No visible outcomes; ROI can’t be explained.
  • Causes: Vanity metrics unlinked to value hypotheses.
  • Remedies: Use DORA Four (Lead Time, Deployment Frequency, Change Failure Rate, MTTR) plus adoption and DevEx satisfaction to make outcomes visible.
  • References: Accelerate; The DevOps Handbook.

7) YAML Hell (Overexposed Low-Level Config)

Case: Company G teams copy-paste similar YAML; tiny tweaks break pipelines.

  • Symptoms: Duplicated configs; misconfig incidents; higher learning cost.
  • Causes: Insufficient abstraction; lack of reusable modules.
  • Remedies: Provide opinionated abstractions (reusable IaC/CI templates, a Platform API, Backstage plugins) with safe defaults.
  • References: Terraform: Up & Running; Kubernetes Patterns.

8) One-True-Answer: “Kubernetes for Everything”

Case: Company H forces every workload onto Kubernetes; cost and delay increase.

  • Symptoms: Misfit for batch/data workloads; operational complexity; worse TCO.
  • Causes: Technology monoculture; no workload classification.
  • Remedies: Embrace runtime diversity (K8s, serverless, batch, data, edge); define a workload taxonomy and fitness criteria.
  • References: Kubernetes: Up & Running; Cloud Native Patterns.

9) Bolt-On Security (Gate, Not Guardrails)

Case: Company I requires final-stage security approvals, creating bottlenecks.

  • Symptoms: Audit findings; approval queues; more workarounds.
  • Causes: No shift-left; preference for manual gates over automated guardrails.
  • Remedies: Policy-as-Code (e.g., OPA); SBOM and signing; secrets management; embed scans into the pipeline by default.
  • References: Securing DevOps; OpenSSF guides.

10) Build-and-Abandon (No Operations Design)

Case: Company J ships the platform without defining operations or RACI; incidents implode the org.

  • Symptoms: No one on pager duty; long incidents.
  • Causes: Missing support model, SLA/SLOs, and runbooks.
  • Remedies: Clarify RACI; layer support (L1/L2/L3); design on-call/rosters; standardize runbooks and post-incident reviews.
  • References: The SRE Workbook; Incident Management for Operations.

Appendix: Quick Symptom Checklist

  • Tool sprawl; falling adoption; ticket queues; policy bypasses; low utilization.
  • Stale templates; config copy-paste; K8s-as-hammer; approval bottlenecks; no on-call.

How to Use

  • Have each team mark applicable Symptoms; prioritize anti-patterns with the most hits for quarterly OKRs.
  • Run book clubs for the references and translate takeaways directly into template, policy, and platform improvements.

Comments

Popular posts from this blog

Japan Jazz Anthology Select: Jazz of the SP Era

In practice, the most workable approach is to measure a composite “civility score” built from multiple indicators.