IT lists

August 22, 2025

Case-Based “Oops” Collection: 10 Platform Engineering Anti-Patterns

Each anti-pattern is presented as a four-part set: Symptoms → Causes → Remedies → Reference Books.

1) Tooling as the Goal (Tool-Shopping Platform)

Case: Company A adopted the latest CI/CD and IaC one after another, yet developer experience did not improve.

Symptoms: Tool sprawl, no standards, abandoned post-rollout, low adoption.
Causes: Zero user research; no product mindset or roadmap.
Remedies: Treat the platform as an internal product; assign a PM; define adoption and effective-use metrics plus SLOs; run a build–measure–learn loop.
References: Team Topologies; Accelerate (The Science of Lean Software and DevOps).

2) Ticket-Driven Support (No Self-Service)

Case: Company B requires tickets for every environment request; queues never clear.

Symptoms: Longer lead times, heavy person-dependence, after-hours burnout.
Causes: Lack of self-service design; slow API-ization and automation.
Remedies: Provide a self-service portal and public APIs; standardize with catalogs (templates/modules); shift approvals to policy automation.
References: Site Reliability Engineering; The DevOps Handbook.

3) Over-Governance (Golden Handcuffs)

Case: Company C’s rules are mostly “don’ts,” driving workarounds and shadow IT.

Symptoms: Frequent exceptions, review bottlenecks, lower psychological safety.
Causes: Blanket enforcement; no phased rollout or exception design.
Remedies: Offer a Paved Road; use a scorecard and phased mandates (optional → required); make the recommended path the safest and fastest.
References: Building Evolutionary Architectures; Team Topologies.

4) Customerless Platform (Solves No One’s Pain)

Case: Company D built “for everyone,” so no one used it.

Symptoms: Low adoption; divergent stacks; expectation gaps.
Causes: No personas; no developer-experience (DX) research.
Remedies: Map developer journeys; choose a priority segment; manage outcomes with a North-Star metric set (Lead Time, Deployment Frequency, Change Failure Rate, MTTR).
References: Accelerate; Team Topologies.

5) Neglected Golden Path (Overgrown)

Case: Company E’s templates went a year without updates; drift began on day one.

Symptoms: Outdated starters; per-project manual fixes.
Causes: No lifecycle ownership; no SLOs.
Remedies: Version templates/modules; publish deprecation policies; run periodic “path health” reviews and announcements.
References: The SRE Workbook; Release It!.

6) Metrics Theater

Case: Company F boasts “environments created” and “job counts,” yet business impact is unclear.

Symptoms: No visible outcomes; ROI can’t be explained.
Causes: Vanity metrics unlinked to value hypotheses.
Remedies: Use DORA Four (Lead Time, Deployment Frequency, Change Failure Rate, MTTR) plus adoption and DevEx satisfaction to make outcomes visible.
References: Accelerate; The DevOps Handbook.

7) YAML Hell (Overexposed Low-Level Config)

Case: Company G teams copy-paste similar YAML; tiny tweaks break pipelines.

Symptoms: Duplicated configs; misconfig incidents; higher learning cost.
Causes: Insufficient abstraction; lack of reusable modules.
Remedies: Provide opinionated abstractions (reusable IaC/CI templates, a Platform API, Backstage plugins) with safe defaults.
References: Terraform: Up & Running; Kubernetes Patterns.

8) One-True-Answer: “Kubernetes for Everything”

Case: Company H forces every workload onto Kubernetes; cost and delay increase.

Symptoms: Misfit for batch/data workloads; operational complexity; worse TCO.
Causes: Technology monoculture; no workload classification.
Remedies: Embrace runtime diversity (K8s, serverless, batch, data, edge); define a workload taxonomy and fitness criteria.
References: Kubernetes: Up & Running; Cloud Native Patterns.

9) Bolt-On Security (Gate, Not Guardrails)

Case: Company I requires final-stage security approvals, creating bottlenecks.

Symptoms: Audit findings; approval queues; more workarounds.
Causes: No shift-left; preference for manual gates over automated guardrails.
Remedies: Policy-as-Code (e.g., OPA); SBOM and signing; secrets management; embed scans into the pipeline by default.
References: Securing DevOps; OpenSSF guides.

10) Build-and-Abandon (No Operations Design)

Case: Company J ships the platform without defining operations or RACI; incidents implode the org.

Symptoms: No one on pager duty; long incidents.
Causes: Missing support model, SLA/SLOs, and runbooks.
Remedies: Clarify RACI; layer support (L1/L2/L3); design on-call/rosters; standardize runbooks and post-incident reviews.
References: The SRE Workbook; Incident Management for Operations.

Appendix: Quick Symptom Checklist

Tool sprawl; falling adoption; ticket queues; policy bypasses; low utilization.
Stale templates; config copy-paste; K8s-as-hammer; approval bottlenecks; no on-call.

How to Use

Have each team mark applicable Symptoms; prioritize anti-patterns with the most hits for quarterly OKRs.
Run book clubs for the references and translate takeaways directly into template, policy, and platform improvements.

Search This Blog