DevOps Digest
curated · weekly · signal only

This week in production.

Twelve field-tested essays from engineers running real infrastructure. No listicles, no vendor fluff — just the writing your on-call rotation will thank you for.

Kubernetes

Designing Multi-Tenant Kubernetes Clusters at Scale

A pragmatic teardown of namespace isolation, vCluster patterns, and network policies used to safely host 400+ engineering teams on shared infrastructure.

Advanced 12 min#kubernetes#multi-tenancy
Priya Nair
ITNEXT
Read on Medium
CI/CD

Cutting Monorepo CI Times From 47 to 6 Minutes

How a platform team leveraged remote caching, affected-graph detection, and merge queues to compress a bloated GitHub Actions pipeline into a snappy dev loop.

Intermediate 9 min#ci/cd#github-actions
Marcus Chen
Better Programming
Read on Medium
Platform Engineering

The Internal Developer Platform Nobody Uses

An honest retrospective on shipping a Backstage-based IDP that engineers ignored — and the golden-path shift that turned adoption around in one quarter.

Intermediate 14 min#backstage#idp
Ada Okafor
Platform Engineering
Read on Medium
AWS

VPC Lattice in Anger: A Year of Service-to-Service Routing

Field notes on replacing an internal ALB mesh with VPC Lattice — including the auth policy footguns, IAM sprawl, and where it actually beats a service mesh.

Advanced 11 min#aws#networking
Jordan Reyes
AWS in Plain English
Read on Medium
Observability

OpenTelemetry Without the Vendor Tax

A step-by-step migration off a proprietary agent to a self-hosted OTel Collector fleet — retaining traces, cutting egress spend by 62%, and keeping alerts calm.

Intermediate 10 min#opentelemetry#tracing
Sofia Lindberg
The New Stack
Read on Medium
Security

Signing, SBOMs, and the Boring Parts of Supply Chain Security

Why cosign + SLSA level 3 is finally table stakes, and how a fintech implemented artifact attestation without adding a release-blocking security checkpoint.

Advanced 13 min#supply-chain#sigstore
Ethan Park
Level Up Coding
Read on Medium
Terraform

Terraform Modules That Don't Rot

Composition patterns, versioning discipline, and the interface contracts that keep a shared module library maintainable across 30+ product teams.

Beginner 8 min#terraform#iac
Naomi Ferreira
HashiCorp Blog
Read on Medium
SRE

SLOs That Actually Change Behavior

Most SLO programs become dashboard theater. Here's the error-budget policy structure that makes product managers care about latency percentiles.

Beginner 7 min#slo#reliability
Ravi Subramanian
SRE Weekly
Read on Medium
Kubernetes

GitOps at the Edge: 12,000 Clusters, One Repo

How a CDN provider orchestrates Argo CD across a globally sharded fleet, with progressive rollouts, drift alerts, and a surprisingly small platform team.

Advanced 15 min#gitops#argo-cd
Lena Fischer
CNCF Blog
Read on Medium
AWS

The FinOps Playbook: Cutting $2.4M From an AWS Bill

A month-by-month breakdown of savings plans, Graviton migration, S3 tiering, and the internal chargeback model that made teams self-serve their own optimization.

Intermediate 11 min#finops#cost
Daniel Alvarez
AWS in Plain English
Read on Medium
CI/CD

Progressive Delivery Beyond the Canary Buzzword

Feature flags, weighted routing, and automated rollback wired into a single delivery contract — with the metrics that actually determine promotion.

Intermediate 9 min#canary#feature-flags
Yuki Tanaka
Better Programming
Read on Medium
Platform Engineering

Golden Paths, Not Golden Cages

The design philosophy behind opinionated defaults that engineers thank you for — and the escape hatches that keep the platform team out of every code review.

Beginner 6 min#golden-paths#dx
Isabelle Moreau
Platform Engineering
Read on Medium