Spotlight

From Chaos to 99.9% Uptime: Rebuilding a Kubernetes Platform for GPU Workloads

Mateen Ali Anjum

This case study describes rebuilding a fragile Kubernetes infrastructure into a production-grade platform for GPU-based ML workloads, improving deployment frequency from weekly to 10+ times daily.

More articles →

Tools and utilities

  • kubesdk: Kubernetes SDK

    kubesdk is a fully typed, async-first Python Kubernetes client with a CLI that generates models from any live cluster or CRD, achieving over 1000 RPS on large, multi-cluster workloads.

  • Sgl-project/rbg: AI inference orchestrator

    RoleBasedGroup is a Kubernetes API written in Go for orchestrating distributed stateful AI inference workloads with multi role collaboration and built in service discovery, treating inference services as role based groups rather than isolated workloads.

  • GoKubeDownscaler: workload autoscaler

    GoKubeDownscaler is a horizontal autoscaler for Kubernetes workloads written in Go that automatically scales down deployments, statefulsets, and other resources based on time schedules to save costs.

  • Radar: Kubernetes visibility

    Radar provides Kubernetes cluster visibility through topology graphs, event timelines, and service traffic visualization running as a single binary that connects directly to the Kubernetes API without cluster-side installation.

  • cek : Container Exploration Kit

    cek is a command-line tool for exploring OCI container image filesystems, reading file contents, and inspecting layer mechanics without running containers by connecting to container daemons or pulling from registries.

More projects →

Events starting soon

Discover more events onn Kube Events →

Migrating to Karpenter: Fun Stories
Migrating to Karpenter: Fun Stories

Running multiple Kubernetes clusters on AWS with the cluster autoscaler? Every four months, you face the same grind: upgrading Kubernetes versions, recreating auto scaling groups, and hoping instance type changes stick.

Adhi Sutandi, DevOps Engineer at Beekeeper by LumApps, shares how his team migrated from the cluster autoscaler to Karpenter across eight EKS clusters — and the hard lessons they learned along the way.

In this episode:

  • Why AWS auto scaling groups are immutable and how that creates upgrade bottlenecks at scale
  • How the latest AMI tag accidentally turned less critical clusters into chaos engineering environments, dropping SLOs before anyone realized Karpenter was the cause
  • Why pre-stop sleep hooks solved pod restartability problems that Quarkus's built-in graceful shutdown couldn't
  • The case for pod disruption budgets over Karpenter annotations when protecting critical workloads during node rotations
  • How Karpenter's implicit 10% disruption budget caught the team off guard — and the explicit configuration that fixed it

Learn from production

More case studies →

Matching jobs

    • DevOps Engineer with Planet

    • Salary: $14.28M to $20.32M a year

    • Location: remote from

    • Tech stack: Kubernetes, GCP, SQL, Python, Javascript, Go, Shell, Terraform, Grafana

    • DevOps Engineer with Precision Medicine Group

    • Salary: $147.6K to $324.28K a year

    • Location: fully remote

    • Tech stack: Kubernetes, AWS, Helm, Docker, Python, Shell, Terraform, Gitlab, AWS CloudWatch

    • DevSecOps Engineer with Pinterest

    • Salary: $155.58K to $320.32K a year

    • Location: remote from

    • Tech stack: Kubernetes, AWS, Go, Python, C++, Typescript, Terraform, Puppet

    • DevSecOps Engineer with Rise8

    • Salary: $163.12K to $203.9K a year

    • Location: remote from

    • Tech stack: Kubernetes, Shell, Python, Powershell, Terraform, Jenkins, Ansible, Puppet, Chef

    • DevSecOps Engineer with Schonfeld

    • Salary: $120K to $135K a year

    • Location: fully remote

    • Tech stack: Kubernetes, Python, Powershell

Discover more Kubernetes jobs on Kube Careers →

Subscribe to Learn Kubernetes Weekly

Trusted by 77K engineers. Delivered 173 issues and counting.

or subscribe via

Build something

More tutorials →

Call for Papers closing soon

  1. 7

    days

    SREday San Francisco 2026

    The Call For Paper is open until 16 March 2026 at GMT-4. More info →
    • Location: San Francisco, CA, USA

    • In-person conference organized by SREday.

    • The conference starts on the 15 April 2026.

    • Apply here
  2. 7

    days

    SREday Seattle 2026

    The Call For Paper is open until 16 March 2026 at GMT-4. More info →
    • Location: Seattle, WA, USA

    • In-person conference organized by SREday.

    • The conference starts on the 20 April 2026.

    • Apply here
  3. 11

    days

    Cloud Native Days Amsterdam

    The Call For Paper is open until 20 March 2026 at GMT-4. More info →
    • Location: Amsterdam, NL

    • In-person conference organized by Cloud Native Amsterdam.

    • The conference starts on the 22 May 2026.

    • Apply here
  4. 14

    days

    Cloud Native Telco Day Europe

    The Call For Paper is open until 23 March 2026 at GMT-4. More info →
    • Location: Amsterdam, NL

    • In-person conference organized by CNCF.

    • The conference starts on the 23 March 2026.

    • Apply here
  5. 14

    days

    Cloud Native AI + Kubeflow Day Europe

    The Call For Paper is open until 23 March 2026 at GMT-4. More info →
    • Location: Amsterdam, NL

    • In-person conference organized by CNCF.

    • The conference starts on the 23 March 2026.

    • Apply here
  6. 14

    days

    Cloud Native 2026

    The Call For Paper is open until 23 March 2026 at GMT-4. More info →
    • This is a virtual event

    • Online conference organized by Conf42.

    • The conference starts on the 23 April 2026.

    • Apply here
  7. 17

    days

    Data on Kubernetes Day

    The Call For Paper is open until 26 March 2026 at GMT-4. More info →
    • Location: Amsterdam, NL

    • In-person conference organized by CNCF.

    • The conference starts on the 26 March 2026.

    • Apply here

Thanks to our sponsors who make Kube Today possible

Find out more about being a sponsor →

More articles

Even more articles →