Spotlight

Slurm on Kubernetes (SUNK): Modernizing HPC and AI workload management

Dave Davies

This article explains how Slurm on Kubernetes combines Slurm job scheduling with Kubernetes orchestration so AI and HPC teams can modernize GPU-heavy infrastructure without forcing researchers into raw Kubernetes workflows.

More articles →

Tools and utilities

  • K8up

    K8up is a Kubernetes Operator that helps you:

  • X.509 Certificate Exporter

    X.509 Certificate Exporter is a Go-based Prometheus exporter that monitors certificate expiration inside Kubernetes clusters or as a standalone service, helping teams alert before TLS certificates expire.

  • RootCause

    RootCause is a local first MCP server for Kubernetes that turns natural language into evidence backed incident analysis, safe operation checks, and ecosystem diagnostics for tools like Argo CD, Flux, Cilium, and Helm.

  • Helm exporter

    Helm-exporter exports Helm releases, charts, and version statistics in the Prometheus format.

  • Cilium Policy Generator

    Cilium Policy Generator, watches dropped flows in real time, and auto-generates CiliumNetworkPolicy YAML files to allow them — so you stop writing policies by hand in default-deny Cilium clusters.

More projects →

Events starting soon

Discover more events onn Kube Events →

Intelligent Kubernetes Load Balancing
Intelligent Kubernetes Load Balancing

You're running gRPC services in Kubernetes, load balancing looks fine on the dashboard — but some pods are burning at 80% CPU while others sit idle, and adding more replicas only partially helps.

Rohit Agrawal, a Staff Software Engineer on the traffic platform team at Databricks, explains why this happens and how his team replaced Kubernetes's default networking with a proxy-less, client-side load-balancing system built on the xDS protocol.

In this episode:

  • Why KubeProxy's Layer 4 routing breaks down under high-throughput gRPC: it picks a backend once per TCP connection, not per request
  • How Databricks built an Endpoint Discovery Service (EDS) that watches Kubernetes directly and streams real-time pod metadata to every client
  • How zone-aware spillover cut cross-availability-zone costs without sacrificing availability
  • Why CPU-based routing failed (monitoring lag creates oscillation) and what signals to use instead

The system has been running in production for three years across hundreds of services, handling millions of requests.

Learn from production

More case studies →

Matching jobs

    • Data Engineer with Strava

    • Salary: $67.5K to $407K a year

    • Location: based in the office (and remote from home) in San Francisco, CA, USA

    • Tech stack: Kubernetes, AWS, Azure, GCP, Go, Java, Python, Ruby, Scala, SQL

    • DevOps Engineer with Strava

    • Salary: $67.5K to $539K a year

    • Location: based in the office (and remote from home) in San Francisco, CA, USA

    • Tech stack: Kubernetes, Docker, Go, Java, Python, Ruby, Scala, Redis, Cassandra, MySQL

    • Machine Learning Engineer with Iambic Therapeutics, Inc

    • Salary: $57.6K to $462K a year

    • Location: remote from

    • Tech stack: Kubernetes, AWS, Python

    • Machine Learning Engineer with applike group

    • Salary: US$175.5K to US$289.85K a year

    • Location: based in the office (and remote from home) in Hamburg, DE

    • Tech stack: Kubernetes, AWS, Python, SQL, Airflow

    • Platform Engineer with Gen Digital Inc.

    • Salary: $135K to $277.2K a year

    • Location: remote from

    • Tech stack: Kubernetes, Azure, Python, SQL, Snowflake, Kafka

Discover more Kubernetes jobs on Kube Careers →

Subscribe to Learn Kubernetes Weekly

Trusted by 77K engineers. Delivered 179 issues and counting.

or subscribe via

Build something

More tutorials →

Call for Papers closing soon

  1. 2

    days

    Open Conf 2026

    The Call For Paper is open until 19 April 2026 at GMT-4. More info →
    • Location: Athens, GR

    • In-person conference organized by Open Conf.

    • The conference starts on the 21 November 2026.

    • Apply here
  2. 4

    days

    SREday Munich 2026

    The Call For Paper is open until 21 April 2026 at GMT-4. More info →
    • Location: Munich, DE

    • In-person conference organized by SREday.

    • The conference starts on the 15 May 2026.

    • Apply here
  3. 4

    days

    CLC26

    The Call For Paper is open until 21 April 2026 at GMT-4. More info →
    • Location: Mannheim, DE

    • In-person conference organized by Rheinwerk Verlag.

    • The conference starts on the 11 November 2026.

    • Apply here
  4. 13

    days

    Tech Fuse Des Moines 2026

    The Call For Paper is open until 30 April 2026 at GMT-4. More info →
    • Location: Des Moines, IA, USA

    • In-person conference organized by Tech Fuse DSM.

    • The conference starts on the 16 October 2026.

    • Apply here
  5. 13

    days

    Devopsdays Graz

    The Call For Paper is open until 30 April 2026 at GMT-4. More info →
    • Location: Graz, AT

    • In-person conference organized by Devopsdays.

    • The conference starts on the 4 September 2026.

    • Apply here
  6. 13

    days

    bit summit 2026

    The Call For Paper is open until 30 April 2026 at GMT-4. More info →
    • Location: Hamburg, DE

    • In-person conference organized by bit summit.

    • The conference starts on the 23 September 2026.

    • Apply here
  7. 13

    days

    IT-Tage

    The Call For Paper is open until 30 April 2026 at GMT-4. More info →
    • Location: Frankfurt, DE

    • In-person conference organized by Alkmene Verlag.

    • The conference starts on the 10 December 2026.

    • Apply here

Thanks to our sponsors who make Kube Today possible

Find out more about being a sponsor →

More articles

Even more articles →