Spotlight

Designing for Failure: Chaos Engineering Best Practices

Rajat Gupta

This article explains chaos engineering practices for Kubernetes, covering how to design resilient systems by proactively testing failure scenarios like pod crashes, network failures, and resource exhaustion using tools like Chaos Mesh and LitmusChaos.

More articles →

Tools and utilities

  • Schednex: AI scheduler

    Schednex enables the smartest placement of your workloads by drawing on telemetry from K8sGPT and context awareness from AI.

  • k3k: nested k3s

    A Kubernetes in Kubernetes tool, k3k provides a way to run multiple embedded isolated k3s clusters on your kubernetes cluster.

  • Kubernetes MCP Server: AI Kubernetes management

    This tool exposes Kubernetes cluster operations through the Model Context Protocol, allowing AI agents and tools to safely read cluster state and execute controlled actions like kubectl commands.

  • KubeTidy

    KubeTidy helps you clean, merge, and optimize your Kubernetes configurations effortlessly.

  • external-dns-provider-mikrotik – ExternalDNS Webhook for MikroTik DNS

    This project provides a webhook provider for ExternalDNS that lets Kubernetes automatically manage DNS records on a MikroTik RouterOS via its API.

More projects →

Events starting soon

Discover more events onn Kube Events →

Intelligent Kubernetes Load Balancing
Intelligent Kubernetes Load Balancing

You're running gRPC services in Kubernetes, load balancing looks fine on the dashboard — but some pods are burning at 80% CPU while others sit idle, and adding more replicas only partially helps.

Rohit Agrawal, a Staff Software Engineer on the traffic platform team at Databricks, explains why this happens and how his team replaced Kubernetes's default networking with a proxy-less, client-side load-balancing system built on the xDS protocol.

In this episode:

  • Why KubeProxy's Layer 4 routing breaks down under high-throughput gRPC: it picks a backend once per TCP connection, not per request
  • How Databricks built an Endpoint Discovery Service (EDS) that watches Kubernetes directly and streams real-time pod metadata to every client
  • How zone-aware spillover cut cross-availability-zone costs without sacrificing availability
  • Why CPU-based routing failed (monitoring lag creates oscillation) and what signals to use instead

The system has been running in production for three years across hundreds of services, handling millions of requests.

Learn from production

More case studies →

Matching jobs

    • DevOps Engineer with AbhiBus

    • Salary: $1.27L to $3.03L a year

    • Location: based in the office in Hyderabad, IN

    • Tech stack: Kubernetes, AWS, Helm, Docker, Python, PostgreSQL, MySQL, Cloudformation, Terraform, Gitlab

    • DevOps Engineer with IDT

    • Salary: PLN 4.65K to PLN 521.84K a year

    • Location: remote from

    • Tech stack: Kubernetes, AWS, ArgoCD, Docker, Go, Shell, Terraform, GitHub Actions, Jenkins

    • Developer Advocate with Nuon, Inc.

    • Salary: $127.26K to $330K a year

    • Location: based in the office in San Francisco, CA, USA

    • Tech stack: Kubernetes, AWS, C#, Go, Java, Javascript, Python, Shell, SQL, Typescript

    • Head of Platform Engineering with Miovision

    • Salary: $135K to $297K a year

    • Location: remote from

    • Tech stack: Kubernetes, AWS, ArgoCD, Docker, Java, Python, Shell, Snowflake, Terraform, Gitlab

    • Platform Engineer with Iliad - Free

    • Salary: $126K to $275K a year

    • Location: based in the office in Paris, FR

    • Tech stack: Kubernetes, On-premise, Helm, ArgoCD, Docker, Go, Python, Terraform, Ansible, Grafana

Discover more Kubernetes jobs on Kube Careers →

Subscribe to Learn Kubernetes Weekly

Trusted by 77K engineers. Delivered 178 issues and counting.

or subscribe via

Build something

More tutorials →

Call for Papers closing soon

  1. 2

    days

    SREday Austin 2026

    The Call For Paper is open until 12 April 2026 at GMT-4. More info →
    • Location: Austin, TX, USA

    • In-person conference organized by SREday.

    • The conference starts on the 6 May 2026.

    • Apply here
  2. 9

    days

    Open Conf 2026

    The Call For Paper is open until 19 April 2026 at GMT-4. More info →
    • Location: Athens, GR

    • In-person conference organized by Open Conf.

    • The conference starts on the 21 November 2026.

    • Apply here
  3. 11

    days

    SREday Munich 2026

    The Call For Paper is open until 21 April 2026 at GMT-4. More info →
    • Location: Munich, DE

    • In-person conference organized by SREday.

    • The conference starts on the 15 May 2026.

    • Apply here
  4. 11

    days

    CLC26

    The Call For Paper is open until 21 April 2026 at GMT-4. More info →
    • Location: Mannheim, DE

    • In-person conference organized by Rheinwerk Verlag.

    • The conference starts on the 11 November 2026.

    • Apply here
  5. 20

    days

    Tech Fuse Des Moines 2026

    The Call For Paper is open until 30 April 2026 at GMT-4. More info →
    • Location: Des Moines, IA, USA

    • In-person conference organized by Tech Fuse DSM.

    • The conference starts on the 16 October 2026.

    • Apply here
  6. 20

    days

    Devopsdays Graz

    The Call For Paper is open until 30 April 2026 at GMT-4. More info →
    • Location: Graz, AT

    • In-person conference organized by Devopsdays.

    • The conference starts on the 4 September 2026.

    • Apply here
  7. 20

    days

    bit summit 2026

    The Call For Paper is open until 30 April 2026 at GMT-4. More info →
    • Location: Hamburg, DE

    • In-person conference organized by bit summit.

    • The conference starts on the 23 September 2026.

    • Apply here

Thanks to our sponsors who make Kube Today possible

Find out more about being a sponsor →

More articles

Even more articles →