Spotlight

GPU Starvation in Kubernetes: How Dynamic MIG Partitioning Saved Our GPU Budget

Sphoorthi Charan Nayakudugari

This case study explains how the authors used dynamic MIG partitioning to split large GPUs like NVIDIA A100/H100 into multiple isolated slices, letting many small jobs share GPU efficiently.

More articles →

Tools and utilities

  • KubeDiagrams

    KubeDiagrams is a tool that automatically generates visual architecture diagrams from Kubernetes manifests, Helm charts, and live clusters.

  • lazydocker: docker TUI

    lazydocker is a terminal UI for managing Docker containers and services, with log and metric graph viewing, container attachment, and execution of common Docker commands.

  • Radius: application platform

    Radius bridges developer and operator workflows by enabling cloud-neutral application deployment with infrastructure recipes and Dapr integration that simplify building, configuring, and managing modern cloud applications across platforms.

  • Sloth: Prometheus SLO Generator

    Sloth generates Prometheus Service Level Objectives with reliable SLI recording rules and multi-window, multi-burn-rate alerts from simple YAML specs.

  • Dynamo: distributed LLM inference

    NVIDIA Dynamo is a datacenter-scale distributed LLM inference framework supporting disaggregated prefill/decode, KV-aware routing, and dynamic GPU scheduling across vLLM, SGLang, and TensorRT-LLM.

More projects →

Events starting soon

Discover more events onn Kube Events →

How We Cut Build Debugging Time by 75% with AI
How We Cut Build Debugging Time by 75% with AI

Build failures in Kubernetes CI/CD pipelines are a silent productivity killer. Developers spend 45+ minutes scrolling through cryptic logs, often just hitting rerun and hoping for the best.

Ron Matsliah, DevOps engineer at Next Insurance, built an AI-powered assistant that cut build debugging time by 75% — not as a dashboard, but delivered directly in Slack where developers already work.

In this episode:

  • Why combining deterministic rules with AI produces better results than letting an LLM guess alone
  • How correlating Kubernetes events with build logs catches spot instance terminations that produce misleading errors
  • Why integrating into existing workflows and building feedback loops from day one drove adoption
  • The prompt engineering lessons learned from testing with real production data instead of synthetic examples

The takeaway: simple rules plus rich context consistently outperform complex AI queries on their own.

Learn from production

More case studies →

Matching jobs

    • Data Engineer with Kasada

    • Salary: USD 0 to USD 412.61K a year

    • Location: based in the office in Sydney, AU

    • Tech stack: Kubernetes, AWS, Docker, Java, Python, Scala, SQL, Kafka, Airflow, Pulumi

    • Data Engineer with SoFi

    • Salary: $54K to $286K a year

    • Location: based in the office in San Francisco, CA, USA

    • Tech stack: Kubernetes, Docker, SQL, Python, Snowflake, Terraform, Datadog

    • DevOps Engineer with Accesa & RaRo

    • Salary: $115.96K to $255.42K a year

    • Location: remote from

    • Tech stack: Kubernetes, AWS, Azure, GCP, OpenShift, Docker, Terraform

    • DevOps Engineer with Egen

    • Salary: $49.5K to $539K a year

    • Location: remote from

    • Tech stack: Kubernetes, GCP, Helm, Docker, Shell, PostgreSQL, MySQL, Terraform, Azure DevOps, Jenkins

    • DevOps Engineer with HavocAI

    • Salary: $49.5K to $539K a year

    • Location: fully remote

    • Tech stack: Kubernetes, AWS, Docker, Go, Python, Terraform

Discover more Kubernetes jobs on Kube Careers →

Subscribe to Learn Kubernetes Weekly

Trusted by 77K engineers. Delivered 175 issues and counting.

or subscribe via

Build something

More tutorials →

Call for Papers closing soon

  1. 3

    days

    Cloud Native Telco Day Europe

    The Call For Paper is open until 23 March 2026 at GMT-4. More info →
    • Location: Amsterdam, NL

    • In-person conference organized by CNCF.

    • The conference starts on the 23 March 2026.

    • Apply here
  2. 3

    days

    Cloud Native AI + Kubeflow Day Europe

    The Call For Paper is open until 23 March 2026 at GMT-4. More info →
    • Location: Amsterdam, NL

    • In-person conference organized by CNCF.

    • The conference starts on the 23 March 2026.

    • Apply here
  3. 3

    days

    Cloud Native 2026

    The Call For Paper is open until 23 March 2026 at GMT-4. More info →
    • This is a virtual event

    • Online conference organized by Conf42.

    • The conference starts on the 23 April 2026.

    • Apply here
  4. 3

    days

    DevDays 2026

    The Call For Paper is open until 23 March 2026 at GMT-4. More info →
    • Location: Iai, RO

    • In-person conference organized by DevDays Conf.

    • The conference starts on the 23 September 2026.

    • Apply here
  5. 5

    days

    Kubernetes Community Days New York 2026

    The Call For Paper is open until 25 March 2026 at GMT-4. More info →
    • Location: New York, NY, USA

    • In-person conference organized by KCD New York.

    • The conference starts on the 10 June 2026.

    • Apply here
  6. 6

    days

    Data on Kubernetes Day

    The Call For Paper is open until 26 March 2026 at GMT-4. More info →
    • Location: Amsterdam, NL

    • In-person conference organized by CNCF.

    • The conference starts on the 26 March 2026.

    • Apply here
  7. 7

    days

    DeveloperWeek New York 2026

    The Call For Paper is open until 27 March 2026 at GMT-4. More info →
    • Location: New York, NY, USA

    • In-person conference organized by DeveloperWeek New York.

    • The conference starts on the 10 June 2026.

    • Apply here

Thanks to our sponsors who make Kube Today possible

Find out more about being a sponsor →

More articles

Even more articles →