Learn Kubernetes Weekly issue 132 · 21 May 2025

OpenAI's Incident and Mitigation, policies saved us a thousand headaches, We're leaving Kubernetes Reducing Pod Startup Time for Java

This newsletter is brought to you by Dagger — build software engineering workflows and environments with Dagger.

Articles

  1. An In-Depth Analysis of the OpenAI’s Incident and Mitigation Strategies

    midbai.com

    This post analyzes the root cause and mitigation strategies of OpenAI’s Dec 2024 outage, where a misconfigured telemetry service triggered API overload across clusters, crashing Kubernetes control planes and disrupting DNS.

  2. Agents in your software factory

    dagger.io

    How to build software engineering workflows — like code reviews and builds — with LLMs inside.

    sponsored

  3. Taming the wild west of research computing: how policies saved us a thousand headaches

    alessandropomponio.medium.com

    By leveraging Kyverno, Kueue, and Argo CD, IBM Research transformed chaotic GPU resource sharing into a policy-driven, fair computing environment—solving GPU hogging, scheduling conflicts, and administrative overhead in research computing.

  4. We're leaving Kubernetes

    gitpod.io

    This case study highlights the challenges of unpredictable resource usage, complex networking, and workload isolation in multi-tenant Kubernetes platforms.

  5. Resource management in Kubernetes

    medium.com

    This guide shows how to right-size Kubernetes pod resources (CPU, memory, ephemeral storage) using real Prometheus metrics, Go runtime env tuning (GOMAXPROCS, GOMEMLIMIT), and node-level capacity planning.

  6. Reducing Pod Startup Time for Java Application on EKS

    medium.com

    This article explains how to reduce pod cold-start time for Java apps on EKS.

    It covers in-place JVM boot optimization, image prefetching via AWS EventBridge+SSM, and paused low-priority pods to keep nodes warm before real autoscale events.

Articles worth checking out:

Build your modern software factory

Define software delivery workflows and dev environments with reusable components — including LLMs — and run them anywhere.

Built by the creators of Docker.

Learn more

Build your modern software factory

Tutorials

  1. Linux container from Scratch

    michalpitr.substack.com

    This article guides you through using terminal commands to build a Linux container from the ground up.

  2. Mastering Compute Efficiency: Dynamic GPU Partitioning Strategies for Kubernetes-Based ML Systems

    medium.com

    This article explores three GPU sharing techniques—Time Slicing, Multi-Instance GPU (MIG), and Multi-Process Service (MPS)—to enhance GPU utilization in Kubernetes-managed machine learning workloads.

  3. Standardizing App Delivery with Flux and Generic Helm Charts

    medium.com

    This tutorial explains how Flux and Generic Helm Charts standardize Kubernetes app delivery using reusable tech-specific charts, automated OCI deployments, and Kustomize for environment customization.

  4. flux2-multi-tenancy: Automated Tenant Onboarding with Flux and Kyverno

    github.com/fluxcd

    flux2-multi-tenancy provides GitOps templates and Kyverno policies to automate tenant onboarding.

    It provisions namespaces, RBAC, and policy controls in Kubernetes using pull requests, enabling secure multi-tenant cluster management from Git.

  5. Rewriting Docker image registries with Kyverno

    blog.oponomarov.com

    This article shows how to use Kyverno policies and Helm to rewrite container image registry URLs at admission for all pod container types.

    This image mutation enables namespace-controlled migration to new registries without editing every manifest.

More tutorials:

Managing 100s of Kubernetes Clusters using Cluster API

Discover how to manage Kubernetes at scale with declarative infrastructure and automation principles.

Zain Malik shares his experience managing multi-tenant Kubernetes clusters with up to 30,000 pods across clusters capped at 950 nodes. He explains how his team transitioned from Terraform to Cluster API for declarative cluster lifecycle management, contributing upstream to improve AKS support while implementing GitOps workflows.

You will learn:

  • How to address challenges in large-scale Kubernetes operations, including node pool management inconsistencies and lengthy provisioning times
  • Why Cluster API provides a powerful foundation for multi-cloud cluster management, and how to extend it with custom operators for production-specific needs
  • How implementing GitOps principles eliminates manual intervention in critical operations like cluster upgrades
  • Strategies for handling production incidents and bugs when adopting emerging technologies like Cluster API
Managing 100s of Kubernetes Clusters using Cluster API

Kubernetes jobs

    • Site Reliability Engineer with CoW DAO

    • Salary: €90K to €120K a year

    • Location: remote from Europe

    • Tech stack: Kubernetes, AWS, Flux, Docker, Go, Python, Rust, PostgreSQL, Elastic Search, Pulumi

    • Data Engineer with Chartbeat

    • Salary: $128K to $147K a year

    • Location: remote from the United States

    • Tech stack: Kubernetes, Python, PostgreSQL, Snowflake, Kafka

    • Software Engineer with Crusoe

    • Salary: $245K to $290K a year

    • Location: based in the office (and remote from home) in San Francisco, CA, USA

    • Tech stack: Kubernetes, Go, Java, Rust, C++, C, Ceph, Terraform, Ansible, Puppet

    • Platform Engineer with Lyft

    • Salary: CA$108K to CA$135K a year

    • Location: based in the office (and remote from home) in Toronto, ON, CA

    • Tech stack: Kubernetes, AWS, Docker, Go, Python, Kafka, Terraform, Cloudformation, Ansible, Puppet

    • Test Automation Engineer with Palo Alto Networks

    • Salary: $104K to $185.5K a year

    • Location: based in the office in Santa Clara, CA, USA

    • Tech stack: Kubernetes, AWS, Azure, GCP, Docker, Python, Javascript, Gitlab

Discover more Kubernetes jobs on Kube Careers →

Code & tools

  1. Coroot: Observability Platform

    coroot.com

    Coroot is an eBPF-powered observability tool that maps service dependencies, request paths, errors, and latency in real time without code changes or sidecars.

  2. Dagger: runtime for composable workflows

    github.com/dagger

    Dagger is an open-source runtime for composable workflows.

    It's perfect for systems with many moving parts and a strong need for repeatability, modularity, observability and cross-platform support.

    sponsored

  3. Khronoscope: Time Travel for Troubleshooting and Debugging

    github.com/hoyle1974

    Khronoscope snapshots your cluster's resource states in-memory and lets you inspect changes over time with VCR-like controls.

    Without persistent storage or agent overhead, you can view logs, rewind crashes, and trace dependencies across namespaces.

  4. Kilo: WireGaurd network overlay

    github.com/squat

    Kilo is a multi-cloud network overlay built on WireGuard and designed for Kubernetes.

  5. Kraken registry

    github.com/uber

    Kraken is a P2P-powered Docker registry that focuses on scalability and availability.

    It is designed for Docker image management, replication, and distribution in a hybrid cloud environment.

Other interesting projects:

Subscribe to Learn Kubernetes Weekly

Trusted by 77K engineers. Delivered 150 issues and counting.

or subscribe via

Upcoming Kubernetes events

  1. May

    21

    Cloud Native, the Hard Way: Mistakes from Our VM to Kubernetes Journey

    In-person meetup organized by Cloud Native Vilnius.

    • Location: Vilnius, LT

    • This is a free event.

  2. May

    22

    Kubernetes Community Days Seoul 2025

    In-person conference organized by KCD South Korea.

    • Location: Seoul, KR

    • This event requires an entrance fee

  3. May

    22

    On-Prem Kubernetes at Scale with metal-stack.io & AI Workloads on Kubernetes

    Online & in-person meetup organized by Cloud Native Night Munich.

    • Location: München, DE and virtual

    • This is a free event.

  4. May

    23

    Kubernetes Community Days Istanbul 2025

    In-person conference organized by KCD Istanbul.

    • Location: İstanbul, TR

    • This event requires an entrance fee

  5. May

    24

    Mission O11y Possible: Panic at the Pod

    In-person meetup organized by Cloud Native Noida.

    • Location: Noida, IN

    • This is a free event.

  6. Jun

    26

    Advanced Kubernetes course

    Online workshop organized by Learnk8s.

    • This is a virtual event

    • This event requires an entrance fee

Discover more Kubernetes events on Kube Events →

Thanks to our sponsors who make Kube Today possible

  • LearnKube
  • Akamai
  • Fairwinds
  • Densify
Find out more about being a sponsor →

Kubernetes call for papers

  1. expired

    Kubernetes Community Washington DC 2025

    The Call For Paper was open until 26 May 2025 at UTC. More info →
    • Location: Washington, D.C., USA

    • In-person conference organized by KCD Washington DC.

    • The conference starts on the 16 September 2025.

    • Apply here
  2. expired

    Cloud Native Days Austria

    The Call For Paper was open until 31 May 2025 at UTC. More info →
    • Location: Vienna, AT

    • In-person conference organized by CNDA Austria.

    • The conference starts on the 8 October 2025.

    • Apply here
  3. expired

    KubeCon + CloudNativeCon North America 2025

    The Call For Paper was open until 28 May 2025 at UTC. More info →
    • Location: Atlanta, GE, USA

    • In-person conference organized by Linux Foundation.

    • The conference starts on the 10 November 2025.

    • Apply here
  4. expired

    Cloud Native Denmark 2025

    The Call For Paper was open until 16 June 2025 at UTC. More info →
    • Location: Aarhus, DK

    • In-person conference organized by CND.

    • The conference starts on the 17 April 2025.

    • Apply here
  5. expired

    Kubernetes Community Days Porto 2025

    The Call For Paper was open until 30 June 2025 at UTC. More info →
    • Location: Porto, PT

    • In-person conference organized by KCD Porto.

    • The conference starts on the 4 November 2025.

    • Apply here
  6. expired

    Kubernetes Community Days Warsaw 2025

    The Call For Paper was open until 16 June 2025 at UTC. More info →
    • Location: Warsaw, PL

    • In-person conference organized by KCD Warsaw.

    • The conference starts on the 9 October 2025.

    • Apply here
  7. expired

    Texas Linux Festival 2025

    The Call For Paper was open until 3 August 2025 at UTC. More info →
    • Location: Austin, TX, USA

    • In-person conference organized by TXLF.

    • The conference starts on the 4 October 2025.

    • Apply here
  8. expired

    Devopsdays Tel Aviv

    The Call For Paper was open until 15 June 2025 at UTC. More info →
    • Location: Tel Aviv, IL

    • In-person conference organized by Devopsdays.

    • The conference starts on the 11 December 2025.

    • Apply here
  9. expired

    Open Source Summit Japan 2025

    The Call For Paper was open until 4 August 2025 at UTC. More info →
    • Location: Tokyo, JP

    • In-person conference organized by Linux Foundation.

    • The conference starts on the 10 December 2025.

    • Apply here

Until next time!

— Dan

Subscribe to Learn Kubernetes Weekly

Trusted by 77K engineers. Delivered 150 issues and counting.

or subscribe via