Jack Lindamood
This case study shows how OOM Killer terminated a critical network daemon on Kubernetes nodes, causing a network outage.
It covers debugging via serial console and implementing memory reservations to prevent system-critical process termination.
Kalyan Josyula
This case study shows how a team traced repeated pod OOM kills in ASP.NET Core to native memory growth from zombie SignalR connections, glibc fragmentation, and kernel socket buffers.
Nick Roan
This case study shows how a single RAG chunk size change collapsed vLLM prefix-cache hit rate from 85% to 4%, triggering an 80% GPU replica increase while latency stayed flat.
It also includes the fix: adding a two-phase cache replay gate in CI.
Dat Ton
This case study explains how cURL 65 errors and DNS resolution failures on AWS EKS were caused by Linux kernel network limits being exceeded, resolved by increasing netdev_budget, netdev_budget_usecs, and netdev_max_backlog parameters.
Matt Camp
This case study shows how Unitary built Osmia, an open-source orchestration layer on EKS to run autonomous AI coding agents safely at scale using pod isolation, Karpenter, IRSA-based secrets, and real-time trajectory scoring.
Events starting soon
May 31, 2026
Location: Pune, IN
This is a free event.
June 1, 2026
Location: San Diego, CA, USA and virtual
This event requires an entrance fee
June 2, 2026
This is a virtual event
This is a free event.
June 2, 2026
This is a virtual event
This is a free event.
June 2, 2026
Location: London, GB
This event requires an entrance fee
June 2, 2026
Location: New York, NY, USA
This event requires an entrance fee
More Case Studies
Aditya Suryawanshi
This is a war story about a 3-person startup that replaced a $14,850/month over-engineered Kubernetes setup on AWS with Fly.io for $680, cutting P99 latency from 320ms to 180ms and deploy time from 8 minutes to 45 seconds.
Ejiroghene Laurel Dafe
This case study shows how one engineer resolved two real Kubernetes production incidents involving an overly aggressive Ingress rate limit and Istio breaking non-HTTP socket traffic.
Maxim Nazarenko
This case study explains how to migrate bound Kubernetes volumes from deprecated in-tree Azure Disk provisioning to CSI with in-place PVC re-binding, minimal restarts, and no data loss across production disks.
DV Engineering
This case study shows how DoubleVerify built a Kubernetes and Ray serving platform to deploy and scale ML models in production.
It also covers RayService wrapped with Helm, fault tolerance with external Redis, and platform gains like 30% lower GPU cost.
Varun Arora
This case study shows building a centralized multi-account AWS monitoring platform managing 25+ accounts using Python Boto3 to fetch resource configurations into MongoDB, with Flask API and Next.js frontend achieving $30k annual savings.
Matching jobs
Data Engineer with Backblaze External Website
Salary: $7.56K to $203.5K a year
Location: remote from
Tech stack: Kubernetes, Redis, Ansible
Data Engineer with HHAeXchange
Salary: $105K to $115K a year
Location: remote within UTC-5 and UTC-3
Tech stack: Kubernetes, AWS, Docker, Python, SQL, Snowflake, Airflow
Data Engineer with HHAeXchange
Salary: $155K to $184K a year
Location: remote within UTC-5 and UTC-3
Tech stack: Kubernetes, AWS, Docker, Python, SQL, Snowflake, Airflow
DevOps Engineer with ePlus Technology, inc.
Salary: $65K to $90K a year
Location: based in the office in Herndon, VA, USA
Tech stack: Kubernetes, Azure, Azure DevOps
DevOps Engineer with ePlus Technology, inc.
Salary: $32.4K to $330K a year
Location: remote from
Tech stack: Kubernetes, AWS, Azure, GCP, OpenShift, Python, Terraform, Ansible, Grafana, Prometheus