Kubernetes Device Failures: Causes &#038; Solutions &#8211; A Comprehensive Guide

socially assistive robotics supporting coverage of socially assistive robotics

– The AI/ML boom and its impact on Kubernetes

The rise of AI/ML workloads has brought new challenges to Kubernetes. These workloads often rely heavily on specialized hardware, and any device failure can significantly impact performance and lead to frustrating interruptions. As highlighted in the 2024 Llama paper, hardware issues, particularly GPU failures, are a major cause of disruption in AI/ML training. You can also learn how much effort NVIDIA spends on handling devices failures and maintenance in the KubeCon talk by Ryan Hallisey and Piotr Prokop All-Your-GPUs-Are-Belong-to-Us: An Inside Look at NVIDIA’s Self-Healing GeForce NOW Infrastructure (recording) as they see 19 remediation requests per 1000 nodes a day! We also see data centers offering spot consumption models and overcommit on power, making device failures commonplace and a part of the business model.

However, Kubernetes’s view on resources is still very static. The resource is either there or not. And if it is there, the assumption is that it will stay there fully functional – Kubernetes lacks good support for handling full or partial hardware failures. These long-existing assumptions combined with the overall complexity of a setup lead to a variety of failure modes, which we discuss here.

Understanding AI/ML workloads
The generally, all AI/ML workloads require specialized hardware, have challenging scheduling requirements, and are expensive when idle. AI/ML workloads typically fall into two categories – training and inference. Here is an oversimplified view of those categories’ characteristics, which are different from traditional workloads like web services:

Training

These workloads are resource-intensive, often consuming entire machines and running as gangs of pods. Training jobs are usually "run to completion" – but that could be days, weeks or even months. Any failure in a single pod can necessitate restarting the entire step across all the pods.

Inference

These workloads are usually long-running or run indefinitely, and can be small enough to consume a subset of a Node’s devices or large enough to span multiple nodes. They often require downloading huge files with the model weights.

These workload types specifically break many past assumptions:

Before	Now
Can get a better CPU and the app will work faster.	Require a specific device (or class of devices) to run.
When something doesn’t work, just recreate it.	Allocation or reallocation is expensive.
Any node will work. No need to coordinate between Pods.	Scheduled in a special way – devices often connected in a cross-node topology.
Container images are slim and easily available.	Container images may be so big that they require special handling.
Long initialization can be offset by slow rollout.	Initialization may be long and should be optimized, sometimes across many Pods together.
Compute nodes are commoditized and relatively inexpensive, so some idle time is acceptable.	Nodes with specialized hardware can be an order of magnitude more expensive than those without, so idle time is very wasteful.

The existing failure model was relying on old assumptions. It may still work for the new workload types, but it has limited knowledge about devices and is very expensive for them. In some cases, even prohibitively expensive. You will see more examples later in this article.

Why Kubernetes still reigns supreme
This article is not going deeper into the question: why not start fresh for AI/ML workloads since they are so different from the traditional Kubernetes workloads. Despite many challenges, Kubernetes remains the platform of choice for AI/ML workloads. Its maturity, security, and rich ecosystem of tools make it a compelling option. While alternatives exist, they often lack the years of development and refinement that Kubernetes offers. And the Kubernetes developers are actively addressing the gaps identified in this article and beyond.

The current state of device failure handling
This section outlines different failure modes and the best practices and DIY (Do-It-Yourself) solutions used today. The next session will describe a roadmap of improving things for those failure modes.

Failure modes: K8s infrastructure
In order to understand the failures related to the Kubernetes infrastructure, you need to understand how many moving parts are involved in scheduling a Pod on the node. The sequence of events when the Pod is scheduled in the Node is as follows:

Device plugin is scheduled on the Node
Device plugin is registered with the kubelet via local gRPC
Kubelet uses device plugin to watch for devices and updates capacity of the node
Scheduler places a user Pod on a Node based on the updated capacity
Kubelet asks Device plugin to Allocate devices for a User Pod
Kubelet creates a User Pod with the allocated devices attached to it

This diagram shows some of those actors involved:

As there are so many actors interconnected, every one of them and every connection may experience interruptions. This leads to many exceptional situations that are often considered failures, and may cause serious workload interruptions: