Skip to main content
DevOps 2 min read 284 views

Kubernetes 1.35 Introduces Container Restart Feature for AI Workloads

The new "Restart All Containers" alpha feature in Kubernetes 1.35 enables efficient in-place restarts, potentially saving over $100,000 monthly on large GPU clusters.

TD

TechDrop Editorial

Share:

Kubernetes 1.35 introduces "Restart All Containers," a new alpha feature enabling efficient in-place restarts of all containers within a Pod. For AI/ML workloads running on GPU clusters, this could save over $100,000 monthly by allowing fast recovery from checkpoints.

The Problem with Pod Restarts

Traditional Kubernetes recovery requires deleting and recreating Pods when containers need restarting. For GPU-intensive AI/ML workloads, this creates significant overhead:

  • GPU resources must be released and reallocated
  • Container images must be pulled again
  • Application state must be rebuilt from scratch
  • Training jobs lose progress since last checkpoint

How Restart All Containers Helps

The new feature allows restarting all containers in a Pod without Pod recreation:

  • GPU retention: GPU allocations remain attached to the Pod
  • Faster recovery: Skip image pulling and resource allocation
  • Checkpoint resume: Quickly restart from last saved checkpoint
  • Cost savings: Reduced idle time on expensive GPU resources

Dynamic Resource Allocation Beta

Kubernetes 1.35 also promotes Dynamic Resource Allocation (DRA) binding conditions to beta status. DRA improves handling of specialized hardware like GPUs, making it easier for developers to manage demanding AI/ML workloads.

The GPU-First Future

As GPUs become the default hardware for AI workloads, Kubernetes monitoring must evolve. Traditional "golden signals" like CPU and memory usage are less relevant for GPU-bound applications. The community is developing new observability patterns for GPU utilization, memory bandwidth, and tensor core usage.

Adoption Considerations

As an alpha feature, Restart All Containers requires explicit opt-in and should be tested thoroughly before production use. Organizations running large GPU clusters for AI training should evaluate the potential cost savings against the risks of using pre-release functionality.

Related Articles

DevOps 2 min read

Docker Engine 29.3 Ships with Native gRPC Support and BuildKit v0.28

Docker Engine 29.3.0 introduces native gRPC support on listening sockets, BuildKit v0.28.0, and a new bind-create-src option for flexible volume mounting. The release lowers the minimum API version to v1.40 for broader backward compatibility and fixes DNS configuration corruption during daemon reloads.