Kubernetes 1.35 Introduces Container Restart Feature for AI Workloads
The new "Restart All Containers" alpha feature in Kubernetes 1.35 enables efficient in-place restarts, potentially saving over $100,000 monthly on large GPU clusters.
Kubernetes 1.35 introduces "Restart All Containers," a new alpha feature enabling efficient in-place restarts of all containers within a Pod. For AI/ML workloads running on GPU clusters, this could save over $100,000 monthly by allowing fast recovery from checkpoints.
The Problem with Pod Restarts
Traditional Kubernetes recovery requires deleting and recreating Pods when containers need restarting. For GPU-intensive AI/ML workloads, this creates significant overhead:
- GPU resources must be released and reallocated
- Container images must be pulled again
- Application state must be rebuilt from scratch
- Training jobs lose progress since last checkpoint
How Restart All Containers Helps
The new feature allows restarting all containers in a Pod without Pod recreation:
- GPU retention: GPU allocations remain attached to the Pod
- Faster recovery: Skip image pulling and resource allocation
- Checkpoint resume: Quickly restart from last saved checkpoint
- Cost savings: Reduced idle time on expensive GPU resources
Dynamic Resource Allocation Beta
Kubernetes 1.35 also promotes Dynamic Resource Allocation (DRA) binding conditions to beta status. DRA improves handling of specialized hardware like GPUs, making it easier for developers to manage demanding AI/ML workloads.
The GPU-First Future
As GPUs become the default hardware for AI workloads, Kubernetes monitoring must evolve. Traditional "golden signals" like CPU and memory usage are less relevant for GPU-bound applications. The community is developing new observability patterns for GPU utilization, memory bandwidth, and tensor core usage.
Adoption Considerations
As an alpha feature, Restart All Containers requires explicit opt-in and should be tested thoroughly before production use. Organizations running large GPU clusters for AI training should evaluate the potential cost savings against the risks of using pre-release functionality.
Related Articles
GitHub Expands Developer Platform with Actions Artifacts v5 and Copilot Extensions GA
GitHub has shipped Actions Artifacts v5 with immutable storage and artifact attestation for tamper-proof build outputs, alongside the general availability of Copilot Extensions that let third-party tools integrate directly into the Copilot chat experience. The platform also expanded GitHub Models with seven new providers.
Docker Engine 29.3 Ships with Native gRPC Support and BuildKit v0.28
Docker Engine 29.3.0 introduces native gRPC support on listening sockets, BuildKit v0.28.0, and a new bind-create-src option for flexible volume mounting. The release lowers the minimum API version to v1.40 for broader backward compatibility and fixes DNS configuration corruption during daemon reloads.
GitHub Adds Dependabot Pre-Commit Support and 28 New Secret Scanning Detectors
GitHub has shipped two major supply chain security features: Dependabot now parses .pre-commit-config.yaml files and opens PRs to update hook versions, while secret scanning gains 28 new detectors from 15 providers including Snowflake, Supabase, and Vercel. Push protection is now enabled by default for 39 secret types.