Kubernetes 1.35 Introduces Container Restart Feature for AI Workloads

The new "Restart All Containers" alpha feature in Kubernetes 1.35 enables efficient in-place restarts, potentially saving over $100,000 monthly on large GPU clusters.

TechDrop Editorial

January 19, 2026

Kubernetes 1.35 introduces "Restart All Containers," a new alpha feature enabling efficient in-place restarts of all containers within a Pod. For AI/ML workloads running on GPU clusters, this could save over $100,000 monthly by allowing fast recovery from checkpoints.

The Problem with Pod Restarts

Traditional Kubernetes recovery requires deleting and recreating Pods when containers need restarting. For GPU-intensive AI/ML workloads, this creates significant overhead:

GPU resources must be released and reallocated
Container images must be pulled again
Application state must be rebuilt from scratch
Training jobs lose progress since last checkpoint

How Restart All Containers Helps

The new feature allows restarting all containers in a Pod without Pod recreation:

GPU retention: GPU allocations remain attached to the Pod
Faster recovery: Skip image pulling and resource allocation
Checkpoint resume: Quickly restart from last saved checkpoint
Cost savings: Reduced idle time on expensive GPU resources

Dynamic Resource Allocation Beta

Kubernetes 1.35 also promotes Dynamic Resource Allocation (DRA) binding conditions to beta status. DRA improves handling of specialized hardware like GPUs, making it easier for developers to manage demanding AI/ML workloads.

The GPU-First Future

As GPUs become the default hardware for AI workloads, Kubernetes monitoring must evolve. Traditional "golden signals" like CPU and memory usage are less relevant for GPU-bound applications. The community is developing new observability patterns for GPU utilization, memory bandwidth, and tensor core usage.

Adoption Considerations

As an alpha feature, Restart All Containers requires explicit opt-in and should be tested thoroughly before production use. Organizations running large GPU clusters for AI training should evaluate the potential cost savings against the risks of using pre-release functionality.

DevOps 2 min read

GitHub Expands Developer Platform with Actions Artifacts v5 and Copilot Extensions GA

GitHub has shipped Actions Artifacts v5 with immutable storage and artifact attestation for tamper-proof build outputs, alongside the general availability of Copilot Extensions that let third-party tools integrate directly into the Copilot chat experience. The platform also expanded GitHub Models with seven new providers.

DevOps 2 min read

Docker Engine 29.3 Ships with Native gRPC Support and BuildKit v0.28

Docker Engine 29.3.0 introduces native gRPC support on listening sockets, BuildKit v0.28.0, and a new bind-create-src option for flexible volume mounting. The release lowers the minimum API version to v1.40 for broader backward compatibility and fixes DNS configuration corruption during daemon reloads.

DevOps 2 min read

GitHub Adds Dependabot Pre-Commit Support and 28 New Secret Scanning Detectors

GitHub has shipped two major supply chain security features: Dependabot now parses .pre-commit-config.yaml files and opens PRs to update hook versions, while secret scanning gains 28 new detectors from 15 providers including Snowflake, Supabase, and Vercel. Push protection is now enabled by default for 39 secret types.

The Problem with Pod Restarts

How Restart All Containers Helps

Dynamic Resource Allocation Beta

The GPU-First Future

Adoption Considerations

Related Articles

GitHub Expands Developer Platform with Actions Artifacts v5 and Copilot Extensions GA

Docker Engine 29.3 Ships with Native gRPC Support and BuildKit v0.28

GitHub Adds Dependabot Pre-Commit Support and 28 New Secret Scanning Detectors