Kubernetes, the engine of cloud-native computing, is getting turbocharged for AI

Follow ZDNET: Add us as a favorite source On Google.

ZDNET Highlights

This program ensures that users can migrate AI workloads between Kubernetes distributions.
Kubernetes will support rollback to eventually return to a working cluster if something goes wrong.
Several other improvements will make Kubernetes even more suited for AI workloads.

More than a decade ago, there were many alternatives to Kubernetes for container orchestration. Today, unless you have been into cloud-native computing for a long time, you would be hard pressed to name any of them. This is because Kubernetes was clearly the best choice.

At the time, containers, thanks to Docker, were new technology. Fast forward a decade and the technology everyone is working on is AI. to that end, Cloud Native Computing Foundation (CNCF) launched Certified Kubernetes AI Conformity Program (CKACP) at KubeCon North America 2025 in Atlanta as a standardized way to deploy AI workloads on Kubernetes clusters.

A secure, universal platform for AI workloads

The goal of CKACP is to create a community-defined, open standard for running AI workloads consistently and reliably across diverse Kubernetes environments.

Also: Why even an American tech giant is now launching ‘sovereign support’ for Europe

Chris Aniszak, CTO of CNCF, said, “This conformance program will create shared standards to ensure that AI workloads will behave predictably across environments. It is based on the same successful community-driven process we have used with Kubernetes to help bring consistency to more than 100 Kubernetes systems as AI adoption scales.”

Specifically, this initiative is designed to:

Ensure portability and interoperability for AI and machine learning (ML) workloads across public clouds, private infrastructure, and hybrid environments, allowing organizations to avoid vendor lock-in while moving AI workloads wherever they are needed.
Reduce fragmentation by setting a shared baseline of capabilities and configuration that the platform should support, making it easier for enterprises to adopt and scale AI with confidence on Kubernetes.
Give vendors and open-source contributors a clear goal for compliance to ensure that their technologies work together and support production-ready AI deployments.
Enable end users to innovate faster, with the assurance that certified platforms have implemented best practices for resource management, GPU integration, and key AI infrastructure needs, tested and validated by CNCF.
Foster a trusted, open ecosystem for AI development, where standards make it possible to efficiently scale, optimize, and manage AI workloads as use grows across industries.

In short, this initiative focuses on providing both enterprises and vendors with a common, tested framework to ensure that AI runs reliably, securely, and efficiently on any certified Kubernetes platform.

If this approach sounds familiar, well, it should, as it is based on CNCF’s successful Certified Kubernetes Conformance Program. This is because of that 2017 plan and agreement, if you are not happy with it, say, red hat openshiftYou can pick up your containerized workloads and deliver them to Mirantis Kubernetes Engine Or Amazon Elastic Kubernetes Service Without worrying about incompatibilities. This portability, in turn, is why Kubernetes is the foundation of many hybrid clouds.

Also: Coding with AI? My Top 5 Tips to Check Its Output and Stay Out of Trouble

With 58% of organizations already running AI workloads on Kubernetes, CNCF’s new program is expected to significantly streamline teams’ deployment, management, and innovation in AI. By offering common test benchmarks, reference architectures, and validated integrations for GPU and accelerator support, the program aims to make AI infrastructure more robust and secure in multi-vendor, multi-cloud environments.

As Jago MacLeod, Kubernetes and GKE engineering director at Google Cloud, said at KubeCon, “At Google Cloud, we’ve certified for Kubernetes AI conformance because we believe stability and portability are essential to scaling AI. By aligning early to this standard, we’re making it easier for developers and enterprises to build AI applications that are production-ready, portable, and efficient, without having to reinvent the infrastructure for every deployment. Without preparation.”

Understanding Kubernetes Improvements

That wasn’t the only thing MacLeod had to say about the future of Kubernetes. Google and CNCF have other plans for a market-leading container orchestrator. Key improvements coming include rollback support, the ability to skip updates, and new low-level controls for GPUs and other AI-specific hardware.

In his keynote, MacLeod explained that, for the first time, Kubernetes users now have a Reliable minor version rollback featureThis feature means that the cluster can be safely returned to a known-good state after an upgrade, This capability eliminates the long-standing “one-way street” problem of Kubernetes control-plane upgrades, Rollback will rapidly reduce the risk of adopting critical new features or urgent security patches,

Along with this improvement, Kubernetes users can now skip specific updates. This approach gives administrators greater flexibility and control when planning version migrations or reacting to production incidents.

In addition to CKACP, Kubernetes is being fundamentally reimagined to support AI workload demands. This support means Kubernetes will give users detailed control over hardware such as GPUs, TPUs, and custom accelerators. This capability also addresses the huge diversity and scale requirements of modern AI hardware.

Also: SUSE Enterprise Linux 16 is here, and its key feature is digital sovereignty

Additionally, new APIs and open-source features are also included agent sandbox And multi-level checkpointingAnnouncement was made in the program. These features will further accelerate inference, training, and agentive AI operations within the cluster. Innovations such as node-level resource allocation, dynamic GPU provisioning, and scheduler optimization for AI hardware are becoming foundational for both researchers and enterprises running multi-tenant clusters.

Agent Sandbox is an open-source framework and controller that enables the management of isolated, secure environments, also known as sandboxes, designed to run stateful, singleton workloads, such as autonomous AI agents, code interpreters, and development tools. The main features of Agent Sandbox are:

Isolation and security:Each sandbox is strongly isolated at both the kernel and network levels using technologies gVisor Or cut containerIt is therefore safe to run untrusted code (for example, generated by large language models) without compromising the integrity of the host system or cluster.
declarative api: Users can declare sandbox environments and templates using Kubernetes-native resources (sandbox, sandbox template, sandboxclaim), enabling rapid, repeatable creation and management of individual instances.
scale and performance: Agent Sandbox supports thousands of concurrent, stateful sandboxes with fast, on-demand provisioning. This capability would be great for AI agent workloads, code execution or persistent developer environments.
Snapshot and Recovery: But Google Kubernetes Engine (GKE)Agent Sandbox can use pod snapshots for fast checkpointing, hibernation, and instant restart, dramatically reducing startup latency and optimizing resource utilization for AI workloads.

Today, Multi-Tier Checkpointing in Kubernetes Mainly available on GKE. In the future, this mechanism will enable reliable storage and management of checkpoints during training of large-scale ML models.

Also: Enterprises are not prepared for a world of malicious AI agents

Here’s a quick explanation of how multi-tier checkpointing works:

multiple storage levels: Checkpoints are first stored in fast, local storage (such as an in-memory volume or a local disk on the node) for quick access and fast recovery.
replication across nodes: Checkpoint data is replicated to peer nodes in the cluster to protect against node failures.
Persistent Cloud Storage Backup: Periodically, checkpoints are backed up to durable cloud storage to provide a reliable fallback in case of cluster-wide failures or when local copies are unavailable.
systematic management: The system automates checkpoint saving, replication, backup and restoration, minimizing manual intervention during training.

The advantage for AL and ML workloads is that multi-tier checkpointing enables training to be quickly restarted from the last checkpoint without losing significant progress. This mechanism also provides fault tolerance by protecting training tasks from repeated interruptions by ensuring that checkpoints are safely stored and replicated.

Apart from all this, multi-tier checkpointing provides scalability by supporting large distributed training jobs running on thousands of nodes. Finally, this feature, of course, works with all major AI frameworks Jax And pytorchAnd integrates with their checkpointing mechanism.

With rollback, selective update skipping, and production-grade AI hardware management, Kubernetes is ready to power the world’s most demanding AI and enterprise platforms. The launch of the Kubernetes AI Conformance Program by CNCF is further strengthening the ecosystem’s role in setting the standards for interoperability, reliability, and performance for the near future of cloud-native AI.

Also: 6 essential rules for incorporating AI into your software development process – and the No. 1 risk

The first decade of Kubernetes was about moving IT from bare metal and virtual machines (VMs) to containers. Its next decade will be defined by the ability to manage AI at planetary scale by providing security, speed, and resiliency for a new class of workloads.

What's Hot

I tried 0patch as a last resort for my Windows 10 PC – here’s how it compares to its promises

A PC Expert Explains Why Don’t Use Your Router’s USB Port When These Options Are Present

New ‘Remote Labor Index’ shows AI fails 97% of the time in freelancer tasks

I tried 0patch as a last resort for my Windows 10 PC – here’s how it compares to its promises

A PC Expert Explains Why Don’t Use Your Router’s USB Port When These Options Are Present

New ‘Remote Labor Index’ shows AI fails 97% of the time in freelancer tasks

Microsoft’s new text editor is a VIM and Nano option

The best luxury car for buyers for the first time in 2025

Massives Datenleck in Cloud-Spichenn | CSO online

Most Popular

10,000 steps or Japanese walk? We ask experts if you should walk ahead or fast

FIFA Club World Cup Soccer: Stream Palmirus vs. Porto lives from anywhere

What do chatbott is careful about punctuation? I tested it with chat, Gemini and Cloud

Our Picks

I tried 0patch as a last resort for my Windows 10 PC – here’s how it compares to its promises

A PC Expert Explains Why Don’t Use Your Router’s USB Port When These Options Are Present

New ‘Remote Labor Index’ shows AI fails 97% of the time in freelancer tasks

Subscribe to Updates

What's Hot

Kubernetes, the engine of cloud-native computing, is getting turbocharged for AI

ZDNET Highlights

A secure, universal platform for AI workloads

Understanding Kubernetes Improvements

Related Posts

Subscribe to Updates