Kubectl Checkpoint #5091

adrianreber · 2025-01-27T17:37:10Z

Enhancement Description

One-line enhancement description (can be used as a release note): Introduce the kubectl checkpoint sub-command to allow users to checkpoint a container.
Kubernetes Enhancement Proposal: KEP-5901: Add Kubectl Checkpoint KEP #5092
Discussion Link: https://docs.google.com/document/d/1I1UFGHMDO7mMbDbioQp52DEJXEhk1qymch3qL5-EN10/edit
Primary contact (assignee): @adrianreber
Responsible SIGs: mainly api-machinery and cli
Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.34
- Beta release target (x.y):
- Stable release target (x.y):
Alpha
- KEP (k/enhancements) update PR(s):
  - KEP-5901: Add Kubectl Checkpoint KEP #5092
- Code (k/k) update PR(s):
  - Add 'checkpoint' command to kubectl kubernetes#120898 (closed by the bot)
- Docs (k/website) update PR(s):

The text was updated successfully, but these errors were encountered:

kikisdeliveryservice · 2025-01-31T22:56:23Z

Please fill out the Discussion Link section of the issue which indicates that you've spoken to a SIG about opening this KEP. Also please identify the sponsoring SIG for this KEP.

Thanks

adrianreber · 2025-02-04T13:46:02Z

/sig api-machinery
/sig cli

adrianreber · 2025-02-06T10:14:41Z

See #2008 for the corresponding kubelet changes.

assafWeaversoft · 2025-02-09T11:47:39Z

Hello everyone,

We at Weaversoft are developing Grus, a solution focused on container checkpointing, restoration, and migration within Kubernetes environments. Our use case benefits from an API-exposed checkpoint interface, as it enables automated and remote checkpoint management, which is crucial for:

Seamless workload migration across clusters without direct pod access.
Optimized restart workflows by reducing startup times and restoring application state efficiently.
Improved debugging and fault tolerance, where automated checkpointing can assist in forensic analysis and recovery.
The current recommendation to use kubectl debug introduces operational friction for automated solutions like ours. A direct API and CLI interface would significantly improve usability and integration into existing workflows.

We appreciate the ongoing discussions and would love to contribute further. Looking forward to feedback!

SimonBaeumer · 2025-02-10T10:01:21Z

+1 for exposing the checkpoint API at the kube-apiserver

At StackRox I've done analysis on how we potentially implement Container Checkpointing and share my results here. Imho exposing the checkpoint API at the kube-apiserver is a feature that aligns with Kubernetes' architectural goals and enhances its extensibility. Below, I present an analysis comparing two potential approaches to enabling checkpointing functionality, highlighting why API exposure at the kube-apiserver is the preferred option.

Approaches:

Third-party agent architectures
Direct API exposure in the kube-apiserver (upstream)

If the API endpoint is not exposed, an agent is required to keep the latency low between the checkpoint API request and cri-u triggering the checkpoint on the node.

1) Third-party Agent Architectures

This approach involves deploying a custom checkpointing service that communicates with the kubelet on each node. A typical setup would include:

A DaemonSet running an agent on each node.
The agent communicates with the kubelet via the host network and interacts with the container runtime to create checkpoints.
The checkpoint directory (/var/lib/kubelet/checkpoints) is mounted to the agent on each node.

Concerns with Third-party Agent Architectures

a) Agent access to the kubelet API
The agent requires direct access to the kubelet API, either by joining the host network or by exposing the kubelet endpoint on the Kubernetes network. Both options come with security and operational implications:

Joining the host network creates broader attack surfaces.
Exposing the kubelet endpoint may require additional security controls, which introduces complexity.

Importantly, exposing the checkpointing functionality at the kube-apiserver would not result in any additional security trade-offs compared to these alternatives. Instead, it would consolidate checkpointing access to a central, already-secured API layer.

b) Node discovery and routing complexity
Routing checkpointing requests to the appropriate node requires additional logic in third-party tooling. While the Kubernetes API proxy provides an efficient way to handle such routing, without it, third-party solutions must reimplement this logic. This not only duplicates effort but also increases the risk of errors and inconsistencies.

c) Resource consumption and operational overhead
Running an agent on every node introduces extra resource consumption and operational complexity:

Increased CPU and memory requirements across all nodes.
Maintenance overhead for managing the lifecycle of the agent DaemonSet (e.g., upgrades, scaling, etc.).

d) Maintaining upstream alignment
Offloading checkpointing functionality entirely to third-party tools fragments responsibility and misses an opportunity to standardize the feature in Kubernetes. Kubernetes should define a consistent API that lowers the barrier for third-party integrations while limiting the scope of upstream responsibilities to manageable components.

2) Direct API Exposure in kube-apiserver

Exposing the checkpoint API directly at the kube-apiserver provides a cleaner and more scalable solution, with clear benefits:

a) Simplified security model
Centralizing checkpointing functionality in the kube-apiserver ensures that access control can leverage existing Kubernetes RBAC policies. There’s no need to manage additional agents or configure their permissions separately, reducing the attack surface and operational complexity.

b) Unified request routing
The kube-apiserver already has efficient routing mechanisms to direct requests to specific nodes. Exposing the checkpointing API at this level eliminates the need for third-party tooling to reimplement routing logic, enabling seamless integration for external tools.

c) Reduced resource overhead
By eliminating the need for a DaemonSet and per-node agents, clusters can avoid the extra resource consumption and operational burden of managing additional components.

d) Enabling third-party integrations
While Kubernetes would handle API exposure and routing, the responsibility for managing checkpoint retention, encryption, and advanced features could remain with third-party tools. These tools could implement custom logic using Kubernetes Jobs or other constructs, leveraging the upstream API without the need for agents.

Conclusion

In would love to see the checkpoint API at the kube-apiserver. It provides us with a robust and standardized way to integrate checkpointing functionality without introducing unnecessary complexity or resource overhead. By addressing the concerns of agent-based architectures, this approach makes checkpointing simpler, safer, and more maintainable for the Kubernetes ecosystem.

dleviminzi · 2025-02-18T14:30:06Z

Hello,

Just wanted to voice our company's (beam cloud) support for this enhancement.

Being able to checkpoint containers using kubectl/kube-apiserver would benefit us during development and in production. Due to spikes in demand and varying availability of GPUs, being able to easily migrate workloads to different nodes is highly desirable. In our opinion it is a natural and worthwhile extension of existing functionality.

xhejtman · 2025-02-19T13:43:18Z

+1 in an academic environment, where resources are free of charge for users, which introduces some challenges. With a working checkpoint/restore mechanism at the API level, we could address the following usecases:

Resource Usage Rebalancing

With checkpoint/restore in place, we could seamlessly rebalance Pod allocation across nodes without disrupting computation. The process would involve checkpointing the computation, moving the Pod to a different node, and resuming the long-running task.

A key use case is freeing up a node to run a high-load Pod (e.g., one requiring a large number of CPU cores). Currently, achieving this without evicting and destroying running Pods is difficult. Only a few applications support checkpointing on their own, and when a Pod is evicted or killed, previously consumed resources are wasted since it must restart from scratch.

Easier System Upgrades & Hardware Maintenance

Upgrading core system components—such as the kernel, Kubernetes daemon versions, or performing hardware maintenance (e.g., replacing a faulty memory module)—typically requires rebooting nodes, which forces workloads to be terminated.

With checkpoint/restore, workloads could be saved and later restored, avoiding unnecessary resource wastage and improving the overall user experience.

JupyterHub Notebooks Suspend & Resume

Jupyter notebooks provide checkpoints for code and outputs, but they do not save running processes or changes to the container image. We have already implemented a proof of concept (PoC) that enables full checkpointing/restoration of notebook images. However, using the Kubernetes API for checkpointing would be far more convenient than relying on the kubelet API, which requires running a privileged DaemonSet to interact with kubelet directly.

Improved Infrastructure Reliability

Some of our workloads run for extremely long durations (e.g., 3 to 6 months). Checkpointing them periodically would enhance reliability by ensuring computations can eventually complete, even in the case of node failures or maintenance needs.

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 27, 2025

adrianreber mentioned this issue Jan 27, 2025

KEP-5901: Add Kubectl Checkpoint KEP #5092

Open

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cli Categorizes an issue or PR as relevant to SIG CLI. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 4, 2025

github-project-automation bot added this to SIG CLI Feb 4, 2025

github-project-automation bot moved this to Needs Triage in SIG CLI Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubectl Checkpoint #5091

Kubectl Checkpoint #5091

adrianreber commented Jan 27, 2025 •

edited

Loading

kikisdeliveryservice commented Jan 31, 2025

adrianreber commented Feb 4, 2025

adrianreber commented Feb 6, 2025

assafWeaversoft commented Feb 9, 2025

SimonBaeumer commented Feb 10, 2025

dleviminzi commented Feb 18, 2025

xhejtman commented Feb 19, 2025

Kubectl Checkpoint #5091

Kubectl Checkpoint #5091

Comments

adrianreber commented Jan 27, 2025 • edited Loading

Enhancement Description

kikisdeliveryservice commented Jan 31, 2025

adrianreber commented Feb 4, 2025

adrianreber commented Feb 6, 2025

assafWeaversoft commented Feb 9, 2025

SimonBaeumer commented Feb 10, 2025

Approaches:

1) Third-party Agent Architectures

Concerns with Third-party Agent Architectures

2) Direct API Exposure in kube-apiserver

Conclusion

dleviminzi commented Feb 18, 2025

xhejtman commented Feb 19, 2025

Resource Usage Rebalancing

Easier System Upgrades & Hardware Maintenance

JupyterHub Notebooks Suspend & Resume

Improved Infrastructure Reliability

adrianreber commented Jan 27, 2025 •

edited

Loading