-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubectl Checkpoint #5091
Comments
Hi @adrianreber, Please fill out the Discussion Link section of the issue which indicates that you've spoken to a SIG about opening this KEP. Also please identify the sponsoring SIG for this KEP. Thanks |
/sig api-machinery |
See #2008 for the corresponding kubelet changes. |
Hello everyone, We at Weaversoft are developing Grus, a solution focused on container checkpointing, restoration, and migration within Kubernetes environments. Our use case benefits from an API-exposed checkpoint interface, as it enables automated and remote checkpoint management, which is crucial for: Seamless workload migration across clusters without direct pod access. We appreciate the ongoing discussions and would love to contribute further. Looking forward to feedback! |
+1 for exposing the checkpoint API at the kube-apiserver At StackRox I've done analysis on how we potentially implement Container Checkpointing and share my results here. Imho exposing the checkpoint API at the kube-apiserver is a feature that aligns with Kubernetes' architectural goals and enhances its extensibility. Below, I present an analysis comparing two potential approaches to enabling checkpointing functionality, highlighting why API exposure at the kube-apiserver is the preferred option. Approaches:
If the API endpoint is not exposed, an agent is required to keep the latency low between the checkpoint API request and cri-u triggering the checkpoint on the node. 1) Third-party Agent ArchitecturesThis approach involves deploying a custom checkpointing service that communicates with the kubelet on each node. A typical setup would include:
Concerns with Third-party Agent Architecturesa) Agent access to the kubelet API
Importantly, exposing the checkpointing functionality at the kube-apiserver would not result in any additional security trade-offs compared to these alternatives. Instead, it would consolidate checkpointing access to a central, already-secured API layer. b) Node discovery and routing complexity c) Resource consumption and operational overhead
d) Maintaining upstream alignment 2) Direct API Exposure in kube-apiserverExposing the checkpoint API directly at the kube-apiserver provides a cleaner and more scalable solution, with clear benefits: a) Simplified security model b) Unified request routing c) Reduced resource overhead d) Enabling third-party integrations ConclusionIn would love to see the checkpoint API at the kube-apiserver. It provides us with a robust and standardized way to integrate checkpointing functionality without introducing unnecessary complexity or resource overhead. By addressing the concerns of agent-based architectures, this approach makes checkpointing simpler, safer, and more maintainable for the Kubernetes ecosystem. |
Hello, Just wanted to voice our company's (beam cloud) support for this enhancement. Being able to checkpoint containers using kubectl/kube-apiserver would benefit us during development and in production. Due to spikes in demand and varying availability of GPUs, being able to easily migrate workloads to different nodes is highly desirable. In our opinion it is a natural and worthwhile extension of existing functionality. |
+1 in an academic environment, where resources are free of charge for users, which introduces some challenges. With a working checkpoint/restore mechanism at the API level, we could address the following usecases: Resource Usage RebalancingWith checkpoint/restore in place, we could seamlessly rebalance Pod allocation across nodes without disrupting computation. The process would involve checkpointing the computation, moving the Pod to a different node, and resuming the long-running task. A key use case is freeing up a node to run a high-load Pod (e.g., one requiring a large number of CPU cores). Currently, achieving this without evicting and destroying running Pods is difficult. Only a few applications support checkpointing on their own, and when a Pod is evicted or killed, previously consumed resources are wasted since it must restart from scratch. Easier System Upgrades & Hardware MaintenanceUpgrading core system components—such as the kernel, Kubernetes daemon versions, or performing hardware maintenance (e.g., replacing a faulty memory module)—typically requires rebooting nodes, which forces workloads to be terminated. With checkpoint/restore, workloads could be saved and later restored, avoiding unnecessary resource wastage and improving the overall user experience. JupyterHub Notebooks Suspend & ResumeJupyter notebooks provide checkpoints for code and outputs, but they do not save running processes or changes to the container image. We have already implemented a proof of concept (PoC) that enables full checkpointing/restoration of notebook images. However, using the Kubernetes API for checkpointing would be far more convenient than relying on the kubelet API, which requires running a privileged DaemonSet to interact with kubelet directly. Improved Infrastructure ReliabilitySome of our workloads run for extremely long durations (e.g., 3 to 6 months). Checkpointing them periodically would enhance reliability by ensuring computations can eventually complete, even in the case of node failures or maintenance needs. |
Enhancement Description
kubectl checkpoint
sub-command to allow users to checkpoint a container.k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):The text was updated successfully, but these errors were encountered: