-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: when all replicas of a deployment are on one node, restart the deployment instead of evicting it #1685
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andyblog The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hi @andyblog. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
I like this idea |
One of the drawbacks is: So, we better introduce a feature flag, default to be false. |
Pull Request Test Coverage Report for Build 11236341113Details
💛 - Coveralls |
/ok-to-test |
@yxxhero: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
if deployment.Spec.Template.Annotations == nil { | ||
deployment.Spec.Template.Annotations = make(map[string]string) | ||
} | ||
restartedNode, exists := deployment.Spec.Template.Annotations["kubectl.kubernetes.io/restartedNode"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we choose the name kubectl.kubernetes.io/restartedNode
instead of something like x.karpenter.sh/xxx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is OK to use "karpenter.sh/restartedNode". I refer to the restart method of k8s. It is more reasonable to change it to karpenter style here. Can you help review and see which parts are unreasonable and need to be modified? I will modify it together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will take a look asap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other lgtm
@@ -49,6 +49,8 @@ const ( | |||
NodePoolHashAnnotationKey = apis.Group + "/nodepool-hash" | |||
NodePoolHashVersionAnnotationKey = apis.Group + "/nodepool-hash-version" | |||
NodeClaimTerminationTimestampAnnotationKey = apis.Group + "/nodeclaim-termination-timestamp" | |||
// When a deployment is restarted, this annotation is used to mark which node was terminated and restarted. | |||
DeploymentRestartNodeAnnotationKey = apis.Group + "/restart-node" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about DeploymentDrainedPolicyAnnotationKey = apis.Group + "/drained-policy"
? it's about node draining.
return fmt.Errorf("get deployment and drain pod from node %w", err) | ||
var drainPods []*corev1.Pod | ||
var restartDeployments []*appsv1.Deployment | ||
deletionDeadline := node.GetDeletionTimestamp().Add(5 * time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this time configurable? For spot instance, this time is too long, default to be 5min
deployment.Spec.Template.Annotations = make(map[string]string) | ||
} | ||
restartedNode, exists := deployment.Spec.Template.Annotations[v1.DeploymentRestartNodeAnnotationKey] | ||
if exists && restartedNode == nodeName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why skip if restartedNode == nodeName
? And other annotation value maybe better, like true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not take the annotation from deployment.Annotations
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating deployment.Spec.Template.Annotations
can restart deployment, but updating deployment.Annotations
will not cause deployment to restart
Here deployment.Spec.Template.Annotations
is used as a sign of whether deployment has been restarted. If annotation already has nodeName
, it means that the last Reconcile has restarted deployment, so it will be skipped this time
After testing, it can be implemented. Can you help to see if it is suitable?
if deployment != nil { | ||
key := deployment.Namespace + "/" + deployment.Name | ||
if nodeDeploymentReplicas[key] >= *deployment.Spec.Replicas { | ||
// If a deployment has multiple pods on this node, there will be multiple deployments here, and deduplication is required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's in the scaling-up process, this checking maybe wrong. how about check the ready replicas?
// when the restart begins, the number of copies of the deployment on this node will gradually decrease. | ||
// This situation needs to be judged separately. | ||
t.RLock() | ||
_, exists := t.nodeRestartDeployments[nodeName][key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Managing nodeRestartDeployments involves quite a bit of code. Maybe it's time for a refactor?
195aab4
to
fce277e
Compare
t.Unlock() | ||
|
||
deployment.Spec.Template.Annotations[v1.DeploymentRestartNodeAnnotationKey] = nodeName | ||
if err := t.kubeClient.Update(ctx, deployment); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've discussed this in the past. This is a significant expansion to Karpenter's threat model, as Karpenter will now have permission to mutate every deployment in the cluster. Previously, we've discussed needing a fix upstream to support "surge" eviction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, restarting the deployment operation will make the karpenter have too much permissions.
Is the upstream support "surge" eviction feature available in a short period of time?
I'm also wondering if this is better suited for a standardized upstream termination flow, rather than implementing this behavior on Karpenter. Effectively we'll be doing a lot more checks, and querying pods/deployments across the cluster for every pod we're terminating, but only doing something different in a smaller subset of cases, where all pods in the deployment live on the cluster. This is nice to let the deployment controller handle up rollouts, but i'm not sure if the juice is worth the squeeze here. If you feel confident about the change here and the drawbacks, can you write up an RFC detailing the trade-offs so I can better understand? |
This situation should be more common in non-production environments, such as test environments and pre-release environments, which are usually functional verifications. Most services are single-copy. If the service interruption time can be minimized when the node is terminated, the R&D experience will be better. Currently, there seems to be no better way to solve this problem except restarting the deployment. If consider making it a feature and turning it off by default? Our company is currently encountering this problem, and other companies should also encounter the same problem. I will try to write an RFC, and we will discuss the advantages and disadvantages, as well as the feasibility. |
I'm preparing a cluster for dev/staging and facing this same issue. We have deployments with If Karpenter should not alter deployments/replicasets, maybe an alternative is to use pod-disruption-budget and have Karpenter just create an annotation on the node for that case: "I want to disrupt this node but I can't" That would allow an external controller to do:
They would not be scheduled to the same node, if the new pods are Ready, the old ones get deleted, PDB is respected and Karpenter would be able to disrupt it the next check. I guess that would have minimal changes for this project. It could be something implemented by the user from a manifest in karpenter documentation, referencing another github project. |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
Fixes 1674
Description
When all replicas of a Deployment are on the same node, for example, a deployment has 2 pods on this node, and the 2 pods are evicted when the node is terminated. From the time the 2 pods are evicted to the time the 2 pods are created and run successfully on the new node, the deployment has no pods to provide services.
This also happens when a Deployment has only one replica.
During Evicting, a judgment will be made here. If all replicas of the Deployment are on this node, or the Deployment has only one replica, restarting the Deployment is more elegant than evicting. This operation will first create a pod on the new node, wait for the new pod to run successfully, and then terminate the old pod, which will reduce service interruption time.
How was this change tested?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.