fix: when all replicas of a deployment are on one node, restart the deployment instead of evicting it #1685

andyblog · 2024-09-18T03:41:31Z

Fixes 1674

Description
When all replicas of a Deployment are on the same node, for example, a deployment has 2 pods on this node, and the 2 pods are evicted when the node is terminated. From the time the 2 pods are evicted to the time the 2 pods are created and run successfully on the new node, the deployment has no pods to provide services.
This also happens when a Deployment has only one replica.

During Evicting, a judgment will be made here. If all replicas of the Deployment are on this node, or the Deployment has only one replica, restarting the Deployment is more elegant than evicting. This operation will first create a pod on the new node, wait for the new pod to run successfully, and then terminate the old pod, which will reduce service interruption time.

How was this change tested?

make presubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2024-09-18T03:41:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andyblog
Once this PR has been reviewed and has the lgtm label, please assign gjtempleton for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

linux-foundation-easycla · 2024-09-18T03:41:38Z

The committers listed above are authorized under a signed CLA.

✅ login: andyblog (fce277e)

k8s-ci-robot · 2024-09-18T03:41:41Z

Hi @andyblog. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jwcesign · 2024-09-18T05:26:30Z

I like this idea

jwcesign · 2024-09-18T05:46:58Z

One of the drawbacks is:
This generates a lot of replicasets, and if you want to roll back the specific version by replicasets, you may not be able to do it.

So, we better introduce a feature flag, default to be false.

coveralls · 2024-09-20T12:07:36Z

Pull Request Test Coverage Report for Build 11236341113

Details

62 of 134 (46.27%) changed or added relevant lines in 1 file are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.4%) to 80.418%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controllers/node/termination/terminator/terminator.go	62	134	46.27%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/node/termination/controller.go	2	68.59%

Totals
Change from base Build 11225763241:	-0.4%
Covered Lines:	8550
Relevant Lines:	10632

💛 - Coveralls

yxxhero · 2024-09-20T12:14:55Z

/ok-to-test

k8s-ci-robot · 2024-09-20T12:15:09Z

@yxxhero: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

github-actions · 2024-10-06T12:01:54Z

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

jwcesign · 2024-10-06T13:39:04Z

pkg/controllers/node/termination/terminator/terminator.go

+		if deployment.Spec.Template.Annotations == nil {
+			deployment.Spec.Template.Annotations = make(map[string]string)
+		}
+		restartedNode, exists := deployment.Spec.Template.Annotations["kubectl.kubernetes.io/restartedNode"]


Why did we choose the name kubectl.kubernetes.io/restartedNode instead of something like x.karpenter.sh/xxx?

It is OK to use "karpenter.sh/restartedNode". I refer to the restart method of k8s. It is more reasonable to change it to karpenter style here. Can you help review and see which parts are unreasonable and need to be modified? I will modify it together

Sure, I will take a look asap.

jwcesign

Other lgtm

jwcesign · 2024-10-08T12:21:42Z

pkg/apis/v1/labels.go

@@ -49,6 +49,8 @@ const (
 	NodePoolHashAnnotationKey                  = apis.Group + "/nodepool-hash"
 	NodePoolHashVersionAnnotationKey           = apis.Group + "/nodepool-hash-version"
 	NodeClaimTerminationTimestampAnnotationKey = apis.Group + "/nodeclaim-termination-timestamp"
+	// When a deployment is restarted, this annotation is used to mark which node was terminated and restarted.
+	DeploymentRestartNodeAnnotationKey = apis.Group + "/restart-node"


How about DeploymentDrainedPolicyAnnotationKey = apis.Group + "/drained-policy"? it's about node draining.

jwcesign · 2024-10-08T12:26:54Z

pkg/controllers/node/termination/terminator/terminator.go

-				return fmt.Errorf("get deployment and drain pod from node %w", err)
+			var drainPods []*corev1.Pod
+			var restartDeployments []*appsv1.Deployment
+			deletionDeadline := node.GetDeletionTimestamp().Add(5 * time.Minute)


Make this time configurable? For spot instance, this time is too long, default to be 5min

jwcesign · 2024-10-08T13:01:43Z

pkg/controllers/node/termination/terminator/terminator.go

+			deployment.Spec.Template.Annotations = make(map[string]string)
+		}
+		restartedNode, exists := deployment.Spec.Template.Annotations[v1.DeploymentRestartNodeAnnotationKey]
+		if exists && restartedNode == nodeName {


Why skip if restartedNode == nodeName? And other annotation value maybe better, like true?

Why not take the annotation from deployment.Annotations?

Updating deployment.Spec.Template.Annotations can restart deployment, but updating deployment.Annotations will not cause deployment to restart

Here deployment.Spec.Template.Annotations is used as a sign of whether deployment has been restarted. If annotation already has nodeName, it means that the last Reconcile has restarted deployment, so it will be skipped this time

After testing, it can be implemented. Can you help to see if it is suitable?

jwcesign · 2024-10-08T13:07:47Z

pkg/controllers/node/termination/terminator/terminator.go

+		if deployment != nil {
+			key := deployment.Namespace + "/" + deployment.Name
+			if nodeDeploymentReplicas[key] >= *deployment.Spec.Replicas {
+				// If a deployment has multiple pods on this node, there will be multiple deployments here, and deduplication is required.


If it's in the scaling-up process, this checking maybe wrong. how about check the ready replicas?

jwcesign · 2024-10-08T13:09:05Z

pkg/controllers/node/termination/terminator/terminator.go

+				// when the restart begins, the number of copies of the deployment on this node will gradually decrease.
+				// This situation needs to be judged separately.
+				t.RLock()
+				_, exists := t.nodeRestartDeployments[nodeName][key]


Managing nodeRestartDeployments involves quite a bit of code. Maybe it's time for a refactor?

ellistarn · 2024-10-08T16:40:32Z

pkg/controllers/node/termination/terminator/terminator.go

+		t.Unlock()
+
+		deployment.Spec.Template.Annotations[v1.DeploymentRestartNodeAnnotationKey] = nodeName
+		if err := t.kubeClient.Update(ctx, deployment); err != nil {


We've discussed this in the past. This is a significant expansion to Karpenter's threat model, as Karpenter will now have permission to mutate every deployment in the cluster. Previously, we've discussed needing a fix upstream to support "surge" eviction.

Indeed, restarting the deployment operation will make the karpenter have too much permissions.
Is the upstream support "surge" eviction feature available in a short period of time?

njtran · 2024-10-08T16:50:26Z

I'm also wondering if this is better suited for a standardized upstream termination flow, rather than implementing this behavior on Karpenter. Effectively we'll be doing a lot more checks, and querying pods/deployments across the cluster for every pod we're terminating, but only doing something different in a smaller subset of cases, where all pods in the deployment live on the cluster. This is nice to let the deployment controller handle up rollouts, but i'm not sure if the juice is worth the squeeze here.

If you feel confident about the change here and the drawbacks, can you write up an RFC detailing the trade-offs so I can better understand?

andyblog · 2024-10-08T17:19:57Z

This situation should be more common in non-production environments, such as test environments and pre-release environments, which are usually functional verifications. Most services are single-copy. If the service interruption time can be minimized when the node is terminated, the R&D experience will be better.

Currently, there seems to be no better way to solve this problem except restarting the deployment. If consider making it a feature and turning it off by default?

Our company is currently encountering this problem, and other companies should also encounter the same problem.

I will try to write an RFC, and we will discuss the advantages and disadvantages, as well as the feasibility.

Cristobal-M · 2024-10-16T08:21:44Z

I'm preparing a cluster for dev/staging and facing this same issue. We have deployments with replicas: 1 and if the only pod gets evicted = downtime for service. I am not an expert and maybe what I am about to say is dumb.

If Karpenter should not alter deployments/replicasets, maybe an alternative is to use pod-disruption-budget and have Karpenter just create an annotation on the node for that case: "I want to disrupt this node but I can't"

That would allow an external controller to do:

Taint the node
Find deployments responsible for the node pods
Restart such deployments

They would not be scheduled to the same node, if the new pods are Ready, the old ones get deleted, PDB is respected and Karpenter would be able to disrupt it the next check.

I guess that would have minimal changes for this project. It could be something implemented by the user from a manifest in karpenter documentation, referencing another github project.

github-actions · 2024-10-30T12:02:04Z

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

k8s-ci-robot requested review from engedaam and tallaxes September 18, 2024 03:41

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Sep 18, 2024

andyblog changed the title ~~Restart deployment~~ when all replicas of a deployment are on one node, restart the deployment instead of evicting it Sep 18, 2024

andyblog changed the title ~~when all replicas of a deployment are on one node, restart the deployment instead of evicting it~~ fix: when all replicas of a deployment are on one node, restart the deployment instead of evicting it Sep 18, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 6, 2024

jwcesign reviewed Oct 6, 2024

View reviewed changes

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 7, 2024

jwcesign reviewed Oct 8, 2024

View reviewed changes

when all replicas of the deployment on node, restart deployment

fce277e

andyblog force-pushed the restart-deployment branch from 195aab4 to fce277e Compare October 8, 2024 13:19

ellistarn reviewed Oct 8, 2024

View reviewed changes

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: when all replicas of a deployment are on one node, restart the deployment instead of evicting it #1685

fix: when all replicas of a deployment are on one node, restart the deployment instead of evicting it #1685

andyblog commented Sep 18, 2024

k8s-ci-robot commented Sep 18, 2024

linux-foundation-easycla bot commented Sep 18, 2024 •

edited

Loading

k8s-ci-robot commented Sep 18, 2024

jwcesign commented Sep 18, 2024

jwcesign commented Sep 18, 2024 •

edited

Loading

coveralls commented Sep 20, 2024 •

edited

Loading

yxxhero commented Sep 20, 2024

k8s-ci-robot commented Sep 20, 2024

github-actions bot commented Oct 6, 2024

jwcesign Oct 6, 2024 •

edited

Loading

andyblog Oct 6, 2024

jwcesign Oct 6, 2024

jwcesign left a comment

jwcesign Oct 8, 2024

jwcesign Oct 8, 2024

jwcesign Oct 8, 2024 •

edited

Loading

jwcesign Oct 8, 2024

andyblog Oct 8, 2024 •

edited

Loading

jwcesign Oct 8, 2024

jwcesign Oct 8, 2024

ellistarn Oct 8, 2024

andyblog Oct 8, 2024

njtran commented Oct 8, 2024

andyblog commented Oct 8, 2024

Cristobal-M commented Oct 16, 2024

github-actions bot commented Oct 30, 2024

fix: when all replicas of a deployment are on one node, restart the deployment instead of evicting it #1685

Are you sure you want to change the base?

fix: when all replicas of a deployment are on one node, restart the deployment instead of evicting it #1685

Conversation

andyblog commented Sep 18, 2024

k8s-ci-robot commented Sep 18, 2024

linux-foundation-easycla bot commented Sep 18, 2024 • edited Loading

k8s-ci-robot commented Sep 18, 2024

jwcesign commented Sep 18, 2024

jwcesign commented Sep 18, 2024 • edited Loading

coveralls commented Sep 20, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11236341113

Details

💛 - Coveralls

yxxhero commented Sep 20, 2024

k8s-ci-robot commented Sep 20, 2024

github-actions bot commented Oct 6, 2024

jwcesign Oct 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwcesign left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwcesign Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andyblog Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njtran commented Oct 8, 2024

andyblog commented Oct 8, 2024

Cristobal-M commented Oct 16, 2024

github-actions bot commented Oct 30, 2024

linux-foundation-easycla bot commented Sep 18, 2024 •

edited

Loading

jwcesign commented Sep 18, 2024 •

edited

Loading

coveralls commented Sep 20, 2024 •

edited

Loading

jwcesign Oct 6, 2024 •

edited

Loading

jwcesign Oct 8, 2024 •

edited

Loading

andyblog Oct 8, 2024 •

edited

Loading