Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: when all replicas of a deployment are on one node, restart the deployment instead of evicting it #1685

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

andyblog
Copy link

Fixes 1674

Description
When all replicas of a Deployment are on the same node, for example, a deployment has 2 pods on this node, and the 2 pods are evicted when the node is terminated. From the time the 2 pods are evicted to the time the 2 pods are created and run successfully on the new node, the deployment has no pods to provide services.
This also happens when a Deployment has only one replica.

During Evicting, a judgment will be made here. If all replicas of the Deployment are on this node, or the Deployment has only one replica, restarting the Deployment is more elegant than evicting. This operation will first create a pod on the new node, wait for the new pod to run successfully, and then terminate the old pod, which will reduce service interruption time.

How was this change tested?

  • make presubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andyblog
Once this PR has been reviewed and has the lgtm label, please assign gjtempleton for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

linux-foundation-easycla bot commented Sep 18, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Sep 18, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @andyblog. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 18, 2024
@andyblog andyblog changed the title Restart deployment when all replicas of a deployment are on one node, restart the deployment instead of evicting it Sep 18, 2024
@andyblog andyblog changed the title when all replicas of a deployment are on one node, restart the deployment instead of evicting it fix: when all replicas of a deployment are on one node, restart the deployment instead of evicting it Sep 18, 2024
@jwcesign
Copy link
Contributor

I like this idea

@jwcesign
Copy link
Contributor

jwcesign commented Sep 18, 2024

One of the drawbacks is:
This generates a lot of replicasets, and if you want to roll back the specific version by replicasets, you may not be able to do it.

So, we better introduce a feature flag, default to be false.

@coveralls
Copy link

coveralls commented Sep 20, 2024

Pull Request Test Coverage Report for Build 11236341113

Details

  • 62 of 134 (46.27%) changed or added relevant lines in 1 file are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.4%) to 80.418%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/node/termination/terminator/terminator.go 62 134 46.27%
Files with Coverage Reduction New Missed Lines %
pkg/controllers/node/termination/controller.go 2 68.59%
Totals Coverage Status
Change from base Build 11225763241: -0.4%
Covered Lines: 8550
Relevant Lines: 10632

💛 - Coveralls

@yxxhero
Copy link

yxxhero commented Sep 20, 2024

/ok-to-test

@k8s-ci-robot
Copy link
Contributor

@yxxhero: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

github-actions bot commented Oct 6, 2024

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 6, 2024
if deployment.Spec.Template.Annotations == nil {
deployment.Spec.Template.Annotations = make(map[string]string)
}
restartedNode, exists := deployment.Spec.Template.Annotations["kubectl.kubernetes.io/restartedNode"]
Copy link
Contributor

@jwcesign jwcesign Oct 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we choose the name kubectl.kubernetes.io/restartedNode instead of something like x.karpenter.sh/xxx?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is OK to use "karpenter.sh/restartedNode". I refer to the restart method of k8s. It is more reasonable to change it to karpenter style here. Can you help review and see which parts are unreasonable and need to be modified? I will modify it together

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will take a look asap.

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 7, 2024
Copy link
Contributor

@jwcesign jwcesign left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other lgtm

@@ -49,6 +49,8 @@ const (
NodePoolHashAnnotationKey = apis.Group + "/nodepool-hash"
NodePoolHashVersionAnnotationKey = apis.Group + "/nodepool-hash-version"
NodeClaimTerminationTimestampAnnotationKey = apis.Group + "/nodeclaim-termination-timestamp"
// When a deployment is restarted, this annotation is used to mark which node was terminated and restarted.
DeploymentRestartNodeAnnotationKey = apis.Group + "/restart-node"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about DeploymentDrainedPolicyAnnotationKey = apis.Group + "/drained-policy"? it's about node draining.

return fmt.Errorf("get deployment and drain pod from node %w", err)
var drainPods []*corev1.Pod
var restartDeployments []*appsv1.Deployment
deletionDeadline := node.GetDeletionTimestamp().Add(5 * time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this time configurable? For spot instance, this time is too long, default to be 5min

deployment.Spec.Template.Annotations = make(map[string]string)
}
restartedNode, exists := deployment.Spec.Template.Annotations[v1.DeploymentRestartNodeAnnotationKey]
if exists && restartedNode == nodeName {
Copy link
Contributor

@jwcesign jwcesign Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why skip if restartedNode == nodeName? And other annotation value maybe better, like true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not take the annotation from deployment.Annotations?

Copy link
Author

@andyblog andyblog Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating deployment.Spec.Template.Annotations can restart deployment, but updating deployment.Annotations will not cause deployment to restart

Here deployment.Spec.Template.Annotations is used as a sign of whether deployment has been restarted. If annotation already has nodeName, it means that the last Reconcile has restarted deployment, so it will be skipped this time

After testing, it can be implemented. Can you help to see if it is suitable?

if deployment != nil {
key := deployment.Namespace + "/" + deployment.Name
if nodeDeploymentReplicas[key] >= *deployment.Spec.Replicas {
// If a deployment has multiple pods on this node, there will be multiple deployments here, and deduplication is required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's in the scaling-up process, this checking maybe wrong. how about check the ready replicas?

// when the restart begins, the number of copies of the deployment on this node will gradually decrease.
// This situation needs to be judged separately.
t.RLock()
_, exists := t.nodeRestartDeployments[nodeName][key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managing nodeRestartDeployments involves quite a bit of code. Maybe it's time for a refactor?

t.Unlock()

deployment.Spec.Template.Annotations[v1.DeploymentRestartNodeAnnotationKey] = nodeName
if err := t.kubeClient.Update(ctx, deployment); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed this in the past. This is a significant expansion to Karpenter's threat model, as Karpenter will now have permission to mutate every deployment in the cluster. Previously, we've discussed needing a fix upstream to support "surge" eviction.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, restarting the deployment operation will make the karpenter have too much permissions.
Is the upstream support "surge" eviction feature available in a short period of time?

@njtran
Copy link
Contributor

njtran commented Oct 8, 2024

I'm also wondering if this is better suited for a standardized upstream termination flow, rather than implementing this behavior on Karpenter. Effectively we'll be doing a lot more checks, and querying pods/deployments across the cluster for every pod we're terminating, but only doing something different in a smaller subset of cases, where all pods in the deployment live on the cluster. This is nice to let the deployment controller handle up rollouts, but i'm not sure if the juice is worth the squeeze here.

If you feel confident about the change here and the drawbacks, can you write up an RFC detailing the trade-offs so I can better understand?

@andyblog
Copy link
Author

andyblog commented Oct 8, 2024

This situation should be more common in non-production environments, such as test environments and pre-release environments, which are usually functional verifications. Most services are single-copy. If the service interruption time can be minimized when the node is terminated, the R&D experience will be better.

Currently, there seems to be no better way to solve this problem except restarting the deployment. If consider making it a feature and turning it off by default?

Our company is currently encountering this problem, and other companies should also encounter the same problem.

I will try to write an RFC, and we will discuss the advantages and disadvantages, as well as the feasibility.

@Cristobal-M
Copy link

I'm preparing a cluster for dev/staging and facing this same issue. We have deployments with replicas: 1 and if the only pod gets evicted = downtime for service. I am not an expert and maybe what I am about to say is dumb.

If Karpenter should not alter deployments/replicasets, maybe an alternative is to use pod-disruption-budget and have Karpenter just create an annotation on the node for that case: "I want to disrupt this node but I can't"

That would allow an external controller to do:

  • Taint the node
  • Find deployments responsible for the node pods
  • Restart such deployments

They would not be scheduled to the same node, if the new pods are Ready, the old ones get deleted, PDB is respected and Karpenter would be able to disrupt it the next check.

I guess that would have minimal changes for this project. It could be something implemented by the user from a manifest in karpenter documentation, referencing another github project.

Copy link

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pod eviction will cause service interruption
8 participants