Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: LowNodeUtilization can soft taint overutilized nodes #1626

Open
tiraboschi opened this issue Feb 7, 2025 · 1 comment · May be fixed by #1625
Open

Feature: LowNodeUtilization can soft taint overutilized nodes #1626

tiraboschi opened this issue Feb 7, 2025 · 1 comment · May be fixed by #1625
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@tiraboschi
Copy link

Is your feature request related to a problem? Please describe.
The descheduler LowNodeUtilization plugin evicts pods from overutilized nodes. Ideally they should land on underutilized nodes but in the reality the choice is completely up to the scheduler.
The descheduler LowNodeUtilization plugin is only responsible for identifying overutilized and underutilized nodes and deschedule, it will be the scheduler that will decide where to place newly-recreated pods.
The assumption was that the descheduler and the scheduler were acting upon similar criteria achieving a converging result.
In the past the descheduler was only classifying node utilization according to CPU and Memory requests but not the actual usage.
With #1555 the LowNodeUtilization plugin gained the capability to classify nodes according to real utilization metrics and this is an interesting feature when the actual resources usage of pods is sensibly larger than its requests.
On the other side this is creating an asymmetry with the scheduler that, at least with the default kube-scheduler, is going to schedule only according to resource requests.
This could potentially break the assumption that a pod descheduled from an overutilized node is likely going to land on an underutilized node since the scheduler is not aware of which nodes the descheduler is considering over and under utilized.

Describe the solution you'd like
A simply and elegant solution is about having the descheduler providing an hint for the scheduler by dynamically setting/removing a soft taint (effect: PreferNoSchedule) to the node that it considered overutilized according to the metrics and the thresholds configured on the descheduler side.
A PreferNoSchedule is just a "preference" or a "soft" version of a NoSchedule taint: the scheduler will try to avoid placing pods on nodes that the deschduler is considering overutilized but it is not guaranteed.
On the other side being just an hint and not a predicate, this is not introducing any risk of limiting the cluster capacity or making it unschedulable so there is no need to implement complex logic to keep the number of tainted nodes below a certain ratio.
So, on each round, the LowNodeUtilization plugin can simply try to apply the soft taint to all the nodes that are now classified as overutilized and remove it from nodes that are not considered overutilized anymore.
This should help the scheduler to take a decision that is consistent with the descheduler expectation of having pods evicted from overutilized nodes landing on appropriately or underutilized ones.

We can have this as an optional (disable by default) sub-feature of the LowNodeUtilization plugin and we can use a user configurable key for the soft taint (nodeutilization.descheduler.kubernetes.io/overutilized sounds as a valid default).
The users can still set a toleration for that taint on their critical workloads.

Describe alternatives you've considered
Node-controller is already setting some conditions based hard taints (effect: NoSchedule) for some node conditions (MemoryPressure,
DiskPressure, PIDPressure...) but those are hard taints when the node is already in a serious/critical condition.
Here the idea is to use a similar approach but with a soft taint when a node is overutilized but still not in a critical condition.
thresholds and targetThresholds and metricsUtilization can already be configured on the LowNodeUtilization plugin so the plugin is the only component that has an accurate view of which nodes are considered under and over utilized according to its configuration.

What version of descheduler are you using?
descheduler version: v0.32.1

Additional context
Something like this has also been reported in the past as for #994 but at that time the descheduler was only able to classify nodes as under/over utilized according to CPU/Memory requests and so the assumption that we can rely on an implicitly symmetry between descheduler and scheduler behaviour was solid.
Now that the descheduler can consume actual node resource utilization by consuming kubernetes metric the issue could become more frequent when static resource requests are not matching the actual usage patterns.
For instance, to achieve higher workload density, in KubeVirt project the pods that executes KubeVirt VMs are configured by default with a CPU request of 1/10 of what is required by the user on the VM.
With such an high overcommit ratio and a descheduler that is able to classify nodes as overutilized according to actual utilization metrics that can be significantly different from the static request, the capability to provide a soft hint for the scheduler to avoid nodes classified as overutilized sounds like a more than reasonable feature.

@tiraboschi tiraboschi added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 7, 2025
@tiraboschi
Copy link
Author

I was a bit concerned about extending RBAC role for the descheduler, but with a ValidatingAdmissionPolicy (GA since k8s 1.30) I was able to restrict it just to taints containing a specific prefix in the key (eventually parameterizable with an external object) and also enforcing PreferNoSchedule for the taint effect.

I used something like:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: descheduler-cluster-role-node-update
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - patch
      - update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: descheduler-cluster-role-binding-node-update
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: descheduler-cluster-role-node-update
subjects:
  - kind: ServiceAccount
    name: descheduler-sa
    namespace: kube-system
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: "descheduler-sa-update-nodes"
spec:
  failurePolicy: Fail
  matchConstraints:
    matchPolicy: Equivalent
    namespaceSelector: {}
    objectSelector: {}
    resourceRules:
      - apiGroups:   [""]
        apiVersions: ["*"]
        operations:  ["UPDATE"]
        resources:   ["nodes"]
        scope: "*"
  matchConditions:
    - name: 'descheduler-sa'
      expression: "request.userInfo.username=='system:serviceaccount:kube-system:descheduler-sa'"
  variables:
    - name: "oldNonDeschedulerTaints"
      expression: "has(oldObject.spec.taints) ? oldObject.spec.taints.filter(t, !t.key.contains('descheduler.kubernetes.io')) : []"
    - name: "oldTaints"
      expression: "has(oldObject.spec.taints) ? oldObject.spec.taints : []"
    - name: "newNonDeschedulerTaints"
      expression: "has(object.spec.taints) ? object.spec.taints.filter(t, !t.key.contains('descheduler.kubernetes.io')) : []"
    - name: "newTaints"
      expression: "has(object.spec.taints) ? object.spec.taints : []"
    - name: "newDeschedulerTaints"
      expression: "has(object.spec.taints) ? object.spec.taints.filter(t, t.key.contains('descheduler.kubernetes.io')) : []"
  validations:
    - expression: |
        oldObject.metadata.filter(k, k != "resourceVersion" && k != "generation" && k != "managedFields").all(k, k in object.metadata) &&
        object.metadata.filter(k, k != "resourceVersion" && k != "generation" && k != "managedFields").all(k, k in oldObject.metadata && oldObject.metadata[k] == object.metadata[k])
      messageExpression: "'User ' + string(request.userInfo.username) + ' is only allowed to update taints'"
      reason: Forbidden
    - expression: |
        oldObject.spec.filter(k, k != "taints").all(k, k in object.spec) &&
        object.spec.filter(k, k != "taints").all(k, k in oldObject.spec && oldObject.spec[k] == object.spec[k])
      messageExpression: "'User ' + string(request.userInfo.username) + ' is only allowed to update taints'"
      reason: Forbidden
    - expression: "size(variables.newNonDeschedulerTaints) == size(variables.oldNonDeschedulerTaints)"
      messageExpression: "'User ' + string(request.userInfo.username) + ' is not allowed to create/delete non descheduler taints'"
      reason: Forbidden
    - expression: "variables.newNonDeschedulerTaints.all(nt, size(variables.oldNonDeschedulerTaints.filter(ot, nt.key==ot.key)) > 0 ? variables.oldNonDeschedulerTaints.filter(ot, nt.key==ot.key)[0].value == nt.value && variables.oldNonDeschedulerTaints.filter(ot, nt.key==ot.key)[0].effect == nt.effect : true)"
      messageExpression: "'User ' + string(request.userInfo.username) + ' is not allowed to update non descheduler taints'"
      reason: Forbidden
    - expression: "variables.newDeschedulerTaints.all(t, t.effect == 'PreferNoSchedule')"
      messageExpression: "'User ' + string(request.userInfo.username) + ' is only allowed to set taints with PreferNoSchedule effect'"
      reason: Forbidden
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "descheduler-sa-update-nodes"
spec:
  policyName: "descheduler-sa-update-nodes"
  validationActions: [Deny]

It's enough to prevent unwanted updates from the descheduler-sa:

$ kubectl --as=system:serviceaccount:kube-system:descheduler-sa patch node kind-worker -p '{"spec":{"unschedulable":true}}'
Error from server (Forbidden): nodes "kind-worker" is forbidden: ValidatingAdmissionPolicy 'descheduler-sa-update-nodes' with binding 'descheduler-sa-update-nodes' denied request: User system:serviceaccount:kube-system:descheduler-sa is only allowed to update taints
$ kubectl --as=system:serviceaccount:kube-system:descheduler-sa label node kind-worker key=value
Error from server (Forbidden): nodes "kind-worker" is forbidden: ValidatingAdmissionPolicy 'descheduler-sa-update-nodes' with binding 'descheduler-sa-update-nodes' denied request: User system:serviceaccount:kube-system:descheduler-sa is only allowed to update taints
$ kubectl --as=system:serviceaccount:kube-system:descheduler-sa annotate node kind-worker key=value
Error from server (Forbidden): nodes "kind-worker" is forbidden: ValidatingAdmissionPolicy 'descheduler-sa-update-nodes' with binding 'descheduler-sa-update-nodes' denied request: User system:serviceaccount:kube-system:descheduler-sa is only allowed to update taints

It's enough to block the descheduler-sa from creating/updating/deleting other taints:

$ kubectl taint node kind-worker key1=value1:NoSchedule
node/kind-worker tainted
$ kubectl --as=system:serviceaccount:kube-system:descheduler-sa taint node kind-worker key2=value2:NoSchedule
Error from server (Forbidden): nodes "kind-worker" is forbidden: ValidatingAdmissionPolicy 'descheduler-sa-update-nodes' with binding 'descheduler-sa-update-nodes' denied request: User system:serviceaccount:kube-system:descheduler-sa is not allowed to create/delete non descheduler taints
$ kubectl --as=system:serviceaccount:kube-system:descheduler-sa taint --overwrite node kind-worker key1=value3:NoSchedule
Error from server (Forbidden): nodes "kind-worker" is forbidden: ValidatingAdmissionPolicy 'descheduler-sa-update-nodes' with binding 'descheduler-sa-update-nodes' denied request: User system:serviceaccount:kube-system:descheduler-sa is not allowed to update non descheduler taints
$ kubectl --as=system:serviceaccount:kube-system:descheduler-sa taint  node kind-worker key1=value3:NoSchedule-
Error from server (Forbidden): nodes "kind-worker" is forbidden: ValidatingAdmissionPolicy 'descheduler-sa-update-nodes' with binding 'descheduler-sa-update-nodes' denied request: User system:serviceaccount:kube-system:descheduler-sa is not allowed to create/delete non descheduler taints

The descheduler-sa can create/update/delete its taints if with PreferNoSchedule effect:

$ kubectl --as=system:serviceaccount:kube-system:descheduler-sa taint node kind-worker nodeutilization.descheduler.kubernetes.io/overutilized=level1:PreferNoSchedule
node/kind-worker tainted
$ kubectl --as=system:serviceaccount:kube-system:descheduler-sa taint --overwrite node kind-worker nodeutilization.descheduler.kubernetes.io/overutilized=level2:PreferNoSchedule
node/kind-worker modified
$ kubectl --as=system:serviceaccount:kube-system:descheduler-sa taint --overwrite node kind-worker nodeutilization.descheduler.kubernetes.io/overutilized=level2:PreferNoSchedule-
node/kind-worker modified

but it's not able to set and hard taint (effect != PreferNoSchedule):

$ kubectl --as=system:serviceaccount:kube-system:descheduler-sa taint --overwrite node kind-worker nodeutilization.descheduler.kubernetes.io/overutilized=level3:NoSchedule
Error from server (Forbidden): nodes "kind-worker" is forbidden: ValidatingAdmissionPolicy 'descheduler-sa-update-nodes' with binding 'descheduler-sa-update-nodes' denied request: User system:serviceaccount:kube-system:descheduler-sa is only allowed to set taints with PreferNoSchedule effect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
1 participant