Stuck workload is not cleaned up/correctly handled #4224

woehrl01 · 2025-02-11T11:59:48Z

What happened:

In our cluster we irregularly find stuck workloads which are multiple days old.

What you expected to happen:

That the workload is scheduled or garbage collected

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

This is the workload which is "stuck":

apiVersion: kueue.x-k8s.io/v1beta1
kind: Workload
metadata:
  creationTimestamp: "2024-11-29T00:39:00Z"
  finalizers:
  - kueue.x-k8s.io/resource-in-use
  generation: 1
  labels:
    kueue.x-k8s.io/job-uid: 2eae97f8-63f4-4ba6-9930-4c526f0ec781
  name: job-action
  namespace: default
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: action
    uid: 2eae97f8-63f4-4ba6-9930-4c526f0ec781
  resourceVersion: "6456552770"
  uid: c69d69e1-7869-4bff-9282-9cae630cfdd7
spec:
  active: true
  podSets:
  - count: 1
    name: main
    template:
      metadata:
        labels:
          kueue.x-k8s.io/queue-name: queue
      spec:
        containers:
        - args:
          - -c
          - echo "hello world"
          command:
          - sh
          image: busybox
          imagePullPolicy: Always
          name: php-cli
          resources:
            limits:
              memory: 4Gi
            requests:
              cpu: 150m
              memory: "205785634"
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        dnsPolicy: ClusterFirst
        restartPolicy: Never
        securityContext:
          runAsGroup: 0
          runAsUser: 0
        terminationGracePeriodSeconds: 30
  queueName: queue
status:
  admission:
    clusterQueue: queue
    podSetAssignments:
    - count: 1
      flavors:
        cpu: on-demand
        memory: on-demand
      name: main
      resourceUsage:
        cpu: 150m
        memory: "205785634"
  conditions:
  - lastTransitionTime: "2024-11-29T03:03:26Z"
    message: Quota reserved in ClusterQueue queue
    observedGeneration: 1
    reason: QuotaReserved
    status: "True"
    type: QuotaReserved
  - lastTransitionTime: "2024-11-29T03:03:26Z"
    message: The workload is admitted
    observedGeneration: 1
    reason: Admitted
    status: "True"
    type: Admitted

Environment:

Kubernetes version (use kubectl version): 1.31
Kueue version (use git describe --tags --dirty --always): v0.10.1
Cloud provider or hardware configuration: aws
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

mimowo · 2025-02-11T12:35:00Z

What does it mean that the workload is stuck? Do you mean the pods get created by cannot get scheduled?

If so then one option is to use waitForPodsReady. You can configure it to deactivate the workload after a couple of attempts. Then, you could have a small script to GC Deactivated workloads.

woehrl01 · 2025-02-11T12:37:10Z

Stuck means that the job is not unsuspended. The workload is marked as submitted, but the job is not scheduled.

mimowo · 2025-02-11T12:54:26Z

weird, can you check events for the workload and the corresponding job?

woehrl01 · 2025-02-11T12:56:14Z

Unfortunately not, in our case it is only detected if the workload is not executed for a day. Then all events are already gone. My assumption is that maybe the workload is updated, but the unsuspending of the job fails because of an api server error. Is this possible?

woehrl01 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 11, 2025

woehrl01 changed the title ~~Stuck workload is not cleanedup/corretly handled~~ Stuck workload is not cleaned up/correctly handled Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck workload is not cleaned up/correctly handled #4224

Stuck workload is not cleaned up/correctly handled #4224

woehrl01 commented Feb 11, 2025

mimowo commented Feb 11, 2025

woehrl01 commented Feb 11, 2025

mimowo commented Feb 11, 2025

woehrl01 commented Feb 11, 2025

Stuck workload is not cleaned up/correctly handled #4224

Stuck workload is not cleaned up/correctly handled #4224

Comments

woehrl01 commented Feb 11, 2025

mimowo commented Feb 11, 2025

woehrl01 commented Feb 11, 2025

mimowo commented Feb 11, 2025

woehrl01 commented Feb 11, 2025