Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck workload is not cleaned up/correctly handled #4224

Open
woehrl01 opened this issue Feb 11, 2025 · 4 comments
Open

Stuck workload is not cleaned up/correctly handled #4224

woehrl01 opened this issue Feb 11, 2025 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@woehrl01
Copy link
Contributor

What happened:

In our cluster we irregularly find stuck workloads which are multiple days old.

What you expected to happen:

That the workload is scheduled or garbage collected

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

This is the workload which is "stuck":

apiVersion: kueue.x-k8s.io/v1beta1
kind: Workload
metadata:
  creationTimestamp: "2024-11-29T00:39:00Z"
  finalizers:
  - kueue.x-k8s.io/resource-in-use
  generation: 1
  labels:
    kueue.x-k8s.io/job-uid: 2eae97f8-63f4-4ba6-9930-4c526f0ec781
  name: job-action
  namespace: default
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: action
    uid: 2eae97f8-63f4-4ba6-9930-4c526f0ec781
  resourceVersion: "6456552770"
  uid: c69d69e1-7869-4bff-9282-9cae630cfdd7
spec:
  active: true
  podSets:
  - count: 1
    name: main
    template:
      metadata:
        labels:
          kueue.x-k8s.io/queue-name: queue
      spec:
        containers:
        - args:
          - -c
          - echo "hello world"
          command:
          - sh
          image: busybox
          imagePullPolicy: Always
          name: php-cli
          resources:
            limits:
              memory: 4Gi
            requests:
              cpu: 150m
              memory: "205785634"
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        dnsPolicy: ClusterFirst
        restartPolicy: Never
        securityContext:
          runAsGroup: 0
          runAsUser: 0
        terminationGracePeriodSeconds: 30
  queueName: queue
status:
  admission:
    clusterQueue: queue
    podSetAssignments:
    - count: 1
      flavors:
        cpu: on-demand
        memory: on-demand
      name: main
      resourceUsage:
        cpu: 150m
        memory: "205785634"
  conditions:
  - lastTransitionTime: "2024-11-29T03:03:26Z"
    message: Quota reserved in ClusterQueue queue
    observedGeneration: 1
    reason: QuotaReserved
    status: "True"
    type: QuotaReserved
  - lastTransitionTime: "2024-11-29T03:03:26Z"
    message: The workload is admitted
    observedGeneration: 1
    reason: Admitted
    status: "True"
    type: Admitted

Environment:

  • Kubernetes version (use kubectl version): 1.31
  • Kueue version (use git describe --tags --dirty --always): v0.10.1
  • Cloud provider or hardware configuration: aws
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@woehrl01 woehrl01 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 11, 2025
@mimowo
Copy link
Contributor

mimowo commented Feb 11, 2025

What does it mean that the workload is stuck? Do you mean the pods get created by cannot get scheduled?

If so then one option is to use waitForPodsReady. You can configure it to deactivate the workload after a couple of attempts. Then, you could have a small script to GC Deactivated workloads.

@woehrl01
Copy link
Contributor Author

Stuck means that the job is not unsuspended. The workload is marked as submitted, but the job is not scheduled.

@mimowo
Copy link
Contributor

mimowo commented Feb 11, 2025

weird, can you check events for the workload and the corresponding job?

@woehrl01
Copy link
Contributor Author

Unfortunately not, in our case it is only detected if the workload is not executed for a day. Then all events are already gone. My assumption is that maybe the workload is updated, but the unsuspending of the job fails because of an api server error. Is this possible?

@woehrl01 woehrl01 changed the title Stuck workload is not cleanedup/corretly handled Stuck workload is not cleaned up/correctly handled Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants