Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs are waiting too long for a runner to come online. #3704

Open
4 tasks done
julien-michaud opened this issue Aug 12, 2024 · 5 comments
Open
4 tasks done

Jobs are waiting too long for a runner to come online. #3704

julien-michaud opened this issue Aug 12, 2024 · 5 comments
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@julien-michaud
Copy link

Checks

Controller Version

0.9.3

Deployment Method

ArgoCD

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

- install the controller
- start a job

Describe the bug

Some jobs are waiting from 30 seconds to more than 90 seconds to be scheduled on a runner.

Describe the expected behavior

Jobs should not have to wait that long in my opinion

Additional Context

---
podLabels:
  finops.company.net/stage: prod
  finops.company.net/service_class: live
  finops.company.net/cluster: gke-live-labs-europe-west1

bufferReserveResourcesCronJob:
  create: true

gha-runner-scale-set-controller:
  resources:
    limits:
      memory: 300Mi
    requests:
      cpu: 100m
      memory: 300Mi
  flags:
    logFormat: "json"
  podLabels:
    finops.company.net/stage: prod
    finops.company.net/service_class: live
    finops.company.net/cluster: gke-live-labs-europe-west1
  podAnnotations:
    logs.company.com/datadog_source: "gha-runner-scale-set"

gha-runner-scale-set:
  runnerScaleSetName: "company-hosted"
  maxRunners: 200
  listenerTemplate:
    metadata:
      labels:
        finops.company.net/stage: prod
        finops.company.net/service_class: live
        finops.company.net/cluster: gke-live-labs-europe-west1
      annotations:
        logs.company.com/datadog_source: "gha-runner-scale-set"
        ad.datadoghq.com/listener.checks: |
          {
            "openmetrics": {
              "instances": [
                {
                  "openmetrics_endpoint": "http://%%host%%:8080/metrics",
                  "histogram_buckets_as_distributions": true,
                  "namespace": "actions-runner-system",
                  "metrics": [".*"],
                  "max_returned_metrics": 12000
                }
              ]
            }
          }
  template:
    metadata:
      labels:
        finops.company.net/stage: prod
        finops.company.net/service_class: live
        finops.company.net/cluster: gke-live-labs-europe-west1
      annotations:
        logs.company.com/datadog_source: "gha-runner-scale-set"
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node_pool
                operator: In
                values:
                - github-actions
      tolerations:
        - key: "github-actions"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app.kubernetes.io/component: runner
      containers:
        - name: runner
          image: europe-docker.pkg.dev/platform-replace/company-prod/devex/gha-runners:v0.1.13
          command: ["/home/runner/run.sh"]
          resources:
            requests:
              cpu: 4
  controllerServiceAccount:
    namespace: actions-runner-system
    name: actions-runner-controller-gha-rs-controller

Controller Logs

https://gist.github.com/julien-michaud/585574678b5804eafdf30c913030543e

listener logs:
https://gist.github.com/julien-michaud/27c8025ea0117243f0a85dde1e31bf9f

Runner Pod Logs

https://gist.github.com/julien-michaud/bd3a618f5e8e1d1de1dbb688619563a6
@julien-michaud julien-michaud added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Aug 12, 2024
@julien-michaud julien-michaud changed the title Job is waiting for a runner from <company> to come online. Job is waiting too long for a runner to come online. Aug 12, 2024
@julien-michaud julien-michaud changed the title Job is waiting too long for a runner to come online. Jobs are waiting too long for a runner to come online. Aug 12, 2024
@gwynforthewyn
Copy link
Contributor

I was having a similar issue, though not quite as bad. The helm chart accepts a parameter "minRunners" you can set to have warm runners available to service jobs immediately. For my architecture, setting it to 1 for a small team and 2 for a big team has proven sufficient.

If you're not using helm, it looks like minRunners gets set in the AutoscalingRunnerSet spec.

@julien-michaud
Copy link
Author

julien-michaud commented Aug 22, 2024

I was having a similar issue, though not quite as bad. The helm chart accepts a parameter "minRunners" you can set to have warm runners available to service jobs immediately. For my architecture, setting it to 1 for a small team and 2 for a big team has proven sufficient.

If you're not using helm, it looks like minRunners gets set in the AutoscalingRunnerSet spec.

I agree that setting warm runners could resolve the problem.
What I don't understand is why the controller/listener (on my setup) take ~20 to 30 seconds to pop a runner if none is available.

1 - job is created on Github
2 - after ~15 seconds, the pod is created
3 - the pod need 5/10 seconds to become ready (probably because of dind initialization)
4 - after ~5/10 seconds, the runner is starting the job

@shapirus
Copy link

shapirus commented Sep 2, 2024

Warm runners have one issue: they force the node that their respective pods run on to be running by preventing cluster-autoscaler from terminating the instance, which makes it impossible to set minRunners=0 to save costs by running the builder instances only when they are needed.

Ideally, we should have an option to start a certain number of extra runners when a job arrives, e.g.:

minRunners: 0
overScaleRunners: 5

this way, when a new job arrives, there is an initial wait period for a build node and the initial runners to start, but a certain number of extra runners will be kept ready to take the next jobs when they arrive. Eventually all of them would scale back to zero after a certain period without new jobs.

Besides, the following is true, I am observing it too:

1 - job is created on Github
2 - after ~15 seconds, the pod is created
3 - the pod need 5/10 seconds to become ready (probably because of dind initialization)
4 - after ~5/10 seconds, the runner is starting the job

There are significant delays which are not related to the normal pod startup routine.

@shapirus
Copy link

shapirus commented Sep 2, 2024

It would also be nice to make it possible to "preheat" a pool of runners.

For example:

  • we have a "heavy" runner set to run CPU and memory-intensive jobs that have respective resources allocation
  • we also have a "generic" runner set for light tasks with a lower resources allocation
  • the light tasks are executed after the heavy task is complete (and if it was successful)
  • it would be nice to start all the (potentially) required runners right away, when the workflow is started, so that the generic runners will be ready to accept their jobs once the heavy job finishes without having to wait for them to start up.

@devonhk
Copy link

devonhk commented Sep 10, 2024

The helm chart accepts a parameter "minRunners" you can set to have warm runners available to service jobs immediately. For my architecture, setting it to 1 for a small team and 2 for a big team has proven sufficient

I tried something similar to work around this issue, but something I noticed is the runners cannot be reused across multiple jobs.

Here's an example:

  1. I have a workflow that triggers 10 jobs sequentially
  2. I set the minRunners to 2.

What I expect
I should have a sufficient number of runners to execute the entire workflow

What actually happens
Runner pods are terminated after every job, so new ones have to be created.

I would prefer to not have to "preheat" 10 runners at the start of the workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

4 participants