Jobs are waiting too long for a runner to come online. #3704

julien-michaud · 2024-08-12T14:18:03Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.9.3

Deployment Method

ArgoCD

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

- install the controller
- start a job

Describe the bug

Some jobs are waiting from 30 seconds to more than 90 seconds to be scheduled on a runner.

Describe the expected behavior

Jobs should not have to wait that long in my opinion

Additional Context

---
podLabels:
  finops.company.net/stage: prod
  finops.company.net/service_class: live
  finops.company.net/cluster: gke-live-labs-europe-west1

bufferReserveResourcesCronJob:
  create: true

gha-runner-scale-set-controller:
  resources:
    limits:
      memory: 300Mi
    requests:
      cpu: 100m
      memory: 300Mi
  flags:
    logFormat: "json"
  podLabels:
    finops.company.net/stage: prod
    finops.company.net/service_class: live
    finops.company.net/cluster: gke-live-labs-europe-west1
  podAnnotations:
    logs.company.com/datadog_source: "gha-runner-scale-set"

gha-runner-scale-set:
  runnerScaleSetName: "company-hosted"
  maxRunners: 200
  listenerTemplate:
    metadata:
      labels:
        finops.company.net/stage: prod
        finops.company.net/service_class: live
        finops.company.net/cluster: gke-live-labs-europe-west1
      annotations:
        logs.company.com/datadog_source: "gha-runner-scale-set"
        ad.datadoghq.com/listener.checks: |
          {
            "openmetrics": {
              "instances": [
                {
                  "openmetrics_endpoint": "http://%%host%%:8080/metrics",
                  "histogram_buckets_as_distributions": true,
                  "namespace": "actions-runner-system",
                  "metrics": [".*"],
                  "max_returned_metrics": 12000
                }
              ]
            }
          }
  template:
    metadata:
      labels:
        finops.company.net/stage: prod
        finops.company.net/service_class: live
        finops.company.net/cluster: gke-live-labs-europe-west1
      annotations:
        logs.company.com/datadog_source: "gha-runner-scale-set"
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node_pool
                operator: In
                values:
                - github-actions
      tolerations:
        - key: "github-actions"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app.kubernetes.io/component: runner
      containers:
        - name: runner
          image: europe-docker.pkg.dev/platform-replace/company-prod/devex/gha-runners:v0.1.13
          command: ["/home/runner/run.sh"]
          resources:
            requests:
              cpu: 4
  controllerServiceAccount:
    namespace: actions-runner-system
    name: actions-runner-controller-gha-rs-controller

Controller Logs

https://gist.github.com/julien-michaud/585574678b5804eafdf30c913030543e

listener logs:
https://gist.github.com/julien-michaud/27c8025ea0117243f0a85dde1e31bf9f

Runner Pod Logs

https://gist.github.com/julien-michaud/bd3a618f5e8e1d1de1dbb688619563a6

gwynforthewyn · 2024-08-16T14:43:57Z

I was having a similar issue, though not quite as bad. The helm chart accepts a parameter "minRunners" you can set to have warm runners available to service jobs immediately. For my architecture, setting it to 1 for a small team and 2 for a big team has proven sufficient.

If you're not using helm, it looks like minRunners gets set in the AutoscalingRunnerSet spec.

julien-michaud · 2024-08-22T09:47:53Z

I was having a similar issue, though not quite as bad. The helm chart accepts a parameter "minRunners" you can set to have warm runners available to service jobs immediately. For my architecture, setting it to 1 for a small team and 2 for a big team has proven sufficient.

If you're not using helm, it looks like minRunners gets set in the AutoscalingRunnerSet spec.

I agree that setting warm runners could resolve the problem.
What I don't understand is why the controller/listener (on my setup) take ~20 to 30 seconds to pop a runner if none is available.

1 - job is created on Github
2 - after ~15 seconds, the pod is created
3 - the pod need 5/10 seconds to become ready (probably because of dind initialization)
4 - after ~5/10 seconds, the runner is starting the job

shapirus · 2024-09-02T15:30:50Z

Warm runners have one issue: they force the node that their respective pods run on to be running by preventing cluster-autoscaler from terminating the instance, which makes it impossible to set minRunners=0 to save costs by running the builder instances only when they are needed.

Ideally, we should have an option to start a certain number of extra runners when a job arrives, e.g.:

minRunners: 0
overScaleRunners: 5

this way, when a new job arrives, there is an initial wait period for a build node and the initial runners to start, but a certain number of extra runners will be kept ready to take the next jobs when they arrive. Eventually all of them would scale back to zero after a certain period without new jobs.

Besides, the following is true, I am observing it too:

1 - job is created on Github
2 - after ~15 seconds, the pod is created
3 - the pod need 5/10 seconds to become ready (probably because of dind initialization)
4 - after ~5/10 seconds, the runner is starting the job

There are significant delays which are not related to the normal pod startup routine.

shapirus · 2024-09-02T15:37:44Z

It would also be nice to make it possible to "preheat" a pool of runners.

For example:

we have a "heavy" runner set to run CPU and memory-intensive jobs that have respective resources allocation
we also have a "generic" runner set for light tasks with a lower resources allocation
the light tasks are executed after the heavy task is complete (and if it was successful)
it would be nice to start all the (potentially) required runners right away, when the workflow is started, so that the generic runners will be ready to accept their jobs once the heavy job finishes without having to wait for them to start up.

devonhk · 2024-09-10T21:56:08Z

The helm chart accepts a parameter "minRunners" you can set to have warm runners available to service jobs immediately. For my architecture, setting it to 1 for a small team and 2 for a big team has proven sufficient

I tried something similar to work around this issue, but something I noticed is the runners cannot be reused across multiple jobs.

Here's an example:

I have a workflow that triggers 10 jobs sequentially
I set the minRunners to 2.

What I expect
I should have a sufficient number of runners to execute the entire workflow

What actually happens
Runner pods are terminated after every job, so new ones have to be created.

I would prefer to not have to "preheat" 10 runners at the start of the workflow.

julien-michaud added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Aug 12, 2024

julien-michaud changed the title ~~Job is waiting for a runner from <company> to come online.~~ Job is waiting too long for a runner to come online. Aug 12, 2024

julien-michaud changed the title ~~Job is waiting too long for a runner to come online.~~ Jobs are waiting too long for a runner to come online. Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs are waiting too long for a runner to come online. #3704

Jobs are waiting too long for a runner to come online. #3704

julien-michaud commented Aug 12, 2024

gwynforthewyn commented Aug 16, 2024

julien-michaud commented Aug 22, 2024 •

edited

Loading

shapirus commented Sep 2, 2024 •

edited

Loading

shapirus commented Sep 2, 2024

devonhk commented Sep 10, 2024 •

edited

Loading

Jobs are waiting too long for a runner to come online. #3704

Jobs are waiting too long for a runner to come online. #3704

Comments

julien-michaud commented Aug 12, 2024

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

gwynforthewyn commented Aug 16, 2024

julien-michaud commented Aug 22, 2024 • edited Loading

shapirus commented Sep 2, 2024 • edited Loading

shapirus commented Sep 2, 2024

devonhk commented Sep 10, 2024 • edited Loading

julien-michaud commented Aug 22, 2024 •

edited

Loading

shapirus commented Sep 2, 2024 •

edited

Loading

devonhk commented Sep 10, 2024 •

edited

Loading