Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gha_job_execution_duration_seconds_sum reports wrong value in some cases #3731

Open
4 tasks done
hpedrorodrigues opened this issue Sep 5, 2024 · 3 comments
Open
4 tasks done
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@hpedrorodrigues
Copy link

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Install `gha-runner-scale-set-controller` using the Helm chart via FluxCD
2. Install a few `gha-runner-scale-set`s using the Helm chart via FluxCD
3. Run a few workflows to use these runner sets (including canceling a few of them / either manually or due to `concurrency.group`)

Describe the bug

In a few cases (don't know exact reason yet) the listener reports the metric gha_job_execution_duration_seconds_sum with a wrong value.

Example:

gha_job_execution_duration_seconds_sum{enterprise="",event_name="repository_dispatch",job_name="create-gh-deployment",job_result="canceled",job_workflow_ref="[redacted]/.github/workflows/gh-deployment.yml@refs/heads/master",organization="[redacted]",repository="[redacted]",runner_id="0",runner_name=""} 1.27722295721e+11

Looking at the repository, all runs take less than 60 seconds to finish. The other ones are canceled even before starting because the branch has a new commit.

Screenshot 2024-09-05 at 14 26 34 Screenshot 2024-09-05 at 14 26 50

Describe the expected behavior

Not sure if this is caused only by canceled runs, but I'd expect the listener to return 0 for such runs.

Additional Context

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: arc-controller
  namespace: arc
spec:
  chart:
    spec:
      chart: gha-runner-scale-set-controller
      sourceRef:
        name: arc
        kind: HelmRepository
        namespace: flux-system
      version: '>=0.9.3'
  interval: 1m
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace
  values:
    replicaCount: 1
    image:
      repository: [redacted]
    serviceAccount:
      create: true
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 200m
        memory: 200Mi
    metrics:
      controllerManagerAddr: ':8080'
      listenerAddr: ':8080'
      listenerEndpoint: '/metrics'
    flags:
      logFormat: 'json'
      watchSingleNamespace: 'arc'
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: cp-small-runner-set
  namespace: arc
spec:
  chart:
    spec:
      chart: gha-runner-scale-set
      sourceRef:
        name: arc
        kind: HelmRepository
        namespace: flux-system
      version: '>=0.9.3'
  interval: 1m
  values:
    githubConfigUrl: [redacted]
    githubConfigSecret: gh-app-secret
    maxRunners: 10
    minRunners: 0
    runnerGroup: default
    runnerScaleSetName: cp-small
    containerMode:
      type: dind
    template:
      metadata:
        annotations:
          cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
      spec:
        nodeSelector:
          spot: 'false'
          dedicated-for: github-actions
        tolerations:
          - effect: NoSchedule
            key: dedicated-for
            value: github-actions-2x
        containers:
          - name: runner
            image: arc-default-runner
            command: ['/home/runner/run.sh']
            resources:
              requests:
                cpu: 2
                memory: 4Gi
              limits:
                cpu: 2
                memory: 4Gi
        terminationGracePeriodSeconds: 600

Controller Logs

N/A

Runner Pod Logs

N/A
@hpedrorodrigues hpedrorodrigues added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Sep 5, 2024
Copy link
Contributor

github-actions bot commented Sep 5, 2024

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@Lucas-Hughes
Copy link

I get the same result from canceled runs or when the runner pods failed.

I implemented a bit of a hacky fix by putting parameters in Grafana to ignore certain values above a threshold, but agree that it should be 0 for those runs.

@laserpedro
Copy link

I get the same result and like @Lucas-Hughes it seems to happen when the jobs are cancelled. That's too bad since this metrics is super valuable since we can create alerts to detect slower than usual github jobs ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

3 participants