Job Steps Incorrectly Marked as Successful #165

israel-morales · 2024-05-21T15:30:13Z

Hello,

I will apologize in advance that the error is inconsistent and I cannot reproduce on demand.

With the kubernetes runner hooks, we have experienced some job steps incorrectly being marked as successful.
This behavior is unexpected and has lead to issues with our dev pipelines.

The two screenshots I have attached show the issue clearly. You can see that the output of the workflow pod is cut off and immediately is marked as successful.

Again, this occurs sometimes and it's not clear what the underlying issue is. Nor is the issue limited to a specific job or seems load based.

Any guidance into how we can further troubleshoot or prevent this issue would be appreciated, thank you!

chart version: gha-runner-scale-set-0.9.0
values: values-gha.txt

nikola-jokic · 2024-06-13T09:43:55Z

Hey @israel-morales,

This one is a tough one... I'll try my best to figure out what is happening, and I'll update you on the progress. Sorry for the delay

nikola-jokic · 2024-06-17T12:53:12Z

Hey @israel-morales,

Can you please let me know if you are still seeing this issue on ARC 0.9.2? I'm wondering if the source of the issue was a controller bug that caused runner container to shutdown before it executed the job?
If you are still seeing the issue, can you please provide the runner log? Unfortunately, I failed to reproduce it. I tried killing the workflow container, tried killing the command within the workflow container, and tried killing the child command. I couldn't find a repro, but I'm wondering if ARC issues on 0.9.0 release caused this behavior.

israel-morales · 2024-07-10T09:51:20Z

@nikola-jokic We have seen the issue occur on ARC 0.9.2.
We noticed that killing the pods, processes or even inducing OOM will elicit a proper response from ARC and the Runners, the issue in question is due to something else.

We did manage to capture logs during one of these events, which I'll attach for your review.

runner.log

The step ends with:
Finished process 100 with exit code 0

Let me know if there is anything else I can do to help determine the cause.

genesis-jamin · 2024-07-10T23:05:39Z

Another example on 0.9.3 (copied from actions/actions-runner-controller#3578):

Runner logs: https://gist.github.com/genesis-jamin/774d115df441c3afdd755f73a3c499dc

Grep the logs "Finished process 170 with exit code 0" to see where the sleep 6000 step ends.

genesis-jamin · 2024-07-11T21:35:14Z

@nikola-jokic which version of k8s have you been testing with? We've seen this error on 1.28 and 1.29.

EDIT: We see this on 1.30 as well.

genesis-jamin · 2024-07-12T16:32:00Z

Someone on the pytest xdist repo mentioned that this could be related to the k8s exec logic: pytest-dev/pytest-xdist#1106 (comment)

genesis-jamin · 2024-07-24T19:01:01Z

We were able to root cause this -- turns out it's related to our k8s cluster setup. Our k8s cluster is hosted on GKE, and we noticed that every time a Github step would terminate early, it happened right after the cluster scaled down and evicted some konnectivity-agent pod. I am not a k8s networking expert myself but it seems that this pod is responsible for maintaining the websocket connection used when exec'ing into a pod (which is what the runner does to run commands inside the workflow pod).

We were able to someone mitigate this issue by adding taints / tolerations so that konnectivity-agent pods didn't run on our pytest node pool (which is scaling up / down frequently). This helps us avoid the case when a konnectivity-agent pod is evicted due to the node scaling down, but does not solve the case when the konnectivity autoscaler scales down the number of replicas.

Another option for us is to disable autoscaling, but that defeats the purpose of using ARC in the first place 😆

nikola-jokic added bug Something isn't working k8s labels Jun 13, 2024

genesis-jamin mentioned this issue Jul 10, 2024

Kubernetes mode terminating early or not terminating at all. actions/actions-runner-controller#3578

Open

4 tasks

genesis-jamin mentioned this issue Jul 12, 2024

pytest xdist abruptly exits with exit code 0 when running on Github Actions pytest-dev/pytest-xdist#1106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job Steps Incorrectly Marked as Successful #165

Job Steps Incorrectly Marked as Successful #165

israel-morales commented May 21, 2024 •

edited

Loading

nikola-jokic commented Jun 13, 2024

nikola-jokic commented Jun 17, 2024

israel-morales commented Jul 10, 2024 •

edited

Loading

genesis-jamin commented Jul 10, 2024 •

edited

Loading

genesis-jamin commented Jul 11, 2024 •

edited

Loading

genesis-jamin commented Jul 12, 2024

genesis-jamin commented Jul 24, 2024

Job Steps Incorrectly Marked as Successful #165

Job Steps Incorrectly Marked as Successful #165

Comments

israel-morales commented May 21, 2024 • edited Loading

nikola-jokic commented Jun 13, 2024

nikola-jokic commented Jun 17, 2024

israel-morales commented Jul 10, 2024 • edited Loading

genesis-jamin commented Jul 10, 2024 • edited Loading

genesis-jamin commented Jul 11, 2024 • edited Loading

genesis-jamin commented Jul 12, 2024

genesis-jamin commented Jul 24, 2024

israel-morales commented May 21, 2024 •

edited

Loading

israel-morales commented Jul 10, 2024 •

edited

Loading

genesis-jamin commented Jul 10, 2024 •

edited

Loading

genesis-jamin commented Jul 11, 2024 •

edited

Loading