Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job Steps Incorrectly Marked as Successful #165

Open
israel-morales opened this issue May 21, 2024 · 7 comments
Open

Job Steps Incorrectly Marked as Successful #165

israel-morales opened this issue May 21, 2024 · 7 comments
Labels
bug Something isn't working k8s

Comments

@israel-morales
Copy link

israel-morales commented May 21, 2024

Hello,

I will apologize in advance that the error is inconsistent and I cannot reproduce on demand.

With the kubernetes runner hooks, we have experienced some job steps incorrectly being marked as successful.
This behavior is unexpected and has lead to issues with our dev pipelines.

The two screenshots I have attached show the issue clearly. You can see that the output of the workflow pod is cut off and immediately is marked as successful.

jobfailsuccess

jobfailedsucessfully

Again, this occurs sometimes and it's not clear what the underlying issue is. Nor is the issue limited to a specific job or seems load based.

Any guidance into how we can further troubleshoot or prevent this issue would be appreciated, thank you!

chart version: gha-runner-scale-set-0.9.0
values: values-gha.txt

@nikola-jokic nikola-jokic added bug Something isn't working k8s labels Jun 13, 2024
@nikola-jokic
Copy link
Contributor

Hey @israel-morales,

This one is a tough one... I'll try my best to figure out what is happening, and I'll update you on the progress. Sorry for the delay

@nikola-jokic
Copy link
Contributor

Hey @israel-morales,

Can you please let me know if you are still seeing this issue on ARC 0.9.2? I'm wondering if the source of the issue was a controller bug that caused runner container to shutdown before it executed the job?
If you are still seeing the issue, can you please provide the runner log? Unfortunately, I failed to reproduce it. I tried killing the workflow container, tried killing the command within the workflow container, and tried killing the child command. I couldn't find a repro, but I'm wondering if ARC issues on 0.9.0 release caused this behavior.

@israel-morales
Copy link
Author

israel-morales commented Jul 10, 2024

@nikola-jokic We have seen the issue occur on ARC 0.9.2.
We noticed that killing the pods, processes or even inducing OOM will elicit a proper response from ARC and the Runners, the issue in question is due to something else.

We did manage to capture logs during one of these events, which I'll attach for your review.

screenshot
runner.log

The step ends with:
Finished process 100 with exit code 0

Let me know if there is anything else I can do to help determine the cause.

@genesis-jamin
Copy link

genesis-jamin commented Jul 10, 2024

Another example on 0.9.3 (copied from actions/actions-runner-controller#3578):

image

Runner logs: https://gist.github.com/genesis-jamin/774d115df441c3afdd755f73a3c499dc

Grep the logs "Finished process 170 with exit code 0" to see where the sleep 6000 step ends.

@genesis-jamin
Copy link

genesis-jamin commented Jul 11, 2024

@nikola-jokic which version of k8s have you been testing with? We've seen this error on 1.28 and 1.29.

EDIT: We see this on 1.30 as well.

@genesis-jamin
Copy link

Someone on the pytest xdist repo mentioned that this could be related to the k8s exec logic: pytest-dev/pytest-xdist#1106 (comment)

@genesis-jamin
Copy link

We were able to root cause this -- turns out it's related to our k8s cluster setup. Our k8s cluster is hosted on GKE, and we noticed that every time a Github step would terminate early, it happened right after the cluster scaled down and evicted some konnectivity-agent pod. I am not a k8s networking expert myself but it seems that this pod is responsible for maintaining the websocket connection used when exec'ing into a pod (which is what the runner does to run commands inside the workflow pod).

We were able to someone mitigate this issue by adding taints / tolerations so that konnectivity-agent pods didn't run on our pytest node pool (which is scaling up / down frequently). This helps us avoid the case when a konnectivity-agent pod is evicted due to the node scaling down, but does not solve the case when the konnectivity autoscaler scales down the number of replicas.

Another option for us is to disable autoscaling, but that defeats the purpose of using ARC in the first place 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working k8s
Projects
None yet
Development

No branches or pull requests

3 participants