-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job Steps Incorrectly Marked as Successful #165
Comments
Hey @israel-morales, This one is a tough one... I'll try my best to figure out what is happening, and I'll update you on the progress. Sorry for the delay |
Hey @israel-morales, Can you please let me know if you are still seeing this issue on ARC |
@nikola-jokic We have seen the issue occur on ARC We did manage to capture logs during one of these events, which I'll attach for your review. The step ends with: Let me know if there is anything else I can do to help determine the cause. |
Another example on Runner logs: https://gist.github.com/genesis-jamin/774d115df441c3afdd755f73a3c499dc Grep the logs "Finished process 170 with exit code 0" to see where the |
@nikola-jokic which version of k8s have you been testing with? We've seen this error on 1.28 and 1.29. EDIT: We see this on 1.30 as well. |
Someone on the pytest xdist repo mentioned that this could be related to the k8s exec logic: pytest-dev/pytest-xdist#1106 (comment) |
We were able to root cause this -- turns out it's related to our k8s cluster setup. Our k8s cluster is hosted on GKE, and we noticed that every time a Github step would terminate early, it happened right after the cluster scaled down and evicted some We were able to someone mitigate this issue by adding taints / tolerations so that Another option for us is to disable autoscaling, but that defeats the purpose of using ARC in the first place 😆 |
Hello,
I will apologize in advance that the error is inconsistent and I cannot reproduce on demand.
With the kubernetes runner hooks, we have experienced some job steps incorrectly being marked as successful.
This behavior is unexpected and has lead to issues with our dev pipelines.
The two screenshots I have attached show the issue clearly. You can see that the output of the workflow pod is cut off and immediately is marked as successful.
Again, this occurs sometimes and it's not clear what the underlying issue is. Nor is the issue limited to a specific job or seems load based.
Any guidance into how we can further troubleshoot or prevent this issue would be appreciated, thank you!
chart version: gha-runner-scale-set-0.9.0
values: values-gha.txt
The text was updated successfully, but these errors were encountered: