-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VLAB random setup-vpcs errors when configuring server VMs #292
Comments
Another hit:
|
So the error happens during this line execution:
It would be helpful to capture the state (logs, outputs) of the VLAB just after the failure to be able to analyse the root cause. As @Frostman , to progress on this one, what's your preferred approach? Expand hhfab helpers or allow hhfab to (also) run "detached" (eg. a Service interface) and allow artifact gathering externally? |
I could reproduce this in env-3
I could see this in the server logs:
Around that time I see an error in the Agent of (s5248-05) and it looks like it restarts:
Does this look like any known issue @Frostman ? |
I just saw this exact same issue in env-1 while testing the VRF scaling, for what it's worth:
|
First hit with show-tech captured: |
|
I hit this on env-3:
I notice the link going down and no IP after recovering:
Investigating the upstream switch:
I see a series of logs when
And up:
I see other ports going down in the switch log:
@Frostman are you aware of any SONiC issue like this? or should we inspect the lab cabling/NICs. Local fault usually indicates loss of signal detected on the receive data path of a local port:
|
Thanks, @edipascale . I'm getting this consistently in env-3, so I'm trying to improve hhnet script while working on another PR |
OK. I took a look at the hhnet script and it does a
I refactored hhnet to use networkctl and the first impression is it's a lot more stable. I'm facing some other issues in env-3 to test this, at the moment |
I chose networkctl to try to persist network configuration in case of link flap but all my attempts don't improve pipeline success rate. So for now I've reverted to the old hhnet.sh to continue investigating. Findings on server-7
This suggests a temporary disconnect or switch-side issue.
However, the IP address is missing for enp2s1.1007. On the switch side:
I'm adding logging to hhnet.sh and improving it to make it more resilient |
There are known issues with the
hhnet
script: https://github.com/githedgehog/fabricator/blob/master/pkg/hhfab/hhnet.shBut as it's hitting the CI from time to time it needs to be addressed to make the CI more reliable and avoid having to retry the job:
https://github.com/githedgehog/fabricator/actions/runs/12530586269/job/34947281228
The text was updated successfully, but these errors were encountered: