Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries and confirmations to ensure CNCF runners and machines are removed. #58

Open
gyohuangxin opened this issue Jul 15, 2022 · 15 comments
Labels
area/performance Performance management issue/willfix This issue will be worked on kind/bug Something isn't working priority/high High priority issue

Comments

@gyohuangxin
Copy link
Member

Description

There are some remaining CNCF runners not being remove after tests done, the number of them gradually increases over time.
We can delete them manually, but it's better to make sure they are properly removed.
image

The same thing happened to equinix servers deletion:
image

Expected Behavior

We should add retries and confirmations to ensure CNCF runners and machines are removed.

Screenshots/Logs

Environment:

  • Meshery Version:
  • Kubernetes Version:
  • Host OS:
  • Browser:
@gyohuangxin gyohuangxin added the kind/bug Something isn't working label Jul 15, 2022
@stale
Copy link

stale bot commented Sep 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the issue/stale Issue has not had any activity for an extended period of time label Sep 9, 2022
@stale
Copy link

stale bot commented Sep 21, 2022

This issue is being automatically closed due to inactivity. However, you may choose to reopen this issue.

@stale stale bot closed this as completed Sep 21, 2022
@gyohuangxin gyohuangxin reopened this Sep 21, 2022
@stale stale bot removed the issue/stale Issue has not had any activity for an extended period of time label Sep 21, 2022
@stale
Copy link

stale bot commented Nov 12, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the issue/stale Issue has not had any activity for an extended period of time label Nov 12, 2022
@stale
Copy link

stale bot commented Nov 22, 2022

This issue is being automatically closed due to inactivity. However, you may choose to reopen this issue.

@stale stale bot closed this as completed Nov 22, 2022
@leecalcote leecalcote reopened this Nov 22, 2022
@stale stale bot removed the issue/stale Issue has not had any activity for an extended period of time label Nov 22, 2022
@stale
Copy link

stale bot commented Jan 7, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the issue/stale Issue has not had any activity for an extended period of time label Jan 7, 2023
@leecalcote
Copy link
Member

Uh-oh. We do need to complete this item.

@stale stale bot removed the issue/stale Issue has not had any activity for an extended period of time label Jan 8, 2023
@leecalcote leecalcote added issue/willfix This issue will be worked on area/performance Performance management labels Feb 11, 2023
@leecalcote leecalcote added the priority/high High priority issue label Feb 28, 2023
@vielmetti
Copy link

It's possible to create machines on Equinix Metal in such a way that there's a termination time associated with them. See the "termination_time" field at

https://deploy.equinix.com/developers/api/metal/#tag/Devices/operation/createDevice

in the Equinix Metal API reference.

(That's not a substitute for cleanup, but it could backstop any other efforts if there's a bug somewhere else).

@vielmetti
Copy link

There was a short-lived API outage yesterday, described at

https://status.equinixmetal.com/incidents/h30n2jlr5d3p

which may have impacted manual deletion of these systems. Please retry if you were affected by this. As of this writing, there are 48 systems deployed.

@gyohuangxin
Copy link
Member Author

@vielmetti I'm still facing the issue to access the management UI:
image

@vielmetti
Copy link

@gyohuangxin can you open up a ticket with our support team? I'll share your UI issue with the team, but it may be something specific to your account.

@vielmetti
Copy link

@gyohuangxin Can you please task someone else on the project to assist you with cleaning up the idle and stranded resources while we sort out your access problems.

@vielmetti
Copy link

The code that notices that a deprovision failed is here

https://github.com/layer5io/meshery-smp-action/blob/862c5283953f1b5a3a607c9e1f00461f98a4b4d5/.github/workflows/scripts/stop-cil-runner.sh#L19

It logs an error:

echo "ERROR: Failed to remove CNCF CIL machine: $hostname, device id: $device_id."

and then exits without retrying. If anything fails for any temporary reason, the machines will live forever until someone has manual attention.

Where does this error log go? If it's published somewhere we could look for patterns.

@leecalcote
Copy link
Member

@Revolyssup, will you please add this to tomorrow’s CI meeting? @edwvilla’s help here is much appreciated. Let’s ensure that we have a quick review and resolution. // @gyohuangxin

@leecalcote
Copy link
Member

All existing servers were manually deprovisioned today. A fresh batch of newly provisioned servers is running (now) from workflow schedule. Let's see if those servers are automatically deprovisioned on completion of their task.

@leecalcote
Copy link
Member

Yes, it seems that the test servers are successfully deprovisioned at end of test. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Performance management issue/willfix This issue will be worked on kind/bug Something isn't working priority/high High priority issue
Development

No branches or pull requests

3 participants