-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Node mark reboot helper #65
base: master
Are you sure you want to change the base?
Conversation
This can already be done with |
I closed it too soon. SLURM_SC_OFFLINE_ARGS="update State=DRAIN"
exec $SLURM_SCONTROL $SLURM_SC_OFFLINE_ARGS NodeName=$HOSTNAME Reason="$LEADER $NOTE" Versus: SLURM_SC_OFFLINE_ARGS="reboot ASAP"
exec $SLURM_SCONTROL $SLURM_SC_OFFLINE_ARGS $HOSTNAME |
I'm working on an improved version with Slurm 18.08 support ( |
I've got an internal version that we use. I'm going to push it to this branch. |
57329fa
to
7486112
Compare
Ok, so the difference between node-mark-offline and node-mark-reboot is only a few lines, so they can easily be merged into one helper. The main issue is that The mark-node-online helper can cancel pending reboots (if the node is healthy again) or mark nodes online after a reboot. We've opted to reboot them with |
3df3e9c
to
eebeed7
Compare
For anyone looking to use this helper: this would work perfectly with something like this (I'm referring to the service file). That's because we've opted to let the node return in a drained state, so it will only be resumed if NHC is run during the boot process (or manually). That's by design, because we don't want to trigger a boot loop during the prologue, and we also like to avoid scheduling a job on a node before we know for sure that it's in a good state after the reboot. We use it like this in
Alternatively, there is this pull request that tries to handle it differently. |
I added a helper script to mark nodes for reboot. It's based on
node-mark-offline
, but executesscontrol reboot ASAP <node>
instead. This helper script can be used by settingOFFLINE_NODE
to$HELPERDIR/node-mark-reboot
. This is useful for checks that need a reboot when failed. It's compatible with Slurm only.