-
Notifications
You must be signed in to change notification settings - Fork 9
catch sleep notifications and pause all GPU slots #1006
Comments
Comment by @jcoffland |
Comment by @kbernhagen The client or cores need to recover from power loss to GPU. All the client currently catches on Windows are Away Mode and Monitor on/off for idle detection. Example problem: |
Comment by @kbernhagen |
Comment by @jcoffland |
Comment by @kbernhagen Sleep can be delayed up to 2 seconds on Windows, and about 30 seconds on OSX. That's plenty of time to stop cores, or for the cores to stop their processing. |
Comment by @bb30994 The sleep process checkpoints processes that are active in RAM and can resume them easily. They do not checkpoint work-in-process data from a GPU but can Windows can regenerate data associated with the console by refreshing the screen. Adding checkpointing and recovery to each GPU Fahcore is certainly possible, but fixing each of them is a lot more expensive than fixing it one place. A Pause/Resume can successfully back up to the most recent checkpoint that was written to disk with VERY minimal changes to FAHClient. Fahcores do not register for power-change-notifications, nor can they really do anything about the loss of power (and data) in the GPU. As you suggest, they probably can't do anything quickly enough to capture the necessary data, whereas the Pause/Resume sequence need not even happen before the system goes to sleep. Currently, the GPU Fahcore is unaware that anything has happened and it just continues to wait for data that has been lost. |
Comment by @jcoffland
Most cores take 30 seconds or more to stop (aka pause). Are you sure FahStatusTrayIcon actually works? I would bet that it works some of the time but not always. If the core itself would track power notifications it would always work. Better yet the core should timeout when it does not get a response from the GPU. This still seems like an ugly hack to me. Note, there is really only one or two active GPU cores so it's not as big a problem to fix as claimed. |
Comment by @jcoffland |
Comment by @PantherX FahCore_18 has been discussed but nothing has surfaced. |
Comment by @jcoffland |
Comment by @PantherX 0x11 is too old and 0x16 is virtually not assigned to any AMD GPUs unless there is a lack of 0x17 WUs. |
Comment by @bb30994 Having the FahCore keep track of all work packets assigned to the GPU and re-send lost packets would be a valid solution, and ... just like replacing UDP with TCP would increase overhead. If the FahCore is notified of the two changes of state, it can most likely restore the previous checkpoint and resume processing from that point. This was one of the identified limitations of the old screensaver client. If the client pauses / resumes / sleeps / awakens frequently without relatively long periods of activity between interruptions, none of the work being processed may be checkpointed for extended periods of time. I vote for a function that can re-transmit all uncompleted work packets like TCP would do, which will resume processing with a minimum of repeated processing provided it can be guaranteed to perform that function without introducing new bugs. |
Comment by @bb30994 |
Has the problem been solved either in Core_17/_18/_21/_22 or in FAHClient 7.4.xx? The FAHClient setting disable-sleep-when-active was one way to avoid this issue but donors complain that FAH prevents their system from sleeping. If this issue was closed because a better solution was found, then it makes sense to change the default of disable-sleep-when-active to false. I have not seen any recent reports of WUs being corrupted by a system sleep, but that doesn't prove whether loss of GPU power causes the loss of data from active kernels. Maybe the FAHCore could suspend issuing new kernels and wait long enough for them all to be returned to main RAM (which will be saved by the OS), but "long enough" is difficult to define. It's not really necessary to go back to the previous checkpoint as long as all of the data required to resume work is now In main RAM. |
On Windows, I think the messages are
WM_POWERBROADCAST / PBT_APMSUSPEND
WM_POWERBROADCAST / PBT_APMRESUMEAUTOMATIC
On receiving suspend msg, I propose simply calling
app.systemWillSleep()
which would
lock if necessary
write log message
slotMgr.systemWillSleep() // set pauseForSleep flag on GPU slots
slotMgr.update() // or whatever causes an immediate pause
unlock
For resume, there would be a parallel method
app.systemDidWake()
For OSX, catching sleep notifications is part of ticket #944.
The systemWillSleep, systemDidWake mechanism would be shared.
The text was updated successfully, but these errors were encountered: