Skip to content
This repository has been archived by the owner on Jul 13, 2024. It is now read-only.

catch sleep notifications and pause all GPU slots #1006

Closed
jcoffland opened this issue Mar 17, 2013 · 14 comments
Closed

catch sleep notifications and pause all GPU slots #1006

jcoffland opened this issue Mar 17, 2013 · 14 comments
Assignees
Labels
0.Status - More Information Reported issue needs more information before a decision is made. 1.Type - Enhancement Reported issue is an enhamcement.

Comments

@jcoffland
Copy link
Member

Trac Data
Ticket 1006
Reported by @kbernhagen
Status closed
Component FAHClient
Priority 5

On Windows, I think the messages are
WM_POWERBROADCAST / PBT_APMSUSPEND
WM_POWERBROADCAST / PBT_APMRESUMEAUTOMATIC

On receiving suspend msg, I propose simply calling
app.systemWillSleep()

which would
lock if necessary
write log message
slotMgr.systemWillSleep() // set pauseForSleep flag on GPU slots
slotMgr.update() // or whatever causes an immediate pause
unlock

For resume, there would be a parallel method
app.systemDidWake()

For OSX, catching sleep notifications is part of ticket #944.
The systemWillSleep, systemDidWake mechanism would be shared.

@jcoffland
Copy link
Member Author

Comment by @jcoffland
Why is this necessary or desirable? The current power managment notification in Windows seems to work fine.

@jcoffland
Copy link
Member Author

Comment by @kbernhagen
Because when the system goes to sleep, which can always be forced by user or system, GPUs will power down and the GPU slots will malfunction and either get stuck or abort and dump.

The client or cores need to recover from power loss to GPU.
They currently don't even know when the system goes to sleep.

All the client currently catches on Windows are Away Mode and Monitor on/off for idle detection.

Example problem:
http://foldingforum.org/viewtopic.php?f=85&p=239356#p239356

@jcoffland
Copy link
Member Author

Comment by @kbernhagen
Partially a repost, but see also
http://foldingforum.org/viewtopic.php?f=88&t=23926

@jcoffland
Copy link
Member Author

Comment by @jcoffland
This sounds more like a GPU core issue to me. The cores are suppose to recover from this sort of thing. Doubt there is time to shutdown the core cleanly if say for example the laptop lid is closed.

@jcoffland
Copy link
Member Author

Comment by @kbernhagen
Fine if you want to fix it in the cores.

Sleep can be delayed up to 2 seconds on Windows, and about 30 seconds on OSX.

That's plenty of time to stop cores, or for the cores to stop their processing.

@jcoffland
Copy link
Member Author

Comment by @bb30994
I'm attching some user generated code that pauses the GPUs when there's a power change notification -- as an external process. Whether you choose to use something akin to this method or develop your own code is up to you but it's basially MUCH more simple that doing it in all of the Fahcores.

The sleep process checkpoints processes that are active in RAM and can resume them easily. They do not checkpoint work-in-process data from a GPU but can Windows can regenerate data associated with the console by refreshing the screen. Adding checkpointing and recovery to each GPU Fahcore is certainly possible, but fixing each of them is a lot more expensive than fixing it one place. A Pause/Resume can successfully back up to the most recent checkpoint that was written to disk with VERY minimal changes to FAHClient. Fahcores do not register for power-change-notifications, nor can they really do anything about the loss of power (and data) in the GPU. As you suggest, they probably can't do anything quickly enough to capture the necessary data, whereas the Pause/Resume sequence need not even happen before the system goes to sleep.

Currently, the GPU Fahcore is unaware that anything has happened and it just continues to wait for data that has been lost.

@jcoffland
Copy link
Member Author

Comment by @jcoffland
Replying to [comment:5 calxalot]:

Fine if you want to fix it in the cores.

Sleep can be delayed up to 2 seconds on Windows, and about 30 seconds on OSX.

That's plenty of time to stop cores, or for the cores to stop their processing.

Most cores take 30 seconds or more to stop (aka pause).

Are you sure FahStatusTrayIcon actually works? I would bet that it works some of the time but not always. If the core itself would track power notifications it would always work. Better yet the core should timeout when it does not get a response from the GPU. This still seems like an ugly hack to me. Note, there is really only one or two active GPU cores so it's not as big a problem to fix as claimed.

@jcoffland
Copy link
Member Author

Comment by @jcoffland
Which cores is this a problem with?

@jcoffland
Copy link
Member Author

Comment by @PantherX
The currently running GPU FahCores are:
FahCores_11 -> Going EOL
FahCores_15 -> Going EOL
FahCores_16 -> Going EOL
FahCores_17

FahCore_18 has been discussed but nothing has surfaced.

@jcoffland
Copy link
Member Author

Comment by @jcoffland
So really if we fixed 0x17 then the problem would be mostly solved.

@jcoffland
Copy link
Member Author

Comment by @PantherX
Almost. 0x17 is sure but 0x15 is still going to be around for sometime since there are Projects which are still using it.

0x11 is too old and 0x16 is virtually not assigned to any AMD GPUs unless there is a lack of 0x17 WUs.

@jcoffland
Copy link
Member Author

Comment by @bb30994
When sleep successfully powers off a GPU the data being calculated is not backed up and consequently is lost. When un-sleep powers on both the GPU and the CPU and the FahCore resumes operation, it is normally unaware of either change of state so it's still waiting for the results of the processing packets that are no longer active in the GPU.

Having the FahCore keep track of all work packets assigned to the GPU and re-send lost packets would be a valid solution, and ... just like replacing UDP with TCP would increase overhead.

If the FahCore is notified of the two changes of state, it can most likely restore the previous checkpoint and resume processing from that point. This was one of the identified limitations of the old screensaver client. If the client pauses / resumes / sleeps / awakens frequently without relatively long periods of activity between interruptions, none of the work being processed may be checkpointed for extended periods of time.

I vote for a function that can re-transmit all uncompleted work packets like TCP would do, which will resume processing with a minimum of repeated processing provided it can be guaranteed to perform that function without introducing new bugs.

@jcoffland
Copy link
Member Author

Comment by @bb30994
Note that data displayed in the GUI will also be lost when the GPU's memory enters sleep (power off) but all graphic data can be recovered by refreshing the screen after an un-sleep. Presumably the un-sleep notification can be trapped by the FahCore(s) and any non-graphical data that have been sent to the GPU can be refreshed, too (if that's the way the OS handles it).

@jcoffland jcoffland added 1.Type - Enhancement Reported issue is an enhamcement. 0.Status - More Information Reported issue needs more information before a decision is made. labels Apr 3, 2015
@jcoffland jcoffland self-assigned this Apr 3, 2015
@bb30994
Copy link

bb30994 commented Apr 8, 2017

Has the problem been solved either in Core_17/_18/_21/_22 or in FAHClient 7.4.xx?

The FAHClient setting disable-sleep-when-active was one way to avoid this issue but donors complain that FAH prevents their system from sleeping. If this issue was closed because a better solution was found, then it makes sense to change the default of disable-sleep-when-active to false.

I have not seen any recent reports of WUs being corrupted by a system sleep, but that doesn't prove whether loss of GPU power causes the loss of data from active kernels.

Maybe the FAHCore could suspend issuing new kernels and wait long enough for them all to be returned to main RAM (which will be saved by the OS), but "long enough" is difficult to define. It's not really necessary to go back to the previous checkpoint as long as all of the data required to resume work is now In main RAM.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
0.Status - More Information Reported issue needs more information before a decision is made. 1.Type - Enhancement Reported issue is an enhamcement.
Projects
None yet
Development

No branches or pull requests

2 participants