-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CompilationError when using spatial parallel #214
Comments
This is a really annoying bug that I have also seen but have struggled to create a reliable MFE for. It's been the bane of our CI since the new year. If you run the script a second time does it a) run and b) get the right answer? This hopefully means that, until this is fixed, you can just rerun the script and it should work. If you're running on HPC, I'd recommend that if you run a script with a lot of new forms, run it on a very small number of cores first (ideally 1 core if possible), then run on the number of cores you actually want to use. Be aware that, because of the way we build the forms for the all-at-once system internally, if you change the number of timesteps on each ensemble rank then that will cause new forms to be generated, even if you're solving the same equation. I am really hoping that firedrakeproject/firedrake#3989 will fix this for us! If it doesn't then I'll come back and have a go at creating a simple and reliable MFE. |
Yes it seems to be running with spatial parallel the second time around! |
Actually I take it back. To improve performance my PR now generates code on only one rank and broadcasts the resulting string. This means we shouldn't have this crash any longer (though it is only masking the non-deterministic issue). |
Thanks for having a look at this @connorjward. Glad you agree it should fix this issue, at least for asQ (even if the underlying cause is still there). Do you think that firedrakeproject/firedrake#4002 might be the cause of the non-determinism if the non-hashability is producing unstable cache keys? |
I don't think so. The changes there won't have any impact on the generated code. Looking at the example above I see that it uses basically all of the wacky preconditioners we have. I can easily imagine that somewhere in those we do things that have poor test coverage in parallel. |
Shame, but I thought it would be a long shot.
I think it is because our forms get quite complicated because we wrap everything in VFSs to mimic complex numbers. |
Plus you use ensemble a lot which breaks certain assumptions about per-process numbering of coefficients etc. |
@JHopeCollins Could you post that simple wave equation example? |
Running this code:
https://github.com/colinjcotter/sketchpad/blob/rk4/averaging/disc_aver_sw_paradiag.py
using parallelism in both space and averaging gives
CompilationError(f"Generated code differs across ranks (see output in {output})")
Mismatching kernels are found in
mismatching-kernels.tgz
The full error output:
lev3-dt1-alpha05-advection-8570494.txt
The shell script used:
asQ-lev3-dt1-alpha05-advection.txt
The error disappears when spatial parallel is turned off.
The text was updated successfully, but these errors were encountered: