Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CompilationError when using spatial parallel #214

Open
hiroe- opened this issue Jan 29, 2025 · 9 comments
Open

CompilationError when using spatial parallel #214

hiroe- opened this issue Jan 29, 2025 · 9 comments
Labels
bug Something isn't working Core functionality Adding to the main paradiag functionality upstream Issue related to upstream dependencies

Comments

@hiroe-
Copy link

hiroe- commented Jan 29, 2025

Running this code:

https://github.com/colinjcotter/sketchpad/blob/rk4/averaging/disc_aver_sw_paradiag.py

using parallelism in both space and averaging gives CompilationError(f"Generated code differs across ranks (see output in {output})")

Mismatching kernels are found in
mismatching-kernels.tgz

The full error output:
lev3-dt1-alpha05-advection-8570494.txt

The shell script used:
asQ-lev3-dt1-alpha05-advection.txt

The error disappears when spatial parallel is turned off.

@hiroe-
Copy link
Author

hiroe- commented Jan 29, 2025

@colinjcotter

@JHopeCollins JHopeCollins added bug Something isn't working Core functionality Adding to the main paradiag functionality upstream Issue related to upstream dependencies labels Jan 29, 2025
@JHopeCollins
Copy link
Member

This is a really annoying bug that I have also seen but have struggled to create a reliable MFE for. It's been the bane of our CI since the new year.

If you run the script a second time does it a) run and b) get the right answer?
When I've looked at the mismatching code from my cases, the different kernels do exactly the same thing, but some variable names are swapped. Presumably, somewhere in the code-gen pipeline something that is correct but non-deterministic gets used, which mixes up the naming convention (e.g. the ordering of a python set is not guaranteed to be identical between ranks, or across different invocations).

This hopefully means that, until this is fixed, you can just rerun the script and it should work. If you're running on HPC, I'd recommend that if you run a script with a lot of new forms, run it on a very small number of cores first (ideally 1 core if possible), then run on the number of cores you actually want to use.
Even without this bug, running first on a small number of cores is best on Archer2 anyway because the file system is so slow, and currently all ranks try to write the generated code to disk.

Be aware that, because of the way we build the forms for the all-at-once system internally, if you change the number of timesteps on each ensemble rank then that will cause new forms to be generated, even if you're solving the same equation.

I am really hoping that firedrakeproject/firedrake#3989 will fix this for us! If it doesn't then I'll come back and have a go at creating a simple and reliable MFE.

@hiroe-
Copy link
Author

hiroe- commented Jan 31, 2025

Yes it seems to be running with spatial parallel the second time around!

@connorjward
Copy link

connorjward commented Jan 31, 2025

I am really hoping that firedrakeproject/firedrake#3989 will fix this for us! If it doesn't then I'll come back and have a go at creating a simple and reliable MFE.

Looking at the diffs between the different ranks it is clear that the issues are with TSFC/loopy and so won't be fixed in my PR. An MFE is certainly welcome.

That said, it still may be useful to set PYOP2_SPMD_STRICT=1 whilst debugging and to use my branch (connorjward/more-cache-fixes).

Actually I take it back. To improve performance my PR now generates code on only one rank and broadcasts the resulting string. This means we shouldn't have this crash any longer (though it is only masking the non-deterministic issue).

@JHopeCollins
Copy link
Member

Thanks for having a look at this @connorjward. Glad you agree it should fix this issue, at least for asQ (even if the underlying cause is still there).

Do you think that firedrakeproject/firedrake#4002 might be the cause of the non-determinism if the non-hashability is producing unstable cache keys?

@connorjward
Copy link

Do you think that firedrakeproject/firedrake#4002 might be the cause of the non-determinism if the non-hashability is producing unstable cache keys?

I don't think so. The changes there won't have any impact on the generated code.

Looking at the example above I see that it uses basically all of the wacky preconditioners we have. I can easily imagine that somewhere in those we do things that have poor test coverage in parallel.

@JHopeCollins
Copy link
Member

I don't think so. The changes there won't have any impact on the generated code.

Shame, but I thought it would be a long shot.

Looking at the example above I see that it uses basically all of the wacky preconditioners we have. I can easily imagine that somewhere in those we do things that have poor test coverage in parallel.

I think it is because our forms get quite complicated because we wrap everything in VFSs to mimic complex numbers.
Hiroe's example has a lot going on but I have also seen this bug just with LU for the wave equation. The function space for that is (VFS(BDM, dim=2) x VFS(DG,dim=2)) so even the basic equations end up with complicated forms.

@connorjward
Copy link

Plus you use ensemble a lot which breaks certain assumptions about per-process numbering of coefficients etc.

@ksagiyam
Copy link

@JHopeCollins Could you post that simple wave equation example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Core functionality Adding to the main paradiag functionality upstream Issue related to upstream dependencies
Projects
None yet
Development

No branches or pull requests

4 participants