CompilationError when using spatial parallel #214

hiroe- · 2025-01-29T14:28:42Z

Running this code:

https://github.com/colinjcotter/sketchpad/blob/rk4/averaging/disc_aver_sw_paradiag.py

using parallelism in both space and averaging gives CompilationError(f"Generated code differs across ranks (see output in {output})")

Mismatching kernels are found in
mismatching-kernels.tgz

The full error output:
lev3-dt1-alpha05-advection-8570494.txt

The shell script used:
asQ-lev3-dt1-alpha05-advection.txt

The error disappears when spatial parallel is turned off.

The text was updated successfully, but these errors were encountered:

hiroe- · 2025-01-29T14:30:04Z

@colinjcotter

JHopeCollins · 2025-01-30T10:43:03Z

This is a really annoying bug that I have also seen but have struggled to create a reliable MFE for. It's been the bane of our CI since the new year.

If you run the script a second time does it a) run and b) get the right answer?
When I've looked at the mismatching code from my cases, the different kernels do exactly the same thing, but some variable names are swapped. Presumably, somewhere in the code-gen pipeline something that is correct but non-deterministic gets used, which mixes up the naming convention (e.g. the ordering of a python set is not guaranteed to be identical between ranks, or across different invocations).

This hopefully means that, until this is fixed, you can just rerun the script and it should work. If you're running on HPC, I'd recommend that if you run a script with a lot of new forms, run it on a very small number of cores first (ideally 1 core if possible), then run on the number of cores you actually want to use.
Even without this bug, running first on a small number of cores is best on Archer2 anyway because the file system is so slow, and currently all ranks try to write the generated code to disk.

Be aware that, because of the way we build the forms for the all-at-once system internally, if you change the number of timesteps on each ensemble rank then that will cause new forms to be generated, even if you're solving the same equation.

I am really hoping that firedrakeproject/firedrake#3989 will fix this for us! If it doesn't then I'll come back and have a go at creating a simple and reliable MFE.

hiroe- · 2025-01-31T10:53:06Z

Yes it seems to be running with spatial parallel the second time around!

connorjward · 2025-01-31T11:32:08Z

I am really hoping that firedrakeproject/firedrake#3989 will fix this for us! If it doesn't then I'll come back and have a go at creating a simple and reliable MFE.

~~Looking at the diffs between the different ranks it is clear that the issues are with TSFC/loopy and so won't be fixed in my PR. An MFE is certainly welcome.~~

~~That said, it still may be useful to set PYOP2_SPMD_STRICT=1 whilst debugging and to use my branch (connorjward/more-cache-fixes).~~

Actually I take it back. To improve performance my PR now generates code on only one rank and broadcasts the resulting string. This means we shouldn't have this crash any longer (though it is only masking the non-deterministic issue).

JHopeCollins · 2025-01-31T12:09:00Z

Thanks for having a look at this @connorjward. Glad you agree it should fix this issue, at least for asQ (even if the underlying cause is still there).

Do you think that firedrakeproject/firedrake#4002 might be the cause of the non-determinism if the non-hashability is producing unstable cache keys?

connorjward · 2025-01-31T13:38:14Z

Do you think that firedrakeproject/firedrake#4002 might be the cause of the non-determinism if the non-hashability is producing unstable cache keys?

I don't think so. The changes there won't have any impact on the generated code.

Looking at the example above I see that it uses basically all of the wacky preconditioners we have. I can easily imagine that somewhere in those we do things that have poor test coverage in parallel.

JHopeCollins · 2025-01-31T14:12:23Z

I don't think so. The changes there won't have any impact on the generated code.

Shame, but I thought it would be a long shot.

Looking at the example above I see that it uses basically all of the wacky preconditioners we have. I can easily imagine that somewhere in those we do things that have poor test coverage in parallel.

I think it is because our forms get quite complicated because we wrap everything in VFSs to mimic complex numbers.
Hiroe's example has a lot going on but I have also seen this bug just with LU for the wave equation. The function space for that is (VFS(BDM, dim=2) x VFS(DG,dim=2)) so even the basic equations end up with complicated forms.

connorjward · 2025-01-31T14:26:59Z

Plus you use ensemble a lot which breaks certain assumptions about per-process numbering of coefficients etc.

ksagiyam · 2025-01-31T23:17:44Z

@JHopeCollins Could you post that simple wave equation example?

JHopeCollins added bug Something isn't working Core functionality Adding to the main paradiag functionality upstream Issue related to upstream dependencies labels Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CompilationError when using spatial parallel #214

CompilationError when using spatial parallel #214

hiroe- commented Jan 29, 2025

hiroe- commented Jan 29, 2025

JHopeCollins commented Jan 30, 2025

hiroe- commented Jan 31, 2025

connorjward commented Jan 31, 2025 •

edited

Loading

JHopeCollins commented Jan 31, 2025

connorjward commented Jan 31, 2025

JHopeCollins commented Jan 31, 2025

connorjward commented Jan 31, 2025

ksagiyam commented Jan 31, 2025

CompilationError when using spatial parallel #214

CompilationError when using spatial parallel #214

Comments

hiroe- commented Jan 29, 2025

hiroe- commented Jan 29, 2025

JHopeCollins commented Jan 30, 2025

hiroe- commented Jan 31, 2025

connorjward commented Jan 31, 2025 • edited Loading

JHopeCollins commented Jan 31, 2025

connorjward commented Jan 31, 2025

JHopeCollins commented Jan 31, 2025

connorjward commented Jan 31, 2025

ksagiyam commented Jan 31, 2025

connorjward commented Jan 31, 2025 •

edited

Loading