Performance optimisation

Load balancing

GRChombo’s efficiency when running on a large number of distributed-memory nodes is highly dependent on good load balancing of the available computational work across those nodes. Load balancing seeks to avoid the situation where most of the nodes are waiting for some small subset of nodes to finish their computational work, and it does this by seeking to distribute the amount of work to be done per time step evenly among all of the nodes.

Some of the load balancing work is done by Chombo automatically. However, adjusting the input parameters will change how the main grids are split into boxes to be shared between processors, and understanding this and adapting the setup is crucial for the load balancing process.

For example, if the coarsest level is divided into grid cells of 64^3, and you set the maximum box size (max_grid_size) to 64, then GRChombo will not subdivide the coarsest grid, but will just allow one box to cover the entire area. Thus only one process can work on this level, even if you run the code with 64 cores. If instead you set the maximum box size to 16, then the grid will be divided into (64/16)^3 = 64 boxes. Then running it on 64 cores, every process should get one box, and the problem will efficiently use the resources.

Consider also that running on 63 cores will mean that one process will act on two boxes, possibly taking twice as long to complete as the others. Since the processes must synchronise after each step, most will be waiting idly for this one to finish. So there is in principle no gain from using 63 cores rather than just 32.

(Note that you might run on more cores because you need the additional memory, but you should still adjust the number of processes to match the load.)

Of course this calculation is much more complicated on the more refined levels where the number of boxes cannot necessarily be predicted ahead of time, but a bit of trial and error can still result in a big improvement. Note that the number of boxes on each process at leach level is output in the pout.x files, and so it is relatively easy to see how well load balanced you are by just running a few steps.

Note that load balancing the finest levels is much more important that balancing the lowest ones, since each finer level runs twice as often as the next coarser one.

There is also a minimum box size (block_factor), which we usually set to be equal (at least roughly) to the max box size, since this means that all the boxes are roughly the same size. Then having one box per process should mean roughly equal amounts of work. Below about block_factor=8, the costs of subdividing the grid start to outweigh the benefits of sharing the work (each box has +3 ghost cells on each edge, so the ghost cell load becomes comparable to the main calculation load).

Other tips for optimisation

Compile with DEBUG=FALSE and OPT=TRUE
Sometimes running with a few spare processes can give a big speed up, rather than running with exactly 1 box per process.
Look at the jobscripts in Jobscript tips and examples for optimisation ideas when running
One can also compile with OPT=HIGH, which offers a significant speed up as it turns off the CH_assert() functions which check conditions are met, and initialises values to zero rather than to a large number. Note that this is not a good idea when debugging code as it will cause memory leaks rather than ordered exits with error messages (for example, if one accesses data outside the current box). But it should be used during production runs once code is stable.
Adjust fill_ratio to make regridding more or less aggressive - A high fill ratio will try to exactly fit tagged regions, a low fill ratio will allow boxes to be larger and contain untagged cells.
Minimise input/output - only write checkpoint files as often as you need them, and consider using plot files if you only need to view several quantities.
Get advice from your cluster administrator on compile flags for this type of program.
There is a basic timing function in Chombo which can be turned on by adding: CH_TIMER_REPORT(); after gr_amr.conclude(); (found in the Main_XXX.cpp file). Then before you run, you should set export CH_TIMER=TRUE (you may need to do this in the jobscript in order to pass the environmental variable to all processes). Then wherever the pout.* files are there should also be files called time.table.* which give details of time spent in different functions. Note that this will only output if the simulation ends - so you may need to reduce the number of timesteps to see the output.
A useful pdf guide on this topic from the latest GRChombo training day can be found in Useful resources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimisation

Load balancing

Other tips for optimisation

Contents

Clone this wiki locally