Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm performance worse when threads are allocated correctly #1267

Open
TimothyGu opened this issue May 23, 2022 · 1 comment
Open

Realm performance worse when threads are allocated correctly #1267

TimothyGu opened this issue May 23, 2022 · 1 comment

Comments

@TimothyGu
Copy link

TimothyGu commented May 23, 2022

Hi @streichler,

The overall setting is the same as in #1266.

With -ll:ocpu 1 -ll:othr 9 -ll:util 1 on Sapling, the Realm runtime prints many warnings:

[0 - 7f3af9792d00]    0.000178 {4}{threads}: reservation ('dedicated worker (generic) #2') cannot be satisfied
[1 - 7f982b32dd00]    0.000174 {4}{threads}: reservation ('utility proc 1d00010000000000') cannot be satisfied
[2 - 7f5e4563bd00]    0.000217 {4}{threads}: reservation ('dedicated worker (generic) #2') cannot be satisfied
...

@rohany and I tried a few different ways to allow Realm to allocate threads correctly. Compare:

# othr=9 util=1 (implicitly, cpu=1) -- many "reservation cannot be satisfied" warnings
$ mpirun -H c0001:2,c0002:2,c0003:2,c0004:2 --bind-to socket \
    /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 99 -tblis -gx 4 -gy 2 \
    -ll:ocpu 1 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0 \
    -lg:prof 8 -lg:prof_logfile prof99-socket-%.log.gz

# othr=8 util=1 (implicitly, cpu=1) -- no warnings
$ mpirun -H c0001:2,c0002:2,c0003:2,c0004:2 --bind-to socket \
    /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 99 -tblis -gx 4 -gy 2 \
    -ll:ocpu 1 -ll:othr 8 -ll:util 1 -ll:nsize 10G -ll:ncsize 0 \
    -lg:prof 8 -lg:prof_logfile prof99-socket-othr8-%.log.gz

# othr=9 util=1 cpu=0 -- no warnings
$ mpirun -H c0001:2,c0002:2,c0003:2,c0004:2 --bind-to socket \
    /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 99 -tblis -gx 4 -gy 2 \
    -ll:ocpu 1 -ll:othr 9 -ll:cpu 0 -ll:util 1 -ll:nsize 10G -ll:ncsize 0 \
    -lg:prof 8 -lg:prof_logfile prof99-socket-cpu0-%.log.gz

We confirmed through -ll:show_rsrv that the latter two commands reserve threads correctly. However, both of the latter two perform worse. Profiles are available:

It makes sense at some level that -ll:othr 8 performs worse than -ll:othr 9: we are taking away a whole core from the computation. (Indeed, the leaf computation time takes a bit longer on average, 230ms vs 260ms.) But it's not clear to us why -ll:cpu 0 doesn't help (in fact hinders) the performance. From the original profile, the CPUs don't seem to be doing much work anyway, and the leaf computation time is around the same as the original (230ms), but the compute graph is a lot more ragged than without -ll:cpu 0.

@manopapad
Copy link
Contributor

My hunch here was that you're only leaving 1 core for all Legion/Realm meta-work, as explained in #1266 (comment).

However, @TimothyGu's results from #1266 (comment) are not supporting this. It might still be useful to look at profiles for the runs in this comment.

Also, since your experience has been that other OpenMP libraries are more well-behaved than TBLIS, it would be a good idea to verify that this behavior occurs with those others libraries as well.

Finally, it would be interesting to see what happens when you repeat these experiments:

  • othr=9 util=1 (implicitly, cpu=1)
  • othr=8 util=1 (implicitly, cpu=1)
  • othr=9 util=1 cpu=0

with REALM_SYNTHETIC_CORE_MAP="", which disables Realm's thread pinning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants