You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We confirmed through -ll:show_rsrv that the latter two commands reserve threads correctly. However, both of the latter two perform worse. Profiles are available:
It makes sense at some level that -ll:othr 8 performs worse than -ll:othr 9: we are taking away a whole core from the computation. (Indeed, the leaf computation time takes a bit longer on average, 230ms vs 260ms.) But it's not clear to us why -ll:cpu 0 doesn't help (in fact hinders) the performance. From the original profile, the CPUs don't seem to be doing much work anyway, and the leaf computation time is around the same as the original (230ms), but the compute graph is a lot more ragged than without -ll:cpu 0.
The text was updated successfully, but these errors were encountered:
My hunch here was that you're only leaving 1 core for all Legion/Realm meta-work, as explained in #1266 (comment).
However, @TimothyGu's results from #1266 (comment) are not supporting this. It might still be useful to look at profiles for the runs in this comment.
Also, since your experience has been that other OpenMP libraries are more well-behaved than TBLIS, it would be a good idea to verify that this behavior occurs with those others libraries as well.
Finally, it would be interesting to see what happens when you repeat these experiments:
othr=9 util=1 (implicitly, cpu=1)
othr=8 util=1 (implicitly, cpu=1)
othr=9 util=1 cpu=0
with REALM_SYNTHETIC_CORE_MAP="", which disables Realm's thread pinning.
Hi @streichler,
The overall setting is the same as in #1266.
With
-ll:ocpu 1 -ll:othr 9 -ll:util 1
on Sapling, the Realm runtime prints many warnings:@rohany and I tried a few different ways to allow Realm to allocate threads correctly. Compare:
We confirmed through
-ll:show_rsrv
that the latter two commands reserve threads correctly. However, both of the latter two perform worse. Profiles are available:/scratch2/tigu/taco/distal/build/prof99-socket-*.log.gz
– http://sapling.stanford.edu/~tigu/prof99-socket/scratch2/tigu/taco/distal/build/prof99-socket-othr8-*.log.gz
– http://sapling.stanford.edu/~tigu/prof99-socket-othr8/scratch2/tigu/taco/distal/build/prof99-socket-cpu0-*.log.gz
– http://sapling.stanford.edu/~tigu/prof99-socket-cpu0It makes sense at some level that
-ll:othr 8
performs worse than-ll:othr 9
: we are taking away a whole core from the computation. (Indeed, the leaf computation time takes a bit longer on average, 230ms vs 260ms.) But it's not clear to us why-ll:cpu 0
doesn't help (in fact hinders) the performance. From the original profile, the CPUs don't seem to be doing much work anyway, and the leaf computation time is around the same as the original (230ms), but the compute graph is a lot more ragged than without-ll:cpu 0
.The text was updated successfully, but these errors were encountered: