-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rank-per-node slower than rank-per-socket with OpenMP workload #1266
Comments
In particular, we notice that different invocations of the same task have non-trivial variance in runtime on the different sockets per node. |
The common thread I see across this issue and #1267 is that you're seeing worse behavior in cases where you're only leaving 1 core for all Legion/Realm meta-work, i.e. Legion and Realm only have a single core to use for both the runtime analysis and for handling network transfers (note that sapling has 10 physical cores per socket, and by default Realm pins each
In the cases where you're seeing the "reservation cannot be satisfied" warnings, Realm proceeds without pinning threads. Therefore the OS is free to time-slice between the computation threads and the meta-work threads, allowing for network transfers to make progress, meaning that task inputs arrive sooner at remote nodes, so the next iteration starts faster (although the individual leaf tasks might be slower, because they can now get context-switched out). Do you see a difference in performance if you do:
@streichler is there a more direct metric that @TimothyGu and @rohany could collect that will confirm/deny my hypothesis that transfers are being delayed because the network threads are fighting over a single core with the util threads? |
@TimothyGu correct me if I'm wrong, but I don't think the problem we are seeing is related to transfers -- the actual tasks themselves have widely varying runtimes when realm is managing the threads. We expect there to be some jitter in starting computation due to incoming transfers, but after that we expect the task execution time itself to be deterministic. We're seeing situations where randomly some task executions can take 100ms or more longer than others in the case where Legion is managing multiple OMP processors. |
Have you actually confirmed that TBLIS supports multiple OpenMP runtimes in the same process? I know for certain that OpenBLAS does not and if you try to run it you'll see all sorts of weird and undefined behaviors, some of which looked exactly like this. Also, don't underestimate the stupidity of "thread-safe" libraries to take global locks and prevent all sorts of things from happening concurrently. |
I have not -- I'll open an issue there and report back. I sort of assumed that it would because cunumeric uses it, and it seems like cunumeric by default will run with a rank per node. TBLIS having some sort of global variables etc could explain this performance, but not #1267 though. |
@manopapad will need to say for sure, but I think cuNumeric is quite careful about how we use external libraries that proclaim to support OpenMP. A lot of them seem to use global variables to manage OpenMP execution which is not safe in the case where there are multiple OpenMP runtimes in the same process. |
I had verified that TBLIS plays nicely with Realm's OpenMP implementation, and works correctly even with multiple OpenMP groups under the same process. I hadn't verified that the performance doesn't degrade with multiple OpenMP groups. However, judging from the above profiles, this doesn't seem to be the issue; the leaf tasks seem to be taking the same amount of time under both |
@TimothyGu can you link profiles for the case where we perform all of the communication up front, rather than in batches? I think that one had the variance between different task execution lengths showing up more prominently. Manolis is right here that there doesn't seem to be that much variance in the task lengths in these profiles |
Are you also mapping the instances that the tasks need to the NUMA domains associated with the OpenMP processors? |
Yep |
I just tried
There's still a difference: the bind-to-socket version runs faster (1500 vs 1900 ms). Here are the profiles @rohany was referring to, where we do all communication upfront
|
Here are my conclusions so far:
|
Thanks for your analysis manolis, everything seems to point to TBLIS perhaps not supporting multiple OpenMP processes in an isolated manner. I've opened an issue asking about this (devinamatthews/tblis#47), and we'll see what the developers have to say. The reason that I'm not inclined to believe that this is a realm issue is that other openmp libraries run fine without performance problems on a variety of different machines that I have access to with DISTAL, so the only variable here is TBLIS. I'll keep this open until I hear back from the TBLIS folks. Let me know what you think about #1267, as that seems unrelated now, given the analysis here. |
TBLIS developer here. If I am understanding correctly, a) the comparison is between using one MPI rank + OpenMP per socket with thread binding to that socket vs. using two sort of "virtual" ranks per node with separate OpenMP runtimes and no thread binding, and b) the former approach is showing better performance? TBLIS uses a pool of dynamically-allocated memory blocks for packing tensor data during computation. New blocks are allocated when not enough are available (i.e. on the first contraction call typically), and then reused in later computations. Blocks are checked out of a linked-list structure which uses a global lock (could be an OpenMP lock or a spin-lock). In the latter case above, if threads are free to migrate across the sockets then it could be the case that a thread running on socket 0 checks out a memory block created by a thread on socket 1 or vice versa. This will cause excessive memory traffic between sockets in a non-deterministic fashion. Perhaps this is what you are seeing? |
Thanks @devinamatthews! Your understanding is correct. I don't believe that in our runtime threads can migrate between sockets, but it definitely seems reasonable for there to be contention or NUMA effects if the free-list is shared between sockets. Is there some way we could create the free-list explicitly and point tensordot calls at using a specific free-list, rather than using an existing shared data structure? |
Unfortunately no. Something like this will be available in a future version. You can possibly check if this is the problem by editing the TBLIS code: in |
Awesome, @TimothyGu can you try this when you get a chance? I looked at some of the code, it does look somewhat difficult to tease out the memory pool since the packing routines are templated over a particular pool. One thing that could be done for a Legion fork of TBLIS is to change line 163 of If this is the problem, then I don't think it's too important for us to fix right now @TimothyGu, we can continue to run with a rank-per-socket. It might be more of a concern to cuNumeric though. |
Thanks @devinamatthews, I think that solved the performance issues I was seeing! Since this appears to be a TBLIS-specific issue, I'm closing this ticket. |
Hi @streichler, I've been working with @rohany on running some experiments with DISTAL on Sapling. We've been using the TBLIS tensor contraction library, which internally uses OpenMP, but have been seeing some strange performance behavior.
Compare:
The profiles are available at
/scratch2/tigu/taco/distal/build/prof99-socket-*.log.gz
– http://sapling.stanford.edu/~tigu/prof99-socket/scratch2/tigu/taco/distal/build/prof99-node-*.log.gz
– http://sapling.stanford.edu/~tigu/prof99-nodeIt seems clear that the per-socket version is using processors much more efficiently, but it's not clear to us why. On a single node, TBLIS/OpenMP is much (2×) faster than using
mpirun
to bind computation to each socket (using n=70 for comparable problem size):Thanks in advance for your help!
The text was updated successfully, but these errors were encountered: