Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank-per-node slower than rank-per-socket with OpenMP workload #1266

Closed
TimothyGu opened this issue May 23, 2022 · 18 comments
Closed

Rank-per-node slower than rank-per-socket with OpenMP workload #1266

TimothyGu opened this issue May 23, 2022 · 18 comments

Comments

@TimothyGu
Copy link

TimothyGu commented May 23, 2022

Hi @streichler, I've been working with @rohany on running some experiments with DISTAL on Sapling. We've been using the TBLIS tensor contraction library, which internally uses OpenMP, but have been seeing some strange performance behavior.

Compare:

$ mpirun -H c0001:2,c0002:2,c0003:2,c0004:2 --bind-to socket \
    /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 99 -tblis -gx 4 -gy 2 \
    -ll:ocpu 1 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0 \
    -lg:prof 8 -lg:prof_logfile prof99-socket-%.log.gz

  # (Side note: the above command prints a few "reservation cannot be satisfied" warnings;
  # that is the subject of a different issue that will be filed separately.)
  # Update: filed as https://github.com/StanfordLegion/legion/issues/1267

$ mpirun -H c0001,c0002,c0003,c0004 --bind-to none \
    /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 99 -tblis -gx 4 -gy 2 \
    -ll:ocpu 2 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0 \
    -lg:prof 4 -lg:prof_logfile prof99-node-%.log.gz

The profiles are available at

It seems clear that the per-socket version is using processors much more efficiently, but it's not clear to us why. On a single node, TBLIS/OpenMP is much (2×) faster than using mpirun to bind computation to each socket (using n=70 for comparable problem size):

$ mpirun -H c0001:2 --bind-to socket \
    /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 70 -tblis -gx 2 -gy 1 \
    -ll:ocpu 1 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0

$ /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 70 -tblis -gx 2 -gy 1 \
    -ll:ocpu 2 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0

Thanks in advance for your help!

@rohany
Copy link
Contributor

rohany commented May 23, 2022

In particular, we notice that different invocations of the same task have non-trivial variance in runtime on the different sockets per node. -ll:show_rsrv says that all of the threads are bound correctly, so we're not sure how to press forward here, if it's a Realm OpenMP issue or our incorrect use of the TBLIS library. cc'ing @manopapad since Legate uses TBLIS as well, in case you've seen something like this before.

@manopapad
Copy link
Contributor

The common thread I see across this issue and #1267 is that you're seeing worse behavior in cases where you're only leaving 1 core for all Legion/Realm meta-work, i.e. Legion and Realm only have a single core to use for both the runtime analysis and for handling network transfers (note that sapling has 10 physical cores per socket, and by default Realm pins each -ll:othr and -ll:cpu processor to a full physical core, ignoring hyperthreading):

# othr=8 util=1 (implicitly, cpu=1) -- no warnings
# Realm reserves 8 cores for othr, 1 core for cpu, leaving 1 core for util + bgwork
$ mpirun -H c0001:2,c0002:2,c0003:2,c0004:2 --bind-to socket \
    /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 99 -tblis -gx 4 -gy 2 \
    -ll:ocpu 1 -ll:othr 8 -ll:util 1 -ll:nsize 10G -ll:ncsize 0 \
    -lg:prof 8 -lg:prof_logfile prof99-socket-othr8-%.log.gz

# othr=9 util=1 cpu=0 -- no warnings
# Realm reserves 9 cores for othr, leaving 1 core for util + bgwork
$ mpirun -H c0001:2,c0002:2,c0003:2,c0004:2 --bind-to socket \
    /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 99 -tblis -gx 4 -gy 2 \
    -ll:ocpu 1 -ll:othr 9 -ll:cpu 0 -ll:util 1 -ll:nsize 10G -ll:ncsize 0 \
    -lg:prof 8 -lg:prof_logfile prof99-socket-cpu0-%.log.gz

# Realm reserves 18 cores for othr, 1 core for cpu, leaving 1 core for util + bgwork
$ mpirun -H c0001,c0002,c0003,c0004 --bind-to none \
    /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 99 -tblis -gx 4 -gy 2 \
    -ll:ocpu 2 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0 \
    -lg:prof 4 -lg:prof_logfile prof99-node-%.log.gz

In the cases where you're seeing the "reservation cannot be satisfied" warnings, Realm proceeds without pinning threads. Therefore the OS is free to time-slice between the computation threads and the meta-work threads, allowing for network transfers to make progress, meaning that task inputs arrive sooner at remote nodes, so the next iteration starts faster (although the individual leaf tasks might be slower, because they can now get context-switched out).

Do you see a difference in performance if you do:

  • multi-node, bind-to-none, ocpu 2, othr 6, cpu 1, util 1
    vs
  • multi-node, bind-to-socket, ocpu 1, othr 6, cpu 1, util 1

@streichler is there a more direct metric that @TimothyGu and @rohany could collect that will confirm/deny my hypothesis that transfers are being delayed because the network threads are fighting over a single core with the util threads?

@rohany
Copy link
Contributor

rohany commented May 24, 2022

@TimothyGu correct me if I'm wrong, but I don't think the problem we are seeing is related to transfers -- the actual tasks themselves have widely varying runtimes when realm is managing the threads. We expect there to be some jitter in starting computation due to incoming transfers, but after that we expect the task execution time itself to be deterministic. We're seeing situations where randomly some task executions can take 100ms or more longer than others in the case where Legion is managing multiple OMP processors.

@lightsighter
Copy link
Contributor

Have you actually confirmed that TBLIS supports multiple OpenMP runtimes in the same process? I know for certain that OpenBLAS does not and if you try to run it you'll see all sorts of weird and undefined behaviors, some of which looked exactly like this. Also, don't underestimate the stupidity of "thread-safe" libraries to take global locks and prevent all sorts of things from happening concurrently.

@rohany
Copy link
Contributor

rohany commented May 24, 2022

Have you actually confirmed that TBLIS supports multiple OpenMP runtimes in the same process?

I have not -- I'll open an issue there and report back. I sort of assumed that it would because cunumeric uses it, and it seems like cunumeric by default will run with a rank per node.

TBLIS having some sort of global variables etc could explain this performance, but not #1267 though.

@lightsighter
Copy link
Contributor

@manopapad will need to say for sure, but I think cuNumeric is quite careful about how we use external libraries that proclaim to support OpenMP. A lot of them seem to use global variables to manage OpenMP execution which is not safe in the case where there are multiple OpenMP runtimes in the same process.

@manopapad
Copy link
Contributor

I had verified that TBLIS plays nicely with Realm's OpenMP implementation, and works correctly even with multiple OpenMP groups under the same process. I hadn't verified that the performance doesn't degrade with multiple OpenMP groups.

However, judging from the above profiles, this doesn't seem to be the issue; the leaf tasks seem to be taking the same amount of time under both -ll:ocpu 1 -ll:othr 9 and -ll:ocpu 2 -ll:othr 9.

@rohany
Copy link
Contributor

rohany commented May 25, 2022

However, judging from the above profiles, this doesn't seem to be the issue; the leaf tasks seem to be taking the same amount of time under both -ll:ocpu 1 -ll:othr 9 and -ll:ocpu 2 -ll:othr 9.

@TimothyGu can you link profiles for the case where we perform all of the communication up front, rather than in batches? I think that one had the variance between different task execution lengths showing up more prominently. Manolis is right here that there doesn't seem to be that much variance in the task lengths in these profiles

@lightsighter
Copy link
Contributor

Are you also mapping the instances that the tasks need to the NUMA domains associated with the OpenMP processors?

@rohany
Copy link
Contributor

rohany commented May 25, 2022

Yep

@TimothyGu
Copy link
Author

@manopapad

Do you see a difference in performance if you do:

  • multi-node, bind-to-none, ocpu 2, othr 6, cpu 1, util 1
    vs
  • multi-node, bind-to-socket, ocpu 1, othr 6, cpu 1, util 1

I just tried

$ mpirun -H c0001:2,c0002:2,c0003:2,c0004:2 --bind-to socket \
  /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 99 -tblis -gx 4 -gy 2 \
  -ll:ocpu 1 -ll:othr 6 -ll:util 1 -ll:nsize 10G -ll:ncsize 0
$ mpirun -H c0001,c0002,c0003,c0004 --bind-to none \
  /scratch2/tigu/taco/distal/build/bin/chemTest-05-20 -n 99 -tblis -gx 4 -gy 2 \
  -ll:ocpu 2 -ll:othr 6 -ll:util 1 -ll:nsize 10G -ll:ncsize 0

There's still a difference: the bind-to-socket version runs faster (1500 vs 1900 ms).


Here are the profiles @rohany was referring to, where we do all communication upfront

  • bind-to none routinely has large leaf variance (anywhere between 800–1200 ms)

    image

  • bind-to socket still has large variance sometimes, but generally looks much more uniform (most tasks are in the 850–1000 ms range)

    image

@manopapad
Copy link
Contributor

Here are my conclusions so far:

  • Bgwork thread starvation is not a major issue here, but it would still be advisable to leave a couple of physical cores per rank for the runtime analysis and Realm background work.

  • TBLIS is generally able to use two OpenMP groups concurrently without blocking: We see cases where two concurrent invocations of TBLIS on 2 OpenMP groups under the same process finish fast, e.g. task_5 <1448> on the bind-to-none profile.

  • I note that the delayed start of some tasks, e.g. of task_5 <888>, is explained by the fact that one OpenMP processor has received all required input data and can start its task, while the other one is still waiting on a transfer. A similar "staggering" seems to be happening in the bind-to-socket case as well.

  • The large leaf task running time variance is pointing to a problem inside the body of the OpenMP tasks themselves, but I can't tell if the culprit is TBLIS or Realm. At this point I can think of the following things to try:

    • run with -level openmp=1,threadpool=1 and see if the Realm logging can tell us where time is being spent (it might be something to do with the sequential part of the task, and not the actual parallel execution)
    • add instrumentation around the TBLIS OpenMP loops
    • add instrumentation in the Realm OpenMP code
    • use an OpenMP-aware profiler (may not play nicely with Realm)
    • replace the body of the tasks with a different, hand-written OpenMP kernel (doesn't need to do the same thing, but should take approximately the same time), to confirm whether the issue is unique to TBLIS

@rohany
Copy link
Contributor

rohany commented May 28, 2022

Thanks for your analysis manolis, everything seems to point to TBLIS perhaps not supporting multiple OpenMP processes in an isolated manner. I've opened an issue asking about this (devinamatthews/tblis#47), and we'll see what the developers have to say.

The reason that I'm not inclined to believe that this is a realm issue is that other openmp libraries run fine without performance problems on a variety of different machines that I have access to with DISTAL, so the only variable here is TBLIS.

I'll keep this open until I hear back from the TBLIS folks.

Let me know what you think about #1267, as that seems unrelated now, given the analysis here.

@devinamatthews
Copy link

TBLIS developer here. If I am understanding correctly, a) the comparison is between using one MPI rank + OpenMP per socket with thread binding to that socket vs. using two sort of "virtual" ranks per node with separate OpenMP runtimes and no thread binding, and b) the former approach is showing better performance?

TBLIS uses a pool of dynamically-allocated memory blocks for packing tensor data during computation. New blocks are allocated when not enough are available (i.e. on the first contraction call typically), and then reused in later computations. Blocks are checked out of a linked-list structure which uses a global lock (could be an OpenMP lock or a spin-lock). In the latter case above, if threads are free to migrate across the sockets then it could be the case that a thread running on socket 0 checks out a memory block created by a thread on socket 1 or vice versa. This will cause excessive memory traffic between sockets in a non-deterministic fashion.

Perhaps this is what you are seeing?

@rohany
Copy link
Contributor

rohany commented May 28, 2022

Thanks @devinamatthews! Your understanding is correct. I don't believe that in our runtime threads can migrate between sockets, but it definitely seems reasonable for there to be contention or NUMA effects if the free-list is shared between sockets. Is there some way we could create the free-list explicitly and point tensordot calls at using a specific free-list, rather than using an existing shared data structure?

@devinamatthews
Copy link

Unfortunately no. Something like this will be available in a future version. You can possibly check if this is the problem by editing the TBLIS code: in src/memory/memory_pool.hpp, simply comment out lines 107--136 (MemoryPool::acquire) and replace lines 157--160 (MemoryPool::release) with free(ptr). This will reallocate a new data block every time, but that should be a much smaller performance hit than going across NUMA nodes.

@rohany
Copy link
Contributor

rohany commented May 28, 2022

Awesome, @TimothyGu can you try this when you get a chance?

I looked at some of the code, it does look somewhat difficult to tease out the memory pool since the packing routines are templated over a particular pool. One thing that could be done for a Legion fork of TBLIS is to change line 163 of memory_pool.hpp to maintain a free-list for each OpenMP processor. The memory pool can index into it with Realm::get_executing_processor() or something.

If this is the problem, then I don't think it's too important for us to fix right now @TimothyGu, we can continue to run with a rank-per-socket. It might be more of a concern to cuNumeric though.

@TimothyGu
Copy link
Author

Thanks @devinamatthews, I think that solved the performance issues I was seeing! Since this appears to be a TBLIS-specific issue, I'm closing this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants