Hopper config #80

christindbose · 2024-10-24T02:33:59Z

This is an attempt to update the configs with the most relevant features from Hopper (SMX5 to be precise). The key config parameters modified are:

Num of SMs
Num of memory channels, datawidth per channel (HBM2->HBM3 has double the number of channels/stack but half the datawidth)
L1D cache size
L2 cache size

kunal-mansukhani · 2025-01-03T00:55:39Z

@christindbose Tried this out and got

hashing.cc:88: unsigned int ipoly_hash_function(new_addr_type, unsigned int, unsigned int): Assertion "\nmemory_partition_indexing error: The number of " "channels should be " "16, 32 or 64 for the hashing IPOLY index function. other banks " "numbers are not supported. Generate it by yourself! \n" && 0' failed.

Because the number of memory channels is 80 and so it can't be IPOLY hashed. What's the workaround?

christindbose · 2025-01-03T08:38:14Z

@kunal-mansukhani So this is because of the L2 cache configuration setting. I have pushed a simple fix. Please try it out and see if it works for your case.

kunal-mansukhani · 2025-01-03T09:06:34Z

@christindbose
Thanks for the help! I tried out your change and it fixed the error I was getting. The program runs correctly now. But tell me if this makes sense:

I'm running a 32 x 32 shared matmul using the TITAN V vs H100 Config
Titan V Results:

gpgpu_simulation_time = 0 days, 0 hrs, 0 min, 1 sec (1 sec)
gpgpu_simulation_rate = 44032 (inst/sec)
gpgpu_simulation_rate = 903 (cycle/sec)
gpgpu_silicon_slowdown = 1328903x

H100 Results:

gpgpu_simulation_time = 0 days, 0 hrs, 0 min, 3 sec (3 sec)
gpgpu_simulation_rate = 14677 (inst/sec)
gpgpu_simulation_rate = 1971 (cycle/sec)
gpgpu_silicon_slowdown = 574327x

Shouldn't the H100 be much faster when factoring in the gpgpu_silicon_slowdown?

christindbose · 2025-01-03T09:10:11Z

What you're seeing is the simulation time and not the program runtime. The hopper is a larger GPU in terms of the number of resources (#SMs, #channels etc) and so it's possible that the simulation time is longer. The kernel runtime is given by the cycle count; so you should really be looking at that.

kunal-mansukhani · 2025-01-03T09:30:16Z

@christindbose

Got it so if I do total cycles / core clock speed that should get me the actual program execution time if it were executed on that device correct? I'm doing that for this comparison and Titan V is still coming out ahead.

is it that the overhead is too large relative to the actual compute? Should I be trying larger matmuls?

christindbose · 2025-01-03T09:39:39Z

That is correct. How much diff are we talking about?

You should be looking at larger matrix sizes. It's possible that the hopper is highly underutilized at small sizes.

kunal-mansukhani · 2025-01-03T21:57:48Z

@christindbose The total number of cycles for the H100 is consistently higher thant he total number of cycles for the Titan V, and they both have a similar clock speed, so it looks like for the same program, the Titan V has a shorter execution time. I tried on larger matricies but it still seems the same.

The program I'm using doesn't leverage Tensor Cores, is that the issue?

christindbose · 2025-01-04T17:03:23Z

@kunal-mansukhani It's fine to not use tensor cores.

I'd like to know more about your setup. Are you running the simulations in PTX or trace mode? The current hopper config doesn't reflect the clock freq of the actual hopper GPU (we mostly only scaled up the relevant hardware resources). So that will need to be fixed in order to compare with the actual hopper.

kunal-mansukhani · 2025-01-05T19:27:18Z

@christindbose
I'm running the simulation in PTX mode, using CUDA 11.7. So when I calculate the GPU execution time should I use the clock speed in the config file or the real Hopper core clock speed?

kunal-mansukhani · 2025-01-23T19:01:44Z

Still is signficantly slower than worse GPUs on normal kernels
Edit: referring to GPU execution time, not simulation time

JRPan · 2025-01-23T20:27:13Z

What are the numbers you are referring to? What's your simulation cycles and what's your hw cycles? How far are those?

JRPan · 2025-01-23T23:20:37Z

Actually forgot you were comparing with the Titan V. What are the cycles count for Titan V?

beneslami · 2025-01-27T00:07:11Z

Sorry @christindbose
I have a few questions:

in gpgpusim.config you defined:
-gpgpu_n_clusters 132
-gpgpu_n_cores_per_cluster 1
-gpgpu_n_mem 80
-gpgpu_n_sub_partition_per_mchannel 2

So, this means that here we have 132 SMs with 1 core inside. According to Nvidia, each SM in H100 contains 128 cores. So, shouldn't gpgpu_n_cores_per_cluster be 128 ?

In this line, the value of k is 144. Why ? Isn't it representing the total number of cores + merry sub partitions ? why isn't it 292 ( 1321 + 802 )
the clock domain -gpgpu_clock_domains 1200.0:1200.0:1200.0:2619.0

GPU clock: 1.2 GHz
NoC 1.2 GHz -> This means that, since the flit size is 40B and the topology is fly, then the total bandwidth is around 15TB/s. Is it correct ?
L2 1.2 GHz -> similar to NoC, around 15 TB/s . right ?
Memory 2.619 GHz -> I cannot reproduce correct memory bandwidth from this frequency. Since the memory is HBM3 with 5 stacks and also we have 80 memory channels, each stack has 16 bidirectional channels. Sine the bus width is 8 bytes, then the unidirectional channel for each stack is 64 bits. 64bits * 5stack * 2.619GHz * 2DDR = 3.2TB/s. correct ?

Thanks

…2_rop_latency, dram_latency, gpgpu_dram_timing_opt, Gpgpu_num_reg_banks

christindbose · 2025-02-03T05:59:11Z

hi @beneslami

Here are the answers to your questions:

Your understanding is correct. Each SM is divided into 4 sub-cores (and hence has 4x32=128 cores). We have modeled this by means of gpgpu_sub_core_model, gpgpu_num_sched_per_core. The gpgpu_n_clusters is just a implementation detail that dictates the number of interconnect ports. All 'cores' with a 'cluster' share an interconnect port. More details here
That should be correct. Note that by default, the gpgpusim.config uses a custom interconnect and not booksim (ie config_hopper_islip.icnt is not used by default). You can toggle the choice of interconnect using the config network_mode
Can you explain your reasoning regarding the icnt BW? Is it based on a k-ary n-fly topology?
For L2: BW=32B x 1.2Ghz x 80 ~ 3TB/s (unidirectional)
For Mem: I think the end BW of ~3.3TB/s is correct. What's the issue here? I calculated it as 8B x 80 x 2 x 2.619G

christindbose · 2025-02-03T06:07:41Z

Still is signficantly slower than worse GPUs on normal kernels Edit: referring to GPU execution time, not simulation time

hi @kunal-mansukhani

Since, you're running in PTX mode, the config relies on the PTX latencies to run the simulation. We have not correlated these ptx latencies (as our priority is enabling trace based simulation). Hence, the numbers you get are not sound (clearly indicated by the mismatch between Titan V vs hopper).

If you don't have access to a hopper to generate traces for a trace based simulation, you can download traces collected for a Volta here and compare the simulation results using the Titan V and Hopper configs. This will be the closest proxy to simulating an actual hopper.

kunal-mansukhani · 2025-02-03T19:56:52Z

@christindbose
Got it. Thanks!

beneslami · 2025-02-05T17:17:02Z

Hi @christindbose

Thank you very much for your reply. Regarding the icnt BW. It seems that I made a mistake. Here is my reasoning:

Since K=292 and n = 1, I think it's all to all connection (correct me if I'm wrong). Based on the below parameters:
-gpgpu_n_clusters 132
-gpgpu_n_cores_per_cluster 1
-gpgpu_n_mem 80
-gpgpu_n_sub_partition_per_mchannel 2

we have 132 SMs which generates requests, and 160 memory slices which receives requests and generate replies. So, every SM has 160 bidirectional connections. So it's 80 uni-directional connections. Since the flit size is 40 bytes:
80 * 40B * 1.2GHz = 3.8 TB/s unidirectional - This means that every cluster has around 8 TB/s bandwidth.

right ?

christindbose added 3 commits October 23, 2024 22:32

hopper initial

140228d

num SMs/channels are in place

f2d08bb

L1D and L2 sizes updated

d075d47

christindbose marked this pull request as ready for review October 24, 2024 20:19

christindbose assigned cesar-avalos3 and JRPan Oct 24, 2024

small update to documentation for l1d

fa235eb

JRPan mentioned this pull request Nov 7, 2024

A100 gpgpusim.config and trace.config accel-sim/accel-sim-framework#334

Open

This comment was marked as outdated.

Sign in to view

updated cache configs

4fd1ae0

updated hopper freq

f765945

christindbose added 2 commits February 2, 2025 00:15

updated gpgpu_l1_latency, Gpgpu_l1_banks, gpgpu_smem_latency, gpgpu_l…

f376ec5

…2_rop_latency, dram_latency, gpgpu_dram_timing_opt, Gpgpu_num_reg_banks

updated k value

2ecd5e9

adjusted l1d config. Ensured functionality of custom icnt (not default)

f75c8be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hopper config #80

Hopper config #80

christindbose commented Oct 24, 2024 •

edited

Loading

This comment was marked as outdated.

kunal-mansukhani commented Jan 3, 2025 •

edited

Loading

christindbose commented Jan 3, 2025

kunal-mansukhani commented Jan 3, 2025

christindbose commented Jan 3, 2025

kunal-mansukhani commented Jan 3, 2025

christindbose commented Jan 3, 2025

kunal-mansukhani commented Jan 3, 2025

christindbose commented Jan 4, 2025

kunal-mansukhani commented Jan 5, 2025

kunal-mansukhani commented Jan 23, 2025 •

edited

Loading

JRPan commented Jan 23, 2025

JRPan commented Jan 23, 2025

beneslami commented Jan 27, 2025

christindbose commented Feb 3, 2025

christindbose commented Feb 3, 2025

kunal-mansukhani commented Feb 3, 2025

beneslami commented Feb 5, 2025

Hopper config #80

Are you sure you want to change the base?

Hopper config #80

Conversation

christindbose commented Oct 24, 2024 • edited Loading

This comment was marked as outdated.

kunal-mansukhani commented Jan 3, 2025 • edited Loading

christindbose commented Jan 3, 2025

kunal-mansukhani commented Jan 3, 2025

christindbose commented Jan 3, 2025

kunal-mansukhani commented Jan 3, 2025

christindbose commented Jan 3, 2025

kunal-mansukhani commented Jan 3, 2025

christindbose commented Jan 4, 2025

kunal-mansukhani commented Jan 5, 2025

kunal-mansukhani commented Jan 23, 2025 • edited Loading

JRPan commented Jan 23, 2025

JRPan commented Jan 23, 2025

beneslami commented Jan 27, 2025

christindbose commented Feb 3, 2025

christindbose commented Feb 3, 2025

kunal-mansukhani commented Feb 3, 2025

beneslami commented Feb 5, 2025

christindbose commented Oct 24, 2024 •

edited

Loading

kunal-mansukhani commented Jan 3, 2025 •

edited

Loading

kunal-mansukhani commented Jan 23, 2025 •

edited

Loading