-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hopper config #80
base: dev
Are you sure you want to change the base?
Hopper config #80
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
@christindbose Tried this out and got hashing.cc:88: unsigned int ipoly_hash_function(new_addr_type, unsigned int, unsigned int): Assertion "\nmemory_partition_indexing error: The number of " "channels should be " "16, 32 or 64 for the hashing IPOLY index function. other banks " "numbers are not supported. Generate it by yourself! \n" && 0' failed. Because the number of memory channels is 80 and so it can't be IPOLY hashed. What's the workaround? |
@kunal-mansukhani So this is because of the L2 cache configuration setting. I have pushed a simple fix. Please try it out and see if it works for your case. |
@christindbose I'm running a 32 x 32 shared matmul using the TITAN V vs H100 Config
H100 Results:
Shouldn't the H100 be much faster when factoring in the gpgpu_silicon_slowdown? |
What you're seeing is the simulation time and not the program runtime. The hopper is a larger GPU in terms of the number of resources (#SMs, #channels etc) and so it's possible that the simulation time is longer. The kernel runtime is given by the cycle count; so you should really be looking at that. |
Got it so if I do total cycles / core clock speed that should get me the actual program execution time if it were executed on that device correct? I'm doing that for this comparison and Titan V is still coming out ahead. is it that the overhead is too large relative to the actual compute? Should I be trying larger matmuls? |
That is correct. How much diff are we talking about? You should be looking at larger matrix sizes. It's possible that the hopper is highly underutilized at small sizes. |
@christindbose The total number of cycles for the H100 is consistently higher thant he total number of cycles for the Titan V, and they both have a similar clock speed, so it looks like for the same program, the Titan V has a shorter execution time. I tried on larger matricies but it still seems the same. The program I'm using doesn't leverage Tensor Cores, is that the issue? |
@kunal-mansukhani It's fine to not use tensor cores. I'd like to know more about your setup. Are you running the simulations in PTX or trace mode? The current hopper config doesn't reflect the clock freq of the actual hopper GPU (we mostly only scaled up the relevant hardware resources). So that will need to be fixed in order to compare with the actual hopper. |
@christindbose |
Still is signficantly slower than worse GPUs on normal kernels |
What are the numbers you are referring to? What's your simulation cycles and what's your hw cycles? How far are those? |
Actually forgot you were comparing with the Titan V. What are the cycles count for Titan V? |
Sorry @christindbose in gpgpusim.config you defined: So, this means that here we have 132 SMs with 1 core inside. According to Nvidia, each SM in H100 contains 128 cores. So, shouldn't
Thanks |
…2_rop_latency, dram_latency, gpgpu_dram_timing_opt, Gpgpu_num_reg_banks
hi @beneslami Here are the answers to your questions:
|
Since, you're running in PTX mode, the config relies on the PTX latencies to run the simulation. We have not correlated these ptx latencies (as our priority is enabling trace based simulation). Hence, the numbers you get are not sound (clearly indicated by the mismatch between Titan V vs hopper). If you don't have access to a hopper to generate traces for a trace based simulation, you can download traces collected for a Volta here and compare the simulation results using the Titan V and Hopper configs. This will be the closest proxy to simulating an actual hopper. |
@christindbose |
Thank you very much for your reply. Regarding the icnt BW. It seems that I made a mistake. Here is my reasoning: Since K=292 and n = 1, I think it's all to all connection (correct me if I'm wrong). Based on the below parameters: we have 132 SMs which generates requests, and 160 memory slices which receives requests and generate replies. So, every SM has 160 bidirectional connections. So it's 80 uni-directional connections. Since the flit size is 40 bytes: right ? |
This is an attempt to update the configs with the most relevant features from Hopper (SMX5 to be precise). The key config parameters modified are: