-
Congratulations for the amazing accelerate project, but I have some questions hoping replies. With your best practice hardware:
I think the memory struct maybe 8+8=16 channel, wich can provide 600+GB/s bandwidth. With 4090 1000+GB/s vram bandwidth and 600GB/s ram bandwidth, 330Tops gpu calculate capacity, 3Ghz * 64c * 2 unit = about 382Gops(avx), your doc shows it can provide 13.69 token/s performance in deepseek r1 671b q4. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
I find some documents today, according to the official doc and r1 model struct, ktransformers keep some heavily calculate-depend layer in gpu, and storage the experts of moe in ram, which will only activate 8 of 256 in the real progress of inference(about 37b, lightly calculate-depend layer).
shared expert & gate
normal experts
And according to the r1 deployment performance and the hardware analysis in question description, the bottleneck matches the ram bandwidth.
T…