Skip to content

Some questions about best practice #176

Answered by wencynyu
wencynyu asked this question in Q&A
Discussion options

You must be logged in to vote

I find some documents today, according to the official doc and r1 model struct, ktransformers keep some heavily calculate-depend layer in gpu, and storage the experts of moe in ram, which will only activate 8 of 256 in the real progress of inference(about 37b, lightly calculate-depend layer).

  • shared expert & gate

  • normal experts

And according to the r1 deployment performance and the hardware analysis in question description, the bottleneck matches the ram bandwidth.

With 4090 1000+GB/s vram bandwidth and 600GB/s ram bandwidth, 330Tops gpu calculate capacity, 3Ghz * 64c * 2 unit = about 382Gops(avx), your doc shows it can provide 13.69 token/s performance in deepseek r1 671b q4.

T…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by wencynyu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
1 participant