Slow Speed Issue #1029

chigkim · 2024-10-10T14:20:02Z

I ran the tests below on a Macbook Pro with m3-max 64GB. MLX seems to run much slower than llama.cpp with flash attention enabled.

Is this speed just a result of flash attention not available in MLX? If so, it would be amazing to have flash attention!

I'm including the full commands and relevant logs below. Also here is my Full prompt (from Wikipedia article).

Lcpp-fa = llama.cpp with flash attention.

Thanks!

Quant	Engine	PP	TG
Q4_K_M	lcpp-fa	385.56	32.00
Q4_K_M	lcpp	301.82	9.41
4bit	mlx	421.59	23.43
Q8_0	lcpp-fa	393.39	25.90
Q8_0	lcpp	301.18	8.66
8bit	mlx	401.616	19.02

./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -c 33000 -n 1000 --temp 0.0 --top_p 0.9 --seed 1000 -fa -f ../text/llama-portugal.txt;say done
llama_perf_context_print: prompt eval time =   83343.35 ms / 32134 tokens (    2.59 ms per token,   385.56 tokens per second)
llama_perf_context_print:        eval time =   22626.14 ms /   724 runs   (   31.25 ms per token,    32.00 tokens per second)

./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -c 33000 -n 1000 --temp 0.0 --top_p 0.9 --seed 1000 -f ../text/llama-portugal.txt;say done
llama_perf_context_print: prompt eval time =  106466.04 ms / 32134 tokens (    3.31 ms per token,   301.82 tokens per second)
llama_perf_context_print:        eval time =   88601.05 ms /   834 runs   (  106.24 ms per token,     9.41 tokens per second)

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt  -<../text/portugal.txt;say done
Prompt: 32134 tokens, 421.590 tokens-per-sec
Generation: 554 tokens, 23.434 tokens-per-sec

./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -c 33000 -n 1000 --temp 0.0 --top_p 0.9 --seed 1000 -fa -f ../text/llama-portugal.txt;say done
llama_perf_context_print: prompt eval time =   81683.80 ms / 32134 tokens (    2.54 ms per token,   393.39 tokens per second)
llama_perf_context_print:        eval time =   27257.10 ms /   706 runs   (   38.61 ms per token,    25.90 tokens per second)

./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -c 33000 -n 1000 --temp 0.0 --top_p 0.9 --seed 1000 -f ../text/llama-portugal.txt;say done
llama_perf_context_print: prompt eval time =  106694.04 ms / 32134 tokens (    3.32 ms per token,   301.18 tokens per second)
llama_perf_context_print:        eval time =   90074.36 ms /   780 runs   (  115.48 ms per token,     8.66 tokens per second)

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-8bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt  -<../text/portugal.txt;say done
Prompt: 32134 tokens, 401.616 tokens-per-sec
Generation: 788 tokens, 19.020 tokens-per-sec

awni · 2024-10-10T14:31:15Z

There's likely two factors here:

There is an optimization in the latest MLX LM (in the main branch but not yet released) that improves memory use and might help in your case for very long prompts if the system is under memory pressure.
Yes MLX does not have a fast fused attention yet. This is WIP and when it lands we should see a speedup especially for the long context cases.

chigkim · 2024-10-10T16:48:23Z

Thanks for the response!
Yea actually pulled from git this morning and installed from the source, so the fix #1027 should be included in my test.
Looking forward to fast fused attention! :)

awni · 2024-10-21T13:30:08Z

FYI we sped up the fused attention in MLX 0.19.0. It should be noticeably faster though still a bit slower than llama.cpp at the very long sequence lengths.. there's still some optimizations to do there.

chigkim · 2024-10-21T14:16:06Z

That's awesome.
I just pulled, and ran the same test.
For 8bit, prompt increased from 401.616 to 412.589 tokens-per-sec.
Generation increased from 19.020 to 24.498 tokens-per-sec.

Interestingly, running the same test with 4bit produced a bug where it would just generate the full 1000 max-tokens and just repeat the last two paragraphs over and over.

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt -<../text/portugal.txt;say done

Here is my Full prompt (from Wikipedia article) which is the same as my original test.

Just to see what happens, I increased --max-kv-size to 34k and --max-tokens to 2000, and it generated 2k tokens with the loop.

Running the exact same command replacing 4bit with 8bit generated the correct text and stopped at the end without the loop. In previous test, it did not do that.

Should I created a separate issue?

Thanks!

awni · 2024-10-21T14:29:32Z

Should I created a separate issue?

If the behavior changed from 0.18.1 to 0.19 yes it would be good to file another issue for that. Maybe small changes due to numerics make sense but if it went from working to not working that doesn't sound good.

awni · 2024-11-01T18:01:25Z

@chigkim are you still having this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Speed Issue #1029

Slow Speed Issue #1029

chigkim commented Oct 10, 2024 •

edited

Loading

awni commented Oct 10, 2024

chigkim commented Oct 10, 2024

awni commented Oct 21, 2024

chigkim commented Oct 21, 2024

awni commented Oct 21, 2024

awni commented Nov 1, 2024

Slow Speed Issue #1029

Slow Speed Issue #1029

Comments

chigkim commented Oct 10, 2024 • edited Loading

awni commented Oct 10, 2024

chigkim commented Oct 10, 2024

awni commented Oct 21, 2024

chigkim commented Oct 21, 2024

awni commented Oct 21, 2024

awni commented Nov 1, 2024

chigkim commented Oct 10, 2024 •

edited

Loading