Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Speed Issue #1029

Open
chigkim opened this issue Oct 10, 2024 · 6 comments
Open

Slow Speed Issue #1029

chigkim opened this issue Oct 10, 2024 · 6 comments

Comments

@chigkim
Copy link

chigkim commented Oct 10, 2024

I ran the tests below on a Macbook Pro with m3-max 64GB. MLX seems to run much slower than llama.cpp with flash attention enabled.

Is this speed just a result of flash attention not available in MLX? If so, it would be amazing to have flash attention!

I'm including the full commands and relevant logs below. Also here is my Full prompt (from Wikipedia article).

Lcpp-fa = llama.cpp with flash attention.

Thanks!

Quant Engine PP TG
Q4_K_M lcpp-fa 385.56 32.00
Q4_K_M lcpp 301.82 9.41
4bit mlx 421.59 23.43
Q8_0 lcpp-fa 393.39 25.90
Q8_0 lcpp 301.18 8.66
8bit mlx 401.616 19.02
./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -c 33000 -n 1000 --temp 0.0 --top_p 0.9 --seed 1000 -fa -f ../text/llama-portugal.txt;say done
llama_perf_context_print: prompt eval time =   83343.35 ms / 32134 tokens (    2.59 ms per token,   385.56 tokens per second)
llama_perf_context_print:        eval time =   22626.14 ms /   724 runs   (   31.25 ms per token,    32.00 tokens per second)

./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -c 33000 -n 1000 --temp 0.0 --top_p 0.9 --seed 1000 -f ../text/llama-portugal.txt;say done
llama_perf_context_print: prompt eval time =  106466.04 ms / 32134 tokens (    3.31 ms per token,   301.82 tokens per second)
llama_perf_context_print:        eval time =   88601.05 ms /   834 runs   (  106.24 ms per token,     9.41 tokens per second)

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt  -<../text/portugal.txt;say done
Prompt: 32134 tokens, 421.590 tokens-per-sec
Generation: 554 tokens, 23.434 tokens-per-sec

./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -c 33000 -n 1000 --temp 0.0 --top_p 0.9 --seed 1000 -fa -f ../text/llama-portugal.txt;say done
llama_perf_context_print: prompt eval time =   81683.80 ms / 32134 tokens (    2.54 ms per token,   393.39 tokens per second)
llama_perf_context_print:        eval time =   27257.10 ms /   706 runs   (   38.61 ms per token,    25.90 tokens per second)

./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -c 33000 -n 1000 --temp 0.0 --top_p 0.9 --seed 1000 -f ../text/llama-portugal.txt;say done
llama_perf_context_print: prompt eval time =  106694.04 ms / 32134 tokens (    3.32 ms per token,   301.18 tokens per second)
llama_perf_context_print:        eval time =   90074.36 ms /   780 runs   (  115.48 ms per token,     8.66 tokens per second)

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-8bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt  -<../text/portugal.txt;say done
Prompt: 32134 tokens, 401.616 tokens-per-sec
Generation: 788 tokens, 19.020 tokens-per-sec
@awni
Copy link
Member

awni commented Oct 10, 2024

There's likely two factors here:

  • There is an optimization in the latest MLX LM (in the main branch but not yet released) that improves memory use and might help in your case for very long prompts if the system is under memory pressure.

  • Yes MLX does not have a fast fused attention yet. This is WIP and when it lands we should see a speedup especially for the long context cases.

@chigkim
Copy link
Author

chigkim commented Oct 10, 2024

Thanks for the response!
Yea actually pulled from git this morning and installed from the source, so the fix #1027 should be included in my test.
Looking forward to fast fused attention! :)

@awni
Copy link
Member

awni commented Oct 21, 2024

FYI we sped up the fused attention in MLX 0.19.0. It should be noticeably faster though still a bit slower than llama.cpp at the very long sequence lengths.. there's still some optimizations to do there.

@chigkim
Copy link
Author

chigkim commented Oct 21, 2024

That's awesome.
I just pulled, and ran the same test.
For 8bit, prompt increased from 401.616 to 412.589 tokens-per-sec.
Generation increased from 19.020 to 24.498 tokens-per-sec.

Interestingly, running the same test with 4bit produced a bug where it would just generate the full 1000 max-tokens and just repeat the last two paragraphs over and over.

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt -<../text/portugal.txt;say done

Here is my Full prompt (from Wikipedia article) which is the same as my original test.

Just to see what happens, I increased --max-kv-size to 34k and --max-tokens to 2000, and it generated 2k tokens with the loop.

Running the exact same command replacing 4bit with 8bit generated the correct text and stopped at the end without the loop. In previous test, it did not do that.

Should I created a separate issue?

Thanks!

@awni
Copy link
Member

awni commented Oct 21, 2024

Should I created a separate issue?

If the behavior changed from 0.18.1 to 0.19 yes it would be good to file another issue for that. Maybe small changes due to numerics make sense but if it went from working to not working that doesn't sound good.

@awni
Copy link
Member

awni commented Nov 1, 2024

@chigkim are you still having this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants