-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi node vLLM #530
base: main
Are you sure you want to change the base?
Multi node vLLM #530
Conversation
…d an example for the lost souls
f017ca0
to
a5ad8b5
Compare
|
||
self._tokenizer = AutoTokenizer.from_pretrained(self.model) | ||
else: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a more explicit error when raising
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's looking good, thanks a lot! I'll let @NathanHB take a look too but if tests pass we can merge
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Test suite is failing, can you check why? |
Hi,
Using vLLM in a multi-node setup can be tricky, and I encountered some issues on my personal supercomputer with the latest version of lighteval. This PR makes modifications to improve the multi-node experience with vLLM, as well as adds examples and updates the documentation for easier discoverability.
There are 4 commits, all of which can be adjusted or reverted if they don't align with lighteval’s intended usage.
First commit: 64bef7d
I encountered an infinite hang in multi-node setups, occurring in the cleanup function during process group obliteration. Calling these functions via Ray seems to resolve the issue without side effects. I also added the enforce_eager parameter to avoid CUDA Graph-related crashes in multi-node setups. While eager mode may be slightly slower, it prevents crashes and could be allowed (defaults to False).
Second commit: df9ff05
This commit addresses using vllm serve with lighteval endpoint openai, which allows multiple lighteval calls to a single vllm server and avoids peak memory usage on rank 0. To support this, I added the option to define an OPENAI_BASE_URL. If undefined, it falls back to the default OpenAI API. I also added logic to handle tokenizers for custom vllm servers (currently using AutoTokenizer). I'll admit that my approach might be questionable, but I wanted to highlight the issue and offer a proposal.
Third commit: 22bc2d1
Documentation update for
lighteval vllm
in multi-node settings. I reused content from an existing vllm page and added a new multi-node example. Placement is subjective.Fourth commit: f017ca0
Documentation update for
lighteval endpoint openai
in multi-node setups, with a new example for vllm_serve. Placement is subjective.