Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi node vLLM #530

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

ncassereau
Copy link

Hi,

Using vLLM in a multi-node setup can be tricky, and I encountered some issues on my personal supercomputer with the latest version of lighteval. This PR makes modifications to improve the multi-node experience with vLLM, as well as adds examples and updates the documentation for easier discoverability.

There are 4 commits, all of which can be adjusted or reverted if they don't align with lighteval’s intended usage.

First commit: 64bef7d

I encountered an infinite hang in multi-node setups, occurring in the cleanup function during process group obliteration. Calling these functions via Ray seems to resolve the issue without side effects. I also added the enforce_eager parameter to avoid CUDA Graph-related crashes in multi-node setups. While eager mode may be slightly slower, it prevents crashes and could be allowed (defaults to False).

Second commit: df9ff05

This commit addresses using vllm serve with lighteval endpoint openai, which allows multiple lighteval calls to a single vllm server and avoids peak memory usage on rank 0. To support this, I added the option to define an OPENAI_BASE_URL. If undefined, it falls back to the default OpenAI API. I also added logic to handle tokenizers for custom vllm servers (currently using AutoTokenizer). I'll admit that my approach might be questionable, but I wanted to highlight the issue and offer a proposal.

Third commit: 22bc2d1

Documentation update for lighteval vllm in multi-node settings. I reused content from an existing vllm page and added a new multi-node example. Placement is subjective.

Fourth commit: f017ca0

Documentation update for lighteval endpoint openai in multi-node setups, with a new example for vllm_serve. Placement is subjective.


self._tokenizer = AutoTokenizer.from_pretrained(self.model)
else:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a more explicit error when raising

Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's looking good, thanks a lot! I'll let @NathanHB take a look too but if tests pass we can merge

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@clefourrier
Copy link
Member

Test suite is failing, can you check why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants