Multi node vLLM #530

ncassereau · 2025-01-31T13:53:52Z

Hi,

Using vLLM in a multi-node setup can be tricky, and I encountered some issues on my personal supercomputer with the latest version of lighteval. This PR makes modifications to improve the multi-node experience with vLLM, as well as adds examples and updates the documentation for easier discoverability.

There are 4 commits, all of which can be adjusted or reverted if they don't align with lighteval’s intended usage.

First commit: `64bef7d`

I encountered an infinite hang in multi-node setups, occurring in the cleanup function during process group obliteration. Calling these functions via Ray seems to resolve the issue without side effects. I also added the enforce_eager parameter to avoid CUDA Graph-related crashes in multi-node setups. While eager mode may be slightly slower, it prevents crashes and could be allowed (defaults to False).

Second commit: `df9ff05`

This commit addresses using vllm serve with lighteval endpoint openai, which allows multiple lighteval calls to a single vllm server and avoids peak memory usage on rank 0. To support this, I added the option to define an OPENAI_BASE_URL. If undefined, it falls back to the default OpenAI API. I also added logic to handle tokenizers for custom vllm servers (currently using AutoTokenizer). I'll admit that my approach might be questionable, but I wanted to highlight the issue and offer a proposal.

Third commit: `22bc2d1`

Documentation update for lighteval vllm in multi-node settings. I reused content from an existing vllm page and added a new multi-node example. Placement is subjective.

Fourth commit: `f017ca0`

Documentation update for lighteval endpoint openai in multi-node setups, with a new example for vllm_serve. Placement is subjective.

… node settings

…d an example for the lost souls

clefourrier · 2025-02-06T08:00:08Z

src/lighteval/models/endpoints/openai_model.py

+
+                self._tokenizer = AutoTokenizer.from_pretrained(self.model)
+            else:
+                raise


Please add a more explicit error when raising

clefourrier

It's looking good, thanks a lot! I'll let @NathanHB take a look too but if tests pass we can merge

HuggingFaceDocBuilderDev · 2025-02-06T08:00:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

clefourrier · 2025-02-06T08:59:21Z

Test suite is failing, can you check why?

ncassereau added 4 commits February 1, 2025 09:41

Solved an issue where lighteval vllm would hang indefinitely in multi…

ed023bc

… node settings

Allows using custom OpenAI endpoint (for instance with vLLM)

c663628

Added documentation for lighteval vllm and an example for the lost souls

b42ec3f

Added documentation for lighteval endpoint with vllm serve backend an…

a5ad8b5

…d an example for the lost souls

ncassereau force-pushed the multi_node_vllm branch from f017ca0 to a5ad8b5 Compare February 1, 2025 08:41

clefourrier reviewed Feb 6, 2025

View reviewed changes

clefourrier approved these changes Feb 6, 2025

View reviewed changes

Merge branch 'main' into multi_node_vllm

0456886

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi node vLLM #530

Multi node vLLM #530

ncassereau commented Jan 31, 2025

clefourrier Feb 6, 2025

clefourrier left a comment

HuggingFaceDocBuilderDev commented Feb 6, 2025

clefourrier commented Feb 6, 2025

Multi node vLLM #530

Are you sure you want to change the base?

Multi node vLLM #530

Conversation

ncassereau commented Jan 31, 2025

First commit: 64bef7d

Second commit: df9ff05

Third commit: 22bc2d1

Fourth commit: f017ca0

clefourrier Feb 6, 2025

Choose a reason for hiding this comment

clefourrier left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 6, 2025

clefourrier commented Feb 6, 2025

First commit: `64bef7d`

Second commit: `df9ff05`

Third commit: `22bc2d1`

Fourth commit: `f017ca0`