Let lighteval support sglang #552

Jayon02 · 2025-02-12T05:49:16Z

You can use sglang in lighteval tasks

lighteval sglang \ "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \ "helm|bigbench:bbq_lite_json:age_disambig|0|0"

src/lighteval/models/vllm/vllm_model.py

src/lighteval/pipeline.py

src/lighteval/tasks/registry.py

src/lighteval/main_vllm.py

src/lighteval/models/sglang/sglang_model.py

src/lighteval/main_sglang.py

src/lighteval/models/model_input.py

src/lighteval/models/model_loader.py

src/lighteval/models/sglang/sglang_model.py

src/lighteval/pipeline.py

commit 132290b Author: Jayon02 <[email protected]> Date: Sat Feb 15 11:08:24 2025 +0000 modify document commit 601a755 Author: Jayon02 <[email protected]> Date: Sat Feb 15 10:22:43 2025 +0000 pass pre commit check and modify document commit 3e1fb88 Author: qiujiang chen <[email protected]> Date: Sat Feb 15 06:59:12 2025 +0000 optimize input, adjust precision commit 1a59076 Author: qiujiang chen <[email protected]> Date: Thu Feb 13 19:51:22 2025 +0000 text files commit 9dc62b7 Author: qiujiang chen <[email protected]> Date: Wed Feb 12 14:08:21 2025 +0000 modify format

Jayon02 · 2025-02-15T11:57:03Z

I have addressed all the issues mentioned above. I apologize for the problems that occurred during my development process.
I add sglang as a potential backend for lighteval, offering high accuracy and efficiency.
I provide an initial document describing how to use sglang in lighteval after installing sglang in your device .
Here is an example of how to run lighteval with sglang, including adding model arguments and sampling arguments.

lighteval sglang \
"pretrained=Qwen/Qwen2-7B,dtype=float16,tp_size=2,generation_parameters={temperature: 1.0}" \
"leaderboard|truthfulqa:mc|0|0"

Please let me know if there is anything that could be improved. Thanks to all SGLang team members for their help.

Jayon02 · 2025-02-15T13:12:52Z

I conduct some experiments to compare the metrics of lighteval using sglang and vllm.

The experimental setup is as follows:

NVIDIA A100
Linux 5.15.0-126-generic
lighteval                         0.6.0.dev0
python                            3.13.0
torch                             2.5.1
vllm                              0.7.2
sglang                            0.4.2.post4
sgl-kernel                        0.0.3.post3
flashinfer-python                 0.2.0.post2+cu124torch2.5

However, I find that the parameter temperature is set to 1.0 in some vllm tests. This parameter increases the uncertainty of the model's generated results. Here are some results and t means temperature below:

	metric	sglang value (t = 1.0)	vllm value (t = 1.0)	sglang value (t = 0.0)	vllm value (t = 0.0)
helm:truthfulqa	em	0.5306	0.5046	0.6346	0.6346
	qem	0.5367	0.5122	0.6346	0.6346
	pem	0.5321	0.5061	0.6346	0.6346
	pqem	0.6223	0.6070	0.7080	0.7080
	acc	0.2584	0.2584	0.2584	0.2584
helm:siqa	em	0.1479	0.1428	0.1018	0.1008
	qem	0.1535	0.1515	0.1018	0.1008
	pem	0.1515	0.1469	0.1018	0.1008
	pqem	0.4135	0.4186	0.3649	0.3644
harness:bbh:hyperbaton	em	0.092	0.080	0	0
	qem	0.108	0.108	0	0
	pem	0.504	0.420	0.824	0.824
	pqem	0.780	0.740	0.968	0.968
	perfect_em	0	0	0	0
helm:mmlu	em	0.6373	0.6475	0.7110	0.7111
	qem	0.6375	0.6475	0.7110	0.7111
	pem	0.6373	0.6475	0.7110	0.7111
	pqem	0.7245	0.7339	0.7832	0.7832

I find that when temperature = 0.0, the difference between sglang and vllm is very small. I also independently launch vllm and sglang backend without lighteval. When temperature = 1.0, for the same input, the results generated by each backend are different each time. I observe that each backend selects one output from multiple available options, and both backends have the same set of possible outputs. When evaluating, I think we should consider all options, not just one. The results in the table when temperature = 1.0 is inaccurate。

In other experiments that only require using loglikelihood to compute metrics, such as lighteval:logiqa, and leaderboard:truthfulqa:mc, temperature is set to 0.0 and the results are nearly the same for both sglang and vllm.

I believe that for model evaluation, a more deterministic parameter setting should be chosen to better assess the model. So I think temperature = 0.0 is a more reasonable setting.

I am very eager to know your thoughts on the consideration of this parameter setting.

zhaochenyang20 · 2025-02-16T07:43:55Z

@Jayon02 great work!

NathanHB · 2025-02-17T10:48:41Z

docs/source/installation.mdx

@@ -31,6 +33,7 @@ appropriate extras group.
 | adapters     | To evaluate adapters models (delta and peft)                              |
 | tensorboardX | To upload your results to tensorboard                                     |
 | vllm         | To use vllm as backend for inference                                      |
+| sglang       | To use sglang as backend for inference                                    |


You did not modify the pyproject file to reflect this extra

I link sglang document for user to install here. SGLang has some limitation which can't be added by pyproject. If we add sglang = [sglang>=0.4.2.post], pip install lighteval[sglang] can't install dependency correctly. So I hope user can install sglang with sglang document.

Great ! Well in that case, it's not really an extra and I would not add it to the table

docs/source/use-sglang-as-backend.mdx

NathanHB · 2025-02-17T12:11:01Z

src/lighteval/models/sglang/sglang_model.py

+            # Would we rather truncate the prompt to allow generation to go to max_new_tokens, at the risk
+            # of losing some meaning, or have some generations that are exceedingly short?
+            # The choice we go for here is to avoid truncating the prompt if we can, since it
+            # should have been managed by the prompt creator/few shot manager if requested by the user.


sometimes, even in 0 shots the prompt is too long, then I think we should truncate the prompt and allow generation, since we also evaluate model on context length

NathanHB

@Jayon02 Great work on this !! Thanks a lot for the contribution, I think the PR is good to go, you only need to dress a few nits I mentioned above :)

HuggingFaceDocBuilderDev · 2025-02-17T12:18:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zhaochenyang20 · 2025-02-17T17:10:48Z

@Jayon02 This PR is great. Could you resolve / close the conversation if you solved it?

commit 1035aa5 Author: Jayon02 <[email protected]> Date: Tue Feb 18 01:31:21 2025 +0000 modify document and fix bug commit be58c7c Author: Jayon02 <[email protected]> Date: Mon Feb 17 14:35:03 2025 +0000 modify toml commit 86e41c9 Merge: 132290b 50f3695 Author: Jayon02 <[email protected]> Date: Sun Feb 16 01:30:17 2025 +0000 Merge branch 'main' into sglang commit 132290b Author: Jayon02 <[email protected]> Date: Sat Feb 15 11:08:24 2025 +0000 modify document commit 601a755 Author: Jayon02 <[email protected]> Date: Sat Feb 15 10:22:43 2025 +0000 pass pre commit check and modify document commit 3e1fb88 Author: qiujiang chen <[email protected]> Date: Sat Feb 15 06:59:12 2025 +0000 optimize input, adjust precision commit 1a59076 Author: qiujiang chen <[email protected]> Date: Thu Feb 13 19:51:22 2025 +0000 text files commit 9dc62b7 Author: qiujiang chen <[email protected]> Date: Wed Feb 12 14:08:21 2025 +0000 modify format

zhaochenyang20 · 2025-02-18T03:45:09Z

@Jayon02 @NathanHB This PR is approved. When can we merge it?

NathanHB · 2025-02-18T09:50:14Z

Waiting for tests and should be ready !

zhaochenyang20 · 2025-02-18T15:59:52Z

@Jayon02 Great work!

Jayon02 and others added 8 commits February 7, 2025 22:15

support sglang

f39de46

output bugs

22dc29b

Update vllm_output.txt

b23063b

fix outputs bug

85954f2

Merge branch 'main' of https://github.com/Jayon02/lighteval

75290a9

fix outputs bug

985d366

adjust precision

c395b1e

adjust precision

9c0f7cb