-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let lighteval support sglang #552
Conversation
commit 132290b Author: Jayon02 <[email protected]> Date: Sat Feb 15 11:08:24 2025 +0000 modify document commit 601a755 Author: Jayon02 <[email protected]> Date: Sat Feb 15 10:22:43 2025 +0000 pass pre commit check and modify document commit 3e1fb88 Author: qiujiang chen <[email protected]> Date: Sat Feb 15 06:59:12 2025 +0000 optimize input, adjust precision commit 1a59076 Author: qiujiang chen <[email protected]> Date: Thu Feb 13 19:51:22 2025 +0000 text files commit 9dc62b7 Author: qiujiang chen <[email protected]> Date: Wed Feb 12 14:08:21 2025 +0000 modify format
Please let me know if there is anything that could be improved. Thanks to all SGLang team members for their help. |
I conduct some experiments to compare the metrics of lighteval using sglang and vllm. The experimental setup is as follows:
However, I find that the parameter temperature is set to 1.0 in some vllm tests. This parameter increases the uncertainty of the model's generated results. Here are some results and t means temperature below:
I find that when temperature = 0.0, the difference between sglang and vllm is very small. I also independently launch vllm and sglang backend without lighteval. When temperature = 1.0, for the same input, the results generated by each backend are different each time. I observe that each backend selects one output from multiple available options, and both backends have the same set of possible outputs. When evaluating, I think we should consider all options, not just one. The results in the table when temperature = 1.0 is inaccurate。 In other experiments that only require using I believe that for model evaluation, a more deterministic parameter setting should be chosen to better assess the model. So I think temperature = 0.0 is a more reasonable setting. I am very eager to know your thoughts on the consideration of this parameter setting. |
@Jayon02 great work! |
@@ -31,6 +33,7 @@ appropriate extras group. | |||
| adapters | To evaluate adapters models (delta and peft) | | |||
| tensorboardX | To upload your results to tensorboard | | |||
| vllm | To use vllm as backend for inference | | |||
| sglang | To use sglang as backend for inference | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did not modify the pyproject
file to reflect this extra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I link sglang document
for user to install here. SGLang has some limitation which can't be added by pyproject
. If we add sglang = [sglang>=0.4.2.post]
, pip install lighteval[sglang]
can't install dependency correctly. So I hope user can install sglang with sglang document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great ! Well in that case, it's not really an extra and I would not add it to the table
# Would we rather truncate the prompt to allow generation to go to max_new_tokens, at the risk | ||
# of losing some meaning, or have some generations that are exceedingly short? | ||
# The choice we go for here is to avoid truncating the prompt if we can, since it | ||
# should have been managed by the prompt creator/few shot manager if requested by the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes, even in 0 shots the prompt is too long, then I think we should truncate the prompt and allow generation, since we also evaluate model on context length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jayon02 Great work on this !! Thanks a lot for the contribution, I think the PR is good to go, you only need to dress a few nits I mentioned above :)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@Jayon02 This PR is great. Could you resolve / close the conversation if you solved it? |
commit 1035aa5 Author: Jayon02 <[email protected]> Date: Tue Feb 18 01:31:21 2025 +0000 modify document and fix bug commit be58c7c Author: Jayon02 <[email protected]> Date: Mon Feb 17 14:35:03 2025 +0000 modify toml commit 86e41c9 Merge: 132290b 50f3695 Author: Jayon02 <[email protected]> Date: Sun Feb 16 01:30:17 2025 +0000 Merge branch 'main' into sglang commit 132290b Author: Jayon02 <[email protected]> Date: Sat Feb 15 11:08:24 2025 +0000 modify document commit 601a755 Author: Jayon02 <[email protected]> Date: Sat Feb 15 10:22:43 2025 +0000 pass pre commit check and modify document commit 3e1fb88 Author: qiujiang chen <[email protected]> Date: Sat Feb 15 06:59:12 2025 +0000 optimize input, adjust precision commit 1a59076 Author: qiujiang chen <[email protected]> Date: Thu Feb 13 19:51:22 2025 +0000 text files commit 9dc62b7 Author: qiujiang chen <[email protected]> Date: Wed Feb 12 14:08:21 2025 +0000 modify format
Waiting for tests and should be ready ! |
@Jayon02 Great work! |
You can use sglang in lighteval tasks
lighteval sglang \ "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \ "helm|bigbench:bbq_lite_json:age_disambig|0|0"