Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extended task for LiveCodeBench codegeneration #548

Merged
merged 23 commits into from
Feb 18, 2025

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Feb 10, 2025

Adds a new extended task to run LiveCodeBench's codegeneration subset.

The results for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B:

lighteval vllm \
    "pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B,dtype=float16,data_parallel_size=4,max_model_length=32768,gpu_memory_utilisation=0.8,generation_parameters={temperature: 0.7}" \
    "extended|lcb:codegeneration|0|0" \
    --use-chat-template

with the yaml file like so:

model:
  base_params:
    model_args: "pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B,dtype=float16,data_parallel_size=4,max_model_length=32768,gpu_memory_utilisation=0.8"
  generation:
    temperature: 0.6
    top_p: 0.95
lighteval vllm \
    "lcb.yaml" \
    "extended|lcb:codegeneration|0|0" \
    --use-chat-template
...
|            Task             |Version|Metric|Value|   |Stderr|
|-----------------------------|------:|------|----:|---|-----:|
|all                          |       |maj@16|0.163|±  |0.0188|
|extended:lcb:codegeneration:0|      0|maj@16|0.163|±  |0.0188|

Note: This is just an idea, not sure it's the best approach.

Additionally it adds a way of updating the number of samples required to run a metric via the yaml file:

model:
  base_params:
    model_args: "pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B,dtype=bfloat16,data_parallel_size=4,max_model_length=32768,gpu_memory_utilisation=0.8"
  generation:
    temperature: 0.6
    top_p: 0.95
  metric_options:
    codegen_pass@1:16:
      num_samples: 16

Under the metric_options, an entry can be added with the metric_name to be updated. It would just work with num_samples, but defined like this it shouldn't need more updates. Otherwise, the num_samples can be informed using the metric_name.

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@NathanHB
Copy link
Member

Hi ! Thanks for the PR.
To select dates, I think the only way would be to select the right dataset splits in the task config, there is no way of doing it form the cli.
For different prompts, it's not possible to do at runtime, you need to define it at the task level.

@plaguss
Copy link
Contributor Author

plaguss commented Feb 14, 2025

There's a job currently running with the following command:

lighteval vllm \
    "lcb.yaml" \
    "extended|lcb:codegeneration|0|0" \
    --custom-tasks src/lighteval/tasks/extended/lcb/main.py \
    --system-prompt "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\n\n" \
    --output-dir $OUTPUT_DIR \
    --save-details

and yaml file:

model:
  base_params:
    model_args: "pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B,dtype=bfloat16,tensor_parallel_size=4,max_model_length=32768,gpu_memory_utilisation=0.8"
  generation:
    temperature: 0.6
    top_p: 0.95

Will let the PR ready for review once a full run completes.
It's been running for 4.5 hours and seems to be half way in the generations, and GPU utilization look like this:
image

@lewtun
Copy link
Member

lewtun commented Feb 15, 2025

Hi @plaguss @NathanHB will it be possible to run this eval without needing a YAML file?

The reason I ask is that all of our codebases assume one can run lighteval vllm {ARGS} where we just populate {ARGS} at runtime. Having a YAML adds another layer of complexity, where we would need to grep / regex the model_args params and update them.

Also, perhaps we can speed this up dramatically by using data_parallel_size instead of tensor_parallel_size for models that fit on a single H100 (i.e. use a full node with 8 copies)?

@plaguss
Copy link
Contributor Author

plaguss commented Feb 16, 2025

Hi @plaguss @NathanHB will it be possible to run this eval without needing a YAML file?

The reason I ask is that all of our codebases assume one can run lighteval vllm {ARGS} where we just populate {ARGS} at runtime. Having a YAML adds another layer of complexity, where we would need to grep / regex the model_args params and update them.

Hi Lewis, I coulnd't find a way of passing the generation parameters in the CLI, which seem relevant for this model. I can update the code to pass them through ARGS (it should be here unless there's already a better way, @NathanHB?)

NEW:
I added the following logic to allow reading the arguments from the CLI to simplify things:

lighteval vllm \
    "pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B,dtype=float16,data_parallel_size=4,max_model_length=32768,gpu_memory_utilisation=0.8,generation_parameters={temperature:0.7,top_p:5}" \
    "extended|lcb:codegeneration|0|0" \
    --custom-tasks src/lighteval/tasks/extended/lcb/main.py \
    --output-dir $OUTPUT_DIR \
    --save-details

Now we could read the generation parameters from the model args following this pattern, let me know what you both think.

Also, perhaps we can speed this up dramatically by using data_parallel_size instead of tensor_parallel_size for models that fit on a single H100 (i.e. use a full node with 8 copies)?

Sure, I run it with data_parellel_size=4 for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, it took 4h approx.

@plaguss plaguss marked this pull request as ready for review February 16, 2025 08:08
@plaguss plaguss requested a review from NathanHB February 17, 2025 08:22
@plaguss
Copy link
Contributor Author

plaguss commented Feb 17, 2025

The 32B is still running due to an error, but the other values can be found here:

Model Lighteval (Replica) LiveCodeBench (DeepSeek Reported)
DeepSeek-R1-Distill-Qwen-1.5B 0.163 0.169
DeepSeek-R1-Distill-Qwen-7B 0.366 0.376
DeepSeek-R1-Distill-Qwen-8B 0.370 0.396
DeepSeek-R1-Distill-Qwen-14B 0.515 0.531
DeepSeek-R1-Distill-Qwen-32B 0.566 0.572
DeepSeek-R1-Distill-Qwen-70B 0.545 0.575

@NathanHB
Copy link
Member

The 32B is still running due to an error, but the other values can be found here:

Model Lighteval (Replica) LiveCodeBench (DeepSeek Reported)
DeepSeek-R1-Distill-Qwen-1.5B 0.163 0.169
DeepSeek-R1-Distill-Qwen-7B 0.366 0.376
DeepSeek-R1-Distill-Qwen-8B 0.370 0.396
DeepSeek-R1-Distill-Qwen-14B 0.515 0.531
DeepSeek-R1-Distill-Qwen-32B - 0.572
DeepSeek-R1-Distill-Qwen-70B 0.545 0.575

Great ! thanks for adding a way to pass generation params as args

Copy link
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this ! The results look great ! I was only wondering about dynamically changing the metric config at runtime, and if you could add some docs !
Otherwise ready to merge :)

@@ -134,9 +134,11 @@ def vllm(
with open(model_args, "r") as f:
config = yaml.safe_load(f)["model"]
model_args = config["base_params"]["model_args"]
metric_options = config.get("metric_options", {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some docs for this ?

if metric_data := self._metric_options.get(metric.metric_name, None):
num_samples = metric_data.get("num_samples", None)
if num_samples:
task.num_samples.append(num_samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has this been tested ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, it had 2 bugs in fact, thanks! now works as expected:

            for metric in task.metrics:
                if metric_data := self._metric_options.get(metric.metric_name, None):
                    num_samples = metric_data.get("num_samples", None)
                    if num_samples:
                        task.num_samples = [num_samples]

@NathanHB NathanHB merged commit fd479ee into huggingface:main Feb 18, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants