Add LiveCodeBench's codegeneration task from lighteval (#346)

* Add lcb:codegeneration task from ligtheval * Add results from R1 Qwen 32B
huggingface · Feb 19, 2025 · 740a7a4 · 740a7a4
1 parent 9cf270d
commit 740a7a4
Show file tree

Hide file tree

Showing 2 changed files with 31 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -344,6 +344,36 @@ lighteval vllm $MODEL_ARGS "custom|gpqa:diamond|0|0" \
 python scripts/run_benchmarks.py --model-id={model_id}  --benchmarks gpqa
 ```
 
+### LiveCodeBench
+
+We are able to reproduce Deepseek's reported results on the LiveCodeBench code generation benchmark within ~1-3 standard deviations:
+
+| Model                         | LiveCodeBench (🤗 LightEval) | GPQA Diamond (DeepSeek Reported) |
+|:------------------------------|:---------------------------:|:--------------------------------:|
+| DeepSeek-R1-Distill-Qwen-1.5B |            16.3             |               16.9               |
+| DeepSeek-R1-Distill-Qwen-7B   |            36.6             |               37.6               |
+| DeepSeek-R1-Distill-Qwen-14B  |            51.5             |               53.1               |
+| DeepSeek-R1-Distill-Qwen-32B  |            56.6                |               57.2               |
+| DeepSeek-R1-Distill-Llama-8B  |            37.0             |               39.6               |
+| DeepSeek-R1-Distill-Llama-70B |            54.5             |               57.5               |
+
+To reproduce these results use the following command:
+
+```shell
+NUM_GPUS=1 # Set to 8 for 32B and 70B models, or data_parallel_size=8 with the smaller models for speed
+MODEL=deepseek-ai/{model_name}
+MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS,generation_parameters={temperature:0.6,top_p:0.95}"
+OUTPUT_DIR=data/evals/$MODEL
+
+lighteval vllm $MODEL_ARGS "extended|lcb:codegeneration|0|0" \
+    --use-chat-template \
+    --output-dir $OUTPUT_DIR
+```
+
+```shell
+python scripts/run_benchmarks.py --model-id={model_id}  --benchmarks lcb
+```
+
 ## Data generation
 
 ### Generate data from a smol distilled R1 model

diff --git a/src/open_r1/utils/evaluation.py b/src/open_r1/utils/evaluation.py
@@ -50,6 +50,7 @@ def register_lighteval_task(
 register_lighteval_task(LIGHTEVAL_TASKS, "custom", "aime24", "aime24", 0)
 register_lighteval_task(LIGHTEVAL_TASKS, "custom", "aime25", "aime25", 0)
 register_lighteval_task(LIGHTEVAL_TASKS, "custom", "gpqa", "gpqa:diamond", 0)
+register_lighteval_task(LIGHTEVAL_TASKS, "extended", "lcb", "lcb:codegeneration", 0)
 
 
 def get_lighteval_tasks():