Replication of Still 1.5 Preview AIME: 33% AIME not 39% #22

michaelzhiluo · 2025-01-31T02:03:04Z

I have ran Still 1.5 Pass@16 averaged and got 31% Pass 1 averaged over 16 passes of AIME, temperature 0.6 (like in deepseek) and top_p=0.95 (like in deepseek).

Are there other parameters that need to be tuned to replicate these results?

michaelzhiluo · 2025-01-31T07:39:40Z

Update: We tried 32K (max context length for qwen) and did it over pass 16:

Total problems: 30
Pass@1 average over 16 runs: 32.71%
Pass@16: 73.33%

HazekiahWon · 2025-02-05T19:38:42Z

Update: We tried 32K (max context length for qwen) and did it over pass 16:

Total problems: 30 Pass@1 average over 16 runs: 32.71% Pass@16: 73.33%

Hi, I run the default script. The default behavior is to do 5 samplings for each query (top 0.95, temperature 0.6), and gets 55/150=36.67%.

Is the result mismatch due to randomness?

Therefore, I want to ask: is it a commonsense in the community to evaluate pass@1 using sampling? If so, how to choose the number of samplings (and other sampling parameters). All of them seem to affect the final result and introduce randomness, compared with greedy decoding.

michaelzhiluo · 2025-02-05T21:01:01Z

I recommend trying out at least pass of 16. Low pass is high variance and is not true results. Deepseek averaged it over 64 runs for example.

I have also tried pass 1 averaged 64 times and couldn’t get it past 32 percent for aime. They must have also fixed a seed, I have a random seed every time.

michaelzhiluo · 2025-02-06T22:51:59Z

Some updates...
I explictly converted the weights to bfloat16 for vLLM dtype and the same exact SamplingParameters parameters given to their engine. Got the best score of 33.1% for Pass 1 averaged over 16. But this was probably a lucky seed. Still a long distance from 39 though!

More updates! I have shifted evaluation from my own implementation to verl repository. I am getting 32.5% consistently!

michaelzhiluo · 2025-02-11T08:25:33Z

Hi All, we've released Deepscaler: https://github.com/agentica-project/deepscaler

The results for Still are in the Github README.md; should you all have a different method to better evaluate your model, lmk!

Timothy023 · 2025-02-11T14:09:08Z

Hi,

For STILL-3-1.5b-preview, we do not add additional prompts in user messages. The following is an example,

chat_prob = tokenizer.apply_chat_template(
    [
        {
            "role": "system",
            "content": "You are a helpful and harmless assistant. You should think step-by-step.",
        },
        {"role": "user", "content": "1 + 1 = ?"},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

You can reproduce our results by using our evaluation code in Slow_Thinking_with_LLMs/OpenRLHF-STILL/evaluation.

Hope this can help you.

EliverQ · 2025-02-11T14:29:23Z

Hi All, we've released Deepscaler: https://github.com/agentica-project/deepscaler

The results for Still are in the Github README.md; should you all have a different method to better evaluate your model, lmk!

Additionally, the evaluation settings are as follows: "For MATH and AIME, we employed a sampling decoding setup with a sampling temperature of 0.6 and a top-p sampling probability of 0.95. Each question was sampled 64 times, and the average score was calculated. "

We would appreciate it if you could update the results in your README after reproducing the outcomes using our settings and code.

EliverQ · 2025-02-11T14:34:14Z

Update: We tried 32K (max context length for qwen) and did it over pass 16:
Total problems: 30 Pass@1 average over 16 runs: 32.71% Pass@16: 73.33%

Hi, I run the default script. The default behavior is to do 5 samplings for each query (top 0.95, temperature 0.6), and gets 55/150=36.67%.

Is the result mismatch due to randomness?

Therefore, I want to ask: is it a commonsense in the community to evaluate pass@1 using sampling? If so, how to choose the number of samplings (and other sampling parameters). All of them seem to affect the final result and introduce randomness, compared with greedy decoding.

Hi,

Thank you for your feedback! You're right that increasing the number of samplings can generally enhance the robustness of the evaluation, especially with smaller test sets like AIME, which has only 30 questions. We also observed that practices like DeepSeek R1 often perform 64 samplings to improve result stability. Additionally, adjusting the temperature parameter (like 0.6) can also influence the diversity and quality of the outputs.

I hope this information helps! If you have any further questions, feel free to reach out.

Best regards,

michaelzhiluo · 2025-02-13T01:05:50Z

Hi all, thanks for replying.

We have the same exact sampling parameters in our evaluation script (based off of VERL). Can you try our codebase?

You can see here: agentica-project/deepscaler#3. People were able to replicate Deepscaler, indicating that our evaluation is reliable.

michaelzhiluo changed the title ~~Replication of Still 1.5 Preview AIME Results~~ Replication of Still 1.5 Preview AIME: 32% AIME not 39% Feb 3, 2025

michaelzhiluo changed the title ~~Replication of Still 1.5 Preview AIME: 32% AIME not 39%~~ Replication of Still 1.5 Preview AIME: 33% AIME not 39% Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication of Still 1.5 Preview AIME: 33% AIME not 39% #22

Replication of Still 1.5 Preview AIME: 33% AIME not 39% #22

michaelzhiluo commented Jan 31, 2025 •

edited

Loading

michaelzhiluo commented Jan 31, 2025

HazekiahWon commented Feb 5, 2025

michaelzhiluo commented Feb 5, 2025

michaelzhiluo commented Feb 6, 2025 •

edited

Loading

michaelzhiluo commented Feb 11, 2025

Timothy023 commented Feb 11, 2025

EliverQ commented Feb 11, 2025

EliverQ commented Feb 11, 2025

michaelzhiluo commented Feb 13, 2025

Replication of Still 1.5 Preview AIME: 33% AIME not 39% #22

Replication of Still 1.5 Preview AIME: 33% AIME not 39% #22

Comments

michaelzhiluo commented Jan 31, 2025 • edited Loading

michaelzhiluo commented Jan 31, 2025

HazekiahWon commented Feb 5, 2025

michaelzhiluo commented Feb 5, 2025

michaelzhiluo commented Feb 6, 2025 • edited Loading

michaelzhiluo commented Feb 11, 2025

Timothy023 commented Feb 11, 2025

EliverQ commented Feb 11, 2025

EliverQ commented Feb 11, 2025

michaelzhiluo commented Feb 13, 2025

michaelzhiluo commented Jan 31, 2025 •

edited

Loading

michaelzhiluo commented Feb 6, 2025 •

edited

Loading