Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication of Still 1.5 Preview AIME: 33% AIME not 39% #22

Open
michaelzhiluo opened this issue Jan 31, 2025 · 9 comments
Open

Replication of Still 1.5 Preview AIME: 33% AIME not 39% #22

michaelzhiluo opened this issue Jan 31, 2025 · 9 comments

Comments

@michaelzhiluo
Copy link

michaelzhiluo commented Jan 31, 2025

I have ran Still 1.5 Pass@16 averaged and got 31% Pass 1 averaged over 16 passes of AIME, temperature 0.6 (like in deepseek) and top_p=0.95 (like in deepseek).

Are there other parameters that need to be tuned to replicate these results?

@michaelzhiluo
Copy link
Author

Update: We tried 32K (max context length for qwen) and did it over pass 16:

Total problems: 30
Pass@1 average over 16 runs: 32.71%
Pass@16: 73.33%

@michaelzhiluo michaelzhiluo changed the title Replication of Still 1.5 Preview AIME Results Replication of Still 1.5 Preview AIME: 32% AIME not 39% Feb 3, 2025
@HazekiahWon
Copy link

Update: We tried 32K (max context length for qwen) and did it over pass 16:

Total problems: 30 Pass@1 average over 16 runs: 32.71% Pass@16: 73.33%

Hi, I run the default script. The default behavior is to do 5 samplings for each query (top 0.95, temperature 0.6), and gets 55/150=36.67%.

Is the result mismatch due to randomness?

Therefore, I want to ask: is it a commonsense in the community to evaluate pass@1 using sampling? If so, how to choose the number of samplings (and other sampling parameters). All of them seem to affect the final result and introduce randomness, compared with greedy decoding.

@michaelzhiluo
Copy link
Author

I recommend trying out at least pass of 16. Low pass is high variance and is not true results. Deepseek averaged it over 64 runs for example.

I have also tried pass 1 averaged 64 times and couldn’t get it past 32 percent for aime. They must have also fixed a seed, I have a random seed every time.

@michaelzhiluo
Copy link
Author

michaelzhiluo commented Feb 6, 2025

Some updates...
I explictly converted the weights to bfloat16 for vLLM dtype and the same exact SamplingParameters parameters given to their engine. Got the best score of 33.1% for Pass 1 averaged over 16. But this was probably a lucky seed. Still a long distance from 39 though!

More updates! I have shifted evaluation from my own implementation to verl repository. I am getting 32.5% consistently!

@michaelzhiluo michaelzhiluo changed the title Replication of Still 1.5 Preview AIME: 32% AIME not 39% Replication of Still 1.5 Preview AIME: 33% AIME not 39% Feb 6, 2025
@michaelzhiluo
Copy link
Author

Hi All, we've released Deepscaler: https://github.com/agentica-project/deepscaler

The results for Still are in the Github README.md; should you all have a different method to better evaluate your model, lmk!

@Timothy023
Copy link
Collaborator

Hi,

For STILL-3-1.5b-preview, we do not add additional prompts in user messages. The following is an example,

chat_prob = tokenizer.apply_chat_template(
    [
        {
            "role": "system",
            "content": "You are a helpful and harmless assistant. You should think step-by-step.",
        },
        {"role": "user", "content": "1 + 1 = ?"},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

You can reproduce our results by using our evaluation code in Slow_Thinking_with_LLMs/OpenRLHF-STILL/evaluation.

Hope this can help you.

@EliverQ
Copy link
Contributor

EliverQ commented Feb 11, 2025

Hi All, we've released Deepscaler: https://github.com/agentica-project/deepscaler

The results for Still are in the Github README.md; should you all have a different method to better evaluate your model, lmk!

Additionally, the evaluation settings are as follows: "For MATH and AIME, we employed a sampling decoding setup with a sampling temperature of 0.6 and a top-p sampling probability of 0.95. Each question was sampled 64 times, and the average score was calculated. "

We would appreciate it if you could update the results in your README after reproducing the outcomes using our settings and code.

@EliverQ
Copy link
Contributor

EliverQ commented Feb 11, 2025

Update: We tried 32K (max context length for qwen) and did it over pass 16:
Total problems: 30 Pass@1 average over 16 runs: 32.71% Pass@16: 73.33%

Hi, I run the default script. The default behavior is to do 5 samplings for each query (top 0.95, temperature 0.6), and gets 55/150=36.67%.

Is the result mismatch due to randomness?

Therefore, I want to ask: is it a commonsense in the community to evaluate pass@1 using sampling? If so, how to choose the number of samplings (and other sampling parameters). All of them seem to affect the final result and introduce randomness, compared with greedy decoding.

Hi,

Thank you for your feedback! You're right that increasing the number of samplings can generally enhance the robustness of the evaluation, especially with smaller test sets like AIME, which has only 30 questions. We also observed that practices like DeepSeek R1 often perform 64 samplings to improve result stability. Additionally, adjusting the temperature parameter (like 0.6) can also influence the diversity and quality of the outputs.

I hope this information helps! If you have any further questions, feel free to reach out.

Best regards,

@michaelzhiluo
Copy link
Author

Hi all, thanks for replying.

We have the same exact sampling parameters in our evaluation script (based off of VERL). Can you try our codebase?

You can see here: agentica-project/deepscaler#3. People were able to replicate Deepscaler, indicating that our evaluation is reliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants