-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replication of Still 1.5 Preview AIME: 33% AIME not 39% #22
Comments
Update: We tried 32K (max context length for qwen) and did it over pass 16: Total problems: 30 |
Hi, I run the default script. The default behavior is to do 5 samplings for each query (top 0.95, temperature 0.6), and gets 55/150=36.67%. Is the result mismatch due to randomness? Therefore, I want to ask: is it a commonsense in the community to evaluate pass@1 using sampling? If so, how to choose the number of samplings (and other sampling parameters). All of them seem to affect the final result and introduce randomness, compared with greedy decoding. |
I recommend trying out at least pass of 16. Low pass is high variance and is not true results. Deepseek averaged it over 64 runs for example. I have also tried pass 1 averaged 64 times and couldn’t get it past 32 percent for aime. They must have also fixed a seed, I have a random seed every time. |
Some updates... More updates! I have shifted evaluation from my own implementation to verl repository. I am getting 32.5% consistently! |
Hi All, we've released Deepscaler: https://github.com/agentica-project/deepscaler The results for Still are in the Github README.md; should you all have a different method to better evaluate your model, lmk! |
Hi, For STILL-3-1.5b-preview, we do not add additional prompts in user messages. The following is an example,
You can reproduce our results by using our evaluation code in Hope this can help you. |
Additionally, the evaluation settings are as follows: "For MATH and AIME, we employed a sampling decoding setup with a sampling temperature of 0.6 and a top-p sampling probability of 0.95. Each question was sampled 64 times, and the average score was calculated. " We would appreciate it if you could update the results in your README after reproducing the outcomes using our settings and code. |
Hi, Thank you for your feedback! You're right that increasing the number of samplings can generally enhance the robustness of the evaluation, especially with smaller test sets like AIME, which has only 30 questions. We also observed that practices like DeepSeek R1 often perform 64 samplings to improve result stability. Additionally, adjusting the temperature parameter (like 0.6) can also influence the diversity and quality of the outputs. I hope this information helps! If you have any further questions, feel free to reach out. Best regards, |
Hi all, thanks for replying. We have the same exact sampling parameters in our evaluation script (based off of VERL). Can you try our codebase? You can see here: agentica-project/deepscaler#3. People were able to replicate Deepscaler, indicating that our evaluation is reliable. |
I have ran Still 1.5 Pass@16 averaged and got 31% Pass 1 averaged over 16 passes of AIME, temperature 0.6 (like in deepseek) and top_p=0.95 (like in deepseek).
Are there other parameters that need to be tuned to replicate these results?
The text was updated successfully, but these errors were encountered: