-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to select DPO subset? #36
Comments
Hi Qiyuan, Thank you for your question. This 100K was filtered empirically lol. We noted that the original Magpie dataset had too many information-seeking and advice-seeking entries, so we manually decreased their proportion in the DPO phase and made the task categories more diverse and balanced. For example, for Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1, we first apply the following filter from raw dataset:
We then ramdomly sampled these amounts :
... and get 100K instructions with more diverse and balanced task categories. Please let me know if you need more information and I am happy to discuss! We will add these details to the appendix in our next Arxiv update! |
Thank you for your reply! |
Because the response distribution of UltraChat and Magpie-Air are too different. Therefore, when apply Magpie DPO on UltraChat SFT checkpoint, it may make it feel really awkward to learn lol. Usually, when doing SFT+DPO, then distribution shifts between two phases should not be too high... For a successful DPO, UltraChat + UltraFeedback are fine; Magpie SFT + Magpie DPO are also fine. |
Thank you very much for your reply! |
lol you can also use Magpie for scientific research! It's bigger and have higher quality~ Feel free to ask more questions and I am happy to answer~ |
Yes, I have planned to use Magpie as the SFT dataset, hoping to achieve better results. |
I haven't trained too many models on Qwen, but I will definitely do it when I have more GPUs available! |
Thank you very much! I hope it can be officially provided because the performance can be guaranteed to provide a fair comparison. I will train a version of Qwen, and if I can roughly reproduce the performance, I will submit a PR. |
Wow that's great! Thank you so much! |
I tried the methods you guys mentioned and it worked great! |
beta = 0.01 is a nice option. Here is an example of training configs we used to train our DPO model:
You can find all configs here and here within model cards. Just a reminder that the performance of DPO is very sensitive to learning rate... You may need to spend some time to perform grid search for the optimal learning rate. Empirically, Magpie datasets works well with learning rate between 0.5e-6 to 1.0e-6. |
Very wonderful work!
When I have filtered 300k data, I want to know how to get this 100k subset to synthesize DPO data.
If you can provide this part of the data filtering code, I believe it will be very helpful.
The text was updated successfully, but these errors were encountered: