Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to select DPO subset? #36

Open
qychen2001 opened this issue Dec 22, 2024 · 11 comments
Open

How to select DPO subset? #36

qychen2001 opened this issue Dec 22, 2024 · 11 comments

Comments

@qychen2001
Copy link

To create the dataset, we first selected 100K high-quality Magpie instructions with diverse task categories, then generated responses using Llama 3 8B Instruct 5 times for each instruction, using a temperature of 0.8. We then annotated RM scores using RLHFlow/ArmoRM-Llama3-8B-v0.1, labeling the response with the highest RM score as the chosen response, and the one with the lowest RM score as the rejected response.

Very wonderful work!
When I have filtered 300k data, I want to know how to get this 100k subset to synthesize DPO data.
If you can provide this part of the data filtering code, I believe it will be very helpful.

@zhangchen-xu
Copy link
Member

Hi Qiyuan,

Thank you for your question. This 100K was filtered empirically lol. We noted that the original Magpie dataset had too many information-seeking and advice-seeking entries, so we manually decreased their proportion in the DPO phase and made the task categories more diverse and balanced.

For example, for Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1, we first apply the following filter from raw dataset:

  • Difficulty >= medium
  • Input Quality >= good
  • Reward >= -5

We then ramdomly sampled these amounts :

  • 30K Information Seeking & Advice Seeking
  • 15K Coding&Debugging
  • 25K Math
  • 30K from all other task categories

... and get 100K instructions with more diverse and balanced task categories.

Please let me know if you need more information and I am happy to discuss! We will add these details to the appendix in our next Arxiv update!

@qychen2001
Copy link
Author

Thank you for your reply!
I found an interesting phenomenon in my practice, when I use UltraChat as a SFT model and use Magpie Air DPO on top of it for preference optimization, the result is not very good, but with Ultrafeedback the result is good. I wonder what you think about this?

@zhangchen-xu
Copy link
Member

Because the response distribution of UltraChat and Magpie-Air are too different. Therefore, when apply Magpie DPO on UltraChat SFT checkpoint, it may make it feel really awkward to learn lol. Usually, when doing SFT+DPO, then distribution shifts between two phases should not be too high...

For a successful DPO, UltraChat + UltraFeedback are fine; Magpie SFT + Magpie DPO are also fine.

@qychen2001
Copy link
Author

Thank you very much for your reply!
In fact, I have considered this issue, but since Ultrachat is usually used as SFT data in scientific research, I did not try to use Magpie's SFT data directly.
Interestingly, when I built Qwen's DPO data myself, it achieved very good results on Ultrachat.
Perhaps, I should try to replace the SFT model with the version trained on Magpie.
Thank you again for your enthusiastic reply!

@zhangchen-xu
Copy link
Member

lol you can also use Magpie for scientific research! It's bigger and have higher quality~

Feel free to ask more questions and I am happy to answer~

@qychen2001
Copy link
Author

Yes, I have planned to use Magpie as the SFT dataset, hoping to achieve better results.
If possible, I wonder if you can share other models trained with Magpie, such as Qwen, etc. I think it will be very helpful to the community.

@zhangchen-xu
Copy link
Member

I haven't trained too many models on Qwen, but I will definitely do it when I have more GPUs available!

@qychen2001
Copy link
Author

Thank you very much! I hope it can be officially provided because the performance can be guaranteed to provide a fair comparison. I will train a version of Qwen, and if I can roughly reproduce the performance, I will submit a PR.

@zhangchen-xu
Copy link
Member

Wow that's great! Thank you so much!

@qychen2001
Copy link
Author

I tried the methods you guys mentioned and it worked great!
Also, I'm curious to know the value of beta in DPO. I've found that magpie, a multiple sampling method, has less reward differences compared to ultrafeedback, and it seems that a larger beta works better.

@zhangchen-xu
Copy link
Member

beta = 0.01 is a nice option. Here is an example of training configs we used to train our DPO model:

# Customized Configs
model_name_or_path: Magpie-Align/Llama-3-8B-Magpie-Align-SFT-v0.3
hub_model_id: Magpie-Align/Llama-3-8B-Magpie-Align-v0.3-RC
output_dir: alignment_handbook_out/Llama-3-8B-Magpie-Align-v0.3-RC
run_name: Llama-3-8B-Magpie-Align-v0.3-RC

dataset_mixer:
  princeton-nlp/llama3-ultrafeedback-armorm: 1.0
dataset_splits:
- train
- test
preprocessing_num_workers: 24

# DPOTrainer arguments
bf16: true
beta: 0.01
learning_rate: 0.7e-6
gradient_accumulation_steps: 8
per_device_train_batch_size: 2
per_device_eval_batch_size: 4
num_train_epochs: 1
max_length: 2048
max_prompt_length: 1800
warmup_ratio: 0.1
logging_steps: 1
lr_scheduler_type: cosine
optim: adamw_torch

torch_dtype: null
use_flash_attention_2: true
do_eval: true
evaluation_strategy: steps
eval_steps: 100
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: False
log_level: info
push_to_hub: true
save_strategy: "steps"
save_steps: 100
save_total_limit: 1
seed: 42
report_to:
- wandb

You can find all configs here and here within model cards.

Just a reminder that the performance of DPO is very sensitive to learning rate... You may need to spend some time to perform grid search for the optimal learning rate. Empirically, Magpie datasets works well with learning rate between 0.5e-6 to 1.0e-6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants