How to select DPO subset? #36

qychen2001 · 2024-12-22T05:01:31Z

To create the dataset, we first selected 100K high-quality Magpie instructions with diverse task categories, then generated responses using Llama 3 8B Instruct 5 times for each instruction, using a temperature of 0.8. We then annotated RM scores using RLHFlow/ArmoRM-Llama3-8B-v0.1, labeling the response with the highest RM score as the chosen response, and the one with the lowest RM score as the rejected response.

Very wonderful work!
When I have filtered 300k data, I want to know how to get this 100k subset to synthesize DPO data.
If you can provide this part of the data filtering code, I believe it will be very helpful.

zhangchen-xu · 2024-12-29T03:52:43Z

Hi Qiyuan,

Thank you for your question. This 100K was filtered empirically lol. We noted that the original Magpie dataset had too many information-seeking and advice-seeking entries, so we manually decreased their proportion in the DPO phase and made the task categories more diverse and balanced.

For example, for Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1, we first apply the following filter from raw dataset:

Difficulty >= medium
Input Quality >= good
Reward >= -5

We then ramdomly sampled these amounts :

30K Information Seeking & Advice Seeking
15K Coding&Debugging
25K Math
30K from all other task categories

... and get 100K instructions with more diverse and balanced task categories.

Please let me know if you need more information and I am happy to discuss! We will add these details to the appendix in our next Arxiv update!

qychen2001 · 2025-01-20T10:34:27Z

Thank you for your reply!
I found an interesting phenomenon in my practice, when I use UltraChat as a SFT model and use Magpie Air DPO on top of it for preference optimization, the result is not very good, but with Ultrafeedback the result is good. I wonder what you think about this?

zhangchen-xu · 2025-01-20T21:12:43Z

Because the response distribution of UltraChat and Magpie-Air are too different. Therefore, when apply Magpie DPO on UltraChat SFT checkpoint, it may make it feel really awkward to learn lol. Usually, when doing SFT+DPO, then distribution shifts between two phases should not be too high...

For a successful DPO, UltraChat + UltraFeedback are fine; Magpie SFT + Magpie DPO are also fine.

qychen2001 · 2025-01-21T03:38:40Z

Thank you very much for your reply!
In fact, I have considered this issue, but since Ultrachat is usually used as SFT data in scientific research, I did not try to use Magpie's SFT data directly.
Interestingly, when I built Qwen's DPO data myself, it achieved very good results on Ultrachat.
Perhaps, I should try to replace the SFT model with the version trained on Magpie.
Thank you again for your enthusiastic reply!

zhangchen-xu · 2025-01-21T03:42:24Z

lol you can also use Magpie for scientific research! It's bigger and have higher quality~

Feel free to ask more questions and I am happy to answer~

qychen2001 · 2025-01-21T03:56:24Z

Yes, I have planned to use Magpie as the SFT dataset, hoping to achieve better results.
If possible, I wonder if you can share other models trained with Magpie, such as Qwen, etc. I think it will be very helpful to the community.

zhangchen-xu · 2025-01-21T03:59:41Z

I haven't trained too many models on Qwen, but I will definitely do it when I have more GPUs available!

qychen2001 · 2025-01-21T04:02:46Z

Thank you very much! I hope it can be officially provided because the performance can be guaranteed to provide a fair comparison. I will train a version of Qwen, and if I can roughly reproduce the performance, I will submit a PR.

zhangchen-xu · 2025-01-21T04:04:41Z

Wow that's great! Thank you so much!

qychen2001 · 2025-01-22T10:29:59Z

I tried the methods you guys mentioned and it worked great!
Also, I'm curious to know the value of beta in DPO. I've found that magpie, a multiple sampling method, has less reward differences compared to ultrafeedback, and it seems that a larger beta works better.

zhangchen-xu · 2025-01-23T03:48:21Z

beta = 0.01 is a nice option. Here is an example of training configs we used to train our DPO model:

# Customized Configs
model_name_or_path: Magpie-Align/Llama-3-8B-Magpie-Align-SFT-v0.3
hub_model_id: Magpie-Align/Llama-3-8B-Magpie-Align-v0.3-RC
output_dir: alignment_handbook_out/Llama-3-8B-Magpie-Align-v0.3-RC
run_name: Llama-3-8B-Magpie-Align-v0.3-RC

dataset_mixer:
  princeton-nlp/llama3-ultrafeedback-armorm: 1.0
dataset_splits:
- train
- test
preprocessing_num_workers: 24

# DPOTrainer arguments
bf16: true
beta: 0.01
learning_rate: 0.7e-6
gradient_accumulation_steps: 8
per_device_train_batch_size: 2
per_device_eval_batch_size: 4
num_train_epochs: 1
max_length: 2048
max_prompt_length: 1800
warmup_ratio: 0.1
logging_steps: 1
lr_scheduler_type: cosine
optim: adamw_torch

torch_dtype: null
use_flash_attention_2: true
do_eval: true
evaluation_strategy: steps
eval_steps: 100
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: False
log_level: info
push_to_hub: true
save_strategy: "steps"
save_steps: 100
save_total_limit: 1
seed: 42
report_to:
- wandb

You can find all configs here and here within model cards.

Just a reminder that the performance of DPO is very sensitive to learning rate... You may need to spend some time to perform grid search for the optimal learning rate. Empirically, Magpie datasets works well with learning rate between 0.5e-6 to 1.0e-6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to select DPO subset? #36

How to select DPO subset? #36

qychen2001 commented Dec 22, 2024

zhangchen-xu commented Dec 29, 2024

qychen2001 commented Jan 20, 2025

zhangchen-xu commented Jan 20, 2025

qychen2001 commented Jan 21, 2025

zhangchen-xu commented Jan 21, 2025

qychen2001 commented Jan 21, 2025

zhangchen-xu commented Jan 21, 2025

qychen2001 commented Jan 21, 2025

zhangchen-xu commented Jan 21, 2025

qychen2001 commented Jan 22, 2025

zhangchen-xu commented Jan 23, 2025

How to select DPO subset? #36

How to select DPO subset? #36

Comments

qychen2001 commented Dec 22, 2024

zhangchen-xu commented Dec 29, 2024

qychen2001 commented Jan 20, 2025

zhangchen-xu commented Jan 20, 2025

qychen2001 commented Jan 21, 2025

zhangchen-xu commented Jan 21, 2025

qychen2001 commented Jan 21, 2025

zhangchen-xu commented Jan 21, 2025

qychen2001 commented Jan 21, 2025

zhangchen-xu commented Jan 21, 2025

qychen2001 commented Jan 22, 2025

zhangchen-xu commented Jan 23, 2025