Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Optimization for Preference Training with Prefix Sharing #476

Open
austin362667 opened this issue Dec 13, 2024 · 0 comments
Open
Assignees

Comments

@austin362667
Copy link
Collaborator

austin362667 commented Dec 13, 2024

🚀 The feature, motivation and pitch

In Accelerating Direct Preference Optimization with Prefix Sharing, the authors proposed a efficient way to reduce total training tokens in paired preference optimization by combining the shared prompt with both chosen and rejected responses into a single sequence. As a result, the computation of the shared prompt is performed only once per training sample, eliminating redundant processing.

To do so, it leverages a custom attention mask. This mask masks out the region where the rejected response attends to the chosen response, ensuring that both responses are computed independently of each other.

To be more specific, please check the diagram from the paper below:

image

This method extends beyond DPO (demonstrated in the paper) and is compatible with all offline paired preference optimization algorithms, including ORPO and SimPO.

Alternatives

No response

Additional context

https://github.com/frankxwang/dpo-prefix-sharing

@austin362667 austin362667 self-assigned this Jan 8, 2025
lancerts added a commit that referenced this issue Feb 21, 2025
…ention` (#504)

## Summary

> TLDR of #476: The
shared prefix attention mask is an optimization for paired-preference
alignment training.

To pave the way for #476,
this PR aims to set up basic unit tests of flex attn with casual and
shared prefix mask.

## Testing Done

## Benchmarks

1. Casual Attention Mask (Flash Attention 2 vs. Torch Scaled Dot Product
Attention vs. FlexAttention)


![image](https://github.com/user-attachments/assets/bf5d479e-7157-4b17-aaea-0557f160e7b5)

3. Shared Prefix Attention Mask (Flash Attention 2 vs. Torch Scaled Dot
Product Attention vs. FlexAttention)


![image](https://github.com/user-attachments/assets/4c23601a-477c-404b-9097-03bda513e82a)



- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence

---------

Signed-off-by: Austin Liu <[email protected]>
Co-authored-by: Shao Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant