Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completion only fine-tuning of instruction models with collections of HF datasets #1103

Merged
merged 29 commits into from
Feb 10, 2025

Conversation

chimezie
Copy link
Contributor

@chimezie chimezie commented Nov 10, 2024

PR Merge of #825, #1090 and fix for #1095 due to combined benefit for use in fine-tuning instruction models with HF completion datasets

@awni
Copy link
Member

awni commented Nov 11, 2024

Should we close the other three PRs in favor of this one?

@chimezie
Copy link
Contributor Author

Should we close the other three PRs in favor of this one?

Yes. I'll do that. Just wasn't sure if they made sense being done all at once versus piecemeal

@ivanfioravanti
Copy link
Contributor

Yesterday I was having many issues during fine-tuning with error "apply_chat_template raise ValueError( ValueError: No chat template is set for this processor. Please either set the chat_template attribute, or provide a chat template as an argument."
Is this solving them? I will try this now.

@ivanfioravanti
Copy link
Contributor

chat_template was missing in tokenizer_config.json (old model), solved!

@ivanfioravanti
Copy link
Contributor

Is there an ETA for this PR? It's really useful to simplify training on existing HF datasets.

@chimezie
Copy link
Contributor Author

chimezie commented Dec 8, 2024

@ivanfioravanti , the last thing I needed to do (which is now complete) was to update how the completion mask is identified and calculated from either the string or the corresponding token sequence. I took a look at DataCollatorForCompletionOnlyLM and axolotl for guidance, and the latter had the most straightforward solution (see: #28950).

I was hoping to rely on a more standard approach via the return_assistant_tokens_mask keyword argument of apply_chat_template, which only seems to be available for chat templates that support it via the {% generation %} keyword, but it doesn't appear to be widely supported yet.

In any case, it is ready for a review from @awni , etc.

@ivanfioravanti
Copy link
Contributor

Amazing job 🤩

@chimezie
Copy link
Contributor Author

chimezie commented Dec 9, 2024

Amazing job 🤩

Thank you.

chimezie added a commit to chimezie/mlx-tuning-fork that referenced this pull request Dec 23, 2024
@ivanfioravanti
Copy link
Contributor

Any update on merging this PR in main?

Ensure completion batching doesn't allow BOS dupes for instruction models with chat models whose tokenizer configurations have ```add_bos_token = True``` (see: 1095)
For use in calculating mask for everything up to the after the response prompt (i.e., the continuation/completion)
Follow example of trl's DataCollatorForCompletionOnlyLM to use response template to identify beginning of completion/continuation tokens for the purpose of masking out the other tokens during loss calculation
@awni awni force-pushed the completion_only_fix_bos_dupe branch from 7fed239 to ac3d9f5 Compare February 9, 2025 16:32
@awni awni force-pushed the completion_only_fix_bos_dupe branch from ac3d9f5 to 6ace6dc Compare February 9, 2025 16:33
@awni awni force-pushed the completion_only_fix_bos_dupe branch from 2bfec57 to bb2c8bc Compare February 10, 2025 02:00
@awni
Copy link
Member

awni commented Feb 10, 2025

I did a bit of work on this PR. Mostly cosmetic / simplifying stuff. But a couple note-able changes:

  • removed the response_template to determine where the completion starts. This is instead auto-determined in the chat/completion dataset
  • removed the extra loss function and iterate batches. the datasets simply return the position we need to start the mask at and this gets wired through to the existing loss.

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the addition. It will be really nice to test completion-only fine tuning!

@awni awni merged commit 5865899 into ml-explore:main Feb 10, 2025
4 checks passed
@chimezie
Copy link
Contributor Author

Thanks a lot for the addition. It will be really nice to test completion-only fine tuning!

My pleasure!

@chimezie chimezie deleted the completion_only_fix_bos_dupe branch February 11, 2025 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants