Completion only fine-tuning of instruction models with collections of HF datasets #1103

chimezie · 2024-11-10T15:35:00Z

PR Merge of #825, #1090 and fix for #1095 due to combined benefit for use in fine-tuning instruction models with HF completion datasets

awni · 2024-11-11T14:15:58Z

Should we close the other three PRs in favor of this one?

chimezie · 2024-11-13T12:06:12Z

Should we close the other three PRs in favor of this one?

Yes. I'll do that. Just wasn't sure if they made sense being done all at once versus piecemeal

ivanfioravanti · 2024-11-17T12:48:08Z

Yesterday I was having many issues during fine-tuning with error "apply_chat_template raise ValueError( ValueError: No chat template is set for this processor. Please either set the chat_template attribute, or provide a chat template as an argument."
Is this solving them? I will try this now.

ivanfioravanti · 2024-11-17T13:37:50Z

chat_template was missing in tokenizer_config.json (old model), solved!

ivanfioravanti · 2024-12-06T12:14:03Z

Is there an ETA for this PR? It's really useful to simplify training on existing HF datasets.

chimezie · 2024-12-08T16:44:00Z

@ivanfioravanti , the last thing I needed to do (which is now complete) was to update how the completion mask is identified and calculated from either the string or the corresponding token sequence. I took a look at DataCollatorForCompletionOnlyLM and axolotl for guidance, and the latter had the most straightforward solution (see: #28950).

I was hoping to rely on a more standard approach via the return_assistant_tokens_mask keyword argument of apply_chat_template, which only seems to be available for chat templates that support it via the {% generation %} keyword, but it doesn't appear to be widely supported yet.

In any case, it is ready for a review from @awni , etc.

ivanfioravanti · 2024-12-08T17:09:00Z

Amazing job 🤩

llms/mlx_lm/tuner/trainer.py

chimezie · 2024-12-09T18:40:35Z

Amazing job 🤩

Thank you.

…ing training Input masking training added in lieu of ml-explore/mlx-examples#1103

ivanfioravanti · 2025-01-10T22:23:40Z

Any update on merging this PR in main?

… an updated attempt to better sync with iterate_batches logic

…iterate_batches) by default.

Renamed the batch iteration function (iterate_delineated_batches -> iterate_completion_batches).

…, adds support for custom chat HF datasets (ml-explore#1088), and fixes (ml-explore#1087)

…atasets

Ensure completion batching doesn't allow BOS dupes for instruction models with chat models whose tokenizer configurations have ```add_bos_token = True``` (see: 1095)

For use in calculating mask for everything up to the after the response prompt (i.e., the continuation/completion)

Follow example of trl's DataCollatorForCompletionOnlyLM to use response template to identify beginning of completion/continuation tokens for the purpose of masking out the other tokens during loss calculation

awni · 2025-02-10T04:11:32Z

I did a bit of work on this PR. Mostly cosmetic / simplifying stuff. But a couple note-able changes:

removed the response_template to determine where the completion starts. This is instead auto-determined in the chat/completion dataset
removed the extra loss function and iterate batches. the datasets simply return the position we need to start the mask at and this gets wired through to the existing loss.

awni

Thanks a lot for the addition. It will be really nice to test completion-only fine tuning!

chimezie · 2025-02-11T15:46:21Z

Thanks a lot for the addition. It will be really nice to test completion-only fine tuning!

My pleasure!

This was referenced Nov 13, 2024

Add functions for input-masked loss calculation and batching #825

Closed

Generalize HF datasets to a collection of HF datasets via hf_datasets #1090

Closed

chimezie commented Dec 8, 2024

View reviewed changes

llms/mlx_lm/tuner/trainer.py Show resolved Hide resolved

chimezie added a commit to chimezie/mlx-tuning-fork that referenced this pull request Dec 23, 2024

Support for collections of HF datasets, --top-k, update to input mask…

7a261b4

…ing training Input masking training added in lieu of ml-explore/mlx-examples#1103

chimezie mentioned this pull request Jan 26, 2025

Improve support for customizing the LoRa train method #1224

Open

chimezie added 16 commits February 9, 2025 07:12

Add input_masked loss calculation and batching w/ padding

604be3c

Replace iterate_input_masked_batches with iterate_delineated_batches,…

79a0427

… an updated attempt to better sync with iterate_batches logic

Minor documentation update

84fc1bd

Updates CL lora tuner with input masking that uses default_loss (and …

27cd361

…iterate_batches) by default.

Fix variable reference

30fd5af

Update sublist search and calculation of input id length

02abeea

Fix

71d9f8c

Add input masking for fine-tuning in documentation

3496cbe

Renamed the batch iteration function (iterate_delineated_batches -> iterate_completion_batches).

Generalize HF datasets to a collection of HF dataasets via datasets…

14a75f3

…, adds support for custom chat HF datasets (ml-explore#1088), and fixes (ml-explore#1087)

Updates to LoRA documentation

8ec802f

Fixes to config format in documentattion

214c79b

Fixes to references to hf_datasets

387c45e

Fix keyword argument invokation

78c33e5

Fix iteration over HF dataset collection

a4a86ad

Fix index calculation

a5b866c

Add ability to fetch raw prompt and completion text from completion d…

4890870

…atasets

chimezie added 9 commits February 9, 2025 07:41

Minor fix

69282ab

Don't dupe BOS

3f08dfc

Ensure completion batching doesn't allow BOS dupes for instruction models with chat models whose tokenizer configurations have ```add_bos_token = True``` (see: 1095)

Update documentation

5ce58e4

Default for hf_datasets configuration

f989401

Synch use of special tokens with iterate_batches

6df285e

Add response template (or token) argument

cb87f6f

For use in calculating mask for everything up to the after the response prompt (i.e., the continuation/completion)

Incorporate use of response template for completion masking

95e1f22

Follow example of trl's DataCollatorForCompletionOnlyLM to use response template to identify beginning of completion/continuation tokens for the purpose of masking out the other tokens during loss calculation

Move response template to LoRA configuration

7989d0a

Generalize the get_item method to all CompletionDatasets

b9748e9

awni force-pushed the completion_only_fix_bos_dupe branch from 7fed239 to ac3d9f5 Compare February 9, 2025 16:32

simplify collections

6ace6dc

awni force-pushed the completion_only_fix_bos_dupe branch from ac3d9f5 to 6ace6dc Compare February 9, 2025 16:33

awni added 2 commits February 9, 2025 17:31

put offset in prompt, simplify

6e9542a

more nits

bb2c8bc

awni force-pushed the completion_only_fix_bos_dupe branch from 2bfec57 to bb2c8bc Compare February 10, 2025 02:00

simplify

eda597b

awni approved these changes Feb 10, 2025

View reviewed changes

awni merged commit 5865899 into ml-explore:main Feb 10, 2025
4 checks passed

chimezie deleted the completion_only_fix_bos_dupe branch February 11, 2025 15:47

awni mentioned this pull request Feb 12, 2025

Add instruct tuning support to LoRA training #1211

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completion only fine-tuning of instruction models with collections of HF datasets #1103

Completion only fine-tuning of instruction models with collections of HF datasets #1103

chimezie commented Nov 10, 2024 •

edited

Loading

awni commented Nov 11, 2024

chimezie commented Nov 13, 2024

ivanfioravanti commented Nov 17, 2024

ivanfioravanti commented Nov 17, 2024

ivanfioravanti commented Dec 6, 2024

chimezie commented Dec 8, 2024 •

edited

Loading

ivanfioravanti commented Dec 8, 2024

chimezie commented Dec 9, 2024

ivanfioravanti commented Jan 10, 2025

awni commented Feb 10, 2025

awni left a comment

chimezie commented Feb 11, 2025

Completion only fine-tuning of instruction models with collections of HF datasets #1103

Completion only fine-tuning of instruction models with collections of HF datasets #1103

Conversation

chimezie commented Nov 10, 2024 • edited Loading

awni commented Nov 11, 2024

chimezie commented Nov 13, 2024

ivanfioravanti commented Nov 17, 2024

ivanfioravanti commented Nov 17, 2024

ivanfioravanti commented Dec 6, 2024

chimezie commented Dec 8, 2024 • edited Loading

ivanfioravanti commented Dec 8, 2024

chimezie commented Dec 9, 2024

ivanfioravanti commented Jan 10, 2025

awni commented Feb 10, 2025

awni left a comment

Choose a reason for hiding this comment

chimezie commented Feb 11, 2025

chimezie commented Nov 10, 2024 •

edited

Loading

chimezie commented Dec 8, 2024 •

edited

Loading