Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Huggingface tokenizer #1269

Merged
merged 1 commit into from
Feb 19, 2025
Merged

Support Huggingface tokenizer #1269

merged 1 commit into from
Feb 19, 2025

Conversation

khatwanimohit
Copy link
Collaborator

@khatwanimohit khatwanimohit commented Feb 13, 2025

Description

Support Huggingface tokenizer and decouple it out of the input pipeline.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
b/394635939

Tests

Tested using python MaxText/train.py MaxText/configs/base.yml tokenizer_path='google/gemma-2-2b-it' tokenizer_type=huggingface run_name=${USER}-$RANDOM model_name=gemma2-2b base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset per_device_batch_size=1.0 enable_checkpointing=false steps=2 and added a test for comparing Gemma2-2b HF tokenizer with Sentencepiece.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

@khatwanimohit khatwanimohit force-pushed the mohit/hf_tokenizer branch 4 times, most recently from a618206 to 52c8ade Compare February 14, 2025 07:14
Copy link
Collaborator

@RissyRan RissyRan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

return self.tokenizer.decode(t)


def build_tokenizer(tokenizer_path, tokenizer_type, add_bos, add_eos, hf_access_token):
"""Loads the tokenizer at `tokenizer_path`"""
max_logging.log(f"Tokenizer path: {tokenizer_path}")
if "tiktoken" in tokenizer_path:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why the tiktoken type is slightly different here? i.e. why if tokenizer_type == "tiktoken" does not work?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -99,5 +107,23 @@ def test_detokenize(self):
self.assertEqual(np.asarray(self.source_tokenizer.decode(tokens)), np.asarray(text))


class HFTokenizerTest(unittest.TestCase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test!

@khatwanimohit khatwanimohit force-pushed the mohit/hf_tokenizer branch 3 times, most recently from 412b0eb to aa15d33 Compare February 19, 2025 18:41
@copybara-service copybara-service bot merged commit bea1cef into main Feb 19, 2025
17 of 18 checks passed
@copybara-service copybara-service bot deleted the mohit/hf_tokenizer branch February 19, 2025 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants