-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Huggingface tokenizer #1269
Conversation
6e95d6b
to
265f09b
Compare
a618206
to
52c8ade
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
MaxText/tokenizer.py
Outdated
return self.tokenizer.decode(t) | ||
|
||
|
||
def build_tokenizer(tokenizer_path, tokenizer_type, add_bos, add_eos, hf_access_token): | ||
"""Loads the tokenizer at `tokenizer_path`""" | ||
max_logging.log(f"Tokenizer path: {tokenizer_path}") | ||
if "tiktoken" in tokenizer_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering why the tiktoken type is slightly different here? i.e. why if tokenizer_type == "tiktoken"
does not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -99,5 +107,23 @@ def test_detokenize(self): | |||
self.assertEqual(np.asarray(self.source_tokenizer.decode(tokens)), np.asarray(text)) | |||
|
|||
|
|||
class HFTokenizerTest(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test!
412b0eb
to
aa15d33
Compare
aa15d33
to
0aa574f
Compare
0aa574f
to
3fdd18f
Compare
Description
Support Huggingface tokenizer and decouple it out of the input pipeline.
If the change fixes a bug or a Github issue, please include a link, e.g.,:
b/394635939
Tests
Tested using
python MaxText/train.py MaxText/configs/base.yml tokenizer_path='google/gemma-2-2b-it' tokenizer_type=huggingface run_name=${USER}-$RANDOM model_name=gemma2-2b base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset per_device_batch_size=1.0 enable_checkpointing=false steps=2
and added a test for comparing Gemma2-2b HF tokenizer with Sentencepiece.Checklist
Before submitting this PR, please make sure (put X in square brackets):