Support Huggingface tokenizer #1269

khatwanimohit · 2025-02-13T18:15:39Z

Description

Support Huggingface tokenizer and decouple it out of the input pipeline.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
b/394635939

Tests

Tested using python MaxText/train.py MaxText/configs/base.yml tokenizer_path='google/gemma-2-2b-it' tokenizer_type=huggingface run_name=${USER}-$RANDOM model_name=gemma2-2b base_output_directory=gs://runner-maxtext-logs dataset_path=gs://maxtext-dataset per_device_batch_size=1.0 enable_checkpointing=false steps=2 and added a test for comparing Gemma2-2b HF tokenizer with Sentencepiece.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

MaxText/input_pipeline/_tfds_data_processing.py

RissyRan

LGTM!

RissyRan · 2025-02-18T17:52:10Z

MaxText/tokenizer.py

+    return self.tokenizer.decode(t)
+
+
+def build_tokenizer(tokenizer_path, tokenizer_type, add_bos, add_eos, hf_access_token):
  """Loads the tokenizer at `tokenizer_path`"""
  max_logging.log(f"Tokenizer path: {tokenizer_path}")
  if "tiktoken" in tokenizer_path:


Wondering why the tiktoken type is slightly different here? i.e. why if tokenizer_type == "tiktoken" does not work?

RissyRan · 2025-02-18T17:54:36Z

MaxText/tests/tokenizer_test.py

@@ -99,5 +107,23 @@ def test_detokenize(self):
    self.assertEqual(np.asarray(self.source_tokenizer.decode(tokens)), np.asarray(text))


+class HFTokenizerTest(unittest.TestCase):


Thanks for the test!

khatwanimohit requested review from gobbleturk, bvandermoon, vipannalla and RissyRan as code owners February 13, 2025 18:15

khatwanimohit force-pushed the mohit/hf_tokenizer branch 6 times, most recently from 6e95d6b to 265f09b Compare February 13, 2025 20:13

anfals reviewed Feb 13, 2025

View reviewed changes

MaxText/input_pipeline/_tfds_data_processing.py Show resolved Hide resolved

khatwanimohit force-pushed the mohit/hf_tokenizer branch 4 times, most recently from a618206 to 52c8ade Compare February 14, 2025 07:14

RissyRan approved these changes Feb 18, 2025

View reviewed changes

khatwanimohit force-pushed the mohit/hf_tokenizer branch 3 times, most recently from 412b0eb to aa15d33 Compare February 19, 2025 18:41

github-actions bot added the pull ready label Feb 19, 2025

khatwanimohit force-pushed the mohit/hf_tokenizer branch from aa15d33 to 0aa574f Compare February 19, 2025 19:32

support HF tokenizer

3fdd18f

khatwanimohit force-pushed the mohit/hf_tokenizer branch from 0aa574f to 3fdd18f Compare February 19, 2025 19:48

copybara-service bot merged commit bea1cef into main Feb 19, 2025
17 of 18 checks passed

copybara-service bot deleted the mohit/hf_tokenizer branch February 19, 2025 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Huggingface tokenizer #1269

Support Huggingface tokenizer #1269

khatwanimohit commented Feb 13, 2025 •

edited

Loading

RissyRan left a comment

RissyRan Feb 18, 2025

RissyRan Feb 19, 2025

RissyRan Feb 18, 2025

		@@ -99,5 +107,23 @@ def test_detokenize(self):
		self.assertEqual(np.asarray(self.source_tokenizer.decode(tokens)), np.asarray(text))


		class HFTokenizerTest(unittest.TestCase):

Support Huggingface tokenizer #1269

Support Huggingface tokenizer #1269

Conversation

khatwanimohit commented Feb 13, 2025 • edited Loading

Description

Tests

Checklist

RissyRan left a comment

Choose a reason for hiding this comment

RissyRan Feb 18, 2025

Choose a reason for hiding this comment

RissyRan Feb 19, 2025

Choose a reason for hiding this comment

RissyRan Feb 18, 2025

Choose a reason for hiding this comment

khatwanimohit commented Feb 13, 2025 •

edited

Loading