Enables the huggingface checkpoint conversion to MaxText orbax. #1291

wang2yn84 · 2025-02-20T22:10:27Z

Description

This PR converts the huggingface llama checkpoint to MaxText orbax format. The purpose of the PR is to convert the Deepseek distilled checkpoint to MaxText. The original llama_or_mistral_ckpt.py only works on the Pytorch checkpoint, but not the Huggingface checkpoint. This PR fixes the work flow and documented the whole process.

Right now the accuracy still has some issue and need further debug.

Tests

Run through the workflow multiple times and the generated checkpoint works.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

singh-mitali

Many lint-only changes in this PR. Can those be skipped. Perhaps run "bash code_style.sh" under maxtext dir

wang2yn84 · 2025-02-20T22:46:35Z

Many lint-only changes in this PR. Can those be skipped. Perhaps run "bash code_style.sh" under maxtext dir

True! Updated and removed those changes.

anfals · 2025-02-20T22:52:09Z

@richjames0 had this PR: #1028 to support safetensors, but looks like this is doing more?

wang2yn84 · 2025-02-20T23:01:10Z

@richjames0 had this PR: #1028 to support safetensors, but looks like this is doing more?

Yes I'm aware of this PR, but it doesn't work when I try that conversion function. More logics are required to handle the discrepancies between Huggingface model structure and MaxText. That's why I have this PR.

anfals · 2025-02-20T23:20:16Z

@richjames0 had this PR: #1028 to support safetensors, but looks like this is doing more?

Yes I'm aware of this PR, but it doesn't work when I try that conversion function. More logics are required to handle the discrepancies between Huggingface model structure and MaxText. That's why I have this PR.

Gotcha! Yeah I was pointed to this older PR for converting a HF chpt, but we are seeing major issues with loss as I run with it. Your comment more or less confirms the ckpt conversion was the problem. But it'll be good once this PR lands and merges

singh-mitali · 2025-02-20T22:33:49Z

MaxText/llama_or_mistral_ckpt.py

@@ -168,6 +168,29 @@ def _hf_mapping(layer_idx: int = -1, expert_idx: int = -1) -> dict:
  }


+def _hf_to_maxtext_mapping(layer_idx: int = -1, expert_idx: int = -1) -> dict:


Maybe less error prone to have a function here which reverses key/value in previous dict.

I'm actually considering deleting that mapping cuz it's not working. Will refactor in the follow up PRs.

singh-mitali · 2025-02-20T23:33:21Z

MaxText/llama_or_mistral_ckpt.py

+  return x
+
+
+def convert_huggingface_to_jax_weights(base_model_path, model_size, huggingface_ckpt, model_params, mem_info):


This function and the following old function look very similar except the loading function. Could they be combined?

singh-mitali

Left some comments - but those could be addresses in a follow up CL.

RissyRan · 2025-02-21T01:18:34Z

Thanks for the change! The old PR works fine on my side a while ago, I'd like to take a look at this PR.

wang2yn84 · 2025-02-21T01:40:20Z

Thanks for the change! The old PR works fine on my side a while ago, I'd like to take a look at this PR.

Does it converts the Huggingface checkpoint? The name mapping has the wrong direction, it's from MaxText to Huggingface. I would be surprised if it worked before.

RissyRan · 2025-02-21T04:35:41Z

Thanks for the change! The old PR works fine on my side a while ago, I'd like to take a look at this PR.

Does it converts the Huggingface checkpoint? The name mapping has the wrong direction, it's from MaxText to Huggingface. I would be surprised if it worked before.

Yes, it works. @richjames0 and I were working on those ckpt from huggingface. To be specific we tested:

download safetensors from https://huggingface.co/mistralai/Mixtral-8x22B-v0.1
run script with this PR

What's the issue you met? Key not finding?

RissyRan · 2025-02-21T04:25:46Z

MaxText/tests/__init__.py

@@ -0,0 +1,15 @@
+"""
+Copyright 2023 Google LLC


RissyRan · 2025-02-21T04:27:59Z

MaxText/tests/forward_pass_logit_checker.py

@@ -77,7 +77,7 @@ def get_data(golden_data, golden_data_index, config):
  return ids, decoder_segment_ids, decoder_positions, logits


-def main(config, test_args):
+def main(config, test_args):  # pylint: disable=W0621


Will bash code_style.sh work for those?

Yea I used bash code_style.sh and it report test_args is redefined within main function. That's weird and functionally everything is working fine. So I have to disable it for now.

Oh.... test_args shouldn't be redefined and overwritten. Probably we shouldn't disable it, but find the cause?

It's not redefined anywhere. That's the myth.

vipannalla

Looks great, thanks Lance for adding this script. I had few minor comments.

vipannalla · 2025-02-21T05:41:34Z

MaxText/llama_or_mistral_ckpt.py

+    wk = np.reshape(wk, [base_num_query_heads * head_dim, base_num_kv_heads, head_dim])
+    wv = np.reshape(wv, [base_num_query_heads * head_dim, base_num_kv_heads, head_dim])
+
+    if model_size[:8] == "llama3.1":


Is this logic applicable for 3.1 version or all version after 3.1 as well?

We only have 3.1 for now. Yes it should work for say, 3.3

Is this change needed later? update it to 3.3?

Basically we are still using 3.1 to represent all the versions after 3.1. We can refactor that later to be more accurate. But right now, not only this place, there are other code in the database also depends on 3.1 to recognize the pattern.

vipannalla · 2025-02-21T05:42:36Z

MaxText/llama_or_mistral_ckpt.py

+    self_attention["value"]["kernel"][layer_idx, ...] = wv  # pylint: disable=E1137
+    self_attention["out"]["kernel"][layer_idx, ...] = w_post  # pylint: disable=E1137
+
+  self_attention["query"]["kernel"] = np.transpose(self_attention["query"]["kernel"], axes=(1, 0, 2, 3))


Nit: can you add a comment about the hardcoded axes (1, 0, 2, 3) and what they refer to in maxtext/Jax?

vipannalla · 2025-02-21T05:43:45Z

MaxText/llama_or_mistral_ckpt.py

+    layer_weight["post_self_attention_layer_norm"]["scale"][layer_idx, ...] = post_self_attention_layernorm  # pylint: disable=E1137
+
+  layer_weight["pre_self_attention_layer_norm"]["scale"] = np.transpose(
+      layer_weight["pre_self_attention_layer_norm"]["scale"], axes=(1, 0)


Please add comment about what (1, 0) axes refer to here.

vipannalla · 2025-02-21T05:45:58Z

MaxText/llama_or_mistral_ckpt.py


+  if num_experts is None:
+    # swap the layer index
+    layer_weight["mlp"]["wi_0"]["kernel"] = np.transpose(layer_weight["mlp"]["wi_0"]["kernel"], axes=(1, 0, 2))


please add doc about hardcoded axes for posterity...

vipannalla · 2025-02-21T05:48:04Z

MaxText/llama_or_mistral_ckpt.py

+  if huggingface_ckpt:
+    return _convert_huggingface_to_jax_weights(base_model_path, model_size, model_params, mem_info)


This is probably for future PRs, but can we also consolidate the rest of the logic into _convert_pytorch_to_jax_weights() to make it cleaner?

Good point! Updated!

wang2yn84 · 2025-02-21T08:09:47Z

Thanks for the change! The old PR works fine on my side a while ago, I'd like to take a look at this PR.

Does it converts the Huggingface checkpoint? The name mapping has the wrong direction, it's from MaxText to Huggingface. I would be surprised if it worked before.

Thanks for the change! The old PR works fine on my side a while ago, I'd like to take a look at this PR.

Does it converts the Huggingface checkpoint? The name mapping has the wrong direction, it's from MaxText to Huggingface. I would be surprised if it worked before.

Yes, it works. @richjames0 and I were working on those ckpt from huggingface. To be specific we tested:

download safetensors from https://huggingface.co/mistralai/Mixtral-8x22B-v0.1

run script with this PR

What's the issue you met? Key not finding?

As far as I can see, llama_or_mistral_ckpt.py doesn't have the safetensor loader and it can only load pth file. How do you load safetensor checkpoint?

RissyRan · 2025-02-21T18:23:14Z

Are

Thanks for the change! The old PR works fine on my side a while ago, I'd like to take a look at this PR.

Does it converts the Huggingface checkpoint? The name mapping has the wrong direction, it's from MaxText to Huggingface. I would be surprised if it worked before.

Thanks for the change! The old PR works fine on my side a while ago, I'd like to take a look at this PR.

Does it converts the Huggingface checkpoint? The name mapping has the wrong direction, it's from MaxText to Huggingface. I would be surprised if it worked before.

Yes, it works. @richjames0 and I were working on those ckpt from huggingface. To be specific we tested:

download safetensors from https://huggingface.co/mistralai/Mixtral-8x22B-v0.1

run script with this PR

What's the issue you met? Key not finding?

As far as I can see, llama_or_mistral_ckpt.py doesn't have the safetensor loader and it can only load pth file. How do you load safetensor checkpoint?

You should be able to find this one in https://github.com/AI-Hypercomputer/maxtext/pull/1028/files

def load_safetensors_checkpoint(ckpt_paths):
  chkpt_vars_raw = {}
  for i, ckpt_path in enumerate(ckpt_paths):
    max_logging.log(f"Loading checkpoint path {i+1} of {len(ckpt_paths)} ...")
    with safe_open(ckpt_path, framework="pt") as f:
      for k in f.keys():
        assert k not in chkpt_vars_raw
        chkpt_vars_raw[k] = f.get_tensor(k)
  chkpt_vars = [_HFNamespaceMapper(chkpt_vars_raw)]
  return chkpt_vars

Sorry that we haven't merged this PR in time due to a minor comment. Please don't merge until we are aligned. Due the urgency, I am ok to save this in a branch or copy of a separate file.

RissyRan · 2025-02-21T18:59:53Z

MaxText/configs/base.yml

@@ -518,6 +519,7 @@ inference_metadata_file: "" # path to a json file
 inference_server: "MaxtextInterleavedServer"  # inference server to start
 inference_benchmark_test: False
 enable_model_warmup: False
+hf_model_path: ""  # inference checkpoint correctness verification


We don't need to add into base.yml (if only uses in llama_or_mistral_ckpt.py)?

When I run the script, if I don't add it to base.yml, it complains it's configured in the command line but not in the config.

Interesting! How come? as max_kl_div, atol, etc are not in the base.yml as well.

Discussed offline and agreed to removed it from here. It's excluded in the test. Should use "--" to pass in the config.

RissyRan · 2025-02-21T19:00:48Z

MaxText/llama_or_mistral_ckpt.py

+    wk = np.reshape(wk, [base_num_query_heads * head_dim, base_num_kv_heads, head_dim])
+    wv = np.reshape(wv, [base_num_query_heads * head_dim, base_num_kv_heads, head_dim])
+
+    if model_size[:8] == "llama3.1":


Is this change needed later? update it to 3.3?

RissyRan · 2025-02-21T19:06:08Z

MaxText/tests/forward_pass_logit_checker.py

@@ -77,7 +77,7 @@ def get_data(golden_data, golden_data_index, config):
  return ids, decoder_segment_ids, decoder_positions, logits


-def main(config, test_args):
+def main(config, test_args):  # pylint: disable=W0621


Oh.... test_args shouldn't be redefined and overwritten. Probably we shouldn't disable it, but find the cause?

RissyRan · 2025-02-21T19:08:01Z

MaxText/tests/hf_checkpoint_conversion_checker.py

+
+def test_huggingface_to_maxtext_back_to_huggingface_flow():
+  base_num_query_heads = 16
+  head_dim = 32


Why those 2 config are defined/hardcode in the test?

I moved this test to a separate file. Basically this is a unit test for the permutation function. So everything else is hardcoded.

…for my workflow.

wang2yn84 requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla and RissyRan as code owners February 20, 2025 22:10

singh-mitali reviewed Feb 20, 2025

View reviewed changes

wang2yn84 and others added 8 commits February 20, 2025 22:34

Baseline xpk script to submit the job.

15f488f

HF checkpoint conversion script initial version.

6c390be

Enable Huggingface to MaxText checkpoint conversion.

f2a150d

Initial clean up.

339fb2a

Minor fixes to the previous clean up.

6baab97

Add the 70B script. It's working now!

cc616c4

Fix checkpoint conversion scripts and lint errors.

b8222de

Revert lint only changes. Polish the conversion script.

36bdf05

wang2yn84 force-pushed the lance-deepseek branch from b28d784 to 36bdf05 Compare February 20, 2025 22:35

wang2yn84 added 2 commits February 20, 2025 22:37

Revert the changes, comping from wrong merge.

6a7ff7f

Fix the import error.

1b573a4

Fix lint errors.

e92ccca

Fix the typecheck errors.

31a9f38

singh-mitali reviewed Feb 20, 2025

View reviewed changes

singh-mitali assigned vipannalla Feb 20, 2025

singh-mitali approved these changes Feb 20, 2025

View reviewed changes

RissyRan assigned RissyRan and richjames0 Feb 21, 2025

Fix lint errors.

b8d72cb

wang2yn84 added 2 commits February 21, 2025 01:48

Revert back some unnecessary change.

312fbfd

Revert back more unnecessary change.

0c48de5

RissyRan reviewed Feb 21, 2025

View reviewed changes

vipannalla reviewed Feb 21, 2025

View reviewed changes

vipannalla approved these changes Feb 21, 2025

View reviewed changes

Minor fixes.

d3641a7

Resolve comments.

2623e75

wang2yn84 added the pull ready label Feb 21, 2025

RissyRan reviewed Feb 21, 2025

View reviewed changes

wang2yn84 added 2 commits February 21, 2025 20:20

keep the original permute_to_match_maxtext_rope and create a new one …

ecd4870

…for my workflow.

Create a new test for checkpoint conversion.

31171bf

copybara-service bot merged commit e7038bc into main Feb 21, 2025
12 of 20 checks passed

copybara-service bot deleted the lance-deepseek branch February 21, 2025 21:04

RissyRan mentioned this pull request Feb 22, 2025

Revert "Enables the huggingface checkpoint conversion to MaxText orbax." #1300

Open

4 tasks

		@@ -168,6 +168,29 @@ def _hf_mapping(layer_idx: int = -1, expert_idx: int = -1) -> dict:
		}


		def _hf_to_maxtext_mapping(layer_idx: int = -1, expert_idx: int = -1) -> dict:

		return x


		def convert_huggingface_to_jax_weights(base_model_path, model_size, huggingface_ckpt, model_params, mem_info):

		if huggingface_ckpt:
		return _convert_huggingface_to_jax_weights(base_model_path, model_size, model_params, mem_info)

Enables the huggingface checkpoint conversion to MaxText orbax. #1291

Enables the huggingface checkpoint conversion to MaxText orbax. #1291

Conversation

wang2yn84 commented Feb 20, 2025

Description

Tests

Checklist

singh-mitali left a comment

Choose a reason for hiding this comment

wang2yn84 commented Feb 20, 2025

anfals commented Feb 20, 2025

wang2yn84 commented Feb 20, 2025

anfals commented Feb 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singh-mitali left a comment

Choose a reason for hiding this comment

RissyRan commented Feb 21, 2025

wang2yn84 commented Feb 21, 2025

RissyRan commented Feb 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vipannalla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wang2yn84 commented Feb 21, 2025

RissyRan commented Feb 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment