`revert_liger_kernel_to_xxx` can't revert LigerCrossEntropyLoss for transformers>=4.46.1 #542

Tcc0403 · 2025-01-27T19:51:34Z

🐛 Describe the bug

#369 found that CrossEntropyLoss wasn't applied in post-grad-acc-fix versions of transformers. Despite the fact that #375 fixed the issue, it didn't consider the revert functions for convergence test.

Currently, the convergence test, test_mini_models_with_logits, is comparing two models which both are using LigerCrossEntropyLoss except the first test case. In other words, the test results might be false positive in the second and later test cases

The implementation of current revert functions is reloading module by calling importlib.reload(module_name). We can fix the issue by carefully checking the transformers version and adding all patched modules for reloads. We should also enhance our monkey_patch unit test by adding another revert and compare, ensuring the correctness of convergence test results.

Reproduce

Add a print statement in LigerCrossEntropyLossFunction and run

python3 -m pytest test/convergence/test_mini_models_with_logits.py -v -rP

Versions

none

The text was updated successfully, but these errors were encountered:

jp1924 · 2025-02-04T07:46:10Z

Oh, this is a serious issue.
It means that the tests performed so far might be incorrect.
Are you currently working on this? If not, I will open a PR.

Tcc0403 · 2025-02-04T10:18:03Z

Only LigerCrossEntropy doesn't get reverted. So as long as LigerCrossEntropy passes the unit test, convergence test should be reliable. The main issue here is even if we find a way to revert current monkey patch implementation, it would probably be too hacky and vulnerable to transformers changes in future version.
The most simple and approachable solution we've discussed in liger team is splitting fp32 and bf16 test cases, so convergence test doesn't rely on any revert functions. You can open a PR for it. @jp1924

jp1924 · 2025-02-11T06:08:42Z

@Tcc0403 Ah, first, sorry for the late reply.

From what I understand, is it correct that the revert function isn't working properly?

Looking at the code with this understanding,
I think there might be side effects later because the monkey patch directly inplaces 'from torch.nn import CrossEntropy'.

If that's the case, I think separating the FP32 and BF16 tests would be better for future stability.

For now, I want to merge that LLaVA PR quickly and move on to other things,
so I'll work on the test convergence part. I'll ask if I have any questions.

Tcc0403 · 2025-02-11T10:13:39Z

From what I understand, is it correct that the revert function isn't working properly?

Correct.

If that's the case, I think separating the FP32 and BF16 tests would be better for future stability.

Yes. It would be great if you can create a PR for it. I'll review and merge it asap, so we can circle back to llava PR.

jp1924 · 2025-02-12T07:40:24Z

@Tcc0403
I'm working on something I don't understand, so I have a question.

Can you explain a bit more about the revert_liger_kernel_to_*** function and how it relates to bf16 and fp32?

I did see the problem you mentioned with revert not working properly.
The revert function only reverts to modeling_llama.LlamaForCausalLM, so it doesn't revert if torch.nn.CrossEntropy is directly in place, right?

When I test test_mini_models_with_logits.py, the first llama fp32 test I run is faild, so I think we should be discussing how to fix the monkey_patch or revert code, right?

But suddenly I can't see how separating fp32 and 16 would solve this problem.

Because it seems like the code structure would still fail the llama test even if I split it into fp32 and 16.

If I'm missing something, please let me know. Thanks.

Tcc0403 · 2025-02-12T13:15:25Z

There are two issues in with_logits convergence test:

One is due to the future optimization in transformers. This is the reason why the first llama test failed. This issue will be fixed by #546

The other one is what you mentioned above

I did see the problem you mentioned with revert not working properly.
The revert function only reverts to modeling_llama.LlamaForCausalLM, so it doesn't revert if torch.nn.CrossEntropy is directly in place, right?

Since we can't correctly revert nn.CrossEntropy, test cases except the first one(llama bf16) are actually comparing two models both patched with LigerCrossEntropy. That's why we don't see other bf16 test cases fail as llama bf16 does. To prevent this issue, we want to avoid using unreliable revert functions for testing bf16 and fp32 scenario on the same model by separating two dtype into two files.

Tcc0403 · 2025-02-12T13:26:04Z

There are two issues in with_logits convergence test:

One is due to the future optimization in transformers. This is the reason why the first llama test failed. This issue will be fixed by #546

The other one is what you mentioned above

I did see the problem you mentioned with revert not working properly.
The revert function only reverts to modeling_llama.LlamaForCausalLM, so it doesn't revert if torch.nn.CrossEntropy is directly in place, right?

Since we can't correctly revert nn.CrossEntropy, test cases except the first one(llama bf16) are actually comparing two models both patched with LigerCrossEntropy. That's why we don't see other bf16 test cases fail as llama bf16 does. To prevent this issue, we want to avoid using unreliable revert functions for testing bf16 and fp32 scenario on the same model by separating two dtype into two files. This is the change we want in the discussion.

However, because patching nn.CrossEntropy on a single model seems to affect all models afterwards, we need another patching method for LigerCrossEntropy. We plan to handle it by adopting the implementation discussed in #543

@Tcc0403

## Summary   #542 ## Testing Done   - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence @Tcc0403

Tcc0403 mentioned this issue Jan 27, 2025

Add huggingface llava #524

Open

3 tasks

Tcc0403 added the bug Something isn't working label Jan 27, 2025

Tcc0403 changed the title ~~revert_liger_kernel_to_xxx functions for convergence test are incompatible with some huggingface/transformers versions~~ revert_liger_kernel_to_xxx can't revert LigerCrossEntropyLoss for transformers>=4.46.1 Feb 1, 2025

jp1924 mentioned this issue Feb 13, 2025

test split to 16, 32 #564

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`revert_liger_kernel_to_xxx` can't revert LigerCrossEntropyLoss for transformers>=4.46.1 #542

`revert_liger_kernel_to_xxx` can't revert LigerCrossEntropyLoss for transformers>=4.46.1 #542

Tcc0403 commented Jan 27, 2025 •

edited

Loading

jp1924 commented Feb 4, 2025

Tcc0403 commented Feb 4, 2025

jp1924 commented Feb 11, 2025

Tcc0403 commented Feb 11, 2025

jp1924 commented Feb 12, 2025 •

edited

Loading

Tcc0403 commented Feb 12, 2025

Tcc0403 commented Feb 12, 2025

revert_liger_kernel_to_xxx can't revert LigerCrossEntropyLoss for transformers>=4.46.1 #542

revert_liger_kernel_to_xxx can't revert LigerCrossEntropyLoss for transformers>=4.46.1 #542

Comments

Tcc0403 commented Jan 27, 2025 • edited Loading

🐛 Describe the bug

Reproduce

Versions

jp1924 commented Feb 4, 2025

Tcc0403 commented Feb 4, 2025

jp1924 commented Feb 11, 2025

Tcc0403 commented Feb 11, 2025

jp1924 commented Feb 12, 2025 • edited Loading

Tcc0403 commented Feb 12, 2025

Tcc0403 commented Feb 12, 2025

`revert_liger_kernel_to_xxx` can't revert LigerCrossEntropyLoss for transformers>=4.46.1 #542

`revert_liger_kernel_to_xxx` can't revert LigerCrossEntropyLoss for transformers>=4.46.1 #542

Tcc0403 commented Jan 27, 2025 •

edited

Loading

jp1924 commented Feb 12, 2025 •

edited

Loading