-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revert_liger_kernel_to_xxx
can't revert LigerCrossEntropyLoss for transformers>=4.46.1
#542
Comments
revert_liger_kernel_to_xxx
functions for convergence test are incompatible with some huggingface/transformers versionsrevert_liger_kernel_to_xxx
can't revert LigerCrossEntropyLoss for transformers>=4.46.1
Oh, this is a serious issue. |
Only LigerCrossEntropy doesn't get reverted. So as long as LigerCrossEntropy passes the unit test, convergence test should be reliable. The main issue here is even if we find a way to revert current monkey patch implementation, it would probably be too hacky and vulnerable to transformers changes in future version. |
@Tcc0403 Ah, first, sorry for the late reply. From what I understand, is it correct that the revert function isn't working properly? Looking at the code with this understanding, If that's the case, I think separating the FP32 and BF16 tests would be better for future stability. For now, I want to merge that LLaVA PR quickly and move on to other things, |
Correct.
Yes. It would be great if you can create a PR for it. I'll review and merge it asap, so we can circle back to llava PR. |
@Tcc0403 Can you explain a bit more about the revert_liger_kernel_to_*** function and how it relates to bf16 and fp32? I did see the problem you mentioned with revert not working properly. When I test test_mini_models_with_logits.py, the first llama fp32 test I run is faild, so I think we should be discussing how to fix the monkey_patch or revert code, right? But suddenly I can't see how separating fp32 and 16 would solve this problem. Because it seems like the code structure would still fail the llama test even if I split it into fp32 and 16. If I'm missing something, please let me know. Thanks. |
There are two issues in with_logits convergence test: One is due to the future optimization in transformers. This is the reason why the first llama test failed. This issue will be fixed by #546 The other one is what you mentioned above
Since we can't correctly revert |
There are two issues in with_logits convergence test: One is due to the future optimization in transformers. This is the reason why the first llama test failed. This issue will be fixed by #546 The other one is what you mentioned above
Since we can't correctly revert However, because patching |
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> #542 ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence @Tcc0403
🐛 Describe the bug
#369 found that CrossEntropyLoss wasn't applied in post-grad-acc-fix versions of transformers. Despite the fact that #375 fixed the issue, it didn't consider the revert functions for convergence test.
Currently, the convergence test,
test_mini_models_with_logits
, is comparing two models which both are using LigerCrossEntropyLoss except the first test case. In other words, the test results might be false positive in the second and later test casesThe implementation of current revert functions is reloading module by calling
importlib.reload(module_name)
. We can fix the issue by carefully checking the transformers version and adding all patched modules for reloads. We should also enhance our monkey_patch unit test by adding another revert and compare, ensuring the correctness of convergence test results.Reproduce
Add a print statement in LigerCrossEntropyLossFunction and run
Versions
none
The text was updated successfully, but these errors were encountered: