Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error when run sh run_qwen.sh #487

Closed
CharlesJhonson opened this issue Dec 18, 2024 · 3 comments · Fixed by #539
Closed

error when run sh run_qwen.sh #487

CharlesJhonson opened this issue Dec 18, 2024 · 3 comments · Fixed by #539
Labels
good first issue Good for newcomers

Comments

@CharlesJhonson
Copy link

I run sh run_qwen.sh locally on a GPU machine. Errors as follow, could someone help.

conda list |grep trl
trl                       0.13.0                   pypi_0    pypi
conda list |grep transformers
transformers              4.47.1                   pypi_0    pypi
sh run_qwen.sh
********************
It's effective
********************
Applied Liger kernels to Qwen2
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.01it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/Liger-Kernel-main/examples/huggingface/training.py", line 81, in <module>
[rank0]:     train()
[rank0]:   File "/home/Liger-Kernel-main/examples/huggingface/training.py", line 67, in train
[rank0]:     trainer = SFTTrainer(
[rank0]:   File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]: TypeError: SFTTrainer.__init__() got an unexpected keyword argument 'max_seq_length'
E1218 16:54:26.878000 140467821201216 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 102105) of binary: /home/miniforge3/envs/ligerkernel/bin/python
Traceback (most recent call last):
  File "/home/miniforge3/envs/ligerkernel/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
training.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-18_16:54:26
  host      : 23
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 102105)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@bboyleonp666
Copy link
Contributor

@CharlesJhonson, I checked the documentation for trl. It seems that there's a change in trl.SFTTrainer in v0.13.0. I haven't dived into the details yet, but I have found that the max_seq_length is removed from the trl.SFTTrainer and can be found in trl.SFTConfig.

@Tcc0403
Copy link
Collaborator

Tcc0403 commented Dec 21, 2024

huggingface/trl#2306
huggingface/trl@5e90682#diff-67e157adfcd37d677fba66f610e3dfb56238cc550f221e8683fcfa0556e0f7caL150
It seems max_seq_length as a deprecated arg has been removed in this patch.

max_seq_length=custom_args.max_seq_length,

Deleting this line and checking what arguments should be added in the traing_args dict should fix the issue.

Some links that might be helpful: trl.SFTTrainer, trl.SFTConfig, transformers.HfArgumentParser, transformers.TrainingArguments

@Tcc0403 Tcc0403 added the good first issue Good for newcomers label Dec 21, 2024
@CharlesJhonson
Copy link
Author

ok thanks very much! @bboyleonp666 @Tcc0403
I will try it.

lancerts added a commit that referenced this issue Feb 21, 2025
Fixes #487 

I've chosen to remove the deprecated parameter as previously mentioned
in the issue. The sequence length for the training dataset can be
specified using the HF datasets library as mentioned
[here](https://huggingface.co/docs/trl/main/en/sft_trainer#datasets).

Please let me know if any further rectification is required and I will
make the necessary changes.

cc: @Tcc0403

---------

Co-authored-by: Yun Dai <[email protected]>
Co-authored-by: Shao Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants