error when run `sh run_qwen.sh` #487

CharlesJhonson · 2024-12-18T09:04:02Z

I run sh run_qwen.sh locally on a GPU machine. Errors as follow, could someone help.

conda list |grep trl
trl                       0.13.0                   pypi_0    pypi
conda list |grep transformers
transformers              4.47.1                   pypi_0    pypi

sh run_qwen.sh
********************
It's effective
********************
Applied Liger kernels to Qwen2
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.01it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/Liger-Kernel-main/examples/huggingface/training.py", line 81, in <module>
[rank0]:     train()
[rank0]:   File "/home/Liger-Kernel-main/examples/huggingface/training.py", line 67, in train
[rank0]:     trainer = SFTTrainer(
[rank0]:   File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]: TypeError: SFTTrainer.__init__() got an unexpected keyword argument 'max_seq_length'
E1218 16:54:26.878000 140467821201216 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 102105) of binary: /home/miniforge3/envs/ligerkernel/bin/python
Traceback (most recent call last):
  File "/home/miniforge3/envs/ligerkernel/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
training.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-18_16:54:26
  host      : 23
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 102105)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

The text was updated successfully, but these errors were encountered:

bboyleonp666 · 2024-12-21T03:04:34Z

@CharlesJhonson, I checked the documentation for trl. It seems that there's a change in trl.SFTTrainer in v0.13.0. I haven't dived into the details yet, but I have found that the max_seq_length is removed from the trl.SFTTrainer and can be found in trl.SFTConfig.

Tcc0403 · 2024-12-21T07:54:44Z

huggingface/trl#2306
huggingface/trl@5e90682#diff-67e157adfcd37d677fba66f610e3dfb56238cc550f221e8683fcfa0556e0f7caL150
It seems max_seq_length as a deprecated arg has been removed in this patch.

Liger-Kernel/examples/huggingface/training.py

Line 71 in 15a2f58

max_seq_length=custom_args.max_seq_length,

Deleting this line and checking what arguments should be added in the traing_args dict should fix the issue.

Some links that might be helpful: trl.SFTTrainer, trl.SFTConfig, transformers.HfArgumentParser, transformers.TrainingArguments

CharlesJhonson · 2024-12-24T07:50:37Z

ok thanks very much! @bboyleonp666 @Tcc0403
I will try it.

@Tcc0403

Fixes #487 I've chosen to remove the deprecated parameter as previously mentioned in the issue. The sequence length for the training dataset can be specified using the HF datasets library as mentioned [here](https://huggingface.co/docs/trl/main/en/sft_trainer#datasets). Please let me know if any further rectification is required and I will make the necessary changes. cc: @Tcc0403 --------- Co-authored-by: Yun Dai <[email protected]> Co-authored-by: Shao Tang <[email protected]>

Tcc0403 added the good first issue Good for newcomers label Dec 21, 2024

This was referenced Jan 24, 2025

Improve Hugging Face SFT Script #539

Merged

Update Documentation about SFT Config huggingface/trl#2649

Closed

lancerts closed this as completed in #539 Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error when run `sh run_qwen.sh` #487

error when run `sh run_qwen.sh` #487

CharlesJhonson commented Dec 18, 2024

bboyleonp666 commented Dec 21, 2024

Tcc0403 commented Dec 21, 2024 •

edited

Loading

CharlesJhonson commented Dec 24, 2024

error when run sh run_qwen.sh #487

error when run sh run_qwen.sh #487

Comments

CharlesJhonson commented Dec 18, 2024

bboyleonp666 commented Dec 21, 2024

Tcc0403 commented Dec 21, 2024 • edited Loading

CharlesJhonson commented Dec 24, 2024

error when run `sh run_qwen.sh` #487

error when run `sh run_qwen.sh` #487

Tcc0403 commented Dec 21, 2024 •

edited

Loading