You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running the AR experiment of VALLE_V2, the following error occurred. I have found many solutions on the Internet but none of them worked. Have you encountered this problem before?
Training Epoch 0: 17%|�[32m█▋ �[0m| 4002/23438 [19:18<1:21:05, 4.00batch/s]
Training Epoch 0: 17%|�[32m█▋ �[0m| 4003/23438 [19:18<1:21:47, 3.96batch/s][E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800994 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801003 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801005 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801005 milliseconds before timing out.
Saving state to /mnt/workspace/liuhw/Amphion/ckpt/VALLE_V2_wavtokenizer/wavtokenizer_large75_lr1e-4_8layer_3s-15s_libritts_180H_step80w/checkpoint/epoch-0000_step-0001000_loss-8.328300...
2025-02-04 18:25:30 | INFO | accelerate.accelerator | Saving current state to /mnt/workspace/liuhw/Amphion/ckpt/VALLE_V2_wavtokenizer/wavtokenizer_large75_lr1e-4_8layer_3s-15s_libritts_180H_step80w/checkpoint/epoch-0000_step-0001000_loss-8.328300
server17:3881907:3881998 [0] NCCL INFO comm 0x498bd450 rank 0 nranks 4 cudaDev 0 busId 1e000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
server17:3881910:3882004 [0] NCCL INFO comm 0x490369c0 rank 3 nranks 4 cudaDev 3 busId 3f000 - Abort COMPLETE
server17:3881908:3882007 [0] NCCL INFO comm 0x46d3ff60 rank 1 nranks 4 cudaDev 1 busId 3d000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
server17:3881909:3882001 [0] NCCL INFO comm 0x47fce560 rank 2 nranks 4 cudaDev 2 busId 3e000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
Traceback (most recent call last):
File "/mnt/workspace/liuhw/Amphion/.//bins/tts/train.py", line 156, in
main()
File "/mnt/workspace/liuhw/Amphion/.//bins/tts/train.py", line 152, in main
trainer.train_loop()
File "/mnt/workspace/liuhw/Amphion/models/tts/valle_v2_wavtokenizer/base_trainer.py", line 321, in train_loop
train_loss = self._train_epoch()
File "/mnt/workspace/liuhw/Amphion/models/tts/valle_v2_wavtokenizer/base_trainer.py", line 402, in _train_epoch
loss = self._train_step(batch)
File "/mnt/workspace/liuhw/Amphion/models/tts/valle_v2_wavtokenizer/valle_ar_trainer.py", line 199, in _train_step
out = self.model(
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1148, in forward
self._sync_buffers()
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1748, in _sync_buffers
self._sync_module_buffers(authoritative_rank)
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1752, in _sync_module_buffers
self._default_broadcast_coalesced(
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1775, in _default_broadcast_coalesced
self._distributed_broadcast_coalesced(
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1689, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801005 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 3881907) of binary: /mnt/workspace/liuhw/miniconda/envs/Amphion/bin/python
Traceback (most recent call last):
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/bin/accelerate", line 8, in
sys.exit(main())
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/accelerate/commands/launch.py", line 985, in launch_command
multi_gpu_launcher(args)
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
distrib_run.run(args)
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
.//bins/tts/train.py FAILED
Failures:
[1]:
time : 2025-02-04_18:25:32
host : server17
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 3881908)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3881908
[2]:
time : 2025-02-04_18:25:32
host : server17
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 3881909)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3881909
[3]:
time : 2025-02-04_18:25:32
host : server17
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 3881910)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3881910
Root Cause (first observed failure):
[0]:
time : 2025-02-04_18:25:32
host : server17
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 3881907)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3881907
The text was updated successfully, but these errors were encountered:
Hi@CriDora, thanks for using our code! From my experience, the nccl timeout problem is usually related to a problem in the dataset loading, like it takes too long time to load a specific data file. maybe you can add a timeout in data loading; another possibility is try updating the accelerate package: pip install -U accelerate.
When running the AR experiment of VALLE_V2, the following error occurred. I have found many solutions on the Internet but none of them worked. Have you encountered this problem before?
Training Epoch 0: 17%|�[32m█▋ �[0m| 4002/23438 [19:18<1:21:05, 4.00batch/s]
Training Epoch 0: 17%|�[32m█▋ �[0m| 4003/23438 [19:18<1:21:47, 3.96batch/s][E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800994 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801003 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801005 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801005 milliseconds before timing out.
Saving state to /mnt/workspace/liuhw/Amphion/ckpt/VALLE_V2_wavtokenizer/wavtokenizer_large75_lr1e-4_8layer_3s-15s_libritts_180H_step80w/checkpoint/epoch-0000_step-0001000_loss-8.328300...
2025-02-04 18:25:30 | INFO | accelerate.accelerator | Saving current state to /mnt/workspace/liuhw/Amphion/ckpt/VALLE_V2_wavtokenizer/wavtokenizer_large75_lr1e-4_8layer_3s-15s_libritts_180H_step80w/checkpoint/epoch-0000_step-0001000_loss-8.328300
server17:3881907:3881998 [0] NCCL INFO comm 0x498bd450 rank 0 nranks 4 cudaDev 0 busId 1e000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
server17:3881910:3882004 [0] NCCL INFO comm 0x490369c0 rank 3 nranks 4 cudaDev 3 busId 3f000 - Abort COMPLETE
server17:3881908:3882007 [0] NCCL INFO comm 0x46d3ff60 rank 1 nranks 4 cudaDev 1 busId 3d000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
server17:3881909:3882001 [0] NCCL INFO comm 0x47fce560 rank 2 nranks 4 cudaDev 2 busId 3e000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
Traceback (most recent call last):
File "/mnt/workspace/liuhw/Amphion/.//bins/tts/train.py", line 156, in
main()
File "/mnt/workspace/liuhw/Amphion/.//bins/tts/train.py", line 152, in main
trainer.train_loop()
File "/mnt/workspace/liuhw/Amphion/models/tts/valle_v2_wavtokenizer/base_trainer.py", line 321, in train_loop
train_loss = self._train_epoch()
File "/mnt/workspace/liuhw/Amphion/models/tts/valle_v2_wavtokenizer/base_trainer.py", line 402, in _train_epoch
loss = self._train_step(batch)
File "/mnt/workspace/liuhw/Amphion/models/tts/valle_v2_wavtokenizer/valle_ar_trainer.py", line 199, in _train_step
out = self.model(
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1148, in forward
self._sync_buffers()
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1748, in _sync_buffers
self._sync_module_buffers(authoritative_rank)
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1752, in _sync_module_buffers
self._default_broadcast_coalesced(
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1775, in _default_broadcast_coalesced
self._distributed_broadcast_coalesced(
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1689, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801005 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 3881907) of binary: /mnt/workspace/liuhw/miniconda/envs/Amphion/bin/python
Traceback (most recent call last):
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/bin/accelerate", line 8, in
sys.exit(main())
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/accelerate/commands/launch.py", line 985, in launch_command
multi_gpu_launcher(args)
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
distrib_run.run(args)
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
.//bins/tts/train.py FAILED
Failures:
[1]:
time : 2025-02-04_18:25:32
host : server17
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 3881908)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3881908
[2]:
time : 2025-02-04_18:25:32
host : server17
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 3881909)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3881909
[3]:
time : 2025-02-04_18:25:32
host : server17
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 3881910)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3881910
Root Cause (first observed failure):
[0]:
time : 2025-02-04_18:25:32
host : server17
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 3881907)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3881907
The text was updated successfully, but these errors were encountered: