You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When attempting to run the hifigan_finetune.py script in the provided tts examples, the following error is triggered:
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
I've tried the standard approach for fixing this issue which is including the following at the head of the hifigan_finetune.py script:
import torch.multiprocessing as mp
mp.set_start_method('spawn', force=True)
This just leads to a further cascading set of errors. I'm not sure if I'm doing something grossly wrong or if hifigan is grossly bugged. I've tried this in NeMo docker containers of various versions and conda environments of various configurations. All of which lead to the same issue.
Steps/Code to reproduce bug
Clone NeMo repository.
Add proper manifest.json files to examples/tts (for simplicity)
Update hifigan.yaml config to point to manifests. Update hifigan_finetune.py to use the "hifigan" config (instead of hifigan_44100)
Run hifigan_finetune.py
Expected behavior
I expect training to begin. Instead, the model is downloaded and sanity checking begins, followed shortly by the above error.
Environment overview (please complete the following information)
Environment location: Docker
I've used the following docker pull commands. ALL of which have reproduced the issue:
docker pull nvcr.io/nvidia/nemo:23.03
docker pull nvcr.io/nvidia/nemo:23.06
docker pull nvcr.io/nvidia/nemo:24.05
docker pull nvcr.io/nvidia/nemo:24.12.01
The issue does NOT seem to be present in the following version:
docker pull nvcr.io/nvidia/nemo:22.09
Environment details
If NVIDIA docker image is used you don't need to specify these.
Additional context
I've tried this on two machines, a Windows 11 PC running WSL with a RTX 2070 and a Linux Server running Ubuntu 22.04 with a RTX 2000 Ada. Both have experienced the same issue. The windows machine had a conda environment which ran fastpitch training no problem. The linux machine also ran fastpitch in its conda environment but I switched to the Docker containers to make sure this issue was actually pervasive and not just due to my install process.
The text was updated successfully, but these errors were encountered:
I take that back - the issue is in 22.09.. I was able to move past it by setting num_workers=0. Is this a bug or am I mistaken in some part of this process?
Describe the bug
When attempting to run the hifigan_finetune.py script in the provided tts examples, the following error is triggered:
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
I've tried the standard approach for fixing this issue which is including the following at the head of the hifigan_finetune.py script:
import torch.multiprocessing as mp
mp.set_start_method('spawn', force=True)
This just leads to a further cascading set of errors. I'm not sure if I'm doing something grossly wrong or if hifigan is grossly bugged. I've tried this in NeMo docker containers of various versions and conda environments of various configurations. All of which lead to the same issue.
Steps/Code to reproduce bug
Expected behavior
I expect training to begin. Instead, the model is downloaded and sanity checking begins, followed shortly by the above error.
Environment overview (please complete the following information)
Environment location: Docker
I've used the following docker pull commands. ALL of which have reproduced the issue:
docker pull nvcr.io/nvidia/nemo:23.03
docker pull nvcr.io/nvidia/nemo:23.06
docker pull nvcr.io/nvidia/nemo:24.05
docker pull nvcr.io/nvidia/nemo:24.12.01
The issue does NOT seem to be present in the following version:
docker pull nvcr.io/nvidia/nemo:22.09
Environment details
If NVIDIA docker image is used you don't need to specify these.
Additional context
I've tried this on two machines, a Windows 11 PC running WSL with a RTX 2070 and a Linux Server running Ubuntu 22.04 with a RTX 2000 Ada. Both have experienced the same issue. The windows machine had a conda environment which ran fastpitch training no problem. The linux machine also ran fastpitch in its conda environment but I switched to the Docker containers to make sure this issue was actually pervasive and not just due to my install process.
The text was updated successfully, but these errors were encountered: