Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HiFiGAN Finetune "Cannot re-initialize CUDA in forked subprocess." #12178

Open
Fournogo opened this issue Feb 13, 2025 · 1 comment
Open

HiFiGAN Finetune "Cannot re-initialize CUDA in forked subprocess." #12178

Fournogo opened this issue Feb 13, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@Fournogo
Copy link

Describe the bug

When attempting to run the hifigan_finetune.py script in the provided tts examples, the following error is triggered:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

I've tried the standard approach for fixing this issue which is including the following at the head of the hifigan_finetune.py script:
import torch.multiprocessing as mp
mp.set_start_method('spawn', force=True)

This just leads to a further cascading set of errors. I'm not sure if I'm doing something grossly wrong or if hifigan is grossly bugged. I've tried this in NeMo docker containers of various versions and conda environments of various configurations. All of which lead to the same issue.

Steps/Code to reproduce bug

  1. Clone NeMo repository.
  2. Add proper manifest.json files to examples/tts (for simplicity)
  3. Update hifigan.yaml config to point to manifests. Update hifigan_finetune.py to use the "hifigan" config (instead of hifigan_44100)
  4. Run hifigan_finetune.py

Expected behavior

I expect training to begin. Instead, the model is downloaded and sanity checking begins, followed shortly by the above error.

Environment overview (please complete the following information)

Environment location: Docker

I've used the following docker pull commands. ALL of which have reproduced the issue:
docker pull nvcr.io/nvidia/nemo:23.03
docker pull nvcr.io/nvidia/nemo:23.06
docker pull nvcr.io/nvidia/nemo:24.05
docker pull nvcr.io/nvidia/nemo:24.12.01

The issue does NOT seem to be present in the following version:
docker pull nvcr.io/nvidia/nemo:22.09

Environment details

If NVIDIA docker image is used you don't need to specify these.

Additional context

I've tried this on two machines, a Windows 11 PC running WSL with a RTX 2070 and a Linux Server running Ubuntu 22.04 with a RTX 2000 Ada. Both have experienced the same issue. The windows machine had a conda environment which ran fastpitch training no problem. The linux machine also ran fastpitch in its conda environment but I switched to the Docker containers to make sure this issue was actually pervasive and not just due to my install process.

@Fournogo Fournogo added the bug Something isn't working label Feb 13, 2025
@Fournogo
Copy link
Author

Fournogo commented Feb 13, 2025

I take that back - the issue is in 22.09.. I was able to move past it by setting num_workers=0. Is this a bug or am I mistaken in some part of this process?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant