Parrot-TTS is a text-to-speech (TTS) system that utilizes a Transformer based sequence-to-sequence model to map character tokens to HuBERT quantized units and a modified HiFi-GAN vocoder for speech synthesis. This repository is an official impplementation of our EACL 2024 paper available at https://aclanthology.org/2024.findings-eacl.6/. This repository provides instructions for installation, demo execution, and training the TTS model on your own data. We have uploaded a few files generated with our model (trained with no transliteration for non-English characters) and are available at https://drive.google.com/file/d/1b4uoeRv106J-4NvzVnotfBiAuFz049_q/view?usp=sharing
-
Create and activate a new Conda environment:
conda create --name parrottts python=3.8.19 conda activate parrottts
-
Install the required libraries:
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu125
Run a demo using the provided Jupyter notebook, demo.ipynb
. The checkpoints are trained on training data available from https://sites.google.com/view/limmits25/home?authuser=0
- The notebook will automatically download the following files from Google Drive and store at following locations:
runs/aligner/symbol.pkl
: A dictionary to map characters to tokens.runs/TTE/ckpt
: Model to convert character text tokens to HuBERT units.runs/vocoder/checkpoints
: Model to predict speech from HuBERT units.
To train Parrot-TTS on your dataset, follow these steps (1-10):
- Update the
dataset_dir
folder inutils/aligner/aligner_preprocessor_config.yaml
. Thedataset_dir
contains individual speakers and within it contains theirwavs
andtxt
files. The code cleans text files per speaker, stores them separately, and computes unique characters across all speakers. For non-english speakers, make sure to checkdo_transliteration
flag inutils/aligner/aligner_preprocessor_config.yaml
.python utils/aligner/preprocessor.py utils/aligner/aligner_preprocessor_config.yaml
- Update
base_dataset_dir
intrain.sh
.base_dataset_dir
is the same asdataset_dir
used in Step 1.bash utils/aligner/train.sh
- Download the HuBERT checkpoint and quantizer from this link and store them in
utils/hubert_extraction
. Once downloaded, the following command can be run. Note: You may need to clone and install fairseq to run this step. - Run the following command to extract HuBERT units:
python utils/hubert_extraction/extractor.py utils/hubert_extraction/hubert_config.yaml
- Note: HuBERT units have already been extracted for the corpus and are available at this Google Drive link. Download and save it at
runs/hubert_extraction
.
- Prepare the necessary files for training the TTE module:
python utils/TTE/preprocessor.py utils/TTE/TTE_config.yaml
- Train the TTE module using the following command:
python train.py --config utils/TTE/TTE_config.yaml --num_gpus 1
- Run inference to predict HuBERT from the trained TTE module:
python inference.py --config utils/TTE/TTE_config.yaml --checkpoint_pth runs/TTE/ckpt/parrot_model-step=50000-val_total_loss_step=0.00.ckpt --device cuda:2
- Generate training and validation files for the vocoder:
python utils/vocoder/preprocessor.py --input_file runs/hubert_extraction/hubert.txt --root_path runs/vocoder
- Set the number of GPUs in the
nproc_per_node
variable and run the following command:CUDA_VISIBLE_DEVICES=1,2,3 python -m torch.distributed.run --nproc_per_node=3 utils/vocoder/train.py --checkpoint_path runs/vocoder/checkpoints --config utils/vocoder/config.json
- Infer the vocoder on the validation file:
python utils/vocoder/inference.py --checkpoint_file runs/vocoder/checkpoints -n 100 --vc --input_code_file runs/vocoder/val.txt --output_dir runs/vocoder/generations_vocoder
- Infer the vocoder on predictions from the TTE module:
python utils/vocoder/inference.py --checkpoint_file runs/vocoder/checkpoints -n 100 --vc --input_code_file runs/TTE/predictions.txt --output_dir runs/vocoder/generations_tte
This repository is developed using insights from: