How to train a multilingual model, is there a script for it? #1656

zds-potato · 2023-01-10T14:43:32Z

I see that the w2v-conformer pre-trained model is trained using a multilingual dataset. Currently I have not found a relevant multilingual training solution or script.

Some of the problems encountered so far are how to choose the text modeling unit, is it BPE or char or something else?

Emiyassstar · 2023-01-11T02:17:23Z

w2v-conformer don't use any text information to calculate pretrain loss . But in order not to change the wenet training pipeline,you can fill in any text unit just like 'A' for multilingual wavs.
For multilingual training, You can merge all wavs into one dataset and balance data refer to Facebook's XLSR model.
For wav2vec training example ,you can see #1003

xingchensong closed this as completed Feb 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train a multilingual model, is there a script for it? #1656

How to train a multilingual model, is there a script for it? #1656

zds-potato commented Jan 10, 2023

Emiyassstar commented Jan 11, 2023

How to train a multilingual model, is there a script for it? #1656

How to train a multilingual model, is there a script for it? #1656

Comments

zds-potato commented Jan 10, 2023

Emiyassstar commented Jan 11, 2023