You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Came you to any conclusion?
I faced this problem too, since gst encoder takes zero paddings, the network is able to take into account the duration of the audio, which on my dataset led to the fact that short lines are pronounced slowly, and long fast.
I tried using one-dimensional convolution and masking zero before gru layer, but it worsened the work of tokens.
How do we ensure that the padding of the reference mel spectogram is taken into account when the reference encoder is applied on a batch of mels?
The text was updated successfully, but these errors were encountered: