Differentiable Duration Modeling for End-to-End Text-to-Speech
Abstract: Parallel text-to-speech (TTS) models have recently enabled
fast and highly-natural speech
synthesis.
However, such models typically require external alignment models, which are not necessarily optimized for
the decoder as they are not jointly trained. In this paper, we propose a differentiable duration method for
learning monotonic alignments between input and output sequences. Our method is based on a soft-duration
mechanism that optimizes a stochastic process in
expectation. Using this differentiable duration method, a direct text to waveform TTS model is introduced to
produce raw
audio as output instead of performing neural vocoding. Our model learns to perform high-fidelity speech
synthesis
through a combination of adversarial training and matching the total ground-truth duration. Experimental
results show
that our model obtains competitive results while enjoying a much simpler training pipeline.
Audio Samples
Audio samples are taken from the LJ Speech data set [1].
Text: In the preceding chapter I have been tempted by the importance of the general question to give
it prominence and precedence over the particular branch of which I am treating.
Ground Truth
HiFi-GAN+Mel
Tacotron 2
FastSpeech 2
AutoTTS
Text: but in a separate chamber, that belonging to one of the warders of the jail.
Ground Truth
HiFi-GAN+Mel
Tacotron 2
FastSpeech 2
AutoTTS
Text: the same callous indifference to the moral well-being of the prisoners, the same want of
employment and of all
disciplinary control.
Ground Truth
HiFi-GAN+Mel
Tacotron 2
FastSpeech 2
AutoTTS
Text: radical improvement was generally considered impossible. The great evil, however, had been
sensibly diminished.
Ground Truth
HiFi-GAN+Mel
Tacotron 2
FastSpeech 2
AutoTTS
Text: The referral of the copy to local Secret Service should not delay the immediate referral of
the information by the
fastest available means of communication