Le Maguer, S├ębastien; Steiner, Ingmar; Hewer, Alexander

An HMM/DNN comparison for synchronized text-to-speech and tongue motion synthesis

Proc. Interspeech 2017, pp. 239-243, Stockholm, Sweden, 2017.

We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a statistical shape space model of the tongue surface to an articulatory speech corpus and training a speech synthesis system directly on the tongue model parameter weights. We focus our analysis on the application of two standard methodologies, based on Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs), respectively, to train both acoustic models and the tongue model parameter weights. We evaluate both methodologies at every step by comparing the predicted articulatory movements against the reference data. The results show that even with less than 2h of data, DNNs already outperform HMMs.