Publications

Le Maguer, Sébastien; Möbius, Bernd; Steiner, Ingmar

Toward the use of information density based descriptive features in HMM based speech synthesis Inproceedings

8th International Conference on Speech Prosody, pp. 1029-1033, Boston, MA, USA, 2016.
Over the last decades, acoustic modeling for speech synthesis has been improved significantly. However, in most systems, the descriptive feature set used to represent annotated text has been the same for many years. Specifically, the prosody models in most systems are based on low level information such as syllable stress or word part-of-speech tags. In this paper, we propose to enrich the descriptive feature set by adding a linguistic measure computed from the predictability of an event, such as the occurrence of a syllable or word. By adding such descriptive features, we assume that we will improve prosody modeling. This new feature set is then used to train prosody models for speech synthesis. Results from an evaluation study indicate a preference for the new descriptive feature set over the conventional one.

@inproceedings{LeMaguer2016SP,
title = {Toward the use of information density based descriptive features in HMM based speech synthesis},
author = {S{\'e}bastien Le Maguer and Bernd M{\"o}bius and Ingmar Steiner},
url = {https://www.researchgate.net/publication/305684951_Toward_the_use_of_information_density_based_descriptive_features_in_HMM_based_speech_synthesis},
year = {2016},
date = {2016},
booktitle = {8th International Conference on Speech Prosody},
pages = {1029-1033},
address = {Boston, MA, USA},
abstract = {

Over the last decades, acoustic modeling for speech synthesis has been improved significantly. However, in most systems, the descriptive feature set used to represent annotated text has been the same for many years. Specifically, the prosody models in most systems are based on low level information such as syllable stress or word part-of-speech tags. In this paper, we propose to enrich the descriptive feature set by adding a linguistic measure computed from the predictability of an event, such as the occurrence of a syllable or word. By adding such descriptive features, we assume that we will improve prosody modeling. This new feature set is then used to train prosody models for speech synthesis. Results from an evaluation study indicate a preference for the new descriptive feature set over the conventional one.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   C1 C5

Le Maguer, Sébastien; Möbius, Bernd; Steiner, Ingmar; Lolive, Damien

De l'utilisation de descripteurs issus de la linguistique computationnelle dans le cadre de la synthèse par HMM Inproceedings

Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP, AFCP - ATALA, pp. 714-722, Paris, France, 2016.

Durant les dernières décennies, la modélisation acoustique effectuée par les systèmes de synthèse de parole paramétrique a fait l’objet d’une attention particulière. Toutefois, dans la plupart des systèmes connus, l’ensemble des descripteurs linguistiques utilisés pour représenter le texte reste identique. Plus specifiquement, la modélisation de la prosodie reste guidée par des descripteurs de bas niveau comme l’information d’accentuation de la syllabe ou bien l’étiquette grammaticale du mot. Dans cet article, nous proposons d’intégrer des informations basées sur la prédictibilité d’un évènement (la syllabe ou le mot). Plusieurs études indiquent une corrélation forte entre cette mesure, fortement présente dans la linguistique computationnelle, et certaines spécificités lors de la production humaine de la parole. Notre hypothèse est donc que l’ajout de ces descripteurs améliore la modélisation de la prosodie. Cet article se focalise sur une analyse objective de l’apport de ces descripteurs sur la synthèse HMM pour la langue anglaise et française.

@inproceedings{Lemaguer/etal:2016b,
title = {De l'utilisation de descripteurs issus de la linguistique computationnelle dans le cadre de la synthèse par HMM},
author = {S{\'e}bastien Le Maguer and Bernd M{\"o}bius and Ingmar Steiner and Damien Lolive},
url = {https://aclanthology.org/2016.jeptalnrecital-jep.80},
year = {2016},
date = {2016},
booktitle = {Actes de la conf{\'e}rence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP},
pages = {714-722},
publisher = {AFCP - ATALA},
address = {Paris, France},
abstract = {Durant les dernières d{\'e}cennies, la mod{\'e}lisation acoustique effectu{\'e}e par les systèmes de synthèse de parole param{\'e}trique a fait l’objet d’une attention particulière. Toutefois, dans la plupart des systèmes connus, l’ensemble des descripteurs linguistiques utilis{\'e}s pour repr{\'e}senter le texte reste identique. Plus specifiquement, la mod{\'e}lisation de la prosodie reste guid{\'e}e par des descripteurs de bas niveau comme l’information d’accentuation de la syllabe ou bien l’{\'e}tiquette grammaticale du mot. Dans cet article, nous proposons d’int{\'e}grer des informations bas{\'e}es sur la pr{\'e}dictibilit{\'e} d’un {\'e}vènement (la syllabe ou le mot). Plusieurs {\'e}tudes indiquent une corr{\'e}lation forte entre cette mesure, fortement pr{\'e}sente dans la linguistique computationnelle, et certaines sp{\'e}cificit{\'e}s lors de la production humaine de la parole. Notre hypothèse est donc que l’ajout de ces descripteurs am{\'e}liore la mod{\'e}lisation de la prosodie. Cet article se focalise sur une analyse objective de l’apport de ces descripteurs sur la synthèse HMM pour la langue anglaise et française.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   C1 C5

Schulz, Erika; Malisz, Zofia; Andreeva, Bistra; Möbius, Bernd

Einfluss von Informationsdichte und prosodischer Struktur auf Vokalraumausdehnung Inproceedings

Phonetik und Phonologie 11, Marburg, 2015.

Vokalraumausdehnung wird von mehreren Faktoren bestimmt, z. B. von Geschlecht (Simpson und Ericsdotter 2007), Sprechstil (Bradlow, Kraus und Hayes 2003), Prosodie (Bergem 1993), Sprechgeschwindigkeit (Weirich und Simpson 2014) oder phonologischer Nachbarschaftsdichte (Munson und Solomon 2004). Auch Sprachredundanz kann als Prädiktor spektraler Ausprägung von Vokalen dienen (Aylett und Turk 2006). Diese Studie untersucht den Einfluss von Informationsdichte und prosodischen Strukturen auf Vokalraumausdehnung in Französisch, Deutsch, Amerikanischem Englisch und Finnisch.

@inproceedings{pundp11,
title = {Einfluss von Informationsdichte und prosodischer Struktur auf Vokalraumausdehnung},
author = {Erika Schulz and Zofia Malisz and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://www.online.uni-marburg.de/pundp11/talks/Schulz_etal.pdf},
year = {2015},
date = {2015},
booktitle = {Phonetik und Phonologie 11},
address = {Marburg},
abstract = {Vokalraumausdehnung wird von mehreren Faktoren bestimmt, z. B. von Geschlecht (Simpson und Ericsdotter 2007), Sprechstil (Bradlow, Kraus und Hayes 2003), Prosodie (Bergem 1993), Sprechgeschwindigkeit (Weirich und Simpson 2014) oder phonologischer Nachbarschaftsdichte (Munson und Solomon 2004). Auch Sprachredundanz kann als Pr{\"a}diktor spektraler Auspr{\"a}gung von Vokalen dienen (Aylett und Turk 2006). Diese Studie untersucht den Einfluss von Informationsdichte und prosodischen Strukturen auf Vokalraumausdehnung in Franz{\"o}sisch, Deutsch, Amerikanischem Englisch und Finnisch.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Malisz, Zofia; Schulz, Erika; Oh, Yoon Mi; Andreeva, Bistra; Möbius, Bernd

Dimensions of segmental variability: relationships between information density and prosodic structure Inproceedings

Workshop "Modeling variability in speech", Stuttgart, 2015.

Contextual predictability variation affects phonological and phonetic structure. Reduction and expansion of acoustic-phonetic features is also characteristic of prosodic variability. In this study, we assess the impact of surprisal and prosodic structure on phonetic encoding, both independently of each other and in interaction. We model segmental duration, vowel space size and spectral characteristics of vowels and consonants as a function of surprisal as well as of syllable prominence, phrase boundary, and speech rate. Correlates of phonetic encoding density are extracted from a subset of the BonnTempo corpus for six languages: American English, Czech, Finnish, French, German, and Polish. Surprisal is estimated from segmental n-gram language models trained on large text corpora. Our findings are generally compatible with a weak version of Aylett and Turk’s Smooth Signal Redundancy hypothesis, suggesting that prosodic structure mediates between the requirements of efficient communication and the speech signal. However, this mediation is not perfect, as we found evidence for additional, direct effects of changes in surprisal on the phonetic structure of utterances. These effects appear to be stable across different speech rates.

@inproceedings{malisz15,
title = {Dimensions of segmental variability: relationships between information density and prosodic structure},
author = {Zofia Malisz and Erika Schulz and Yoon Mi Oh and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://www.frontiersin.org/articles/10.3389/fcomm.2018.00025/full},
year = {2015},
date = {2015},
booktitle = {Workshop "Modeling variability in speech"},
address = {Stuttgart},
abstract = {

Contextual predictability variation affects phonological and phonetic structure. Reduction and expansion of acoustic-phonetic features is also characteristic of prosodic variability. In this study, we assess the impact of surprisal and prosodic structure on phonetic encoding, both independently of each other and in interaction. We model segmental duration, vowel space size and spectral characteristics of vowels and consonants as a function of surprisal as well as of syllable prominence, phrase boundary, and speech rate. Correlates of phonetic encoding density are extracted from a subset of the BonnTempo corpus for six languages: American English, Czech, Finnish, French, German, and Polish. Surprisal is estimated from segmental n-gram language models trained on large text corpora. Our findings are generally compatible with a weak version of Aylett and Turk's Smooth Signal Redundancy hypothesis, suggesting that prosodic structure mediates between the requirements of efficient communication and the speech signal. However, this mediation is not perfect, as we found evidence for additional, direct effects of changes in surprisal on the phonetic structure of utterances. These effects appear to be stable across different speech rates.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Successfully