Publications - SFB 1102

15 entries

Raveh, Eran; Steiner, Ingmar; Gessinger, Iona; Möbius, Bernd

Studying Mutual Phonetic Influence With a Web-Based Spoken Dialogue System Inproceedings

20th International Conference on Speech and Computer (SPECOM), Leipzig, Germany, 2018.

Abstract
|
Links
|
BibTeX

This paper presents a study on mutual speech variation influences in a human-computer setting. The study highlights behavioral patterns in data collected as part of a shadowing experiment, and is performed using a novel end-to-end platform for studying phonetic variation in dialogue. It includes a spoken dialogue system capable of detecting and tracking the state of phonetic features in the user’s speech and adapting accordingly. It provides visual and numeric representations of the changes in real time, offering a high degree of customization, and can be used for simulating or reproducing speech variation scenarios. The replicated experiment presented in this paper along with the analysis of the relationship between the human and non-human interlocutors lays the groundwork for a spoken dialogue system with personalized speaking style, which we expect will improve the naturalness and efficiency of human-computer interaction.

https://arxiv.org/abs/1809.04945

@inproceedings{Raveh2018SPECOM,
title = {Studying Mutual Phonetic Influence With a Web-Based Spoken Dialogue System},
author = {Eran Raveh and Ingmar Steiner and Iona Gessinger and Bernd M{\"o}bius},
url = {https://arxiv.org/abs/1809.04945},
year = {2018},
date = {2018},
booktitle = {20th International Conference on Speech and Computer (SPECOM)},
address = {Leipzig, Germany},
abstract = {This paper presents a study on mutual speech variation influences in a human-computer setting. The study highlights behavioral patterns in data collected as part of a shadowing experiment, and is performed using a novel end-to-end platform for studying phonetic variation in dialogue. It includes a spoken dialogue system capable of detecting and tracking the state of phonetic features in the user's speech and adapting accordingly. It provides visual and numeric representations of the changes in real time, offering a high degree of customization, and can be used for simulating or reproducing speech variation scenarios. The replicated experiment presented in this paper along with the analysis of the relationship between the human and non-human interlocutors lays the groundwork for a spoken dialogue system with personalized speaking style, which we expect will improve the naturalness and efficiency of human-computer interaction.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C5

Gessinger, Iona; Raveh, Eran; Möbius, Bernd; Steiner, Ingmar

Phonetic Accommodation in HCI: Introducing a Wizard-of-Oz Experiment Inproceedings

Phonetik & Phonologie 14, Vienna, Austria, 2018.

Abstract
|
Links
|
BibTeX

This paper discusses phonetic accommodation of 20 native German speakers interacting with the simulated spoken dialogue system Mirabella in a Wizard-of-Oz experiment. The study examines intonation of wh-questions and pronunciation of allophonic contrasts in German. In a question-and-answer exchange with the system, the users produce predominantly falling intonation patterns for wh-questions when the system does so as well. The number of rising patterns on the part of the users increases significantly when Mirabella produces questions with rising intonation. In a map task, Mirabella provides information about hidden items while producing variants of two allophonic contrasts which are dispreferred by the users. For the [Iç] vs. [Ik] contrast in the suffix h-igi, the number of dispreferred variants on the part of the users increases significantly during the map task. For the [E:] vs. [e:] contrast as a realization of stressed h-a-¨ i, such a convergence effect is not found on the group level, yet still occurs for some individual users. Almost every user converges to the system to a substantial degree for a subset of the examined features, but we also find maintenance of preferred variants and even occasional divergence. This individual variation is in line with previous findings in accommodation research.

@inproceedings{Gessinger2018PuP,
title = {Phonetic Accommodation in HCI: Introducing a Wizard-of-Oz Experiment},
author = {Iona Gessinger and Eran Raveh and Bernd M{\"o}bius and Ingmar Steiner},
url = {https://www.coli.uni-saarland.de/~moebius/documents/gessinger_etal_is2019.pdf},
year = {2018},
date = {2018-09-06},
booktitle = {Phonetik & Phonologie 14},
address = {Vienna, Austria},
abstract = {This paper discusses phonetic accommodation of 20 native German speakers interacting with the simulated spoken dialogue system Mirabella in a Wizard-of-Oz experiment. The study examines intonation of wh-questions and pronunciation of allophonic contrasts in German. In a question-and-answer exchange with the system, the users produce predominantly falling intonation patterns for wh-questions when the system does so as well. The number of rising patterns on the part of the users increases significantly when Mirabella produces questions with rising intonation. In a map task, Mirabella provides information about hidden items while producing variants of two allophonic contrasts which are dispreferred by the users. For the [Iç] vs. [Ik] contrast in the suffix h-igi, the number of dispreferred variants on the part of the users increases significantly during the map task. For the [E:] vs. [e:] contrast as a realization of stressed h-a-¨ i, such a convergence effect is not found on the group level, yet still occurs for some individual users. Almost every user converges to the system to a substantial degree for a subset of the examined features, but we also find maintenance of preferred variants and even occasional divergence. This individual variation is in line with previous findings in accommodation research.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C5

Gessinger, Iona; Schweitzer, Antje; Andreeva, Bistra; Raveh, Eran; Möbius, Bernd; Steiner, Ingmar

Convergence of Pitch Accents in a Shadowing Task Inproceedings

Proceedings of the 9th International Conference on Speech Prosody, Speech Prosody Special Interest Group, pp. 225-229, Poznán, Poland, 2018.

Abstract
|
Links
|
BibTeX

In the present study, a corpus of short German sentences collected in a shadowing task was examined with respect to pitch accent realization. The pitch accents were parameterized with the PaIntE model, which describes the f0 contour of intonation events concerning their height, slope, and temporal alignment. Convergence was quantified as decrease in Euclidean distance, and hence increase in similarity, between the PaIntE parameter vectors. This was assessed for three stimulus types: natural speech, diphone based speech synthesis, or HMM based speech synthesis. The factors tested in the analysis were experimental phase – was the sentence uttered before or while shadowing the model, accent type – a distinction was made between prenuclear and nuclear pitch accents, and sex of speaker and shadowed model. For the natural and HMM stimuli, Euclidean distance decreased in the shadowing task. This convergence effect did not depend on the accent type. However, prenuclear pitch accents showed generally lower values in Euclidean distance than nuclear pitch accents. Whether the sex of the speaker and the shadowed model matched did not explain any variance in the data. For the diphone stimuli, no convergence of pitch accents was observed.

@inproceedings{Gessinger2018SP,
title = {Convergence of Pitch Accents in a Shadowing Task},
author = {Iona Gessinger and Antje Schweitzer and Bistra Andreeva and Eran Raveh and Bernd M{\"o}bius and Ingmar Steiner},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/29618},
doi = {https://doi.org/10.21437/SpeechProsody.2018-46},
year = {2018},
date = {2018},
booktitle = {Proceedings of the 9th International Conference on Speech Prosody},
pages = {225-229},
publisher = {Speech Prosody Special Interest Group},
address = {Pozn{\'a}n, Poland},
abstract = {In the present study, a corpus of short German sentences collected in a shadowing task was examined with respect to pitch accent realization. The pitch accents were parameterized with the PaIntE model, which describes the f0 contour of intonation events concerning their height, slope, and temporal alignment. Convergence was quantified as decrease in Euclidean distance, and hence increase in similarity, between the PaIntE parameter vectors. This was assessed for three stimulus types: natural speech, diphone based speech synthesis, or HMM based speech synthesis. The factors tested in the analysis were experimental phase - was the sentence uttered before or while shadowing the model, accent type - a distinction was made between prenuclear and nuclear pitch accents, and sex of speaker and shadowed model. For the natural and HMM stimuli, Euclidean distance decreased in the shadowing task. This convergence effect did not depend on the accent type. However, prenuclear pitch accents showed generally lower values in Euclidean distance than nuclear pitch accents. Whether the sex of the speaker and the shadowed model matched did not explain any variance in the data. For the diphone stimuli, no convergence of pitch accents was observed.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C5

Steiner, Ingmar; Le Maguer, Sébastien

Creating New Language and Voice Components for the Updated MaryTTS Text-to-Speech Synthesis Platform Inproceedings

11th Language Resources and Evaluation Conference (LREC), pp. 3171-3175, Miyazaki, Japan, 2018.

Abstract
|
Links
|
BibTeX

We present a new workflow to create components for the MaryTTS text-to-speech synthesis platform, which is popular with researchers and developers, extending it to support new languages and custom synthetic voices. This workflow replaces the previous toolkit with an efficient, flexible process that leverages modern build automation and cloud-hosted infrastructure. Moreover, it is compatible with the updated MaryTTS architecture, enabling new features and state-of-the-art paradigms such as synthesis based on deep neural networks (DNNs). Like MaryTTS itself, the new tools are free, open source software (FOSS), and promote the use of open data.

1045 (0.14MB)
https://arxiv.org/abs/1712.04787

@inproceedings{Steiner2018LREC,
title = {Creating New Language and Voice Components for the Updated MaryTTS Text-to-Speech Synthesis Platform},
author = {Ingmar Steiner and S{\'e}bastien Le Maguer},
url = {https://arxiv.org/abs/1712.04787},
year = {2018},
date = {2018-05-10},
booktitle = {11th Language Resources and Evaluation Conference (LREC)},
pages = {3171-3175},
address = {Miyazaki, Japan},
abstract = {We present a new workflow to create components for the MaryTTS text-to-speech synthesis platform, which is popular with researchers and developers, extending it to support new languages and custom synthetic voices. This workflow replaces the previous toolkit with an efficient, flexible process that leverages modern build automation and cloud-hosted infrastructure. Moreover, it is compatible with the updated MaryTTS architecture, enabling new features and state-of-the-art paradigms such as synthesis based on deep neural networks (DNNs). Like MaryTTS itself, the new tools are free, open source software (FOSS), and promote the use of open data.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C5

Steiner, Ingmar; Le Maguer, Sébastien; Hewer, Alexander

Synthesis of Tongue Motion and Acoustics from Text using a Multimodal Articulatory Database Journal Article

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25, pp. 2351-2361, 2017.

Abstract
|
Links
|
BibTeX

We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a 3D model of the tongue surface to an articulatory dataset and training a statistical parametric speech synthesis system directly on the tongue model parameters. We evaluate the model at every step by comparing the spatial coordinates of predicted articulatory movements against the reference data. The results indicate a global mean Euclidean distance of less than 2.8 mm, and our approach can be adapted to add an articulatory modality to conventional TTS applications without the need for extra data.

https://arxiv.org/abs/1612.09352

@article{Steiner2017TASLP,
title = {Synthesis of Tongue Motion and Acoustics from Text using a Multimodal Articulatory Database},
author = {Ingmar Steiner and S{\'e}bastien Le Maguer and Alexander Hewer},
url = {https://arxiv.org/abs/1612.09352},
year = {2017},
date = {2017},
journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
pages = {2351-2361},
volume = {25},
number = {12},
abstract = {We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a 3D model of the tongue surface to an articulatory dataset and training a statistical parametric speech synthesis system directly on the tongue model parameters. We evaluate the model at every step by comparing the spatial coordinates of predicted articulatory movements against the reference data. The results indicate a global mean Euclidean distance of less than 2.8 mm, and our approach can be adapted to add an articulatory modality to conventional TTS applications without the need for extra data.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project: C5

Le Maguer, Sébastien; Steiner, Ingmar

The "Uprooted" MaryTTS Entry for the Blizzard Challenge 2017 Inproceedings

Blizzard Challenge, Stockholm, Sweden, 2017.

Abstract
|
Links
|
BibTeX

The MaryTTS system is a modular text-to-speech (TTS) system which has been developed for nearly 20 years. This paper describes the MaryTTS entry for the Blizzard Challenge 2017. In contrast to last year’s MaryTTS system, based on a unit selection baseline using the latest stable MaryTTS version, the basis for this year’s system is a new, experimental version with a completely redesigned architecture.

@inproceedings{LeMaguer2017BC,
title = {The "Uprooted" MaryTTS Entry for the Blizzard Challenge 2017},
author = {S{\'e}bastien Le Maguer and Ingmar Steiner},
url = {http://mary.dfki.de/documentation/publications/index.html},
year = {2017},
date = {2017},
booktitle = {Blizzard Challenge},
address = {Stockholm, Sweden},
abstract = {The MaryTTS system is a modular text-to-speech (TTS) system which has been developed for nearly 20 years. This paper describes the MaryTTS entry for the Blizzard Challenge 2017. In contrast to last year’s MaryTTS system, based on a unit selection baseline using the latest stable MaryTTS version, the basis for this year’s system is a new, experimental version with a completely redesigned architecture.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C5

Gessinger, Iona; Raveh, Eran; Le Maguer, Sébastien; Möbius, Bernd; Steiner, Ingmar

Shadowing Synthesized Speech - Segmental Analysis of Phonetic Convergence Inproceedings

Interspeech, pp. 3797-3801, Stockholm, Sweden, 2017.

Abstract
|
Links
|
BibTeX

To shed light on the question whether humans converge phonetically to synthesized speech, a shadowing experiment was conducted using three different types of stimuli – natural speaker, diphone synthesis, and HMM synthesis. Three segment-level phonetic features of German that are well-known to vary across native speakers were examined. The first feature triggered convergence in roughly one third of the cases for all stimulus types. The second feature showed generally a small amount of convergence, which may be due to the nature of the feature itself. Still the effect was strongest for the natural stimuli, followed by the HMM stimuli and weakest for the diphone stimuli. The effect of the third feature was clearly observable for the natural stimuli and less pronounced in the synthetic stimuli. This is presumably a result of the partly insufficient perceptibility of this target feature in the synthetic stimuli and demonstrates the necessity of gaining fine-grained control over the synthesis output, should it be intended to implement capabilities of phonetic convergence on the segmental level in spoken dialogue systems

@inproceedings{Gessinger2017IS,
title = {Shadowing Synthesized Speech - Segmental Analysis of Phonetic Convergence},
author = {Iona Gessinger and Eran Raveh and S{\'e}bastien Le Maguer and Bernd M{\"o}bius and Ingmar Steiner},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/29623},
year = {2017},
date = {2017},
booktitle = {Interspeech},
pages = {3797-3801},
address = {Stockholm, Sweden},
abstract = {To shed light on the question whether humans converge phonetically to synthesized speech, a shadowing experiment was conducted using three different types of stimuli – natural speaker, diphone synthesis, and HMM synthesis. Three segment-level phonetic features of German that are well-known to vary across native speakers were examined. The first feature triggered convergence in roughly one third of the cases for all stimulus types. The second feature showed generally a small amount of convergence, which may be due to the nature of the feature itself. Still the effect was strongest for the natural stimuli, followed by the HMM stimuli and weakest for the diphone stimuli. The effect of the third feature was clearly observable for the natural stimuli and less pronounced in the synthetic stimuli. This is presumably a result of the partly insufficient perceptibility of this target feature in the synthetic stimuli and demonstrates the necessity of gaining fine-grained control over the synthesis output, should it be intended to implement capabilities of phonetic convergence on the segmental level in spoken dialogue systems},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C5

Le Maguer, Sébastien; Steiner, Ingmar; Hewer, Alexander

An HMM/DNN comparison for synchronized text-to-speech and tongue motion synthesis Inproceedings

Proc. Interspeech 2017, pp. 239-243, Stockholm, Sweden, 2017.

Abstract
|
Links
|
BibTeX

We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a statistical shape space model of the tongue surface to an articulatory speech corpus and training a speech synthesis system directly on the tongue model parameter weights. We focus our analysis on the application of two standard methodologies, based on Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs), respectively, to train both acoustic models and the tongue model parameter weights. We evaluate both methodologies at every step by comparing the predicted articulatory movements against the reference data. The results show that even with less than 2h of data, DNNs already outperform HMMs.

@inproceedings{LeMaguer2017IS,
title = {An HMM/DNN comparison for synchronized text-to-speech and tongue motion synthesis},
author = {S{\'e}bastien Le Maguer and Ingmar Steiner and Alexander Hewer},
url = {https://www.isca-speech.org/archive/interspeech_2017/maguer17_interspeech.html},
doi = {https://doi.org/10.21437/Interspeech.2017-936},
year = {2017},
date = {2017},
booktitle = {Proc. Interspeech 2017},
pages = {239-243},
address = {Stockholm, Sweden},
abstract = {We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a statistical shape space model of the tongue surface to an articulatory speech corpus and training a speech synthesis system directly on the tongue model parameter weights. We focus our analysis on the application of two standard methodologies, based on Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs), respectively, to train both acoustic models and the tongue model parameter weights. We evaluate both methodologies at every step by comparing the predicted articulatory movements against the reference data. The results show that even with less than 2h of data, DNNs already outperform HMMs.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C5

Raveh, Eran; Gessinger, Iona; Le Maguer, Sébastien; Möbius, Bernd; Steiner, Ingmar

Investigating Phonetic Convergence in a Shadowing Experiment with Synthetic Stimuli Inproceedings

Trouvain, Jürgen; Steiner, Ingmar; Möbius, Bernd; (Ed.): 28th Conference on Electronic Speech Signal Processing (ESSV), pp. 254-261, Saarbrücken, Germany, 2017.

Abstract
|
Links
|
BibTeX

This paper presents a shadowing experiment with synthetic stimuli, whose goal is to investigate phonetic convergence in a human-computer interaction paradigm. Comparisons to the results of a previous experiment with natural stimuli are made. The process of generating the synthetic stimuli, which are based on the natural ones, is described as well.

@inproceedings{Raveh2017ESSV,
title = {Investigating Phonetic Convergence in a Shadowing Experiment with Synthetic Stimuli},
author = {Eran Raveh and Iona Gessinger and S{\'e}bastien Le Maguer and Bernd M{\"o}bius and Ingmar Steiner},
editor = {J{\"u}rgen Trouvain and Ingmar Steiner and Bernd M{\"o}bius},
url = {https://www.semanticscholar.org/paper/Investigating-Phonetic-Convergence-in-a-Shadowing-Raveh-Gessinger/c296fb0e3ad53cd690a2845827c762046fce2bbe},
year = {2017},
date = {2017},
booktitle = {28th Conference on Electronic Speech Signal Processing (ESSV)},
pages = {254-261},
address = {Saarbr{\"u}cken, Germany},
abstract = {This paper presents a shadowing experiment with synthetic stimuli, whose goal is to investigate phonetic convergence in a human-computer interaction paradigm. Comparisons to the results of a previous experiment with natural stimuli are made. The process of generating the synthetic stimuli, which are based on the natural ones, is described as well.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C5

Steiner, Ingmar; Le Maguer, Sébastien; Manzoni, Judith; Gilles, Peter; Trouvain, Jürgen

Developing new language tools for MaryTTS: the case of Luxembourgish Inproceedings

Trouvain, Jürgen; Steiner, Ingmar; Möbius, Bernd; (Ed.): 28th Conference on Electronic Speech Signal Processing (ESSV), pp. 186-192, Saarbrücken, Germany, 2017.

Abstract
|
Links
|
BibTeX

We present new methods and resources which have been used to create a text to speech (TTS) synthesis system for the Luxembourgish language. The system uses the MaryTTS platform, which is extended with new natural language processing (NLP) components. We designed and recorded a multilingual, phonetically balanced speech corpus, and used it to build a new Luxembourgish synthesis voice. All speech data and software has been published under an open-source license and is freely available online.

@inproceedings{Steiner2017ESSVb,
title = {Developing new language tools for MaryTTS: the case of Luxembourgish},
author = {Ingmar Steiner and S{\'e}bastien Le Maguer and Judith Manzoni and Peter Gilles and J{\"u}rgen Trouvain},
editor = {J{\"u}rgen Trouvain and Ingmar Steiner and Bernd M{\"o}bius},
url = {https://www.semanticscholar.org/paper/THE-CASE-OF-LUXEMBOURGISH-Steiner-Maguer/7ca34b3c6460008c013a6ac799336a5f30fc9878},
year = {2017},
date = {2017},
booktitle = {28th Conference on Electronic Speech Signal Processing (ESSV)},
pages = {186-192},
address = {Saarbr{\"u}cken, Germany},
abstract = {We present new methods and resources which have been used to create a text to speech (TTS) synthesis system for the Luxembourgish language. The system uses the MaryTTS platform, which is extended with new natural language processing (NLP) components. We designed and recorded a multilingual, phonetically balanced speech corpus, and used it to build a new Luxembourgish synthesis voice. All speech data and software has been published under an open-source license and is freely available online.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C5

Le Maguer, Sébastien; Steiner, Ingmar

Uprooting MaryTTS: Agile Processing and Voicebuilding Inproceedings

Trouvain, Jürgen; Steiner, Ingmar; Möbius, Bernd; (Ed.): 28th Conference on Electronic Speech Signal Processing (ESSV), pp. 152-159, Saarbrücken, Germany, 2017.

Abstract
|
Links
|
BibTeX

MaryTTS is a modular speech synthesis system whose development started around 2003. The system is open-source and has grown significantly thanks to the contribution of the community. However, the drawback is an increase in the complexity of the system. This complexity has now reached a stage where the system is complicated to analyze and maintain. The current paper presents the new architecture of the MaryTTS system. This architecture aims to simplify the maintenance but also to provide more flexibility in the use of the system. To achieve this goal we have completely redesigned the core of the system using the structure ROOTS. We also have changed the module sequence logic to make the system more consistent with the designer. Finally, the voicebuilding has been redesigned to follow a continuous delivery methodology. All of these changes lead to more accurate development of the system and therefore more consistent results in its use.

UPROOTINGMARYTTS (0.19MB)
https://www.essv.de/paper.php?id=232

@inproceedings{LeMaguer2017ESSV,
title = {Uprooting MaryTTS: Agile Processing and Voicebuilding},
author = {S{\'e}bastien Le Maguer and Ingmar Steiner},
editor = {J{\"u}rgen Trouvain and Ingmar Steiner and Bernd M{\"o}bius},
url = {https://www.essv.de/paper.php?id=232},
year = {2017},
date = {2017-03-15},
booktitle = {28th Conference on Electronic Speech Signal Processing (ESSV)},
pages = {152-159},
address = {Saarbr{\"u}cken, Germany},
abstract = {MaryTTS is a modular speech synthesis system whose development started around 2003. The system is open-source and has grown significantly thanks to the contribution of the community. However, the drawback is an increase in the complexity of the system. This complexity has now reached a stage where the system is complicated to analyze and maintain. The current paper presents the new architecture of the MaryTTS system. This architecture aims to simplify the maintenance but also to provide more flexibility in the use of the system. To achieve this goal we have completely redesigned the core of the system using the structure ROOTS. We also have changed the module sequence logic to make the system more consistent with the designer. Finally, the voicebuilding has been redesigned to follow a continuous delivery methodology. All of these changes lead to more accurate development of the system and therefore more consistent results in its use.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C5

Le Maguer, Sébastien; Möbius, Bernd; Steiner, Ingmar

Toward the use of information density based descriptive features in HMM based speech synthesis Inproceedings

8th International Conference on Speech Prosody, pp. 1029-1033, Boston, MA, USA, 2016.

Abstract
|
Links
|
BibTeX

Over the last decades, acoustic modeling for speech synthesis has been improved signiﬁcantly. However, in most systems, the descriptive feature set used to represent annotated text has been the same for many years. Speciﬁcally, the prosody models in most systems are based on low level information such as syllable stress or word part-of-speech tags. In this paper, we propose to enrich the descriptive feature set by adding a linguistic measure computed from the predictability of an event, such as the occurrence of a syllable or word. By adding such descriptive features, we assume that we will improve prosody modeling. This new feature set is then used to train prosody models for speech synthesis. Results from an evaluation study indicate a preference for the new descriptive feature set over the conventional one.

https://www.researchgate.net/publication/305684951_Toward_the_use_of_information_density_based_descriptive_features_in_HMM_based_speech_synthesis

@inproceedings{LeMaguer2016SP,
title = {Toward the use of information density based descriptive features in HMM based speech synthesis},
author = {S{\'e}bastien Le Maguer and Bernd M{\"o}bius and Ingmar Steiner},
url = {https://www.researchgate.net/publication/305684951_Toward_the_use_of_information_density_based_descriptive_features_in_HMM_based_speech_synthesis},
year = {2016},
date = {2016},
booktitle = {8th International Conference on Speech Prosody},
pages = {1029-1033},
address = {Boston, MA, USA},
abstract = {

Over the last decades, acoustic modeling for speech synthesis has been improved signiﬁcantly. However, in most systems, the descriptive feature set used to represent annotated text has been the same for many years. Speciﬁcally, the prosody models in most systems are based on low level information such as syllable stress or word part-of-speech tags. In this paper, we propose to enrich the descriptive feature set by adding a linguistic measure computed from the predictability of an event, such as the occurrence of a syllable or word. By adding such descriptive features, we assume that we will improve prosody modeling. This new feature set is then used to train prosody models for speech synthesis. Results from an evaluation study indicate a preference for the new descriptive feature set over the conventional one.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects: C1 C5

Le Maguer, Sébastien; Möbius, Bernd; Steiner, Ingmar; Lolive, Damien

De l'utilisation de descripteurs issus de la linguistique computationnelle dans le cadre de la synthèse par HMM Inproceedings

Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP, AFCP - ATALA, pp. 714-722, Paris, France, 2016.

Abstract
|
Links
|
BibTeX

Durant les dernières décennies, la modélisation acoustique effectuée par les systèmes de synthèse de parole paramétrique a fait l’objet d’une attention particulière. Toutefois, dans la plupart des systèmes connus, l’ensemble des descripteurs linguistiques utilisés pour représenter le texte reste identique. Plus specifiquement, la modélisation de la prosodie reste guidée par des descripteurs de bas niveau comme l’information d’accentuation de la syllabe ou bien l’étiquette grammaticale du mot. Dans cet article, nous proposons d’intégrer des informations basées sur la prédictibilité d’un évènement (la syllabe ou le mot). Plusieurs études indiquent une corrélation forte entre cette mesure, fortement présente dans la linguistique computationnelle, et certaines spécificités lors de la production humaine de la parole. Notre hypothèse est donc que l’ajout de ces descripteurs améliore la modélisation de la prosodie. Cet article se focalise sur une analyse objective de l’apport de ces descripteurs sur la synthèse HMM pour la langue anglaise et française.

@inproceedings{Lemaguer/etal:2016b,
title = {De l'utilisation de descripteurs issus de la linguistique computationnelle dans le cadre de la synthèse par HMM},
author = {S{\'e}bastien Le Maguer and Bernd M{\"o}bius and Ingmar Steiner and Damien Lolive},
url = {https://aclanthology.org/2016.jeptalnrecital-jep.80},
year = {2016},
date = {2016},
booktitle = {Actes de la conf{\'e}rence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP},
pages = {714-722},
publisher = {AFCP - ATALA},
address = {Paris, France},
abstract = {Durant les dernières d{\'e}cennies, la mod{\'e}lisation acoustique effectu{\'e}e par les systèmes de synthèse de parole param{\'e}trique a fait l’objet d’une attention particulière. Toutefois, dans la plupart des systèmes connus, l’ensemble des descripteurs linguistiques utilis{\'e}s pour repr{\'e}senter le texte reste identique. Plus specifiquement, la mod{\'e}lisation de la prosodie reste guid{\'e}e par des descripteurs de bas niveau comme l’information d’accentuation de la syllabe ou bien l’{\'e}tiquette grammaticale du mot. Dans cet article, nous proposons d’int{\'e}grer des informations bas{\'e}es sur la pr{\'e}dictibilit{\'e} d’un {\'e}vènement (la syllabe ou le mot). Plusieurs {\'e}tudes indiquent une corr{\'e}lation forte entre cette mesure, fortement pr{\'e}sente dans la linguistique computationnelle, et certaines sp{\'e}cificit{\'e}s lors de la production humaine de la parole. Notre hypothèse est donc que l’ajout de ces descripteurs am{\'e}liore la mod{\'e}lisation de la prosodie. Cet article se focalise sur une analyse objective de l’apport de ces descripteurs sur la synthèse HMM pour la langue anglaise et française.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects: C1 C5

Le Maguer, Sébastien; Steiner, Ingmar

The MaryTTS entry for the Blizzard Challenge 2016 Inproceedings

Blizzard Challenge, Cupertino, CA, USA, 2016.

Abstract
|
Links
|
BibTeX

The MaryTTS system is a modular architecture text-to-speech (TTS) system whose development started around 15 years ago. This paper presents the MaryTTS entry for the Blizzard Challenge 2016. For this entry, we used the default configuration of MaryTTS based on the unit selection paradigm.

However, the architecture is currently undergoing a massive refactoring process in order to provide a more fully modular system. This will allow researchers to focus only on some part of the synthesis process. The current participation objective includes assessing the current baseline quality in order to evaluate any future improvements. These can be achieved more easily thanks to a more flexible and robust architecture. The results obtained in this challenge prove that our system is not obsolete, but improvements need to be made to maintain it in the state of the art in the future.

@inproceedings{LeMaguer2016BC,
title = {The MaryTTS entry for the Blizzard Challenge 2016},
author = {S{\'e}bastien Le Maguer and Ingmar Steiner},
url = {https://www.semanticscholar.org/paper/The-MaryTTS-entry-for-the-Blizzard-Challenge-2016-Maguer-Steiner/62e04ad78ba1a531e419bea25cb9eb8799aaf07e},
year = {2016},
date = {2016-09-16},
booktitle = {Blizzard Challenge},
address = {Cupertino, CA, USA},
abstract = {The MaryTTS system is a modular architecture text-to-speech (TTS) system whose development started around 15 years ago. This paper presents the MaryTTS entry for the Blizzard Challenge 2016. For this entry, we used the default configuration of MaryTTS based on the unit selection paradigm. However, the architecture is currently undergoing a massive refactoring process in order to provide a more fully modular system. This will allow researchers to focus only on some part of the synthesis process. The current participation objective includes assessing the current baseline quality in order to evaluate any future improvements. These can be achieved more easily thanks to a more flexible and robust architecture. The results obtained in this challenge prove that our system is not obsolete, but improvements need to be made to maintain it in the state of the art in the future.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C5

Le Maguer, Sébastien; Steiner, Ingmar; Möbius, Bernd

Toward a Speech Synthesis Guided by the Modeling of Unexpected Events Inproceedings

Schweitzer, Antje; Dogil, Grzegorz (Ed.): Workshop on Modeling Variability in Speech, Stuttgart, Germany, 2015.

Links
|
BibTeX

https://www.bibsonomy.org/bibtex/217fb65d2ef291a8a10df15db8a8cf5c7/sfb1102

@inproceedings{LeMaguer2015Variability,
title = {Toward a Speech Synthesis Guided by the Modeling of Unexpected Events},
author = {S{\'e}bastien Le Maguer and Ingmar Steiner and Bernd M{\"o}bius},
editor = {Antje Schweitzer and Grzegorz Dogil},
url = {https://www.bibsonomy.org/bibtex/217fb65d2ef291a8a10df15db8a8cf5c7/sfb1102},
year = {2015},
date = {2015},
booktitle = {Workshop on Modeling Variability in Speech},
address = {Stuttgart, Germany},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C5

Successfully