Publications

Kudera, Jacek; van Os, Marjolein; Möbius, Bernd

Natural and synthetic speech comprehension in simulated tonal and pulsatile tinnitus: A pilot study Inproceedings

Elektronische Sprachsignalverarbeitung 2021, Tagungsband der 32. Konferenz (Berlin), TUDpress, pp. 273-280, Dresden, 2021.

This paper summarizes the results of a Modified Rhyme Test conducted with masked stimuli to simulate two common types of hearing impairment: bilateral pulsatile and pure tinnitus. Two types of stimuli, meaningful German words (natural read speech and TTS output) differing in initial or final positioned minimal pairs were modified to correspond to six listening conditions. Results showed higher recognition scores for natural speech compared to synthetic and better intelligibility for pulsatile tinnitus noise over pure tone tinnitus. These insights are of relevance given the alarming rates of tinnitus in epidemiological reports.

@inproceedings{Kudera2021,
title = {Natural and synthetic speech comprehension in simulated tonal and pulsatile tinnitus: A pilot study},
author = {Jacek Kudera and Marjolein van Os and Bernd M{\"o}bius},
url = {https://www.essv.de/paper.php?id=1129},
year = {2021},
date = {2021},
booktitle = {Elektronische Sprachsignalverarbeitung 2021, Tagungsband der 32. Konferenz (Berlin)},
pages = {273-280},
publisher = {TUDpress},
address = {Dresden},
abstract = {This paper summarizes the results of a Modified Rhyme Test conducted with masked stimuli to simulate two common types of hearing impairment: bilateral pulsatile and pure tinnitus. Two types of stimuli, meaningful German words (natural read speech and TTS output) differing in initial or final positioned minimal pairs were modified to correspond to six listening conditions. Results showed higher recognition scores for natural speech compared to synthetic and better intelligibility for pulsatile tinnitus noise over pure tone tinnitus. These insights are of relevance given the alarming rates of tinnitus in epidemiological reports.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Brandt, Erika; Möbius, Bernd; Andreeva, Bistra

Dynamic Formant Trajectories in German Read Speech: Impact of Predictability and Prominence Journal Article

Frontiers in Communication, section Language Sciences, 6, pp. 1-15, 2021.

Phonetic structures expand temporally and spectrally when they are difficult to predict from their context. To some extent, effects of predictability are modulated by prosodic structure. So far, studies on the impact of contextual predictability and prosody on phonetic structures have neglected the dynamic nature of the speech signal. This study investigates the impact of predictability and prominence on the dynamic structure of the first and second formants of German vowels. We expect to find differences in the formant movements between vowels standing in different predictability contexts and a modulation of this effect by prominence. First and second formant values are extracted from a large German corpus. Formant trajectories of peripheral vowels are modeled using generalized additive mixed models, which estimate nonlinear regressions between a dependent variable and predictors. Contextual predictability is measured as biphone and triphone surprisal based on a statistical German language model. We test for the effects of the information-theoretic measures surprisal and word frequency, as well as prominence, on formant movement, while controlling for vowel phonemes and duration. Primary lexical stress and vowel phonemes are significant predictors of first and second formant trajectory shape. We replicate previous findings that vowels are more dispersed in stressed syllables than in unstressed syllables. The interaction of stress and surprisal explains formant movement: unstressed vowels show more variability in their formant trajectory shape at different surprisal levels than stressed vowels. This work shows that effects of contextual predictability on fine phonetic detail can be observed not only in pointwise measures but also in dynamic features of phonetic segments.

@article{Brandt/etal:2021,
title = {Dynamic Formant Trajectories in German Read Speech: Impact of Predictability and Prominence},
author = {Erika Brandt and Bernd M{\"o}bius and Bistra Andreeva},
url = {https://www.frontiersin.org/articles/10.3389/fcomm.2021.643528/full},
doi = {https://doi.org/10.3389/fcomm.2021.643528},
year = {2021},
date = {2021-06-21},
journal = {Frontiers in Communication, section Language Sciences},
pages = {1-15},
volume = {6},
number = {643528},
abstract = {Phonetic structures expand temporally and spectrally when they are difficult to predict from their context. To some extent, effects of predictability are modulated by prosodic structure. So far, studies on the impact of contextual predictability and prosody on phonetic structures have neglected the dynamic nature of the speech signal. This study investigates the impact of predictability and prominence on the dynamic structure of the first and second formants of German vowels. We expect to find differences in the formant movements between vowels standing in different predictability contexts and a modulation of this effect by prominence. First and second formant values are extracted from a large German corpus. Formant trajectories of peripheral vowels are modeled using generalized additive mixed models, which estimate nonlinear regressions between a dependent variable and predictors. Contextual predictability is measured as biphone and triphone surprisal based on a statistical German language model. We test for the effects of the information-theoretic measures surprisal and word frequency, as well as prominence, on formant movement, while controlling for vowel phonemes and duration. Primary lexical stress and vowel phonemes are significant predictors of first and second formant trajectory shape. We replicate previous findings that vowels are more dispersed in stressed syllables than in unstressed syllables. The interaction of stress and surprisal explains formant movement: unstressed vowels show more variability in their formant trajectory shape at different surprisal levels than stressed vowels. This work shows that effects of contextual predictability on fine phonetic detail can be observed not only in pointwise measures but also in dynamic features of phonetic segments.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Meier, David; Andreeva, Bistra

Einflussfaktoren auf die Wahrnehmung von Prominenz im natürlichen Dialog Inproceedings

Elektronische Sprachsignalverarbeitung 2020, Tagungsband der 31. Konferenz , pp. 257-264, Magdeburg, 2020.

Turnbull et al. [1] stellen fest, dass sich auf die Wahrnehmung der prosodischen Prominenz von isolierten Adjektiv-Nomen-Paaren mehrere konkurrierende Faktoren auswirken, nämlich die Phonologie, der Diskurskontext und das Wissen über den Diskurs. Der vorliegende Beitrag hat das Ziel, den relativen Einfluss der evozierten Fokussierung (eng kontrastiv vs. weit kontrastiv) und der Akzentuierung (akzentuiert vs. nicht akzentuiert) auf die Wahrnehmung von Prominenz zu untersuchen und zu überprüfen, ob die in Turnbull et al. vorgestellten Konzepte in einer Umgebung reproduzierbar sind, die eher mit einem natürlichsprachlichen Dialog vergleichbar ist. Für die Studie wurden 144 realisierte Sätze eines einzelnen männlichen Sprechers so zusammengeschnitten, dass ein semantischer Kontrast entweder auf dem betreffenden Nomen oder auf dem Adjektiv entsteht. Die metrisch starken Silben des Adjektivs oder des Nomens waren entweder entsprechend der Fokusstruktur oder gegen Erwartung akzentuiert. Die Ergebnisse zeigen, dass die Akzentuierung einen größeren Einfluss auf die Prominenzwahrnehmung als die Fokusbedingung hat, was im Einklang mit den Ergebnissen von Turnbull et al. ist. Adjektive werden zudem konsequent als prominenter eingestuft als Nomen in vergleichbaren Kontexten. Eine Erweiterung des Diskurskontextes und der Hintergrundinformationen, die dem Versuchsteilnehmer zur Verfügung standen, haben in dem hier vorgestellten Versuchsaufbau allerdings nur vernachlässigbare Effekte.

@inproceedings{Meier2020,
title = {Einflussfaktoren auf die Wahrnehmung von Prominenz im nat{\"u}rlichen Dialog},
author = {David Meier and Bistra Andreeva},
url = {https://www.essv.de/paper.php?id=465},
year = {2020},
date = {2020},
booktitle = {Elektronische Sprachsignalverarbeitung 2020, Tagungsband der 31. Konferenz},
pages = {257-264},
address = {Magdeburg},
abstract = {Turnbull et al. [1] stellen fest, dass sich auf die Wahrnehmung der prosodischen Prominenz von isolierten Adjektiv-Nomen-Paaren mehrere konkurrierende Faktoren auswirken, n{\"a}mlich die Phonologie, der Diskurskontext und das Wissen {\"u}ber den Diskurs. Der vorliegende Beitrag hat das Ziel, den relativen Einfluss der evozierten Fokussierung (eng kontrastiv vs. weit kontrastiv) und der Akzentuierung (akzentuiert vs. nicht akzentuiert) auf die Wahrnehmung von Prominenz zu untersuchen und zu {\"u}berpr{\"u}fen, ob die in Turnbull et al. vorgestellten Konzepte in einer Umgebung reproduzierbar sind, die eher mit einem nat{\"u}rlichsprachlichen Dialog vergleichbar ist. F{\"u}r die Studie wurden 144 realisierte S{\"a}tze eines einzelnen m{\"a}nnlichen Sprechers so zusammengeschnitten, dass ein semantischer Kontrast entweder auf dem betreffenden Nomen oder auf dem Adjektiv entsteht. Die metrisch starken Silben des Adjektivs oder des Nomens waren entweder entsprechend der Fokusstruktur oder gegen Erwartung akzentuiert. Die Ergebnisse zeigen, dass die Akzentuierung einen gr{\"o}{\ss}eren Einfluss auf die Prominenzwahrnehmung als die Fokusbedingung hat, was im Einklang mit den Ergebnissen von Turnbull et al. ist. Adjektive werden zudem konsequent als prominenter eingestuft als Nomen in vergleichbaren Kontexten. Eine Erweiterung des Diskurskontextes und der Hintergrundinformationen, die dem Versuchsteilnehmer zur Verf{\"u}gung standen, haben in dem hier vorgestellten Versuchsaufbau allerdings nur vernachl{\"a}ssigbare Effekte.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Andreeva, Bistra; Möbius, Bernd; Whang, James

Effects of surprisal and boundary strength on phrase-final lengthening Inproceedings

Proc. 10th International Conference on Speech Prosody 2020, pp. 146-150, 2020.

This study examines the influence of prosodic structure (pitch accents and boundary strength) and information density (ID) on phrase-final syllable duration. Phrase-final syllable durations and following pause durations were measured in a subset of a German radio-news corpus (DIRNDL), consisting of about 5 hours of manually annotated speech. The prosodic annotation is in accordance with the autosegmental intonation model and includes labels for pitch accents and boundary tones. We treated pause duration as a quantitative proxy for boundary strength.

ID was calculated as the surprisal of the syllable trigram of the preceding context, based on language models trained on the DeWaC corpus. We found a significant positive correlation between surprisal and phrase-final syllable duration. Syllable duration was statistically modeled as a function of prosodic factors (pitch accent and boundary strength) and surprisal in linear mixed effects models. The results revealed an interaction of surprisal and boundary strength with respect to phrase-final syllable duration. Syllables with high surprisal values are longer before stronger boundaries, whereas low-surprisal syllables are longer before weaker boundaries. This modulation of pre-boundary syllable duration is observed above and beyond the well-established phrase-final lengthening effect.

@inproceedings{Andreeva2020,
title = {Effects of surprisal and boundary strength on phrase-final lengthening},
author = {Bistra Andreeva and Bernd M{\"o}bius andJames Whang},
url = {http://dx.doi.org/10.21437/SpeechProsody.2020-30},
year = {2020},
date = {2020-10-20},
booktitle = {Proc. 10th International Conference on Speech Prosody 2020},
pages = {146-150},
abstract = {This study examines the influence of prosodic structure (pitch accents and boundary strength) and information density (ID) on phrase-final syllable duration. Phrase-final syllable durations and following pause durations were measured in a subset of a German radio-news corpus (DIRNDL), consisting of about 5 hours of manually annotated speech. The prosodic annotation is in accordance with the autosegmental intonation model and includes labels for pitch accents and boundary tones. We treated pause duration as a quantitative proxy for boundary strength. ID was calculated as the surprisal of the syllable trigram of the preceding context, based on language models trained on the DeWaC corpus. We found a significant positive correlation between surprisal and phrase-final syllable duration. Syllable duration was statistically modeled as a function of prosodic factors (pitch accent and boundary strength) and surprisal in linear mixed effects models. The results revealed an interaction of surprisal and boundary strength with respect to phrase-final syllable duration. Syllables with high surprisal values are longer before stronger boundaries, whereas low-surprisal syllables are longer before weaker boundaries. This modulation of pre-boundary syllable duration is observed above and beyond the well-established phrase-final lengthening effect.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Batliner, Anton; Möbius, Bernd

Prosody in automatic speech processing Book Chapter

Gussenhoven, Carlos; Chen, Aoju (Ed.): The Oxford Handbook of Language Prosody, Chap. 46, Oxford University Press, pp. 633-645, 2020, ISBN 9780198832232.

Automatic speech processing (ASP) is understood as covering word recognition, the processing of higher linguistic components (syntax, semantics, and pragmatics), and the processing of computational paralinguistics (CP), which deals with speaker states and traits. This chapter attempts to track the role of prosody in ASP from the word level up to CP. A short history of the field from 1980 to 2020 distinguishes the early years (until 2000)— when the prosodic contribution to the modelling of linguistic phenomena, such as accents, boundaries, syntax, semantics, and dialogue acts, was the focus—from the later years, when the focus shifted to paralinguistics; prosody ceased to be visible. Different types of predictor variables are addressed, among them high-performance power features as well as leverage features, which can also be employed in teaching and therapy.

@inbook{Batliner/Moebius:2020,
title = {Prosody in automatic speech processing},
author = {Anton Batliner and Bernd M{\"o}bius},
editor = {Carlos Gussenhoven and Aoju Chen},
url = {https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780198832232.001.0001/oxfordhb-9780198832232-e-42},
doi = {https://doi.org/10.1093/oxfordhb/9780198832232.013.42},
year = {2020},
date = {2020},
booktitle = {The Oxford Handbook of Language Prosody, Chap. 46},
isbn = {9780198832232},
pages = {633-645},
publisher = {Oxford University Press},
abstract = {Automatic speech processing (ASP) is understood as covering word recognition, the processing of higher linguistic components (syntax, semantics, and pragmatics), and the processing of computational paralinguistics (CP), which deals with speaker states and traits. This chapter attempts to track the role of prosody in ASP from the word level up to CP. A short history of the field from 1980 to 2020 distinguishes the early years (until 2000)— when the prosodic contribution to the modelling of linguistic phenomena, such as accents, boundaries, syntax, semantics, and dialogue acts, was the focus—from the later years, when the focus shifted to paralinguistics; prosody ceased to be visible. Different types of predictor variables are addressed, among them high-performance power features as well as leverage features, which can also be employed in teaching and therapy.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C1

Karpiňski, Maciej; Andreeva, Bistra; Asu, Eva Liina; Beňuš, Štefan; Daugavet, Anna; Mády, Katalin

Central and Eastern Europe Book Chapter

Gussenhoven, Carlos; Chen, Aoju (Ed.): The Oxford Handbook of Language Prosody, Chap. 15, Oxford University Press, pp. 225-235, 2020, ISBN 9780198832232.

The languages of Central and Eastern Europe addressed in this chapter form a typologically divergent collection that includes Slavic (Belarusian, Bulgarian, Czech, Macedonian, Polish, Russian, pluricentric Bosnian-Croatian-Montenegrin-Serbian, Slovak, Slovenian, Ukrainian), Baltic (Latvian, Lithuanian), Finno-Ugric (Hungarian, Finnish, Estonian), and Romance (Romanian). Their prosodic features and structures have been explored to various depths, from different theoretical perspectives, sometimes on the basis of relatively sparse material. Still, enough is known to see that their typological divergence as well as other factors contribute to vivid differences in their prosodic systems. While belonging to intonational languages, they differ in pitch patterns and their usage, duration, and rhythm (some involve phonological duration), as well as prominence mechanisms, accentuation, and word stress (fixed or mobile). Several languages in the area have what is referred to by different traditions as pitch accents, tones or syllable accents, or intonations.

 

@inbook{Karpinski/etal:2020,
title = {Central and Eastern Europe},
author = {Maciej Karpiňski and Bistra Andreeva and Eva Liina Asu and Štefan Beňuš and Anna Daugavet and Katalin M{\'a}dy},
editor = {Carlos Gussenhoven and Aoju Chen},
url = {https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780198832232.001.0001/oxfordhb-9780198832232-e-14},
year = {2020},
date = {2020},
booktitle = {The Oxford Handbook of Language Prosody, Chap. 15},
isbn = {9780198832232},
pages = {225-235},
publisher = {Oxford University Press},
abstract = {The languages of Central and Eastern Europe addressed in this chapter form a typologically divergent collection that includes Slavic (Belarusian, Bulgarian, Czech, Macedonian, Polish, Russian, pluricentric Bosnian-Croatian-Montenegrin-Serbian, Slovak, Slovenian, Ukrainian), Baltic (Latvian, Lithuanian), Finno-Ugric (Hungarian, Finnish, Estonian), and Romance (Romanian). Their prosodic features and structures have been explored to various depths, from different theoretical perspectives, sometimes on the basis of relatively sparse material. Still, enough is known to see that their typological divergence as well as other factors contribute to vivid differences in their prosodic systems. While belonging to intonational languages, they differ in pitch patterns and their usage, duration, and rhythm (some involve phonological duration), as well as prominence mechanisms, accentuation, and word stress (fixed or mobile). Several languages in the area have what is referred to by different traditions as pitch accents, tones or syllable accents, or intonations.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C1

Abdullah, Badr M.; Avgustinova, Tania; Möbius, Bernd; Klakow, Dietrich

Cross-Domain Adaptation of Spoken Language Identification for Related Languages: The Curious Case of Slavic Languages Inproceedings

Proceedings of Interspeech 2020, pp. 477-481, 2020.

State-of-the-art spoken language identification (LID) systems, which are based on end-to-end deep neural networks, have shown remarkable success not only in discriminating between distant languages but also between closely-related languages or even different spoken varieties of the same language. However, it is still unclear to what extent neural LID models generalize to speech samples with different acoustic conditions due to domain shift. In this paper, we present a set of experiments to investigate the impact of domain mismatch on the performance of neural LID systems for a subset of six Slavic languages across two domains (read speech and radio broadcast) and examine two low-level signal descriptors (spectral and cepstral features) for this task. Our experiments show that (1) out-of-domain speech samples severely hinder the performance of neural LID models, and (2) while both spectral and cepstral features show comparable performance within-domain, spectral features show more robustness under domain mismatch. Moreover, we apply unsupervised domain adaptation to minimize the discrepancy between the two domains in our study. We achieve relative accuracy improvements that range from 9% to 77% depending on the diversity of acoustic conditions in the source domain.

@inproceedings{abdullah_etal_is2020,
title = {Cross-Domain Adaptation of Spoken Language Identification for Related Languages: The Curious Case of Slavic Languages},
author = {Badr M. Abdullah and Tania Avgustinova and Bernd M{\"o}bius and Dietrich Klakow},
url = {https://arxiv.org/abs/2008.00545},
doi = {https://doi.org/10.21437/Interspeech.2020-2930},
year = {2020},
date = {2020},
booktitle = {Proceedings of Interspeech 2020},
pages = {477-481},
abstract = {State-of-the-art spoken language identification (LID) systems, which are based on end-to-end deep neural networks, have shown remarkable success not only in discriminating between distant languages but also between closely-related languages or even different spoken varieties of the same language. However, it is still unclear to what extent neural LID models generalize to speech samples with different acoustic conditions due to domain shift. In this paper, we present a set of experiments to investigate the impact of domain mismatch on the performance of neural LID systems for a subset of six Slavic languages across two domains (read speech and radio broadcast) and examine two low-level signal descriptors (spectral and cepstral features) for this task. Our experiments show that (1) out-of-domain speech samples severely hinder the performance of neural LID models, and (2) while both spectral and cepstral features show comparable performance within-domain, spectral features show more robustness under domain mismatch. Moreover, we apply unsupervised domain adaptation to minimize the discrepancy between the two domains in our study. We achieve relative accuracy improvements that range from 9% to 77% depending on the diversity of acoustic conditions in the source domain.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   C1 C4

Abdullah, Badr M.; Kudera, Jacek; Avgustinova, Tania; Möbius, Bernd; Klakow, Dietrich

Rediscovering the Slavic Continuum in Representations Emerging from Neural Models of Spoken Language Identification Inproceedings

Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2020), International Committee on Computational Linguistics (ICCL), pp. 128-139, Barcelona, Spain (Online), 2020.

Deep neural networks have been employed for various spoken language recognition tasks, including tasks that are multilingual by definition such as spoken language identification (LID). In this paper, we present a neural model for Slavic language identification in speech signals and analyze its emergent representations to investigate whether they reflect objective measures of language relatedness or non-linguists’ perception of language similarity. While our analysis shows that the language representation space indeed captures language relatedness to a great extent, we find perceptual confusability to be the best predictor of the language representation similarity.

@inproceedings{abdullah_etal_vardial2020,
title = {Rediscovering the Slavic Continuum in Representations Emerging from Neural Models of Spoken Language Identification},
author = {Badr M. Abdullah and Jacek Kudera and Tania Avgustinova and Bernd M{\"o}bius and Dietrich Klakow},
url = {https://www.aclweb.org/anthology/2020.vardial-1.12},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2020)},
pages = {128-139},
publisher = {International Committee on Computational Linguistics (ICCL)},
address = {Barcelona, Spain (Online)},
abstract = {Deep neural networks have been employed for various spoken language recognition tasks, including tasks that are multilingual by definition such as spoken language identification (LID). In this paper, we present a neural model for Slavic language identification in speech signals and analyze its emergent representations to investigate whether they reflect objective measures of language relatedness or non-linguists’ perception of language similarity. While our analysis shows that the language representation space indeed captures language relatedness to a great extent, we find perceptual confusability to be the best predictor of the language representation similarity.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   C1 C4

Brandt, Erika

Information density and phonetic structure: explaining segmental variability PhD Thesis

Saarland University, Saarbruecken, Germany, 2019.

There is growing evidence that information-theoretic principles influence linguistic structures. Regarding speech several studies have found that phonetic structures lengthen in duration and strengthen in their spectral features when they are difficult to predict from their context, whereas easily predictable phonetic structures are shortened and reduced spectrally. Most of this evidence comes from studies on American English, only some studies have shown similar tendencies in Dutch, Finnish, or Russian. In this context, the Smooth Signal Redundancy hypothesis (Aylett and Turk 2004, Aylett and Turk 2006) emerged claiming that the effect of information-theoretic factors on the segmental structure is moderated through the prosodic structure. In this thesis, we investigate the impact and interaction of information density and prosodic structure on segmental variability in production analyses, mainly based on German read speech, and also listeners‘ perception of differences in phonetic detail caused by predictability effects. Information density (ID) is defined as contextual predictability or surprisal (S(unit_i) = -log2 P(unit_i|context)) and estimated from language models based on large text corpora. In addition to surprisal, we include word frequency, and prosodic factors, such as primary lexical stress, prosodic boundary, and articulation rate, as predictors of segmental variability in our statistical analysis. As acoustic-phonetic measures, we investigate segment duration and deletion, voice onset time (VOT), vowel dispersion, global spectral characteristics of vowels, dynamic formant measures and voice quality metrics. Vowel dispersion is analyzed in the context of German learners‘ speech and in a cross-linguistic study. As results, we replicate previous findings of reduced segment duration (and VOT), higher likelihood to delete, and less vowel dispersion for easily predictable segments. Easily predictable German vowels have less formant change in their vowel section length (VSL), F1 slope and velocity, are less curved in their F2, and show increased breathiness values in cepstral peak prominence (smoothed) than vowels that are difficult to predict from their context. Results for word frequency show similar tendencies: German segments in high-frequency words are shorter, more likely to delete, less dispersed, and show less magnitude in formant change, less F2 curvature, as well as less harmonic richness in open quotient smoothed than German segments in low-frequency words. These effects are found even though we control for the expected and much more effective effects of stress, boundary, and speech rate. In the cross-linguistic analysis of vowel dispersion, the effect of ID is robust across almost all of the six languages and the three intended speech rates. Surprisal does not affect vowel dispersion of non-native German speakers. Surprisal and prosodic factors interact in explaining segmental variability. Especially, stress and surprisal complement each other in their positive effect on segment duration, vowel dispersion and magnitude in formant change. Regarding perception we observe that listeners are sensitive to differences in phonetic detail stemming from high and low surprisal contexts for the same lexical target.

@phdthesis{Brandt_diss_2019,
title = {Information density and phonetic structure: explaining segmental variability},
author = {Erika Brandt},
url = {http://nbn-resolving.de/urn:nbn:de:bsz:291--ds-279181},
doi = {https://doi.org/10.22028/D291-27918},
year = {2019},
date = {2019},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {There is growing evidence that information-theoretic principles influence linguistic structures. Regarding speech several studies have found that phonetic structures lengthen in duration and strengthen in their spectral features when they are difficult to predict from their context, whereas easily predictable phonetic structures are shortened and reduced spectrally. Most of this evidence comes from studies on American English, only some studies have shown similar tendencies in Dutch, Finnish, or Russian. In this context, the Smooth Signal Redundancy hypothesis (Aylett and Turk 2004, Aylett and Turk 2006) emerged claiming that the effect of information-theoretic factors on the segmental structure is moderated through the prosodic structure. In this thesis, we investigate the impact and interaction of information density and prosodic structure on segmental variability in production analyses, mainly based on German read speech, and also listeners' perception of differences in phonetic detail caused by predictability effects. Information density (ID) is defined as contextual predictability or surprisal (S(unit_i) = -log2 P(unit_i|context)) and estimated from language models based on large text corpora. In addition to surprisal, we include word frequency, and prosodic factors, such as primary lexical stress, prosodic boundary, and articulation rate, as predictors of segmental variability in our statistical analysis. As acoustic-phonetic measures, we investigate segment duration and deletion, voice onset time (VOT), vowel dispersion, global spectral characteristics of vowels, dynamic formant measures and voice quality metrics. Vowel dispersion is analyzed in the context of German learners' speech and in a cross-linguistic study. As results, we replicate previous findings of reduced segment duration (and VOT), higher likelihood to delete, and less vowel dispersion for easily predictable segments. Easily predictable German vowels have less formant change in their vowel section length (VSL), F1 slope and velocity, are less curved in their F2, and show increased breathiness values in cepstral peak prominence (smoothed) than vowels that are difficult to predict from their context. Results for word frequency show similar tendencies: German segments in high-frequency words are shorter, more likely to delete, less dispersed, and show less magnitude in formant change, less F2 curvature, as well as less harmonic richness in open quotient smoothed than German segments in low-frequency words. These effects are found even though we control for the expected and much more effective effects of stress, boundary, and speech rate. In the cross-linguistic analysis of vowel dispersion, the effect of ID is robust across almost all of the six languages and the three intended speech rates. Surprisal does not affect vowel dispersion of non-native German speakers. Surprisal and prosodic factors interact in explaining segmental variability. Especially, stress and surprisal complement each other in their positive effect on segment duration, vowel dispersion and magnitude in formant change. Regarding perception we observe that listeners are sensitive to differences in phonetic detail stemming from high and low surprisal contexts for the same lexical target.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C1

Brandt, Erika; Andreeva, Bistra; Möbius, Bernd

Information density and vowel dispersion in the productions of Bulgarian L2 speakers of German Inproceedings

Proceedings of the 19th International Congress of Phonetic Sciences , pp. 3165-3169, Melbourne, Australia, 2019.

We investigated the influence of information density (ID) on vowel space size in L2. Vowel dispersion was measured for the stressed tense vowels /i:, o:, a:/ and their lax counterpart /I, O, a/ in read speech from six German speakers, six advanced and six intermediate Bulgarian speakers of German. The Euclidean distance between center of the vowel space and formant values for each speaker was used as a measure for vowel dispersion. ID was calculated as the surprisal of the triphone of the preceding context. We found a significant positive correlation between surprisal and vowel dispersion in German native speakers. The advanced L2 speakers showed a significant positive relationship between these two measures, while this was not observed in intermediate L2 vowel productions. The intermediate speakers raised their vowel space, reflecting native Bulgarian vowel raising in unstressed positions.

@inproceedings{Brandt2019,
title = {Information density and vowel dispersion in the productions of Bulgarian L2 speakers of German},
author = {Erika Brandt and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/29548},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 19th International Congress of Phonetic Sciences},
pages = {3165-3169},
address = {Melbourne, Australia},
abstract = {We investigated the influence of information density (ID) on vowel space size in L2. Vowel dispersion was measured for the stressed tense vowels /i:, o:, a:/ and their lax counterpart /I, O, a/ in read speech from six German speakers, six advanced and six intermediate Bulgarian speakers of German. The Euclidean distance between center of the vowel space and formant values for each speaker was used as a measure for vowel dispersion. ID was calculated as the surprisal of the triphone of the preceding context. We found a significant positive correlation between surprisal and vowel dispersion in German native speakers. The advanced L2 speakers showed a significant positive relationship between these two measures, while this was not observed in intermediate L2 vowel productions. The intermediate speakers raised their vowel space, reflecting native Bulgarian vowel raising in unstressed positions.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Whang, James

Effects of phonotactic predictability on sensitivity to phonetic detail Journal Article

Laboratory Phonology: Journal of the Association for Laboratory Phonology, 10, pp. 1-28, 2019.

Japanese speakers systematically devoice or delete high vowels [i, u] between two voiceless consonants. Japanese listeners also report perceiving the same high vowels between consonant clusters even in the absence of a vocalic segment. Although perceptual vowel epenthesis has been described primarily as a phonotactic repair strategy, where a phonetically minimal vowel is epenthesized by default, few studies have investigated how the predictability of a vowel in a given context affects the choice of epenthetic vowel. The present study uses a forced-choice labeling task to test how sensitive Japanese listeners are to coarticulatory cues of high vowels [i, u] and non-high vowel [a] in devoicing and non-devoicing contexts. Devoicing contexts were further divided into high-predictability contexts, where the phonotactic distribution strongly favors one of the high vowels, and low-predictability contexts, where both high vowels are allowed, to specifically test for the effects of predictability. Results reveal a strong tendency towards [u] epenthesis as previous studies have found, but the results also reveal a sensitivity to coarticulatory cues that override the default [u] epenthesis, particularly in low-predictability contexts. Previous studies have shown that predictability affects phonetic implementation during production, and this study provides evidence predictability has similar effects during perception.

@article{Whang2019,
title = {Effects of phonotactic predictability on sensitivity to phonetic detail},
author = {James Whang},
url = {https://www.journal-labphon.org/articles/10.5334/labphon.125/},
doi = {https://doi.org/10.5334/labphon.125},
year = {2019},
date = {2019-04-23},
journal = {Laboratory Phonology: Journal of the Association for Laboratory Phonology},
pages = {1-28},
volume = {10},
number = {1},
abstract = {Japanese speakers systematically devoice or delete high vowels [i, u] between two voiceless consonants. Japanese listeners also report perceiving the same high vowels between consonant clusters even in the absence of a vocalic segment. Although perceptual vowel epenthesis has been described primarily as a phonotactic repair strategy, where a phonetically minimal vowel is epenthesized by default, few studies have investigated how the predictability of a vowel in a given context affects the choice of epenthetic vowel. The present study uses a forced-choice labeling task to test how sensitive Japanese listeners are to coarticulatory cues of high vowels [i, u] and non-high vowel [a] in devoicing and non-devoicing contexts. Devoicing contexts were further divided into high-predictability contexts, where the phonotactic distribution strongly favors one of the high vowels, and low-predictability contexts, where both high vowels are allowed, to specifically test for the effects of predictability. Results reveal a strong tendency towards [u] epenthesis as previous studies have found, but the results also reveal a sensitivity to coarticulatory cues that override the default [u] epenthesis, particularly in low-predictability contexts. Previous studies have shown that predictability affects phonetic implementation during production, and this study provides evidence predictability has similar effects during perception.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Malisz, Zofia; Brandt, Erika; Möbius, Bernd; Oh, Yoon Mi; Andreeva, Bistra

Dimensions of segmental variability: interaction of prosody and surprisal in six languages Journal Article

Frontiers in Communication / Language Sciences, 3, pp. 1-18, 2018.

Contextual predictability variation affects phonological and phonetic structure. Reduction and expansion of acoustic-phonetic features is also characteristic of prosodic variability. In this study, we assess the impact of surprisal and prosodic structure on phonetic encoding, both independently of each other and in interaction. We model segmental duration, vowel space size and spectral characteristics of vowels and consonants as a function of surprisal as well as of syllable prominence, phrase boundary, and speech rate. Correlates of phonetic encoding density are extracted from a subset of the BonnTempo corpus for six languages: American English, Czech, Finnish, French, German, and Polish. Surprisal is estimated from segmental n-gram language models trained on large text corpora. Our findings are generally compatible with a weak version of Aylett and Turk’s Smooth Signal Redundancy hypothesis, suggesting that prosodic structure mediates between the requirements of efficient communication and the speech signal. However, this mediation is not perfect, as we found evidence for additional, direct effects of changes in surprisal on the phonetic structure of utterances. These effects appear to be stable across different speech rates.

@article{Malisz2018,
title = {Dimensions of segmental variability: interaction of prosody and surprisal in six languages},
author = {Zofia Malisz and Erika Brandt and Bernd M{\"o}bius and Yoon Mi Oh and Bistra Andreeva},
url = {https://www.frontiersin.org/articles/10.3389/fcomm.2018.00025/full},
doi = {https://doi.org/10.3389/fcomm.2018.00025},
year = {2018},
date = {2018-07-20},
journal = {Frontiers in Communication / Language Sciences},
pages = {1-18},
volume = {3},
number = {25},
abstract = {Contextual predictability variation affects phonological and phonetic structure. Reduction and expansion of acoustic-phonetic features is also characteristic of prosodic variability. In this study, we assess the impact of surprisal and prosodic structure on phonetic encoding, both independently of each other and in interaction. We model segmental duration, vowel space size and spectral characteristics of vowels and consonants as a function of surprisal as well as of syllable prominence, phrase boundary, and speech rate. Correlates of phonetic encoding density are extracted from a subset of the BonnTempo corpus for six languages: American English, Czech, Finnish, French, German, and Polish. Surprisal is estimated from segmental n-gram language models trained on large text corpora. Our findings are generally compatible with a weak version of Aylett and Turk's Smooth Signal Redundancy hypothesis, suggesting that prosodic structure mediates between the requirements of efficient communication and the speech signal. However, this mediation is not perfect, as we found evidence for additional, direct effects of changes in surprisal on the phonetic structure of utterances. These effects appear to be stable across different speech rates.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Zimmerer, Frank; Brandt, Erika; Andreeva, Bistra; Möbius, Bernd

Idiomatic or literal? Production of collocations in German read speech Inproceedings

Proc. Speech Prosody 2018, pp. 428-432, Poznan, 2018.

Collocations have been identified as an interesting field to study the effects of frequency of occurrence in language and speech. We report results of a production experiment including a duration analysis based on the production of German collocations. The collocations occurred in a condition where the phrase was produced with a literal meaning and in another condition where it was idiomatic. A durational difference was found for the collocations, which were reduced in the idiomatic condition. This difference was also observed for the function word und (‘and’) in collocations like Mord und Totschlag (‘murder and manslaughter’). However, an analysis of the vowel /U/ of the function word did not show a durational difference. Some explanations as to why speakers showed different patterns of reduction (not all collocations were produced with a shorter duration in the idiomatic condition by all speakers) and why not all speakers use the durational cue (one out of eight speakers produced the conditions identically) are proposed.

@inproceedings{Zimmerer2018SpPro,
title = {Idiomatic or literal? Production of collocations in German read speech},
author = {Frank Zimmerer and Erika Brandt and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://www.isca-speech.org/archive/speechprosody_2018/zimmerer18_speechprosody.html},
doi = {https://doi.org/10.21437/SpeechProsody.2018-87},
year = {2018},
date = {2018},
booktitle = {Proc. Speech Prosody 2018},
pages = {428-432},
address = {Poznan},
abstract = {Collocations have been identified as an interesting field to study the effects of frequency of occurrence in language and speech. We report results of a production experiment including a duration analysis based on the production of German collocations. The collocations occurred in a condition where the phrase was produced with a literal meaning and in another condition where it was idiomatic. A durational difference was found for the collocations, which were reduced in the idiomatic condition. This difference was also observed for the function word und (‘and’) in collocations like Mord und Totschlag (‘murder and manslaughter’). However, an analysis of the vowel /U/ of the function word did not show a durational difference. Some explanations as to why speakers showed different patterns of reduction (not all collocations were produced with a shorter duration in the idiomatic condition by all speakers) and why not all speakers use the durational cue (one out of eight speakers produced the conditions identically) are proposed.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Brandt, Erika; Zimmerer, Frank; Andreeva, Bistra; Möbius, Bernd

Impact of prosodic structure and information density on dynamic formant trajectories in German Inproceedings

Klessa, Katarzyna; Bachan, Jolanta; Wagner, Agnieszka; Karpiński, Maciej; Śledziński, Daniel (Ed.): Speech Prosody 2018, Speech Prosody Special Interest Group, pp. 119-123, Urbana, 2018, ISSN 2333-2042.

This study investigated the influence of prosodic structure and information density (ID), defined as contextual predictability, on vowel-inherent spectral change (VISC). We extracted formant measurements from the onset and offset of the vowels of a large German corpus of newspaper read speech. Vector length (VL), the Euclidean distance between F1 and F2 trajectory, and F1 and F2 slope, formant deltas of onset and offset relative to vowel duration, were calculated as measures of formant change. ID factors were word frequency and phoneme-based surprisal measures, while the prosodic factors contained global and local articulation rate, primary lexical stress, and prosodic boundary. We expected that vowels increased in spectral change when they were difficult to predict from the context, or stood in low-frequency words while controlling for known effects of prosodic structure. The ID effects were assumed to be modulated by prosodic factors to a certain extent. We confirmed our hypotheses for VL, and found expected independent effects of prosody and ID on F1 slope and F2 slope.

@inproceedings{Brandt2018SpPro,
title = {Impact of prosodic structure and information density on dynamic formant trajectories in German},
author = {Erika Brandt and Frank Zimmerer and Bistra Andreeva and Bernd M{\"o}bius},
editor = {Katarzyna Klessa and Jolanta Bachan and Agnieszka Wagner and Maciej Karpiński and Daniel Śledziński},
url = {https://www.researchgate.net/publication/325744530_Impact_of_prosodic_structure_and_information_density_on_dynamic_formant_trajectories_in_German},
doi = {https://doi.org/10.22028/D291-32050},
year = {2018},
date = {2018},
booktitle = {Speech Prosody 2018},
issn = {2333-2042},
pages = {119-123},
publisher = {Speech Prosody Special Interest Group},
address = {Urbana},
abstract = {This study investigated the influence of prosodic structure and information density (ID), defined as contextual predictability, on vowel-inherent spectral change (VISC). We extracted formant measurements from the onset and offset of the vowels of a large German corpus of newspaper read speech. Vector length (VL), the Euclidean distance between F1 and F2 trajectory, and F1 and F2 slope, formant deltas of onset and offset relative to vowel duration, were calculated as measures of formant change. ID factors were word frequency and phoneme-based surprisal measures, while the prosodic factors contained global and local articulation rate, primary lexical stress, and prosodic boundary. We expected that vowels increased in spectral change when they were difficult to predict from the context, or stood in low-frequency words while controlling for known effects of prosodic structure. The ID effects were assumed to be modulated by prosodic factors to a certain extent. We confirmed our hypotheses for VL, and found expected independent effects of prosody and ID on F1 slope and F2 slope.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Brandt, Erika; Zimmerer, Frank; Möbius, Bernd; Andreeva, Bistra

Mel-cepstral distortion of German vowels in different information density contexts Inproceedings

Proceedings of Interspeech, Stockholm, Sweden, 2017.

This study investigated whether German vowels differ significantly from each other in mel-cepstral distortion (MCD) when they stand in different information density (ID) contexts. We hypothesized that vowels in the same ID contexts are more similar to each other than vowels that stand in different ID conditions. Read speech material from PhonDat2 of 16 German natives (m = 10, f = 6) was analyzed. Bi-phone and word language models were calculated based on DeWaC. To account for additional variability in the data, prosodic factors, as well as corpusspecific frequency values were also entered into the statistical models. Results showed that vowels in different ID conditions were significantly different in their MCD values. Unigram word probability and corpus-specific word frequency showed the expected effect on vowel similarity with a hierarchy between noncontrasting and contrasting conditions. However, these did not form a homogeneous group since there were group-internal significant differences. The largest distance can be found between vowels produced at fast speech rate, and between unstressed vowels.

@inproceedings{Brandt/etal:2017,
title = {Mel-cepstral distortion of German vowels in different information density contexts},
author = {Erika Brandt and Frank Zimmerer and Bernd M{\"o}bius and Bistra Andreeva},
url = {https://www.researchgate.net/publication/319185343_Mel-Cepstral_Distortion_of_German_Vowels_in_Different_Information_Density_Contexts},
year = {2017},
date = {2017},
booktitle = {Proceedings of Interspeech},
address = {Stockholm, Sweden},
abstract = {This study investigated whether German vowels differ significantly from each other in mel-cepstral distortion (MCD) when they stand in different information density (ID) contexts. We hypothesized that vowels in the same ID contexts are more similar to each other than vowels that stand in different ID conditions. Read speech material from PhonDat2 of 16 German natives (m = 10, f = 6) was analyzed. Bi-phone and word language models were calculated based on DeWaC. To account for additional variability in the data, prosodic factors, as well as corpusspecific frequency values were also entered into the statistical models. Results showed that vowels in different ID conditions were significantly different in their MCD values. Unigram word probability and corpus-specific word frequency showed the expected effect on vowel similarity with a hierarchy between noncontrasting and contrasting conditions. However, these did not form a homogeneous group since there were group-internal significant differences. The largest distance can be found between vowels produced at fast speech rate, and between unstressed vowels.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Zimmerer, Frank; Andreeva, Bistra; Möbius, Bernd; Malisz, Zofia; Ferragne, Emmanuel; Pellegrino, François; Brandt, Erika

Perzeption von Sprechgeschwindigkeit und der (nicht nachgewiesene) Einfluss von Surprisal Inproceedings

Möbius, Bernd;  (Ed.): Elektronische Sprachsignalverarbeitung 2017 - Tagungsband der 28. Konferenz, Saarbrücken, 15.-17. März 2017. Studientexte zur Sprachkommunikation, Band 86, pp. 174-179, 2017.

In zwei Perzeptionsexperimenten wurde die Perzeption von Sprechgeschwindigkeit untersucht. Ein Faktor, der dabei besonders im Zentrum des Interesses steht, ist Surprisal, ein informationstheoretisches Maß für die Vorhersagbarkeit einer linguistischen Einheit im Kontext. Zusammengenommen legen die Ergebnisse der Experimente den Schluss nahe, dass Surprisal keinen signifikanten Einfluss auf die Wahrnehmung von Sprechgeschwindigkeit ausübt.

@inproceedings{Zimmerer/etal:2017a,
title = {Perzeption von Sprechgeschwindigkeit und der (nicht nachgewiesene) Einfluss von Surprisal},
author = {Frank Zimmerer and Bistra Andreeva and Bernd M{\"o}bius and Zofia Malisz and Emmanuel Ferragne and François Pellegrino and Erika Brandt},
editor = {Bernd M{\"o}bius},
url = {https://www.researchgate.net/publication/318589916_PERZEPTION_VON_SPRECHGESCHWINDIGKEIT_UND_DER_NICHT_NACHGEWIESENE_EINFLUSS_VON_SURPRISAL},
year = {2017},
date = {2017-03-15},
booktitle = {Elektronische Sprachsignalverarbeitung 2017 - Tagungsband der 28. Konferenz, Saarbr{\"u}cken, 15.-17. M{\"a}rz 2017. Studientexte zur Sprachkommunikation, Band 86},
pages = {174-179},
abstract = {In zwei Perzeptionsexperimenten wurde die Perzeption von Sprechgeschwindigkeit untersucht. Ein Faktor, der dabei besonders im Zentrum des Interesses steht, ist Surprisal, ein informationstheoretisches Ma{\ss} f{\"u}r die Vorhersagbarkeit einer linguistischen Einheit im Kontext. Zusammengenommen legen die Ergebnisse der Experimente den Schluss nahe, dass Surprisal keinen signifikanten Einfluss auf die Wahrnehmung von Sprechgeschwindigkeit aus{\"u}bt.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Malisz, Zofia; O'Dell, Michael; Nieminen, Tommi; Wagner, Petra

Perspectives on speech timing: Coupled oscillator modeling of Polish and Finnish Journal Article

Phonetica, 73, pp. 229-255, 2016.

This study was aimed at analyzing empirical duration data for Polish spoken at different tempos using an updated version of the Coupled Oscillator Model of speech timing and rhythm variability (O’Dell and Nieminen, 1999, 2009). We use Bayesian inference on parameters relating to speech rate to investigate how tempo affects timing in Polish. The model parameters found are then compared with parameters obtained for equivalent material in Finnish to shed light on which of the effects represent general speech rate mechanisms and which are specific to Polish. We discuss the model and its predictions in the context of current perspectives on speech timing.

@article{Malisz/etal:2016,
title = {Perspectives on speech timing: Coupled oscillator modeling of Polish and Finnish},
author = {Zofia Malisz and Michael O'Dell and Tommi Nieminen and Petra Wagner},
url = {https://www.degruyter.com/document/doi/10.1159/000450829/html},
doi = {https://doi.org/10.1159/000450829},
year = {2016},
date = {2016},
journal = {Phonetica},
pages = {229-255},
volume = {73},
abstract = {

This study was aimed at analyzing empirical duration data for Polish spoken at different tempos using an updated version of the Coupled Oscillator Model of speech timing and rhythm variability (O'Dell and Nieminen, 1999, 2009). We use Bayesian inference on parameters relating to speech rate to investigate how tempo affects timing in Polish. The model parameters found are then compared with parameters obtained for equivalent material in Finnish to shed light on which of the effects represent general speech rate mechanisms and which are specific to Polish. We discuss the model and its predictions in the context of current perspectives on speech timing.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Schulz, Erika; Oh, Yoon Mi; Andreeva, Bistra; Möbius, Bernd

Impact of Prosodic Structure and Information Density on Vowel Space Size Inproceedings

Proceedings of Speech Prosody, pp. 350-354, Boston, MA, USA, 2016.

We investigated the influence of prosodic structure and information density on vowel space size. Vowels were measured in five languages from the BonnTempo corpus, French, German, Finnish, Czech, and Polish, each with three female and three male speakers. Speakers read the text at normal, slow, and fast speech rate. The Euclidean distance between vowel space midpoint and formant values for each speaker was used as a measure for vowel distinctiveness. The prosodic model consisted of prominence and boundary. Information density was calculated for each language using the surprisal of the biphone Xn|Xn−1. On average, there is a positive relationship between vowel space expansion and information density. Detailed analysis revealed that this relationship did not hold for Finnish, and was only weak for Polish. When vowel distinctiveness was modeled as a function of prosodic factors and information density in linear mixed effects models (LMM), only prosodic factors were significant in explaining the variance in vowel space expansion. All prosodic factors, except word boundary, showed significant positive results in LMM. Vowels were more distinct in stressed syllables, before a prosodic boundary and at normal and slow speech rate compared to fast speech.

@inproceedings{Schulz/etal:2016a,
title = {Impact of Prosodic Structure and Information Density on Vowel Space Size},
author = {Erika Schulz and Yoon Mi Oh and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://www.researchgate.net/publication/303755409_Impact_of_prosodic_structure_and_information_density_on_vowel_space_size},
year = {2016},
date = {2016},
booktitle = {Proceedings of Speech Prosody},
pages = {350-354},
address = {Boston, MA, USA},
abstract = {We investigated the influence of prosodic structure and information density on vowel space size. Vowels were measured in five languages from the BonnTempo corpus, French, German, Finnish, Czech, and Polish, each with three female and three male speakers. Speakers read the text at normal, slow, and fast speech rate. The Euclidean distance between vowel space midpoint and formant values for each speaker was used as a measure for vowel distinctiveness. The prosodic model consisted of prominence and boundary. Information density was calculated for each language using the surprisal of the biphone Xn|Xn−1. On average, there is a positive relationship between vowel space expansion and information density. Detailed analysis revealed that this relationship did not hold for Finnish, and was only weak for Polish. When vowel distinctiveness was modeled as a function of prosodic factors and information density in linear mixed effects models (LMM), only prosodic factors were significant in explaining the variance in vowel space expansion. All prosodic factors, except word boundary, showed significant positive results in LMM. Vowels were more distinct in stressed syllables, before a prosodic boundary and at normal and slow speech rate compared to fast speech.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Le Maguer, Sébastien; Möbius, Bernd; Steiner, Ingmar

Toward the use of information density based descriptive features in HMM based speech synthesis Inproceedings

8th International Conference on Speech Prosody, pp. 1029-1033, Boston, MA, USA, 2016.
Over the last decades, acoustic modeling for speech synthesis has been improved significantly. However, in most systems, the descriptive feature set used to represent annotated text has been the same for many years. Specifically, the prosody models in most systems are based on low level information such as syllable stress or word part-of-speech tags. In this paper, we propose to enrich the descriptive feature set by adding a linguistic measure computed from the predictability of an event, such as the occurrence of a syllable or word. By adding such descriptive features, we assume that we will improve prosody modeling. This new feature set is then used to train prosody models for speech synthesis. Results from an evaluation study indicate a preference for the new descriptive feature set over the conventional one.

@inproceedings{LeMaguer2016SP,
title = {Toward the use of information density based descriptive features in HMM based speech synthesis},
author = {S{\'e}bastien Le Maguer and Bernd M{\"o}bius and Ingmar Steiner},
url = {https://www.researchgate.net/publication/305684951_Toward_the_use_of_information_density_based_descriptive_features_in_HMM_based_speech_synthesis},
year = {2016},
date = {2016},
booktitle = {8th International Conference on Speech Prosody},
pages = {1029-1033},
address = {Boston, MA, USA},
abstract = {

Over the last decades, acoustic modeling for speech synthesis has been improved significantly. However, in most systems, the descriptive feature set used to represent annotated text has been the same for many years. Specifically, the prosody models in most systems are based on low level information such as syllable stress or word part-of-speech tags. In this paper, we propose to enrich the descriptive feature set by adding a linguistic measure computed from the predictability of an event, such as the occurrence of a syllable or word. By adding such descriptive features, we assume that we will improve prosody modeling. This new feature set is then used to train prosody models for speech synthesis. Results from an evaluation study indicate a preference for the new descriptive feature set over the conventional one.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   C1 C5

Le Maguer, Sébastien; Möbius, Bernd; Steiner, Ingmar; Lolive, Damien

De l'utilisation de descripteurs issus de la linguistique computationnelle dans le cadre de la synthèse par HMM Inproceedings

Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP, AFCP - ATALA, pp. 714-722, Paris, France, 2016.

Durant les dernières décennies, la modélisation acoustique effectuée par les systèmes de synthèse de parole paramétrique a fait l’objet d’une attention particulière. Toutefois, dans la plupart des systèmes connus, l’ensemble des descripteurs linguistiques utilisés pour représenter le texte reste identique. Plus specifiquement, la modélisation de la prosodie reste guidée par des descripteurs de bas niveau comme l’information d’accentuation de la syllabe ou bien l’étiquette grammaticale du mot. Dans cet article, nous proposons d’intégrer des informations basées sur la prédictibilité d’un évènement (la syllabe ou le mot). Plusieurs études indiquent une corrélation forte entre cette mesure, fortement présente dans la linguistique computationnelle, et certaines spécificités lors de la production humaine de la parole. Notre hypothèse est donc que l’ajout de ces descripteurs améliore la modélisation de la prosodie. Cet article se focalise sur une analyse objective de l’apport de ces descripteurs sur la synthèse HMM pour la langue anglaise et française.

@inproceedings{Lemaguer/etal:2016b,
title = {De l'utilisation de descripteurs issus de la linguistique computationnelle dans le cadre de la synthèse par HMM},
author = {S{\'e}bastien Le Maguer and Bernd M{\"o}bius and Ingmar Steiner and Damien Lolive},
url = {https://aclanthology.org/2016.jeptalnrecital-jep.80},
year = {2016},
date = {2016},
booktitle = {Actes de la conf{\'e}rence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP},
pages = {714-722},
publisher = {AFCP - ATALA},
address = {Paris, France},
abstract = {Durant les dernières d{\'e}cennies, la mod{\'e}lisation acoustique effectu{\'e}e par les systèmes de synthèse de parole param{\'e}trique a fait l’objet d’une attention particulière. Toutefois, dans la plupart des systèmes connus, l’ensemble des descripteurs linguistiques utilis{\'e}s pour repr{\'e}senter le texte reste identique. Plus specifiquement, la mod{\'e}lisation de la prosodie reste guid{\'e}e par des descripteurs de bas niveau comme l’information d’accentuation de la syllabe ou bien l’{\'e}tiquette grammaticale du mot. Dans cet article, nous proposons d’int{\'e}grer des informations bas{\'e}es sur la pr{\'e}dictibilit{\'e} d’un {\'e}vènement (la syllabe ou le mot). Plusieurs {\'e}tudes indiquent une corr{\'e}lation forte entre cette mesure, fortement pr{\'e}sente dans la linguistique computationnelle, et certaines sp{\'e}cificit{\'e}s lors de la production humaine de la parole. Notre hypothèse est donc que l’ajout de ces descripteurs am{\'e}liore la mod{\'e}lisation de la prosodie. Cet article se focalise sur une analyse objective de l’apport de ces descripteurs sur la synthèse HMM pour la langue anglaise et française.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   C1 C5

Successfully