Publications

Yuen, Ivan; Andreeva, Bistra; Ibrahim, Omnia; Möbius, Bernd

Prosodic factors do not always suppress discourse or surprisal factors on word-final syllable duration in German polysyllabic words Incollection

Lemke, Robin; Schäfer, Lisa; Reich, Ingo (Ed.): Information Structure and Information Theory, Language Science Press, pp. 215-234, Berlin, 2024.

Predictability is known to influence acoustic duration (e.g., Ibrahim et al. 2022) and prosodic factors such as accenting and boundary-related lengthening have been postulated to account for this effect (e.g., Aylett & Turk 2004). However, it has also been shown that other factors such as information status or speech styles could contribute to acoustic duration (e.g. Baker & Bradlow 2009). This raises the question as to whether acoustic duration is primarily subject to the influence of prosody that reflects linguistic structure including predictability. The current study addressed this question by examining the acoustic duration of word-final syllables in polysyllabic words in DIRNDL, a German radio broadcast corpus (e.g. Eckart et al. 2012). We analysed polysyllabic words followed by an intermediate phrase or an intonational phrase boundary, with or without accenting, and with given or new information status. Our results indicate that the acoustic duration of the word-final syllable was subject to the effect of prosodic boundary for long host words, in line with Aylett & Turk (2004); however, we also observed additional effects of information status, log surprisal and accenting for short host words, in line with Baker & Bradlow (2009). These results suggest that acoustic duration is subject to the influence of prosodic (e.g., boundary and accenting) and linguistic factors (e.g., information status and surprisal), and that the primacy of prosodic factors impacting on acoustic duration is further constrained by some intrinsic durational constraints, for example word length.

@incollection{Yuen/etal:2024b,
title = {Prosodic factors do not always suppress discourse or surprisal factors on word-final syllable duration in German polysyllabic words},
author = {Ivan Yuen and Bistra Andreeva and Omnia Ibrahim and Bernd M{\"o}bius},
editor = {Robin Lemke and Lisa Sch{\"a}fer and Ingo Reich},
url = {https://zenodo.org/records/13383799},
doi = {https://doi.org/10.5281/zenodo.13383799},
year = {2024},
date = {2024},
booktitle = {Information Structure and Information Theory},
pages = {215-234},
publisher = {Language Science Press},
address = {Berlin},
abstract = {Predictability is known to influence acoustic duration (e.g., Ibrahim et al. 2022) and prosodic factors such as accenting and boundary-related lengthening have been postulated to account for this effect (e.g., Aylett & Turk 2004). However, it has also been shown that other factors such as information status or speech styles could contribute to acoustic duration (e.g. Baker & Bradlow 2009). This raises the question as to whether acoustic duration is primarily subject to the influence of prosody that reflects linguistic structure including predictability. The current study addressed this question by examining the acoustic duration of word-final syllables in polysyllabic words in DIRNDL, a German radio broadcast corpus (e.g. Eckart et al. 2012). We analysed polysyllabic words followed by an intermediate phrase or an intonational phrase boundary, with or without accenting, and with given or new information status. Our results indicate that the acoustic duration of the word-final syllable was subject to the effect of prosodic boundary for long host words, in line with Aylett & Turk (2004); however, we also observed additional effects of information status, log surprisal and accenting for short host words, in line with Baker & Bradlow (2009). These results suggest that acoustic duration is subject to the influence of prosodic (e.g., boundary and accenting) and linguistic factors (e.g., information status and surprisal), and that the primacy of prosodic factors impacting on acoustic duration is further constrained by some intrinsic durational constraints, for example word length.},
pubstate = {published},
type = {incollection}
}

Copy BibTeX to Clipboard

Project:   C1

Pellegrino, Elisa; Dellwo, Volker; Pardo, Jennifer; Möbius, Bernd

Forms, factors and functions of phonetic convergence: Editorial Journal Article

Speech Communication, 165, 2024.
This introductory article for the Special Issue on Forms, Factors and Functions of Phonetic Convergence offers a comprehensive overview of the dominant theoretical paradigms, elicitation methods, and computational approaches pertaining to phonetic convergence, and discusses the role of established factors shaping interspeakers’ acoustic adjustments. The nine papers in this collection offer new insights into the fundamental mechanisms, factors and functions behind accommodation in production and perception, and in the perception of accommodation. By integrating acoustic, articulatory and perceptual evaluations of convergence, and combining traditional experimental phonetic analysis with computational modeling, the nine papers (1) emphasize the roles of cognitive adaptability and phonetic variability as triggers for convergence, (2) reveal fundamental similarities between the mechanisms of convergence perception and speaker identification, and (3) shed light on the evolutionary link between adaptation in human and animal vocalizations.

@article{Pellegrino/etal:2024,
title = {Forms, factors and functions of phonetic convergence: Editorial},
author = {Elisa Pellegrino and Volker Dellwo and Jennifer Pardo and Bernd M{\"o}bius},
url = {https://www.sciencedirect.com/science/article/pii/S0167639324001134},
doi = {https://doi.org/10.1016/j.specom.2024.103142},
year = {2024},
date = {2024},
journal = {Speech Communication},
volume = {165},
abstract = {

This introductory article for the Special Issue on Forms, Factors and Functions of Phonetic Convergence offers a comprehensive overview of the dominant theoretical paradigms, elicitation methods, and computational approaches pertaining to phonetic convergence, and discusses the role of established factors shaping interspeakers’ acoustic adjustments. The nine papers in this collection offer new insights into the fundamental mechanisms, factors and functions behind accommodation in production and perception, and in the perception of accommodation. By integrating acoustic, articulatory and perceptual evaluations of convergence, and combining traditional experimental phonetic analysis with computational modeling, the nine papers (1) emphasize the roles of cognitive adaptability and phonetic variability as triggers for convergence, (2) reveal fundamental similarities between the mechanisms of convergence perception and speaker identification, and (3) shed light on the evolutionary link between adaptation in human and animal vocalizations.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Gessinger, Iona; Andreeva, Bistra; Cowan, Benjamin R.

The Use of Modifiers and f0 in Remote Referential Communication with Human and Computer Partners Inproceedings

Proc. Interspeech 2024, pp. 1575-1579, 2024, ISSN 2958-1796.

The present study investigates referring expressions in a remote interaction context with a human or computer partner (both simulated). Across these conditions, we compare the effect of competitor information being available to both partners (common ground) or only the speaker (privileged ground) on target item descriptions. We analyse the number of adjectival modifiers uttered and show that participants responded to the manipulation of information status in both partner conditions. In addition, we examine whether the information status also affects the prosodic realisation of the descriptions. No sufficient evidence was found for this. As expected, adjectives showed a slightly higher peak f0 when a competitor was present in the common ground than when there was no competitor. However, when analysing the overall f0 contour, there was no systematic difference between conditions.

@inproceedings{gessinger24_interspeech,
title = {The Use of Modifiers and f0 in Remote Referential Communication with Human and Computer Partners},
author = {Iona Gessinger and Bistra Andreeva and Benjamin R. Cowan},
url = {https://www.isca-archive.org/interspeech_2024/gessinger24_interspeech.html},
doi = {https://doi.org/10.21437/Interspeech.2024-1169},
year = {2024},
date = {2024},
booktitle = {Proc. Interspeech 2024},
issn = {2958-1796},
pages = {1575-1579},
abstract = {

The present study investigates referring expressions in a remote interaction context with a human or computer partner (both simulated). Across these conditions, we compare the effect of competitor information being available to both partners (common ground) or only the speaker (privileged ground) on target item descriptions. We analyse the number of adjectival modifiers uttered and show that participants responded to the manipulation of information status in both partner conditions. In addition, we examine whether the information status also affects the prosodic realisation of the descriptions. No sufficient evidence was found for this. As expected, adjectives showed a slightly higher peak f0 when a competitor was present in the common ground than when there was no competitor. However, when analysing the overall f0 contour, there was no systematic difference between conditions.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Xue, Wei; Yuen, Ivan; Möbius, Bernd

Towards a better understanding of receptive multilingualism: listening conditions and priming effects Inproceedings

Proceedings of Interspeech 2024, ISCA, pp. 12-16, Kos, Greece, 2024.

Receptive multilingualism is a form of communication where speakers can comprehend an utterance of a foreign language (Lx) using their native language (L1) when L1 and Lx share similarities in, e.g., vocabulary and pronunciation. The success of receptive multilingualism can be tested by examining accuracy and reaction time of auditory word recognition (AWR) of target words in lexical decision tasks. AWR in such tasks can be affected by adverse listening conditions due to environmental noises and by the presence of a preceding prime word. This study explores whether AWR of L1 in Lx-L1 pairs (Lx = Dutch; L1 = German or English) will be affected by different degrees of similarities in their phonology and semantics and whether such an influence will differ as a function of listening condition. We observed less accurate and slower responses without semantic similarity but a null effect on accuracy without phonological overlap. The interaction with listening conditions is language-dependent.

@inproceedings{Xue/etal:2024a,
title = {Towards a better understanding of receptive multilingualism: listening conditions and priming effects},
author = {Wei Xue and Ivan Yuen and Bernd M{\"o}bius},
url = {https://www.isca-archive.org/interspeech_2024/xue24_interspeech.html},
doi = {https://doi.org/10.21437/Interspeech.2024-418},
year = {2024},
date = {2024},
booktitle = {Proceedings of Interspeech 2024},
pages = {12-16},
publisher = {ISCA},
address = {Kos, Greece},
abstract = {Receptive multilingualism is a form of communication where speakers can comprehend an utterance of a foreign language (Lx) using their native language (L1) when L1 and Lx share similarities in, e.g., vocabulary and pronunciation. The success of receptive multilingualism can be tested by examining accuracy and reaction time of auditory word recognition (AWR) of target words in lexical decision tasks. AWR in such tasks can be affected by adverse listening conditions due to environmental noises and by the presence of a preceding prime word. This study explores whether AWR of L1 in Lx-L1 pairs (Lx = Dutch; L1 = German or English) will be affected by different degrees of similarities in their phonology and semantics and whether such an influence will differ as a function of listening condition. We observed less accurate and slower responses without semantic similarity but a null effect on accuracy without phonological overlap. The interaction with listening conditions is language-dependent.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   C1 C4

Manzoni-Luxenburger, Judith; Andreeva, Bistra; Zahner-Ritter, Katharina

Intonational Patterns under Time Pressure: Phonetic Strategies in Bulgarian Learners of German and English Inproceedings

Proc. Speech Prosody 2024, pp. 369-373, 2024.

Research on the second-language (L2) acquisition of intonation is a growing field but only few studies have (so far) focused on the fine phonetic detail of intonational patterns in the L2. The present study concentrates on the phonetic realization of nuclear intonation contours under time pressure, testing Bulgarian learners in their L2s German and English – two languages in which intonation contours are accommodated differently by native speakers (L1) when little sonorant material is available. In particular, nuclear falling contours (H* L-%) tend to be truncated in L1 German while they are compressed in L1 English. Here we recorded 14 Bulgarian learners in their L2s German and English (within subjects, language order counterbalanced) when producing utterances in a statement context. The target word, a surname placed at the end of the utterance, differed in the available sonorant material (disyllable vs. monosyllables with long and short vowels). Our findings showed that Bulgarian speakers primarily truncate nuclear falling movements ((L+)H* L-%) in both L2s, suggesting transfer irrespective of the target strategy. However, our data show substantial inter- and intra-individual variation which we will discuss, along with factors that might explain this variation.

@inproceedings{manzoniluxenburger24_speechprosody,
title = {Intonational Patterns under Time Pressure: Phonetic Strategies in Bulgarian Learners of German and English},
author = {Judith Manzoni-Luxenburger and Bistra Andreeva and Katharina Zahner-Ritter},
url = {https://www.isca-archive.org/speechprosody_2024/manzoniluxenburger24_speechprosody.html},
doi = {https://doi.org/10.21437/SpeechProsody.2024-75},
year = {2024},
date = {2024},
booktitle = {Proc. Speech Prosody 2024},
pages = {369-373},
abstract = {Research on the second-language (L2) acquisition of intonation is a growing field but only few studies have (so far) focused on the fine phonetic detail of intonational patterns in the L2. The present study concentrates on the phonetic realization of nuclear intonation contours under time pressure, testing Bulgarian learners in their L2s German and English – two languages in which intonation contours are accommodated differently by native speakers (L1) when little sonorant material is available. In particular, nuclear falling contours (H* L-%) tend to be truncated in L1 German while they are compressed in L1 English. Here we recorded 14 Bulgarian learners in their L2s German and English (within subjects, language order counterbalanced) when producing utterances in a statement context. The target word, a surname placed at the end of the utterance, differed in the available sonorant material (disyllable vs. monosyllables with long and short vowels). Our findings showed that Bulgarian speakers primarily truncate nuclear falling movements ((L+)H* L-%) in both L2s, suggesting transfer irrespective of the target strategy. However, our data show substantial inter- and intra-individual variation which we will discuss, along with factors that might explain this variation.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Yuen, Ivan; Andreeva, Bistra; Ibrahim, Omnia; Möbius, Bernd

Differential effects of word frequency and utterance position on the duration of tense and lax vowels in German Inproceedings

Proc. Speech Prosody 2024 (Leiden, The Netherlands), pp. 442-446, Leiden, The Netherlands, 2024.

Acoustic duration is subject to modification from multiple sources, for example, utterance position [13] and predictability such as occurrence frequency at word and syllable levels [e.g., 2, 3, 4]. A study of German radio corpus data showed that these two sources interact to modify syllable duration. On the one hand, the predictability effect can percolate downstream to the segmental level, and this downstream effect is sensitive to phonological contrasts [9]. On the other, [6] showed that utterance-final lengthening is uniformly applied to tense and lax vowels in German. This then raises some questions as to whether the effects of the two sources of durational variation are uniformly applied or sensitive to phonological contrasts. The current study focused on the duration of tense and lax vowels in the stressed syllable of monosyllabic and disyllabic words in utterance-medial and utterance-final positions. Twenty German speakers participated in a question-answer elicitation task. A preliminary analysis of seven speakers showed effects of utterance position and word frequency, as well as interactions with vowel type, suggesting a non-uniform application of durational adjustments contingent on phonological vowel length. Interestingly, the frequency effect affects the duration of lax vowels, but utterance position affects the duration of tense vowels.

@inproceedings{Yuen/etal:2024a,
title = {Differential effects of word frequency and utterance position on the duration of tense and lax vowels in German},
author = {Ivan Yuen and Bistra Andreeva and Omnia Ibrahim and Bernd M{\"o}bius},
url = {https://www.isca-archive.org/speechprosody_2024/yuen24_speechprosody.html},
doi = {https://doi.org/10.21437/SpeechProsody.2024-90},
year = {2024},
date = {2024},
booktitle = {Proc. Speech Prosody 2024 (Leiden, The Netherlands)},
pages = {442-446},
address = {Leiden, The Netherlands},
abstract = {Acoustic duration is subject to modification from multiple sources, for example, utterance position [13] and predictability such as occurrence frequency at word and syllable levels [e.g., 2, 3, 4]. A study of German radio corpus data showed that these two sources interact to modify syllable duration. On the one hand, the predictability effect can percolate downstream to the segmental level, and this downstream effect is sensitive to phonological contrasts [9]. On the other, [6] showed that utterance-final lengthening is uniformly applied to tense and lax vowels in German. This then raises some questions as to whether the effects of the two sources of durational variation are uniformly applied or sensitive to phonological contrasts. The current study focused on the duration of tense and lax vowels in the stressed syllable of monosyllabic and disyllabic words in utterance-medial and utterance-final positions. Twenty German speakers participated in a question-answer elicitation task. A preliminary analysis of seven speakers showed effects of utterance position and word frequency, as well as interactions with vowel type, suggesting a non-uniform application of durational adjustments contingent on phonological vowel length. Interestingly, the frequency effect affects the duration of lax vowels, but utterance position affects the duration of tense vowels.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Ibrahim, Omnia; Yuen, Ivan; Xue, Wei; Andreeva, Bistra; Möbius, Bernd

Listener-oriented consequences of predictability-based acoustic adjustment Inproceedings

Baumann, Timo (Ed.): Elektronische Sprachsignalverarbeitung 2024, Tagungsband der 35. Konferenz (Regensburg), TUD Press, pp. 196-202, 2024, ISBN 978-3-95908-325-6.

This paper investigated whether predictability-based adjustments in production have listener-oriented consequences in perception. By manipulating the acoustic features of a target syllable in different predictability contexts in German, we tested 40 listeners’ perceptual preference for the manipulation. Four source words underwent acoustic modifications on the target syllable. Our results revealed a general preference for the original (unmodified) version over the modified one. However, listeners generally favored the unmodified version more when the source word had a higher predictable context compared to a less predictable one. The results showed that predictability-based adjustments have perceptual consequences and that listeners have predictability-based expectations in perception.

@inproceedings{Ibrahim_etal_2024,
title = {Listener-oriented consequences of predictability-based acoustic adjustment},
author = {Omnia Ibrahim and Ivan Yuen and Wei Xue and Bistra Andreeva and Bernd M{\"o}bius},
editor = {Timo Baumann},
url = {https://opus4.kobv.de/opus4-oth-regensburg/frontdoor/index/index/docId/7098},
doi = {https://doi.org/10.35096/othr/pub-7098},
year = {2024},
date = {2024},
booktitle = {Elektronische Sprachsignalverarbeitung 2024, Tagungsband der 35. Konferenz (Regensburg)},
isbn = {978-3-95908-325-6},
pages = {196-202},
publisher = {TUD Press},
abstract = {This paper investigated whether predictability-based adjustments in production have listener-oriented consequences in perception. By manipulating the acoustic features of a target syllable in different predictability contexts in German, we tested 40 listeners’ perceptual preference for the manipulation. Four source words underwent acoustic modifications on the target syllable. Our results revealed a general preference for the original (unmodified) version over the modified one. However, listeners generally favored the unmodified version more when the source word had a higher predictable context compared to a less predictable one. The results showed that predictability-based adjustments have perceptual consequences and that listeners have predictability-based expectations in perception.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Elmers, Mikey

Evaluating pause particles and their functions in natural and synthesized speech in laboratory and lecture settings PhD Thesis

Saarland University, Saarbruecken, Germany, 2023.

Pause-internal phonetic particles (PINTs) comprise a variety of phenomena including: phonetic-acoustic silence, inhalation and exhalation breath noises, filler particles “uh” and “um” in English, tongue clicks, and many others. These particles are omni-present in spontaneous speech, however, they are under-researched in both natural speech and synthetic speech. The present work explores the influence of PINTs in small-context recall experiments, develops a bespoke speech synthesis system that incorporates the PINTs pattern of a single speaker, and evaluates the influence of PINTs on recall for larger material lengths, namely university lectures. The benefit of PINTs on recall has been documented in natural speech in small-context laboratory settings, however, this area of research has been under-explored for synthetic speech. We devised two experiments to evaluate if PINTs have the same recall benefit for synthetic material that is found with natural material. In the first experiment, we evaluated the recollection of consecutive missing digits for a randomized 7-digit number. Results indicated that an inserted silence improved recall accuracy for digits immediately following. In the second experiment, we evaluated sentence recollection. Results indicated that sentences preceded by an inhalation breath noise were better recalled than those with no inhalation. Together, these results reveal that in single-sentence laboratory settings PINTs can improve recall for synthesized speech. The speech synthesis systems used in the small-context recall experiments did not provide much freedom in terms of controlling PINT type or location. Therefore, we endeavoured to develop bespoke speech synthesis systems. Two neural text-to-speech (TTS) systems were created: one that used PINTs annotation labels in the training data, and another that did not include any PINTs labeling in the training material. The first system allowed fine-tuned control for inserting PINTs material into the rendered material. The second system produced PINTs probabilistally. To the best of our knowledge, these are the first TTS systems to render tongue clicks. Equipped with greater control of synthesized PINTs, we returned to evaluating the recall benefit of PINTs. This time we evaluated the influence of PINTs on the recollection of key information in lectures, an ecologically valid task that focused on larger material lengths. Results indicated that key information that followed PINTs material was less likely to be recalled. We were unable to replicate the benefits of PINTs found in the small-context laboratory settings. This body of work showcases that PINTs improve recall for TTS in small-context environments just like previous work had indicated for natural speech. Additionally, we’ve provided a technological contribution via a neural TTS system that exerts finer control over PINT type and placement. Lastly, we’ve shown the importance of using material rendered by speech synthesis systems in perceptual studies.

@phdthesis{Elmers_Diss_2023,
title = {Evaluating pause particles and their functions in natural and synthesized speech in laboratory and lecture settings},
author = {Mikey Elmers},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/36999},
doi = {https://doi.org/10.22028/D291-41118},
year = {2023},
date = {2023},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {Pause-internal phonetic particles (PINTs) comprise a variety of phenomena including: phonetic-acoustic silence, inhalation and exhalation breath noises, filler particles “uh” and “um” in English, tongue clicks, and many others. These particles are omni-present in spontaneous speech, however, they are under-researched in both natural speech and synthetic speech. The present work explores the influence of PINTs in small-context recall experiments, develops a bespoke speech synthesis system that incorporates the PINTs pattern of a single speaker, and evaluates the influence of PINTs on recall for larger material lengths, namely university lectures. The benefit of PINTs on recall has been documented in natural speech in small-context laboratory settings, however, this area of research has been under-explored for synthetic speech. We devised two experiments to evaluate if PINTs have the same recall benefit for synthetic material that is found with natural material. In the first experiment, we evaluated the recollection of consecutive missing digits for a randomized 7-digit number. Results indicated that an inserted silence improved recall accuracy for digits immediately following. In the second experiment, we evaluated sentence recollection. Results indicated that sentences preceded by an inhalation breath noise were better recalled than those with no inhalation. Together, these results reveal that in single-sentence laboratory settings PINTs can improve recall for synthesized speech. The speech synthesis systems used in the small-context recall experiments did not provide much freedom in terms of controlling PINT type or location. Therefore, we endeavoured to develop bespoke speech synthesis systems. Two neural text-to-speech (TTS) systems were created: one that used PINTs annotation labels in the training data, and another that did not include any PINTs labeling in the training material. The first system allowed fine-tuned control for inserting PINTs material into the rendered material. The second system produced PINTs probabilistally. To the best of our knowledge, these are the first TTS systems to render tongue clicks. Equipped with greater control of synthesized PINTs, we returned to evaluating the recall benefit of PINTs. This time we evaluated the influence of PINTs on the recollection of key information in lectures, an ecologically valid task that focused on larger material lengths. Results indicated that key information that followed PINTs material was less likely to be recalled. We were unable to replicate the benefits of PINTs found in the small-context laboratory settings. This body of work showcases that PINTs improve recall for TTS in small-context environments just like previous work had indicated for natural speech. Additionally, we’ve provided a technological contribution via a neural TTS system that exerts finer control over PINT type and placement. Lastly, we’ve shown the importance of using material rendered by speech synthesis systems in perceptual studies.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C1

Werner, Raphael

The phonetics of speech breathing : pauses, physiology, acoustics, and perception PhD Thesis

Saarland University, Saarbruecken, Germany, 2023.

Speech is made up of a continuous stream of speech sounds that is interrupted by pauses and breathing. As phoneticians are primarily interested in describing the segments of the speech stream, pauses and breathing are often neglected in phonetic studies, even though they are vital for speech. The present work adds to a more detailed view of both pausing and speech breathing with a special focus on the latter and the resulting breath noises, investigating their acoustic, physiological, and perceptual aspects. We present an overview of how a selection of corpora annotate pauses and pause-internal particles, as well as a recording setup that can be used for further studies on speech breathing. For pauses, this work emphasized their optionality and variability under different tempos, as well as the temporal composition of silence and breath noise in breath pauses. For breath noises, we first focused on acoustic and physiological characteristics: We explored alignment between the onsets and offsets of audible breath noises with the start and end of expansion of both rib cage and abdomen. Further, we found similarities between speech breath noises and aspiration phases of /k/, as well as that breath noises may be produced with a more open and slightly more front place of articulation than realizations of schwa. We found positive correlations between acoustic and physiological parameters, suggesting that when speakers inhale faster, the resulting breath noises were more intense and produced more anterior in the mouth. Inspecting the entire spectrum of speech breath noises, we showed relatively flat spectra and several weak peaks. These peaks largely overlapped with resonances reported for inhalations produced with a central vocal tract configuration. We used 3D-printed vocal tract models representing four vowels and four fricatives to simulate in- and exhalations by reversing airflow direction. We found the direction to not have a general effect for all models, but only for those with high-tongue configurations, as opposed to those that were more open. Then, we compared inhalations produced with the schwa-model to human inhalations in an attempt to approach the vocal tract configuration in speech breathing. There were some similarities, however, several complexities of human speech breathing not captured in the models complicated comparisons. In two perception studies, we investigated how much information listeners could auditorily extract from breath noises. First, we tested categorizing different breath noises into six different types, based on airflow direction and airway usage, e.g. oral inhalation. Around two thirds of all answers were correct. Second, we investigated how well breath noises could be used to discriminate between speakers and to extract coarse information on speaker characteristics, such as age (old/young) and sex (female/male). We found that listeners were able to distinguish between two breath noises coming from the same or different speakers in around two thirds of all cases. Hearing one breath noise, classification of sex was successful in around 64%, while for age it was 50%, suggesting that sex was more perceivable than age in breath noises.

@phdthesis{Werner_Diss_2023,
title = {The phonetics of speech breathing : pauses, physiology, acoustics, and perception},
author = {Raphael Werner},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/36987},
doi = {https://doi.org/10.22028/D291-41147},
year = {2023},
date = {2023},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {Speech is made up of a continuous stream of speech sounds that is interrupted by pauses and breathing. As phoneticians are primarily interested in describing the segments of the speech stream, pauses and breathing are often neglected in phonetic studies, even though they are vital for speech. The present work adds to a more detailed view of both pausing and speech breathing with a special focus on the latter and the resulting breath noises, investigating their acoustic, physiological, and perceptual aspects. We present an overview of how a selection of corpora annotate pauses and pause-internal particles, as well as a recording setup that can be used for further studies on speech breathing. For pauses, this work emphasized their optionality and variability under different tempos, as well as the temporal composition of silence and breath noise in breath pauses. For breath noises, we first focused on acoustic and physiological characteristics: We explored alignment between the onsets and offsets of audible breath noises with the start and end of expansion of both rib cage and abdomen. Further, we found similarities between speech breath noises and aspiration phases of /k/, as well as that breath noises may be produced with a more open and slightly more front place of articulation than realizations of schwa. We found positive correlations between acoustic and physiological parameters, suggesting that when speakers inhale faster, the resulting breath noises were more intense and produced more anterior in the mouth. Inspecting the entire spectrum of speech breath noises, we showed relatively flat spectra and several weak peaks. These peaks largely overlapped with resonances reported for inhalations produced with a central vocal tract configuration. We used 3D-printed vocal tract models representing four vowels and four fricatives to simulate in- and exhalations by reversing airflow direction. We found the direction to not have a general effect for all models, but only for those with high-tongue configurations, as opposed to those that were more open. Then, we compared inhalations produced with the schwa-model to human inhalations in an attempt to approach the vocal tract configuration in speech breathing. There were some similarities, however, several complexities of human speech breathing not captured in the models complicated comparisons. In two perception studies, we investigated how much information listeners could auditorily extract from breath noises. First, we tested categorizing different breath noises into six different types, based on airflow direction and airway usage, e.g. oral inhalation. Around two thirds of all answers were correct. Second, we investigated how well breath noises could be used to discriminate between speakers and to extract coarse information on speaker characteristics, such as age (old/young) and sex (female/male). We found that listeners were able to distinguish between two breath noises coming from the same or different speakers in around two thirds of all cases. Hearing one breath noise, classification of sex was successful in around 64%, while for age it was 50%, suggesting that sex was more perceivable than age in breath noises.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C1

Gessinger, Iona; Cohn, Michelle; Cowan, Benjamin R.; Zellou, Georgia; Möbius, Bernd

Cross-linguistic emotion perception in human and TTS voices Inproceedings

Proceedings of Interspeech 2023, pp. 5222-5226, Dublin, Ireland, 2023.

This study investigates how German listeners perceive changes in the emotional expression of German and American English human voices and Amazon Alexa text-to-speech (TTS) voices, respectively. Participants rated sentences containing emotionally neutral lexico-semantic information that were resynthesized to vary in prosodic emotional expressiveness. Starting from an emotionally neutral production, three levels of increasing ‚happiness‘ were created. Results show that ‚happiness‘ manipulations lead to higher ratings of emotional valence (i.e., more positive) and arousal (i.e., more excited) for German and English voices, with stronger effects for the German voices. In particular, changes in valence were perceived more prominently in German TTS compared to English TTS. Additionally, both TTS voices were rated lower than the respective human voices on scales that reflect anthropomorphism (e.g., human-likeness). We discuss these findings in the context of cross-linguistic emotion accounts.

@inproceedings{Gessinger/etal:2023,
title = {Cross-linguistic emotion perception in human and TTS voices},
author = {Iona Gessinger and Michelle Cohn and Benjamin R. Cowan and Georgia Zellou and Bernd M{\"o}bius},
url = {https://www.isca-speech.org/archive/interspeech_2023/gessinger23_interspeech.html},
doi = {https://doi.org/10.21437/Interspeech.2023-711},
year = {2023},
date = {2023},
booktitle = {Proceedings of Interspeech 2023},
pages = {5222-5226},
address = {Dublin, Ireland},
abstract = {This study investigates how German listeners perceive changes in the emotional expression of German and American English human voices and Amazon Alexa text-to-speech (TTS) voices, respectively. Participants rated sentences containing emotionally neutral lexico-semantic information that were resynthesized to vary in prosodic emotional expressiveness. Starting from an emotionally neutral production, three levels of increasing 'happiness' were created. Results show that 'happiness' manipulations lead to higher ratings of emotional valence (i.e., more positive) and arousal (i.e., more excited) for German and English voices, with stronger effects for the German voices. In particular, changes in valence were perceived more prominently in German TTS compared to English TTS. Additionally, both TTS voices were rated lower than the respective human voices on scales that reflect anthropomorphism (e.g., human-likeness). We discuss these findings in the context of cross-linguistic emotion accounts.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Yuen, Ivan; Ibrahim, Omnia; Andreeva, Bistra; Möbius, Bernd

Non-uniform cue-trading: differential effects of surprisal on pause usage and pause duration in German Inproceedings

Proceedings of the 20th International Congress of Phonetic Sciences, ICPhS 2023 (Prague, Czech Rep.), pp. 619-623, 2023.

Pause occurrence is conditional on contextual (un)predictability (in terms of surprisal) [10, 11], and so is the acoustic implementation of duration at multiple linguistic levels. Although these cues (i.e., pause usage/pause duration and syllable duration) are subject to the influence of the same factor, it is not clear how they are related to one another. A recent study in [1] using pause duration to define prosodic boundary strength reported a more pronounced surprisal effect on syllable duration, hinting at a trading relationship. The current study aimed to directly test for trading relationships among pause usage, pause duration and syllable duration in different surprisal contexts, analysing German radio news in the DIRNDL corpus. No trading relationship was observed between pause usage and surprisal, or between pause usage and syllable duration. However, a trading relationship was found between the durations of a pause and a syllable for accented items.

@inproceedings{Yuen/etal:2023a,
title = {Non-uniform cue-trading: differential effects of surprisal on pause usage and pause duration in German},
author = {Ivan Yuen and Omnia Ibrahim and Bistra Andreeva and Bernd M{\"o}bius},
year = {2023},
date = {2023},
booktitle = {Proceedings of the 20th International Congress of Phonetic Sciences, ICPhS 2023 (Prague, Czech Rep.)},
pages = {619-623},
abstract = {Pause occurrence is conditional on contextual (un)predictability (in terms of surprisal) [10, 11], and so is the acoustic implementation of duration at multiple linguistic levels. Although these cues (i.e., pause usage/pause duration and syllable duration) are subject to the influence of the same factor, it is not clear how they are related to one another. A recent study in [1] using pause duration to define prosodic boundary strength reported a more pronounced surprisal effect on syllable duration, hinting at a trading relationship. The current study aimed to directly test for trading relationships among pause usage, pause duration and syllable duration in different surprisal contexts, analysing German radio news in the DIRNDL corpus. No trading relationship was observed between pause usage and surprisal, or between pause usage and syllable duration. However, a trading relationship was found between the durations of a pause and a syllable for accented items.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Ibrahim, Omnia; Yuen, Ivan; Andreeva, Bistra; Möbius, Bernd

The interplay between syllable-based predictability and voicing during closure in intersonorant German stops Inproceedings

Conference: Phonetics and Phonology in Europe 2023 (PaPE 2023), Nijmegen, the Netherlands, 2023.
Contextual predictability has pervasive effects on the acoustic realization of speech. Generally, duration is shortened in more predictable contexts and conversely lengthened in less predictable contexts. There are several measures to quantify predictability in a message. One of them is surprisal, which is calculated as S(Uniti) = -log2 P (Uniti|Context). In a recent work, Ibrahim et al. have found that the effect of syllable-based surprisal on the temporal dimension(s) of a syllable selectively extends to the segmental level, for example, consonant voicing in German. Closure duration was uniformly longer for both voiceless and voiced consonants, but voice onset time was not. The voice onset time pattern might be related to German being typically considered an ‚aspirating‘ language, using [+spread glottis] for voiceless consonants and [-spread glottis] for their voiced counterparts. However, voicing has also been reported in an intervocalic context for both voiceless and voiced consonants to varying extents. To further test whether the previously reported surprisal-based effect on voice onset time is driven by the phonological feature [spread glottis], the current study re-examined the downstream effect of syllable-based predictability on segmental voicing in German stops by measuring the degree of residual (phonetic) voicing during stop closure in an inter-sonorant context. Method: Data were based on a subset of stimuli (speech produced in a quiet acoustic condition) from Ibrahim et al. 38 German speakers recorded 60 sentences. Each sentence contained a target stressed CV syllable in a polysyllabic word. Each target syllable began with one of the stops /p, k, b, d/, combined with one of the vowels /a:, e:, i:, o:, u:/. The analyzed data contained voiceless vs. voiced initial stops in a low or high surprisal syllable. Closure duration (CD) and voicing during closure (VDC) were extracted using in-house Python and Praat scripts. A ratio measure VDC/CD was used to factor out any potential covariation between VDC and CD. Linear mixed-effects modeling was used to evaluate the effect(s) of surprisal and target stop voicing status on VDC/CD ratio using the lmer package in R. The final model was: VDC/CD ratio ∼ Surprisal + Target stop voicing status + (1 | Speaker) + (1 | Syllable ) + (1 | PrevManner ) + (1 | Sentence). Results: In an inter-sonorant context, we found a smaller VDC/CD ratio in voiceless stops than in voiced ones (p=2.04e-08***). As expected, residual voicing is shorter during a voiceless closure than during a voiced closure. This is consistent with the idea of preserving a phonological voicing distinction, as well as the physiological constraint of sustaining voicing for a long period during the closure of a voiceless stop. Moreover, the results yielded a significant effect of surprisal on VDC/CD ratio (p=.017*), with no interaction between the two factors (voicing and surprisal). The VDC/CD ratio is larger in a low than in a high surprisal syllable, irrespective of the voicing status of the target stops. That is, the syllable-based surprisal effect percolated down to German voicing, and the effect is uniform for a voiceless and voiced stop, when residual voicing was measured. Such a uniform effect on residual voicing is consistent with the previous result on closure duration. These findings reveal that the syllable-based surprisal effect can spread downstream to the segmental level and the effect is uniform for acoustic cues that are not directly tied to a phonological feature in German voicing (i.e. [spread glottis]).

@inproceedings{inproceedings,
title = {The interplay between syllable-based predictability and voicing during closure in intersonorant German stops},
author = {Omnia Ibrahim and Ivan Yuen and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://www.researchgate.net/publication/371138687_The_interplay_between_syllable-based_predictability_and_voicing_during_closure_in_intersonorant_German_stops},
year = {2023},
date = {2023},
booktitle = {Conference: Phonetics and Phonology in Europe 2023 (PaPE 2023)},
address = {Nijmegen, the Netherlands},
abstract = {

Contextual predictability has pervasive effects on the acoustic realization of speech. Generally, duration is shortened in more predictable contexts and conversely lengthened in less predictable contexts. There are several measures to quantify predictability in a message. One of them is surprisal, which is calculated as S(Uniti) = -log2 P (Uniti|Context). In a recent work, Ibrahim et al. have found that the effect of syllable-based surprisal on the temporal dimension(s) of a syllable selectively extends to the segmental level, for example, consonant voicing in German. Closure duration was uniformly longer for both voiceless and voiced consonants, but voice onset time was not. The voice onset time pattern might be related to German being typically considered an 'aspirating' language, using [+spread glottis] for voiceless consonants and [-spread glottis] for their voiced counterparts. However, voicing has also been reported in an intervocalic context for both voiceless and voiced consonants to varying extents. To further test whether the previously reported surprisal-based effect on voice onset time is driven by the phonological feature [spread glottis], the current study re-examined the downstream effect of syllable-based predictability on segmental voicing in German stops by measuring the degree of residual (phonetic) voicing during stop closure in an inter-sonorant context. Method: Data were based on a subset of stimuli (speech produced in a quiet acoustic condition) from Ibrahim et al. 38 German speakers recorded 60 sentences. Each sentence contained a target stressed CV syllable in a polysyllabic word. Each target syllable began with one of the stops /p, k, b, d/, combined with one of the vowels /a:, e:, i:, o:, u:/. The analyzed data contained voiceless vs. voiced initial stops in a low or high surprisal syllable. Closure duration (CD) and voicing during closure (VDC) were extracted using in-house Python and Praat scripts. A ratio measure VDC/CD was used to factor out any potential covariation between VDC and CD. Linear mixed-effects modeling was used to evaluate the effect(s) of surprisal and target stop voicing status on VDC/CD ratio using the lmer package in R. The final model was: VDC/CD ratio ∼ Surprisal + Target stop voicing status + (1 | Speaker) + (1 | Syllable ) + (1 | PrevManner ) + (1 | Sentence). Results: In an inter-sonorant context, we found a smaller VDC/CD ratio in voiceless stops than in voiced ones (p=2.04e-08***). As expected, residual voicing is shorter during a voiceless closure than during a voiced closure. This is consistent with the idea of preserving a phonological voicing distinction, as well as the physiological constraint of sustaining voicing for a long period during the closure of a voiceless stop. Moreover, the results yielded a significant effect of surprisal on VDC/CD ratio (p=.017*), with no interaction between the two factors (voicing and surprisal). The VDC/CD ratio is larger in a low than in a high surprisal syllable, irrespective of the voicing status of the target stops. That is, the syllable-based surprisal effect percolated down to German voicing, and the effect is uniform for a voiceless and voiced stop, when residual voicing was measured. Such a uniform effect on residual voicing is consistent with the previous result on closure duration. These findings reveal that the syllable-based surprisal effect can spread downstream to the segmental level and the effect is uniform for acoustic cues that are not directly tied to a phonological feature in German voicing (i.e. [spread glottis]).
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Ibrahim, Omnia

Speaker Adaptations as a Function of Message, Channel and Listener Variability PhD Thesis

University of Zürich, Zürich, Switzerland, 2022.

Speech is a highly dynamic process. Some variability is inherited directly from the language itself, while other variability stems from adapting to the surrounding environment or interlocutor. This Ph.D. thesis consists of seven studies investigating speech adaptation concerning the message, channel, and listener variability. It starts with investigating speakers’ adaptation to the linguistic message. Previous work has shown that duration is shortened in more predictable contexts, and conversely lengthened in less predictable contexts. This pervasive predictability effect is well studied in multiple languages and linguistic levels. However, syllable level predictability has been generally overlooked so far. This thesis aims to őll that gap. It focuses on the effect of information-theoretic factors at both the syllable and segmental levels. Furthermore, it found that the predictability effect is not uniform across all durational cues but is somewhat sensitive to the phonological relevance of a language-specific phonetic cue.
Speakers adapt not only to their message but also to the channel of transfer. For example, it is known that speakers modulate the characteristics of their speech and produce clear speech in response to background noise – syllables in noise have a longer duration, with higher average intensity, larger intensity range, and higher F0. Hence, speakers choose redundant multi-dimensional acoustic modifications to make their voices more salient and detectable in a noisy environment. This Ph.D. thesis provides new insights into speakers’ adaptation to noise and predictability on the acoustic realizations of syllables in German; showing that the speakers’ response to background noise is independent of syllable predictability.
Regarding speaker-to-listener adaptations, this thesis finds that speech variability is not necessarily a function of the interaction’s duration. Instead, speakers constantly position themselves concerning the ongoing social interaction. Indeed, speakers’ cooperation during the discussion would lead to a higher convergence behavior. Moreover, interpersonal power dynamics between interlocutors were found to serve as a predictor for accommodation behavior. This adaptation holds for both human-human interaction and human-robot interaction. In an ecological validity study, speakers changed their voice depending on whether they were addressing a human or a robot. Those findings align with previous studies on robot-directed speech and confirm that this difference also holds when the conversations are more natural and spontaneous.
The results of this thesis provide compelling evidence that speech adaptation is socially motivated and, to some extent, consciously controlled by the speaker. These findings have implications for including environment-based and listener-based formulations in speech production models along with message-based formulations. Furthermore, this thesis aims to advance our understanding of verbal and non-verbal behavior mechanisms for social communication. Finally, it contributes to the broader literature on information-theoretical factors and accommodation effects on speakers’ acoustic realization.

@phdthesis{Ibrahim_Diss_2022,
title = {Speaker Adaptations as a Function of Message, Channel and Listener Variability},
author = {Omnia Ibrahim},
url = {https://www.zora.uzh.ch/id/eprint/233694/},
doi = {https://doi.org/10.5167/uzh-233694},
year = {2022},
date = {2022},
school = {University of Z{\"u}rich},
address = {Z{\"u}rich, Switzerland},
abstract = {Speech is a highly dynamic process. Some variability is inherited directly from the language itself, while other variability stems from adapting to the surrounding environment or interlocutor. This Ph.D. thesis consists of seven studies investigating speech adaptation concerning the message, channel, and listener variability. It starts with investigating speakers’ adaptation to the linguistic message. Previous work has shown that duration is shortened in more predictable contexts, and conversely lengthened in less predictable contexts. This pervasive predictability effect is well studied in multiple languages and linguistic levels. However, syllable level predictability has been generally overlooked so far. This thesis aims to őll that gap. It focuses on the effect of information-theoretic factors at both the syllable and segmental levels. Furthermore, it found that the predictability effect is not uniform across all durational cues but is somewhat sensitive to the phonological relevance of a language-specific phonetic cue. Speakers adapt not only to their message but also to the channel of transfer. For example, it is known that speakers modulate the characteristics of their speech and produce clear speech in response to background noise – syllables in noise have a longer duration, with higher average intensity, larger intensity range, and higher F0. Hence, speakers choose redundant multi-dimensional acoustic modifications to make their voices more salient and detectable in a noisy environment. This Ph.D. thesis provides new insights into speakers’ adaptation to noise and predictability on the acoustic realizations of syllables in German; showing that the speakers’ response to background noise is independent of syllable predictability. Regarding speaker-to-listener adaptations, this thesis finds that speech variability is not necessarily a function of the interaction’s duration. Instead, speakers constantly position themselves concerning the ongoing social interaction. Indeed, speakers’ cooperation during the discussion would lead to a higher convergence behavior. Moreover, interpersonal power dynamics between interlocutors were found to serve as a predictor for accommodation behavior. This adaptation holds for both human-human interaction and human-robot interaction. In an ecological validity study, speakers changed their voice depending on whether they were addressing a human or a robot. Those findings align with previous studies on robot-directed speech and confirm that this difference also holds when the conversations are more natural and spontaneous. The results of this thesis provide compelling evidence that speech adaptation is socially motivated and, to some extent, consciously controlled by the speaker. These findings have implications for including environment-based and listener-based formulations in speech production models along with message-based formulations. Furthermore, this thesis aims to advance our understanding of verbal and non-verbal behavior mechanisms for social communication. Finally, it contributes to the broader literature on information-theoretical factors and accommodation effects on speakers’ acoustic realization.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C1

Ibrahim, Omnia; Yuen, Ivan; van Os, Marjolein; Andreeva, Bistra; Möbius, Bernd

The combined effects of contextual predictability and noise on the acoustic realisation of German syllables Journal Article

The Journal of the Acoustical Society of America, 152, 2022.

Speakers tend to speak clearly in noisy environments, while they tend to reserve effort by shortening word duration in predictable contexts. It is unclear how these two communicative demands are met. The current study investigates the acoustic realizations of syllables in predictable vs unpredictable contexts across different background noise levels. Thirty-eight German native speakers produced 60 CV syllables in two predictability contexts in three noise conditions (reference = quiet, 0 dB and −10 dB signal-to-noise ratio). Duration, intensity (average and range), F0 (median), and vowel formants of the target syllables were analysed. The presence of noise yielded significantly longer duration, higher average intensity, larger intensity range, and higher F0. Noise levels affected intensity (average and range) and F0. Low predictability syllables exhibited longer duration and larger intensity range. However, no interaction was found between noise and predictability. This suggests that noise-related modifications might be independent of predictability-related changes, with implications for including channel-based and message-based formulations in speech production.

@article{ibrahim_etal_jasa2022,
title = {The combined effects of contextual predictability and noise on the acoustic realisation of German syllables},
author = {Omnia Ibrahim and Ivan Yuen and Marjolein van Os and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://asa.scitation.org/doi/10.1121/10.0013413},
doi = {https://doi.org/10.1121/10.0013413},
year = {2022},
date = {2022-08-10},
journal = {The Journal of the Acoustical Society of America},
volume = {152},
number = {2},
abstract = {Speakers tend to speak clearly in noisy environments, while they tend to reserve effort by shortening word duration in predictable contexts. It is unclear how these two communicative demands are met. The current study investigates the acoustic realizations of syllables in predictable vs unpredictable contexts across different background noise levels. Thirty-eight German native speakers produced 60 CV syllables in two predictability contexts in three noise conditions (reference = quiet, 0 dB and −10 dB signal-to-noise ratio). Duration, intensity (average and range), F0 (median), and vowel formants of the target syllables were analysed. The presence of noise yielded significantly longer duration, higher average intensity, larger intensity range, and higher F0. Noise levels affected intensity (average and range) and F0. Low predictability syllables exhibited longer duration and larger intensity range. However, no interaction was found between noise and predictability. This suggests that noise-related modifications might be independent of predictability-related changes, with implications for including channel-based and message-based formulations in speech production.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Projects:   C1 A4

Yuen, Ivan; Demuth, Katherine; Shattuck-Hufnagel, Stefanie

Planning of prosodic clitics in Australian English Journal Article

Language, Cognition and Neuroscience, Routledge, pp. 1-6, 2022.

The prosodic word (PW) has been proposed as a planning unit in speech production (Levelt et al. [1999. A theory of lexical access in speech production. Behavioral and Brain Sciences22, 1–75]), supported by evidence that speech initiation time (RT) is faster for Dutch utterances with fewer PWs due to cliticisation (with the number of lexical words and syllables kept constant) (Wheeldon & Lahiri [1997. Prosodic units in speech production. Journal of Memory and Language37(3), 356–381. https://doi.org/10.1006/jmla.1997.2517], W&L). The present study examined prosodic cliticisation (and resulting RT) for a different set of potential clitics (articles, direct-object pronouns), in English, using a different response task (immediate reading aloud). W&L’s result of shorter RTs for fewer PWs was replicated for articles, but not for pronouns, suggesting a difference in cliticisation for these two function word types. However, a post-hoc analysis of the duration of the verb preceding the clitic suggests that both are cliticised. These findings highlight the importance of supplementing production latency measures with phonetic duration measures to understand different stages of language production during utterance planning.

@article{Yuen_of_2022,
title = {Planning of prosodic clitics in Australian English},
author = {Ivan Yuen and Katherine Demuth and Stefanie Shattuck-Hufnagel},
url = {https://www.tandfonline.com/eprint/4K7DVYQIWRKITU3JCACY/full?target=10.1080/23273798.2022.2060517},
doi = {https://doi.org/10.1080/23273798.2022.2060517},
year = {2022},
date = {2022-04-05},
journal = {Language, Cognition and Neuroscience},
pages = {1-6},
publisher = {Routledge},
abstract = {The prosodic word (PW) has been proposed as a planning unit in speech production (Levelt et al. [1999. A theory of lexical access in speech production. Behavioral and Brain Sciences22, 1–75]), supported by evidence that speech initiation time (RT) is faster for Dutch utterances with fewer PWs due to cliticisation (with the number of lexical words and syllables kept constant) (Wheeldon & Lahiri [1997. Prosodic units in speech production. Journal of Memory and Language37(3), 356–381. https://doi.org/10.1006/jmla.1997.2517], W&L). The present study examined prosodic cliticisation (and resulting RT) for a different set of potential clitics (articles, direct-object pronouns), in English, using a different response task (immediate reading aloud). W&L’s result of shorter RTs for fewer PWs was replicated for articles, but not for pronouns, suggesting a difference in cliticisation for these two function word types. However, a post-hoc analysis of the duration of the verb preceding the clitic suggests that both are cliticised. These findings highlight the importance of supplementing production latency measures with phonetic duration measures to understand different stages of language production during utterance planning.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Gessinger, Iona; Cohn, Michelle; Zellou, Georgia; Möbius, Bernd

Cross-cultural comparison of gradient emotion perception: Human vs. Alexa TTS voices Inproceedings

Proceedings of Interspeech 2022, pp. 4970-4974, 2022.

This study compares how American (US) and German (DE) listeners perceive emotional expressiveness from Amazon Alexa text-to-speech (TTS) and human voices. Participants heard identical stimuli, manipulated from an emotionally ‘neutral‘ production to three levels of increased happiness generated by resynthesis. Results show that, for both groups, ‘happiness‘ manipulations lead to higher ratings of emotional valence (i.e., more positive) for the human voice. Moreover, there was a difference across the groups in their perception of arousal (i.e., excitement): US listeners show higher ratings for human voices with manipulations, while DE listeners perceive the Alexa voice as sounding less ‘excited‘ overall. We discuss these findings in terms of theories of cross-cultural emotion perception and human-computer interaction.

@inproceedings{Gessinger/etal:2022a,
title = {Cross-cultural comparison of gradient emotion perception: Human vs. Alexa TTS voices},
author = {Iona Gessinger and Michelle Cohn and Georgia Zellou and Bernd M{\"o}bius},
url = {https://www.isca-speech.org/archive/interspeech_2022/gessinger22_interspeech.html},
doi = {https://doi.org/10.21437/Interspeech.2022-146},
year = {2022},
date = {2022},
booktitle = {Proceedings of Interspeech 2022},
pages = {4970-4974},
abstract = {This study compares how American (US) and German (DE) listeners perceive emotional expressiveness from Amazon Alexa text-to-speech (TTS) and human voices. Participants heard identical stimuli, manipulated from an emotionally ‘neutral' production to three levels of increased happiness generated by resynthesis. Results show that, for both groups, ‘happiness' manipulations lead to higher ratings of emotional valence (i.e., more positive) for the human voice. Moreover, there was a difference across the groups in their perception of arousal (i.e., excitement): US listeners show higher ratings for human voices with manipulations, while DE listeners perceive the Alexa voice as sounding less ‘excited' overall. We discuss these findings in terms of theories of cross-cultural emotion perception and human-computer interaction.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Pardo, Jennifer; Pellegrino, Elisa; Dellwo, Volker; Möbius, Bernd

Special issue: Vocal accommodation in speech communication Journal Article

Journal of Phonetics, 95, 1-9, pp. paper 101196, 2022.

This introductory article for the Special Issue on Vocal Accommodation in Speech Communication provides an overview of prevailing theories of vocal accommodation and summarizes the ten papers in the collection. Communication Accommodation Theory focusses on social factors evoking accent convergence or divergence, while the Interactive Alignment Model proposes cognitive integration of perception and production as an automatic priming mechanism driving convergence language production. Recent research including most of the papers in this Special Issue indicates that a hybrid or interactive synergy model provides a more comprehensive account of observed patterns of phonetic convergence than purely automatic mechanisms. Some of the fundamental questions that this special collection aimed to cover concerned (1) the nature of vocal accommodation in terms of underlying mechanisms and social functions in human–human and human–computer interaction; (2) the effect of task-specific and talker-specific characteristics (gender, age, personality, linguistic and cultural background, role in interaction) on degree and direction of convergence towards human and computer interlocutors; (3) integration of articulatory, perceptual, neurocognitive, and/or multimodal data to the analysis of acoustic accommodation in interactive and non-interactive speech tasks; and (4) the contribution of short/long-term accommodation in human–human and human–computer interactions to the diffusion of linguistic innovation and ultimately language variation and change.

@article{Pardo_etal22,
title = {Special issue: Vocal accommodation in speech communication},
author = {Jennifer Pardo and Elisa Pellegrino and Volker Dellwo and Bernd M{\"o}bius},
url = {https://www.coli.uni-saarland.de/~moebius/documents/pardo_etal_jphon-si2022.pdf},
year = {2022},
date = {2022},
journal = {Journal of Phonetics},
pages = {paper 101196},
volume = {95, 1-9},
abstract = {This introductory article for the Special Issue on Vocal Accommodation in Speech Communication provides an overview of prevailing theories of vocal accommodation and summarizes the ten papers in the collection. Communication Accommodation Theory focusses on social factors evoking accent convergence or divergence, while the Interactive Alignment Model proposes cognitive integration of perception and production as an automatic priming mechanism driving convergence language production. Recent research including most of the papers in this Special Issue indicates that a hybrid or interactive synergy model provides a more comprehensive account of observed patterns of phonetic convergence than purely automatic mechanisms. Some of the fundamental questions that this special collection aimed to cover concerned (1) the nature of vocal accommodation in terms of underlying mechanisms and social functions in human–human and human–computer interaction; (2) the effect of task-specific and talker-specific characteristics (gender, age, personality, linguistic and cultural background, role in interaction) on degree and direction of convergence towards human and computer interlocutors; (3) integration of articulatory, perceptual, neurocognitive, and/or multimodal data to the analysis of acoustic accommodation in interactive and non-interactive speech tasks; and (4) the contribution of short/long-term accommodation in human–human and human–computer interactions to the diffusion of linguistic innovation and ultimately language variation and change.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Andreeva, Bistra; Dimitrova, Snezhina

The influence of L1 prosody on Bulgarian-accented German and English Inproceedings

Proc. Speech Prosody 2022, pp. 764-768, Lisbon, 2022.

The present study investigates L2 prosodic realizations in the readings of two groups of Bulgarian informants: (a) with L2 German, and (b) with L2 English. Each group consisted of ten female learners, who read the fable “The North Wind and the Sun” in their L1 and in the respective L2. We also recorded two groups of female native speakers of the target languages as controls. The following durational parameters were obtained: mean accented syllable duration, accented/naccented duration ratio, speaking rate. With respect to F0 parameters, mean, median, minimum, maximum, span in semitones, and standard deviations per IP were measured. Additionally, we calculated the number of accented and unaccented syllables, IPs and pauses in each reading. Statistical analyses show that the two groups differ in their use of F0. Both groups use higher standard deviation and level in their L2, whereas the ‘German group’ use higher pitch span as well. The number of accented syllables, IPs and pauses is also higher in L2. Regarding duration, both groups use slower articulation rate. The accented/unaccented syllable duration ratio is lower in L2 for the ‘English group’. We also provide original data on speaking rate in Bulgarian from an information theoretical perspective.

@inproceedings{andreeva_2022_speechprosody,
title = {The influence of L1 prosody on Bulgarian-accented German and English},
author = {Bistra Andreeva and Snezhina Dimitrova},
url = {https://www.isca-speech.org/archive/speechprosody_2022/andreeva22_speechprosody.html},
doi = {https://doi.org/10.21437/SpeechProsody.2022-155},
year = {2022},
date = {2022},
booktitle = {Proc. Speech Prosody 2022},
pages = {764-768},
address = {Lisbon},
abstract = {The present study investigates L2 prosodic realizations in the readings of two groups of Bulgarian informants: (a) with L2 German, and (b) with L2 English. Each group consisted of ten female learners, who read the fable “The North Wind and the Sun” in their L1 and in the respective L2. We also recorded two groups of female native speakers of the target languages as controls. The following durational parameters were obtained: mean accented syllable duration, accented/naccented duration ratio, speaking rate. With respect to F0 parameters, mean, median, minimum, maximum, span in semitones, and standard deviations per IP were measured. Additionally, we calculated the number of accented and unaccented syllables, IPs and pauses in each reading. Statistical analyses show that the two groups differ in their use of F0. Both groups use higher standard deviation and level in their L2, whereas the ‘German group’ use higher pitch span as well. The number of accented syllables, IPs and pauses is also higher in L2. Regarding duration, both groups use slower articulation rate. The accented/unaccented syllable duration ratio is lower in L2 for the ‘English group’. We also provide original data on speaking rate in Bulgarian from an information theoretical perspective.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Ibrahim, Omnia; Yuen, Ivan; Andreeva, Bistra; Möbius, Bernd

The effect of predictability on German stop voicing is phonologically selective Inproceedings

Proc. Speech Prosody 2022, pp. 669-673, Lisbon, 2022.

Cross-linguistic evidence suggests that syllables in predictable contexts have shorter duration than in unpredictable contexts. However, it is not clear if predictability uniformly affects phonetic cues of a phonological feature in a segment. The current study explored the effect of syllable-based predictability on the durational correlates of the phonological stop voicing contrast in German, viz. voice onset time (VOT) and closure duration (CD), using data in Ibrahim et al. [1]. The target stop consonants /b, p, d, k/ occurred in stressed CV syllables in polysyllabic words embedded in a sentence, with either voiced or voiceless preceding contexts. The syllable occurred in either a low or a high predictable condition, which was based on a syllable-level trigram language model. We measured VOT and CD of the target consonants (voiced vs. voiceless). Our results showed an interaction effect of predictability and the voicing status of the target consonants on VOT, but a uniform effect on closure duration. This interaction effect on a primary cue like VOT indicates a selective effect of predictability on VOT, but not on CD. This suggests that the effect of predictability is sensitive to the phonological relevance of a language-specific phonetic cue.

@inproceedings{ibrahim_2022_speechprosody,
title = {The effect of predictability on German stop voicing is phonologically selective},
author = {Omnia Ibrahim and Ivan Yuen and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://www.isca-speech.org/archive/pdfs/speechprosody_2022/ibrahim22_speechprosody.pdf},
doi = {https://doi.org/10.21437/SpeechProsody.2022-136},
year = {2022},
date = {2022},
booktitle = {Proc. Speech Prosody 2022},
pages = {669-673},
address = {Lisbon},
abstract = {Cross-linguistic evidence suggests that syllables in predictable contexts have shorter duration than in unpredictable contexts. However, it is not clear if predictability uniformly affects phonetic cues of a phonological feature in a segment. The current study explored the effect of syllable-based predictability on the durational correlates of the phonological stop voicing contrast in German, viz. voice onset time (VOT) and closure duration (CD), using data in Ibrahim et al. [1]. The target stop consonants /b, p, d, k/ occurred in stressed CV syllables in polysyllabic words embedded in a sentence, with either voiced or voiceless preceding contexts. The syllable occurred in either a low or a high predictable condition, which was based on a syllable-level trigram language model. We measured VOT and CD of the target consonants (voiced vs. voiceless). Our results showed an interaction effect of predictability and the voicing status of the target consonants on VOT, but a uniform effect on closure duration. This interaction effect on a primary cue like VOT indicates a selective effect of predictability on VOT, but not on CD. This suggests that the effect of predictability is sensitive to the phonological relevance of a language-specific phonetic cue.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Żygis, Marzena; Beňuš, Štefan; Andreeva, Bistra

Intonation: Other pragmatic functions and phonetic / phonological effects Book Chapter

Bermel, Neil; Fellerer, Jan;  (Ed.): The Oxford Guide to the Slavonic Languages, Oxford University Press, 2022.

@inbook{Zygis2021intonation,
title = {Intonation: Other pragmatic functions and phonetic / phonological effects},
author = {Marzena Żygis and Štefan Beňuš and Bistra Andreeva},
editor = {Neil Bermel and Jan Fellerer},
year = {2022},
date = {2022},
booktitle = {The Oxford Guide to the Slavonic Languages},
publisher = {Oxford University Press},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C1

Successfully