Publications

Ibrahim, Omnia; Yuen, Ivan; Xue, Wei ; Andreeva, Bistra; Möbius, Bernd

Listener-oriented consequences of predictability-based acoustic adjustment Inproceedings

Baumann, Timo (Ed.): Elektronische Sprachsignalverarbeitung 2024, Tagungsband der 35. Konferenz (Regensburg), TUD Press, pp. 196-202, 2024, ISBN 978-3-95908-325-6.

This paper investigated whether predictability-based adjustments in production have listener-oriented consequences in perception. By manipulating the acoustic features of a target syllable in different predictability contexts in German, we tested 40 listeners’ perceptual preference for the manipulation. Four source words underwent acoustic modifications on the target syllable. Our results revealed a general preference for the original (unmodified) version over the modified one. However, listeners generally favored the unmodified version more when the source word had a higher predictable context compared to a less predictable one. The results showed that predictability-based adjustments have perceptual consequences and that listeners have predictability-based expectations in perception.

@inproceedings{Ibrahim_etal_2024,
title = {Listener-oriented consequences of predictability-based acoustic adjustment},
author = {Omnia Ibrahim and Ivan Yuen and Wei Xue and Bistra Andreeva and Bernd M{\"o}bius},
editor = {Timo Baumann},
url = {https://opus4.kobv.de/opus4-oth-regensburg/frontdoor/index/index/docId/7098},
doi = {https://doi.org/10.35096/othr/pub-7098},
year = {2024},
date = {2024},
booktitle = {Elektronische Sprachsignalverarbeitung 2024, Tagungsband der 35. Konferenz (Regensburg)},
isbn = {978-3-95908-325-6},
pages = {196-202},
publisher = {TUD Press},
abstract = {This paper investigated whether predictability-based adjustments in production have listener-oriented consequences in perception. By manipulating the acoustic features of a target syllable in different predictability contexts in German, we tested 40 listeners’ perceptual preference for the manipulation. Four source words underwent acoustic modifications on the target syllable. Our results revealed a general preference for the original (unmodified) version over the modified one. However, listeners generally favored the unmodified version more when the source word had a higher predictable context compared to a less predictable one. The results showed that predictability-based adjustments have perceptual consequences and that listeners have predictability-based expectations in perception.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Elmers, Mikey

Evaluating pause particles and their functions in natural and synthesized speech in laboratory and lecture settings PhD Thesis

Saarland University, Saarbruecken, Germany, 2023.

Pause-internal phonetic particles (PINTs) comprise a variety of phenomena including: phonetic-acoustic silence, inhalation and exhalation breath noises, filler particles “uh” and “um” in English, tongue clicks, and many others. These particles are omni-present in spontaneous speech, however, they are under-researched in both natural speech and synthetic speech. The present work explores the influence of PINTs in small-context recall experiments, develops a bespoke speech synthesis system that incorporates the PINTs pattern of a single speaker, and evaluates the influence of PINTs on recall for larger material lengths, namely university lectures. The benefit of PINTs on recall has been documented in natural speech in small-context laboratory settings, however, this area of research has been under-explored for synthetic speech. We devised two experiments to evaluate if PINTs have the same recall benefit for synthetic material that is found with natural material. In the first experiment, we evaluated the recollection of consecutive missing digits for a randomized 7-digit number. Results indicated that an inserted silence improved recall accuracy for digits immediately following. In the second experiment, we evaluated sentence recollection. Results indicated that sentences preceded by an inhalation breath noise were better recalled than those with no inhalation. Together, these results reveal that in single-sentence laboratory settings PINTs can improve recall for synthesized speech. The speech synthesis systems used in the small-context recall experiments did not provide much freedom in terms of controlling PINT type or location. Therefore, we endeavoured to develop bespoke speech synthesis systems. Two neural text-to-speech (TTS) systems were created: one that used PINTs annotation labels in the training data, and another that did not include any PINTs labeling in the training material. The first system allowed fine-tuned control for inserting PINTs material into the rendered material. The second system produced PINTs probabilistally. To the best of our knowledge, these are the first TTS systems to render tongue clicks. Equipped with greater control of synthesized PINTs, we returned to evaluating the recall benefit of PINTs. This time we evaluated the influence of PINTs on the recollection of key information in lectures, an ecologically valid task that focused on larger material lengths. Results indicated that key information that followed PINTs material was less likely to be recalled. We were unable to replicate the benefits of PINTs found in the small-context laboratory settings. This body of work showcases that PINTs improve recall for TTS in small-context environments just like previous work had indicated for natural speech. Additionally, we’ve provided a technological contribution via a neural TTS system that exerts finer control over PINT type and placement. Lastly, we’ve shown the importance of using material rendered by speech synthesis systems in perceptual studies.

@phdthesis{Elmers_Diss_2023,
title = {Evaluating pause particles and their functions in natural and synthesized speech in laboratory and lecture settings},
author = {Mikey Elmers},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/36999},
doi = {https://doi.org/10.22028/D291-41118},
year = {2023},
date = {2023},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {Pause-internal phonetic particles (PINTs) comprise a variety of phenomena including: phonetic-acoustic silence, inhalation and exhalation breath noises, filler particles “uh” and “um” in English, tongue clicks, and many others. These particles are omni-present in spontaneous speech, however, they are under-researched in both natural speech and synthetic speech. The present work explores the influence of PINTs in small-context recall experiments, develops a bespoke speech synthesis system that incorporates the PINTs pattern of a single speaker, and evaluates the influence of PINTs on recall for larger material lengths, namely university lectures. The benefit of PINTs on recall has been documented in natural speech in small-context laboratory settings, however, this area of research has been under-explored for synthetic speech. We devised two experiments to evaluate if PINTs have the same recall benefit for synthetic material that is found with natural material. In the first experiment, we evaluated the recollection of consecutive missing digits for a randomized 7-digit number. Results indicated that an inserted silence improved recall accuracy for digits immediately following. In the second experiment, we evaluated sentence recollection. Results indicated that sentences preceded by an inhalation breath noise were better recalled than those with no inhalation. Together, these results reveal that in single-sentence laboratory settings PINTs can improve recall for synthesized speech. The speech synthesis systems used in the small-context recall experiments did not provide much freedom in terms of controlling PINT type or location. Therefore, we endeavoured to develop bespoke speech synthesis systems. Two neural text-to-speech (TTS) systems were created: one that used PINTs annotation labels in the training data, and another that did not include any PINTs labeling in the training material. The first system allowed fine-tuned control for inserting PINTs material into the rendered material. The second system produced PINTs probabilistally. To the best of our knowledge, these are the first TTS systems to render tongue clicks. Equipped with greater control of synthesized PINTs, we returned to evaluating the recall benefit of PINTs. This time we evaluated the influence of PINTs on the recollection of key information in lectures, an ecologically valid task that focused on larger material lengths. Results indicated that key information that followed PINTs material was less likely to be recalled. We were unable to replicate the benefits of PINTs found in the small-context laboratory settings. This body of work showcases that PINTs improve recall for TTS in small-context environments just like previous work had indicated for natural speech. Additionally, we’ve provided a technological contribution via a neural TTS system that exerts finer control over PINT type and placement. Lastly, we’ve shown the importance of using material rendered by speech synthesis systems in perceptual studies.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C1

Werner, Raphael

The phonetics of speech breathing : pauses, physiology, acoustics, and perception PhD Thesis

Saarland University, Saarbruecken, Germany, 2023.

Speech is made up of a continuous stream of speech sounds that is interrupted by pauses and breathing. As phoneticians are primarily interested in describing the segments of the speech stream, pauses and breathing are often neglected in phonetic studies, even though they are vital for speech. The present work adds to a more detailed view of both pausing and speech breathing with a special focus on the latter and the resulting breath noises, investigating their acoustic, physiological, and perceptual aspects. We present an overview of how a selection of corpora annotate pauses and pause-internal particles, as well as a recording setup that can be used for further studies on speech breathing. For pauses, this work emphasized their optionality and variability under different tempos, as well as the temporal composition of silence and breath noise in breath pauses. For breath noises, we first focused on acoustic and physiological characteristics: We explored alignment between the onsets and offsets of audible breath noises with the start and end of expansion of both rib cage and abdomen. Further, we found similarities between speech breath noises and aspiration phases of /k/, as well as that breath noises may be produced with a more open and slightly more front place of articulation than realizations of schwa. We found positive correlations between acoustic and physiological parameters, suggesting that when speakers inhale faster, the resulting breath noises were more intense and produced more anterior in the mouth. Inspecting the entire spectrum of speech breath noises, we showed relatively flat spectra and several weak peaks. These peaks largely overlapped with resonances reported for inhalations produced with a central vocal tract configuration. We used 3D-printed vocal tract models representing four vowels and four fricatives to simulate in- and exhalations by reversing airflow direction. We found the direction to not have a general effect for all models, but only for those with high-tongue configurations, as opposed to those that were more open. Then, we compared inhalations produced with the schwa-model to human inhalations in an attempt to approach the vocal tract configuration in speech breathing. There were some similarities, however, several complexities of human speech breathing not captured in the models complicated comparisons. In two perception studies, we investigated how much information listeners could auditorily extract from breath noises. First, we tested categorizing different breath noises into six different types, based on airflow direction and airway usage, e.g. oral inhalation. Around two thirds of all answers were correct. Second, we investigated how well breath noises could be used to discriminate between speakers and to extract coarse information on speaker characteristics, such as age (old/young) and sex (female/male). We found that listeners were able to distinguish between two breath noises coming from the same or different speakers in around two thirds of all cases. Hearing one breath noise, classification of sex was successful in around 64%, while for age it was 50%, suggesting that sex was more perceivable than age in breath noises.

@phdthesis{Werner_Diss_2023,
title = {The phonetics of speech breathing : pauses, physiology, acoustics, and perception},
author = {Raphael Werner},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/36987},
doi = {https://doi.org/10.22028/D291-41147},
year = {2023},
date = {2023},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {Speech is made up of a continuous stream of speech sounds that is interrupted by pauses and breathing. As phoneticians are primarily interested in describing the segments of the speech stream, pauses and breathing are often neglected in phonetic studies, even though they are vital for speech. The present work adds to a more detailed view of both pausing and speech breathing with a special focus on the latter and the resulting breath noises, investigating their acoustic, physiological, and perceptual aspects. We present an overview of how a selection of corpora annotate pauses and pause-internal particles, as well as a recording setup that can be used for further studies on speech breathing. For pauses, this work emphasized their optionality and variability under different tempos, as well as the temporal composition of silence and breath noise in breath pauses. For breath noises, we first focused on acoustic and physiological characteristics: We explored alignment between the onsets and offsets of audible breath noises with the start and end of expansion of both rib cage and abdomen. Further, we found similarities between speech breath noises and aspiration phases of /k/, as well as that breath noises may be produced with a more open and slightly more front place of articulation than realizations of schwa. We found positive correlations between acoustic and physiological parameters, suggesting that when speakers inhale faster, the resulting breath noises were more intense and produced more anterior in the mouth. Inspecting the entire spectrum of speech breath noises, we showed relatively flat spectra and several weak peaks. These peaks largely overlapped with resonances reported for inhalations produced with a central vocal tract configuration. We used 3D-printed vocal tract models representing four vowels and four fricatives to simulate in- and exhalations by reversing airflow direction. We found the direction to not have a general effect for all models, but only for those with high-tongue configurations, as opposed to those that were more open. Then, we compared inhalations produced with the schwa-model to human inhalations in an attempt to approach the vocal tract configuration in speech breathing. There were some similarities, however, several complexities of human speech breathing not captured in the models complicated comparisons. In two perception studies, we investigated how much information listeners could auditorily extract from breath noises. First, we tested categorizing different breath noises into six different types, based on airflow direction and airway usage, e.g. oral inhalation. Around two thirds of all answers were correct. Second, we investigated how well breath noises could be used to discriminate between speakers and to extract coarse information on speaker characteristics, such as age (old/young) and sex (female/male). We found that listeners were able to distinguish between two breath noises coming from the same or different speakers in around two thirds of all cases. Hearing one breath noise, classification of sex was successful in around 64%, while for age it was 50%, suggesting that sex was more perceivable than age in breath noises.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C1

Gessinger, Iona; Cohn, Michelle; Cowan, Benjamin R.; Zellou, Georgia; Möbius, Bernd

Cross-linguistic emotion perception in human and TTS voices Inproceedings

Proceedings of Interspeech 2023, pp. 5222-5226, Dublin, Ireland, 2023.

This study investigates how German listeners perceive changes in the emotional expression of German and American English human voices and Amazon Alexa text-to-speech (TTS) voices, respectively. Participants rated sentences containing emotionally neutral lexico-semantic information that were resynthesized to vary in prosodic emotional expressiveness. Starting from an emotionally neutral production, three levels of increasing ‚happiness‘ were created. Results show that ‚happiness‘ manipulations lead to higher ratings of emotional valence (i.e., more positive) and arousal (i.e., more excited) for German and English voices, with stronger effects for the German voices. In particular, changes in valence were perceived more prominently in German TTS compared to English TTS. Additionally, both TTS voices were rated lower than the respective human voices on scales that reflect anthropomorphism (e.g., human-likeness). We discuss these findings in the context of cross-linguistic emotion accounts.

@inproceedings{Gessinger/etal:2023,
title = {Cross-linguistic emotion perception in human and TTS voices},
author = {Iona Gessinger and Michelle Cohn and Benjamin R. Cowan and Georgia Zellou and Bernd M{\"o}bius},
url = {https://www.isca-speech.org/archive/interspeech_2023/gessinger23_interspeech.html},
doi = {https://doi.org/10.21437/Interspeech.2023-711},
year = {2023},
date = {2023},
booktitle = {Proceedings of Interspeech 2023},
pages = {5222-5226},
address = {Dublin, Ireland},
abstract = {This study investigates how German listeners perceive changes in the emotional expression of German and American English human voices and Amazon Alexa text-to-speech (TTS) voices, respectively. Participants rated sentences containing emotionally neutral lexico-semantic information that were resynthesized to vary in prosodic emotional expressiveness. Starting from an emotionally neutral production, three levels of increasing 'happiness' were created. Results show that 'happiness' manipulations lead to higher ratings of emotional valence (i.e., more positive) and arousal (i.e., more excited) for German and English voices, with stronger effects for the German voices. In particular, changes in valence were perceived more prominently in German TTS compared to English TTS. Additionally, both TTS voices were rated lower than the respective human voices on scales that reflect anthropomorphism (e.g., human-likeness). We discuss these findings in the context of cross-linguistic emotion accounts.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Yuen, Ivan; Ibrahim, Omnia; Andreeva, Bistra; Möbius, Bernd

Non-uniform cue-trading: differential effects of surprisal on pause usage and pause duration in German Inproceedings

Proceedings of the 20th International Congress of Phonetic Sciences, ICPhS 2023 (Prague, Czech Rep.), pp. 619-623, 2023.

Pause occurrence is conditional on contextual (un)predictability (in terms of surprisal) [10, 11], and so is the acoustic implementation of duration at multiple linguistic levels. Although these cues (i.e., pause usage/pause duration and syllable duration) are subject to the influence of the same factor, it is not clear how they are related to one another. A recent study in [1] using pause duration to define prosodic boundary strength reported a more pronounced surprisal effect on syllable duration, hinting at a trading relationship. The current study aimed to directly test for trading relationships among pause usage, pause duration and syllable duration in different surprisal contexts, analysing German radio news in the DIRNDL corpus. No trading relationship was observed between pause usage and surprisal, or between pause usage and syllable duration. However, a trading relationship was found between the durations of a pause and a syllable for accented items.

@inproceedings{Yuen/etal:2023a,
title = {Non-uniform cue-trading: differential effects of surprisal on pause usage and pause duration in German},
author = {Ivan Yuen and Omnia Ibrahim and Bistra Andreeva and Bernd M{\"o}bius},
year = {2023},
date = {2023},
booktitle = {Proceedings of the 20th International Congress of Phonetic Sciences, ICPhS 2023 (Prague, Czech Rep.)},
pages = {619-623},
abstract = {Pause occurrence is conditional on contextual (un)predictability (in terms of surprisal) [10, 11], and so is the acoustic implementation of duration at multiple linguistic levels. Although these cues (i.e., pause usage/pause duration and syllable duration) are subject to the influence of the same factor, it is not clear how they are related to one another. A recent study in [1] using pause duration to define prosodic boundary strength reported a more pronounced surprisal effect on syllable duration, hinting at a trading relationship. The current study aimed to directly test for trading relationships among pause usage, pause duration and syllable duration in different surprisal contexts, analysing German radio news in the DIRNDL corpus. No trading relationship was observed between pause usage and surprisal, or between pause usage and syllable duration. However, a trading relationship was found between the durations of a pause and a syllable for accented items.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Ibrahim, Omnia; Yuen, Ivan; Andreeva, Bistra; Möbius, Bernd

The interplay between syllable-based predictability and voicing during closure in intersonorant German stops Inproceedings

Conference: Phonetics and Phonology in Europe 2023 (PaPE 2023), Nijmegen, the Netherlands, 2023.
Contextual predictability has pervasive effects on the acoustic realization of speech. Generally, duration is shortened in more predictable contexts and conversely lengthened in less predictable contexts. There are several measures to quantify predictability in a message. One of them is surprisal, which is calculated as S(Uniti) = -log2 P (Uniti|Context). In a recent work, Ibrahim et al. have found that the effect of syllable-based surprisal on the temporal dimension(s) of a syllable selectively extends to the segmental level, for example, consonant voicing in German. Closure duration was uniformly longer for both voiceless and voiced consonants, but voice onset time was not. The voice onset time pattern might be related to German being typically considered an ‚aspirating‘ language, using [+spread glottis] for voiceless consonants and [-spread glottis] for their voiced counterparts. However, voicing has also been reported in an intervocalic context for both voiceless and voiced consonants to varying extents. To further test whether the previously reported surprisal-based effect on voice onset time is driven by the phonological feature [spread glottis], the current study re-examined the downstream effect of syllable-based predictability on segmental voicing in German stops by measuring the degree of residual (phonetic) voicing during stop closure in an inter-sonorant context. Method: Data were based on a subset of stimuli (speech produced in a quiet acoustic condition) from Ibrahim et al. 38 German speakers recorded 60 sentences. Each sentence contained a target stressed CV syllable in a polysyllabic word. Each target syllable began with one of the stops /p, k, b, d/, combined with one of the vowels /a:, e:, i:, o:, u:/. The analyzed data contained voiceless vs. voiced initial stops in a low or high surprisal syllable. Closure duration (CD) and voicing during closure (VDC) were extracted using in-house Python and Praat scripts. A ratio measure VDC/CD was used to factor out any potential covariation between VDC and CD. Linear mixed-effects modeling was used to evaluate the effect(s) of surprisal and target stop voicing status on VDC/CD ratio using the lmer package in R. The final model was: VDC/CD ratio ∼ Surprisal + Target stop voicing status + (1 | Speaker) + (1 | Syllable ) + (1 | PrevManner ) + (1 | Sentence). Results: In an inter-sonorant context, we found a smaller VDC/CD ratio in voiceless stops than in voiced ones (p=2.04e-08***). As expected, residual voicing is shorter during a voiceless closure than during a voiced closure. This is consistent with the idea of preserving a phonological voicing distinction, as well as the physiological constraint of sustaining voicing for a long period during the closure of a voiceless stop. Moreover, the results yielded a significant effect of surprisal on VDC/CD ratio (p=.017*), with no interaction between the two factors (voicing and surprisal). The VDC/CD ratio is larger in a low than in a high surprisal syllable, irrespective of the voicing status of the target stops. That is, the syllable-based surprisal effect percolated down to German voicing, and the effect is uniform for a voiceless and voiced stop, when residual voicing was measured. Such a uniform effect on residual voicing is consistent with the previous result on closure duration. These findings reveal that the syllable-based surprisal effect can spread downstream to the segmental level and the effect is uniform for acoustic cues that are not directly tied to a phonological feature in German voicing (i.e. [spread glottis]).

@inproceedings{inproceedings,
title = {The interplay between syllable-based predictability and voicing during closure in intersonorant German stops},
author = {Omnia Ibrahim and Ivan Yuen and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://www.researchgate.net/publication/371138687_The_interplay_between_syllable-based_predictability_and_voicing_during_closure_in_intersonorant_German_stops},
year = {2023},
date = {2023},
booktitle = {Conference: Phonetics and Phonology in Europe 2023 (PaPE 2023)},
address = {Nijmegen, the Netherlands},
abstract = {

Contextual predictability has pervasive effects on the acoustic realization of speech. Generally, duration is shortened in more predictable contexts and conversely lengthened in less predictable contexts. There are several measures to quantify predictability in a message. One of them is surprisal, which is calculated as S(Uniti) = -log2 P (Uniti|Context). In a recent work, Ibrahim et al. have found that the effect of syllable-based surprisal on the temporal dimension(s) of a syllable selectively extends to the segmental level, for example, consonant voicing in German. Closure duration was uniformly longer for both voiceless and voiced consonants, but voice onset time was not. The voice onset time pattern might be related to German being typically considered an 'aspirating' language, using [+spread glottis] for voiceless consonants and [-spread glottis] for their voiced counterparts. However, voicing has also been reported in an intervocalic context for both voiceless and voiced consonants to varying extents. To further test whether the previously reported surprisal-based effect on voice onset time is driven by the phonological feature [spread glottis], the current study re-examined the downstream effect of syllable-based predictability on segmental voicing in German stops by measuring the degree of residual (phonetic) voicing during stop closure in an inter-sonorant context. Method: Data were based on a subset of stimuli (speech produced in a quiet acoustic condition) from Ibrahim et al. 38 German speakers recorded 60 sentences. Each sentence contained a target stressed CV syllable in a polysyllabic word. Each target syllable began with one of the stops /p, k, b, d/, combined with one of the vowels /a:, e:, i:, o:, u:/. The analyzed data contained voiceless vs. voiced initial stops in a low or high surprisal syllable. Closure duration (CD) and voicing during closure (VDC) were extracted using in-house Python and Praat scripts. A ratio measure VDC/CD was used to factor out any potential covariation between VDC and CD. Linear mixed-effects modeling was used to evaluate the effect(s) of surprisal and target stop voicing status on VDC/CD ratio using the lmer package in R. The final model was: VDC/CD ratio ∼ Surprisal + Target stop voicing status + (1 | Speaker) + (1 | Syllable ) + (1 | PrevManner ) + (1 | Sentence). Results: In an inter-sonorant context, we found a smaller VDC/CD ratio in voiceless stops than in voiced ones (p=2.04e-08***). As expected, residual voicing is shorter during a voiceless closure than during a voiced closure. This is consistent with the idea of preserving a phonological voicing distinction, as well as the physiological constraint of sustaining voicing for a long period during the closure of a voiceless stop. Moreover, the results yielded a significant effect of surprisal on VDC/CD ratio (p=.017*), with no interaction between the two factors (voicing and surprisal). The VDC/CD ratio is larger in a low than in a high surprisal syllable, irrespective of the voicing status of the target stops. That is, the syllable-based surprisal effect percolated down to German voicing, and the effect is uniform for a voiceless and voiced stop, when residual voicing was measured. Such a uniform effect on residual voicing is consistent with the previous result on closure duration. These findings reveal that the syllable-based surprisal effect can spread downstream to the segmental level and the effect is uniform for acoustic cues that are not directly tied to a phonological feature in German voicing (i.e. [spread glottis]).
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Ibrahim, Omnia

Speaker Adaptations as a Function of Message, Channel and Listener Variability PhD Thesis

University of Zürich, Zürich, Switzerland, 2022.

Speech is a highly dynamic process. Some variability is inherited directly from the language itself, while other variability stems from adapting to the surrounding environment or interlocutor. This Ph.D. thesis consists of seven studies investigating speech adaptation concerning the message, channel, and listener variability. It starts with investigating speakers’ adaptation to the linguistic message. Previous work has shown that duration is shortened in more predictable contexts, and conversely lengthened in less predictable contexts. This pervasive predictability effect is well studied in multiple languages and linguistic levels. However, syllable level predictability has been generally overlooked so far. This thesis aims to őll that gap. It focuses on the effect of information-theoretic factors at both the syllable and segmental levels. Furthermore, it found that the predictability effect is not uniform across all durational cues but is somewhat sensitive to the phonological relevance of a language-specific phonetic cue.
Speakers adapt not only to their message but also to the channel of transfer. For example, it is known that speakers modulate the characteristics of their speech and produce clear speech in response to background noise – syllables in noise have a longer duration, with higher average intensity, larger intensity range, and higher F0. Hence, speakers choose redundant multi-dimensional acoustic modifications to make their voices more salient and detectable in a noisy environment. This Ph.D. thesis provides new insights into speakers’ adaptation to noise and predictability on the acoustic realizations of syllables in German; showing that the speakers’ response to background noise is independent of syllable predictability.
Regarding speaker-to-listener adaptations, this thesis finds that speech variability is not necessarily a function of the interaction’s duration. Instead, speakers constantly position themselves concerning the ongoing social interaction. Indeed, speakers’ cooperation during the discussion would lead to a higher convergence behavior. Moreover, interpersonal power dynamics between interlocutors were found to serve as a predictor for accommodation behavior. This adaptation holds for both human-human interaction and human-robot interaction. In an ecological validity study, speakers changed their voice depending on whether they were addressing a human or a robot. Those findings align with previous studies on robot-directed speech and confirm that this difference also holds when the conversations are more natural and spontaneous.
The results of this thesis provide compelling evidence that speech adaptation is socially motivated and, to some extent, consciously controlled by the speaker. These findings have implications for including environment-based and listener-based formulations in speech production models along with message-based formulations. Furthermore, this thesis aims to advance our understanding of verbal and non-verbal behavior mechanisms for social communication. Finally, it contributes to the broader literature on information-theoretical factors and accommodation effects on speakers’ acoustic realization.

@phdthesis{Ibrahim_Diss_2022,
title = {Speaker Adaptations as a Function of Message, Channel and Listener Variability},
author = {Omnia Ibrahim},
url = {https://www.zora.uzh.ch/id/eprint/233694/},
doi = {https://doi.org/10.5167/uzh-233694},
year = {2022},
date = {2022},
school = {University of Z{\"u}rich},
address = {Z{\"u}rich, Switzerland},
abstract = {Speech is a highly dynamic process. Some variability is inherited directly from the language itself, while other variability stems from adapting to the surrounding environment or interlocutor. This Ph.D. thesis consists of seven studies investigating speech adaptation concerning the message, channel, and listener variability. It starts with investigating speakers’ adaptation to the linguistic message. Previous work has shown that duration is shortened in more predictable contexts, and conversely lengthened in less predictable contexts. This pervasive predictability effect is well studied in multiple languages and linguistic levels. However, syllable level predictability has been generally overlooked so far. This thesis aims to őll that gap. It focuses on the effect of information-theoretic factors at both the syllable and segmental levels. Furthermore, it found that the predictability effect is not uniform across all durational cues but is somewhat sensitive to the phonological relevance of a language-specific phonetic cue. Speakers adapt not only to their message but also to the channel of transfer. For example, it is known that speakers modulate the characteristics of their speech and produce clear speech in response to background noise – syllables in noise have a longer duration, with higher average intensity, larger intensity range, and higher F0. Hence, speakers choose redundant multi-dimensional acoustic modifications to make their voices more salient and detectable in a noisy environment. This Ph.D. thesis provides new insights into speakers’ adaptation to noise and predictability on the acoustic realizations of syllables in German; showing that the speakers’ response to background noise is independent of syllable predictability. Regarding speaker-to-listener adaptations, this thesis finds that speech variability is not necessarily a function of the interaction’s duration. Instead, speakers constantly position themselves concerning the ongoing social interaction. Indeed, speakers’ cooperation during the discussion would lead to a higher convergence behavior. Moreover, interpersonal power dynamics between interlocutors were found to serve as a predictor for accommodation behavior. This adaptation holds for both human-human interaction and human-robot interaction. In an ecological validity study, speakers changed their voice depending on whether they were addressing a human or a robot. Those findings align with previous studies on robot-directed speech and confirm that this difference also holds when the conversations are more natural and spontaneous. The results of this thesis provide compelling evidence that speech adaptation is socially motivated and, to some extent, consciously controlled by the speaker. These findings have implications for including environment-based and listener-based formulations in speech production models along with message-based formulations. Furthermore, this thesis aims to advance our understanding of verbal and non-verbal behavior mechanisms for social communication. Finally, it contributes to the broader literature on information-theoretical factors and accommodation effects on speakers’ acoustic realization.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C1

Ibrahim, Omnia; Yuen, Ivan; van Os, Marjolein; Andreeva, Bistra; Möbius, Bernd

The combined effects of contextual predictability and noise on the acoustic realisation of German syllables Journal Article

The Journal of the Acoustical Society of America, 152, 2022.

Speakers tend to speak clearly in noisy environments, while they tend to reserve effort by shortening word duration in predictable contexts. It is unclear how these two communicative demands are met. The current study investigates the acoustic realizations of syllables in predictable vs unpredictable contexts across different background noise levels. Thirty-eight German native speakers produced 60 CV syllables in two predictability contexts in three noise conditions (reference = quiet, 0 dB and −10 dB signal-to-noise ratio). Duration, intensity (average and range), F0 (median), and vowel formants of the target syllables were analysed. The presence of noise yielded significantly longer duration, higher average intensity, larger intensity range, and higher F0. Noise levels affected intensity (average and range) and F0. Low predictability syllables exhibited longer duration and larger intensity range. However, no interaction was found between noise and predictability. This suggests that noise-related modifications might be independent of predictability-related changes, with implications for including channel-based and message-based formulations in speech production.

@article{ibrahim_etal_jasa2022,
title = {The combined effects of contextual predictability and noise on the acoustic realisation of German syllables},
author = {Omnia Ibrahim and Ivan Yuen and Marjolein van Os and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://asa.scitation.org/doi/10.1121/10.0013413},
doi = {https://doi.org/10.1121/10.0013413},
year = {2022},
date = {2022-08-10},
journal = {The Journal of the Acoustical Society of America},
volume = {152},
number = {2},
abstract = {Speakers tend to speak clearly in noisy environments, while they tend to reserve effort by shortening word duration in predictable contexts. It is unclear how these two communicative demands are met. The current study investigates the acoustic realizations of syllables in predictable vs unpredictable contexts across different background noise levels. Thirty-eight German native speakers produced 60 CV syllables in two predictability contexts in three noise conditions (reference = quiet, 0 dB and −10 dB signal-to-noise ratio). Duration, intensity (average and range), F0 (median), and vowel formants of the target syllables were analysed. The presence of noise yielded significantly longer duration, higher average intensity, larger intensity range, and higher F0. Noise levels affected intensity (average and range) and F0. Low predictability syllables exhibited longer duration and larger intensity range. However, no interaction was found between noise and predictability. This suggests that noise-related modifications might be independent of predictability-related changes, with implications for including channel-based and message-based formulations in speech production.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Projects:   C1 A4

Yuen, Ivan; Demuth, Katherine; Shattuck-Hufnagel, Stefanie

Planning of prosodic clitics in Australian English Journal Article

Language, Cognition and Neuroscience, Routledge, pp. 1-6, 2022.

The prosodic word (PW) has been proposed as a planning unit in speech production (Levelt et al. [1999. A theory of lexical access in speech production. Behavioral and Brain Sciences22, 1–75]), supported by evidence that speech initiation time (RT) is faster for Dutch utterances with fewer PWs due to cliticisation (with the number of lexical words and syllables kept constant) (Wheeldon & Lahiri [1997. Prosodic units in speech production. Journal of Memory and Language37(3), 356–381. https://doi.org/10.1006/jmla.1997.2517], W&L). The present study examined prosodic cliticisation (and resulting RT) for a different set of potential clitics (articles, direct-object pronouns), in English, using a different response task (immediate reading aloud). W&L’s result of shorter RTs for fewer PWs was replicated for articles, but not for pronouns, suggesting a difference in cliticisation for these two function word types. However, a post-hoc analysis of the duration of the verb preceding the clitic suggests that both are cliticised. These findings highlight the importance of supplementing production latency measures with phonetic duration measures to understand different stages of language production during utterance planning.

@article{Yuen_of_2022,
title = {Planning of prosodic clitics in Australian English},
author = {Ivan Yuen and Katherine Demuth and Stefanie Shattuck-Hufnagel},
url = {https://www.tandfonline.com/eprint/4K7DVYQIWRKITU3JCACY/full?target=10.1080/23273798.2022.2060517},
doi = {https://doi.org/10.1080/23273798.2022.2060517},
year = {2022},
date = {2022-04-05},
journal = {Language, Cognition and Neuroscience},
pages = {1-6},
publisher = {Routledge},
abstract = {The prosodic word (PW) has been proposed as a planning unit in speech production (Levelt et al. [1999. A theory of lexical access in speech production. Behavioral and Brain Sciences22, 1–75]), supported by evidence that speech initiation time (RT) is faster for Dutch utterances with fewer PWs due to cliticisation (with the number of lexical words and syllables kept constant) (Wheeldon & Lahiri [1997. Prosodic units in speech production. Journal of Memory and Language37(3), 356–381. https://doi.org/10.1006/jmla.1997.2517], W&L). The present study examined prosodic cliticisation (and resulting RT) for a different set of potential clitics (articles, direct-object pronouns), in English, using a different response task (immediate reading aloud). W&L’s result of shorter RTs for fewer PWs was replicated for articles, but not for pronouns, suggesting a difference in cliticisation for these two function word types. However, a post-hoc analysis of the duration of the verb preceding the clitic suggests that both are cliticised. These findings highlight the importance of supplementing production latency measures with phonetic duration measures to understand different stages of language production during utterance planning.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Gessinger, Iona; Cohn, Michelle; Zellou, Georgia; Möbius, Bernd

Cross-cultural comparison of gradient emotion perception: Human vs. Alexa TTS voices Inproceedings

Proceedings of Interspeech 2022, pp. 4970-4974, 2022.

This study compares how American (US) and German (DE) listeners perceive emotional expressiveness from Amazon Alexa text-to-speech (TTS) and human voices. Participants heard identical stimuli, manipulated from an emotionally ‘neutral‘ production to three levels of increased happiness generated by resynthesis. Results show that, for both groups, ‘happiness‘ manipulations lead to higher ratings of emotional valence (i.e., more positive) for the human voice. Moreover, there was a difference across the groups in their perception of arousal (i.e., excitement): US listeners show higher ratings for human voices with manipulations, while DE listeners perceive the Alexa voice as sounding less ‘excited‘ overall. We discuss these findings in terms of theories of cross-cultural emotion perception and human-computer interaction.

@inproceedings{Gessinger/etal:2022a,
title = {Cross-cultural comparison of gradient emotion perception: Human vs. Alexa TTS voices},
author = {Iona Gessinger and Michelle Cohn and Georgia Zellou and Bernd M{\"o}bius},
url = {https://www.isca-speech.org/archive/interspeech_2022/gessinger22_interspeech.html},
doi = {https://doi.org/10.21437/Interspeech.2022-146},
year = {2022},
date = {2022},
booktitle = {Proceedings of Interspeech 2022},
pages = {4970-4974},
abstract = {This study compares how American (US) and German (DE) listeners perceive emotional expressiveness from Amazon Alexa text-to-speech (TTS) and human voices. Participants heard identical stimuli, manipulated from an emotionally ‘neutral' production to three levels of increased happiness generated by resynthesis. Results show that, for both groups, ‘happiness' manipulations lead to higher ratings of emotional valence (i.e., more positive) for the human voice. Moreover, there was a difference across the groups in their perception of arousal (i.e., excitement): US listeners show higher ratings for human voices with manipulations, while DE listeners perceive the Alexa voice as sounding less ‘excited' overall. We discuss these findings in terms of theories of cross-cultural emotion perception and human-computer interaction.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Pardo, Jennifer; Pellegrino, Elisa; Dellwo, Volker; Möbius, Bernd

Special issue: Vocal accommodation in speech communication Journal Article

Journal of Phonetics, 95, 1-9, pp. paper 101196, 2022.

This introductory article for the Special Issue on Vocal Accommodation in Speech Communication provides an overview of prevailing theories of vocal accommodation and summarizes the ten papers in the collection. Communication Accommodation Theory focusses on social factors evoking accent convergence or divergence, while the Interactive Alignment Model proposes cognitive integration of perception and production as an automatic priming mechanism driving convergence language production. Recent research including most of the papers in this Special Issue indicates that a hybrid or interactive synergy model provides a more comprehensive account of observed patterns of phonetic convergence than purely automatic mechanisms. Some of the fundamental questions that this special collection aimed to cover concerned (1) the nature of vocal accommodation in terms of underlying mechanisms and social functions in human–human and human–computer interaction; (2) the effect of task-specific and talker-specific characteristics (gender, age, personality, linguistic and cultural background, role in interaction) on degree and direction of convergence towards human and computer interlocutors; (3) integration of articulatory, perceptual, neurocognitive, and/or multimodal data to the analysis of acoustic accommodation in interactive and non-interactive speech tasks; and (4) the contribution of short/long-term accommodation in human–human and human–computer interactions to the diffusion of linguistic innovation and ultimately language variation and change.

@article{Pardo_etal22,
title = {Special issue: Vocal accommodation in speech communication},
author = {Jennifer Pardo and Elisa Pellegrino and Volker Dellwo and Bernd M{\"o}bius},
url = {https://www.coli.uni-saarland.de/~moebius/documents/pardo_etal_jphon-si2022.pdf},
year = {2022},
date = {2022},
journal = {Journal of Phonetics},
pages = {paper 101196},
volume = {95, 1-9},
abstract = {This introductory article for the Special Issue on Vocal Accommodation in Speech Communication provides an overview of prevailing theories of vocal accommodation and summarizes the ten papers in the collection. Communication Accommodation Theory focusses on social factors evoking accent convergence or divergence, while the Interactive Alignment Model proposes cognitive integration of perception and production as an automatic priming mechanism driving convergence language production. Recent research including most of the papers in this Special Issue indicates that a hybrid or interactive synergy model provides a more comprehensive account of observed patterns of phonetic convergence than purely automatic mechanisms. Some of the fundamental questions that this special collection aimed to cover concerned (1) the nature of vocal accommodation in terms of underlying mechanisms and social functions in human–human and human–computer interaction; (2) the effect of task-specific and talker-specific characteristics (gender, age, personality, linguistic and cultural background, role in interaction) on degree and direction of convergence towards human and computer interlocutors; (3) integration of articulatory, perceptual, neurocognitive, and/or multimodal data to the analysis of acoustic accommodation in interactive and non-interactive speech tasks; and (4) the contribution of short/long-term accommodation in human–human and human–computer interactions to the diffusion of linguistic innovation and ultimately language variation and change.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Andreeva, Bistra; Dimitrova, Snezhina

The influence of L1 prosody on Bulgarian-accented German and English Inproceedings

Proc. Speech Prosody 2022, pp. 764-768, Lisbon, 2022.

The present study investigates L2 prosodic realizations in the readings of two groups of Bulgarian informants: (a) with L2 German, and (b) with L2 English. Each group consisted of ten female learners, who read the fable “The North Wind and the Sun” in their L1 and in the respective L2. We also recorded two groups of female native speakers of the target languages as controls. The following durational parameters were obtained: mean accented syllable duration, accented/naccented duration ratio, speaking rate. With respect to F0 parameters, mean, median, minimum, maximum, span in semitones, and standard deviations per IP were measured. Additionally, we calculated the number of accented and unaccented syllables, IPs and pauses in each reading. Statistical analyses show that the two groups differ in their use of F0. Both groups use higher standard deviation and level in their L2, whereas the ‘German group’ use higher pitch span as well. The number of accented syllables, IPs and pauses is also higher in L2. Regarding duration, both groups use slower articulation rate. The accented/unaccented syllable duration ratio is lower in L2 for the ‘English group’. We also provide original data on speaking rate in Bulgarian from an information theoretical perspective.

@inproceedings{andreeva_2022_speechprosody,
title = {The influence of L1 prosody on Bulgarian-accented German and English},
author = {Bistra Andreeva and Snezhina Dimitrova},
url = {https://www.isca-speech.org/archive/speechprosody_2022/andreeva22_speechprosody.html},
doi = {https://doi.org/10.21437/SpeechProsody.2022-155},
year = {2022},
date = {2022},
booktitle = {Proc. Speech Prosody 2022},
pages = {764-768},
address = {Lisbon},
abstract = {The present study investigates L2 prosodic realizations in the readings of two groups of Bulgarian informants: (a) with L2 German, and (b) with L2 English. Each group consisted of ten female learners, who read the fable “The North Wind and the Sun” in their L1 and in the respective L2. We also recorded two groups of female native speakers of the target languages as controls. The following durational parameters were obtained: mean accented syllable duration, accented/naccented duration ratio, speaking rate. With respect to F0 parameters, mean, median, minimum, maximum, span in semitones, and standard deviations per IP were measured. Additionally, we calculated the number of accented and unaccented syllables, IPs and pauses in each reading. Statistical analyses show that the two groups differ in their use of F0. Both groups use higher standard deviation and level in their L2, whereas the ‘German group’ use higher pitch span as well. The number of accented syllables, IPs and pauses is also higher in L2. Regarding duration, both groups use slower articulation rate. The accented/unaccented syllable duration ratio is lower in L2 for the ‘English group’. We also provide original data on speaking rate in Bulgarian from an information theoretical perspective.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Ibrahim, Omnia; Yuen, Ivan; Andreeva, Bistra; Möbius, Bernd

The effect of predictability on German stop voicing is phonologically selective Inproceedings

Proc. Speech Prosody 2022, pp. 669-673, Lisbon, 2022.

Cross-linguistic evidence suggests that syllables in predictable contexts have shorter duration than in unpredictable contexts. However, it is not clear if predictability uniformly affects phonetic cues of a phonological feature in a segment. The current study explored the effect of syllable-based predictability on the durational correlates of the phonological stop voicing contrast in German, viz. voice onset time (VOT) and closure duration (CD), using data in Ibrahim et al. [1]. The target stop consonants /b, p, d, k/ occurred in stressed CV syllables in polysyllabic words embedded in a sentence, with either voiced or voiceless preceding contexts. The syllable occurred in either a low or a high predictable condition, which was based on a syllable-level trigram language model. We measured VOT and CD of the target consonants (voiced vs. voiceless). Our results showed an interaction effect of predictability and the voicing status of the target consonants on VOT, but a uniform effect on closure duration. This interaction effect on a primary cue like VOT indicates a selective effect of predictability on VOT, but not on CD. This suggests that the effect of predictability is sensitive to the phonological relevance of a language-specific phonetic cue.

@inproceedings{ibrahim_2022_speechprosody,
title = {The effect of predictability on German stop voicing is phonologically selective},
author = {Omnia Ibrahim and Ivan Yuen and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://www.isca-speech.org/archive/pdfs/speechprosody_2022/ibrahim22_speechprosody.pdf},
doi = {https://doi.org/10.21437/SpeechProsody.2022-136},
year = {2022},
date = {2022},
booktitle = {Proc. Speech Prosody 2022},
pages = {669-673},
address = {Lisbon},
abstract = {Cross-linguistic evidence suggests that syllables in predictable contexts have shorter duration than in unpredictable contexts. However, it is not clear if predictability uniformly affects phonetic cues of a phonological feature in a segment. The current study explored the effect of syllable-based predictability on the durational correlates of the phonological stop voicing contrast in German, viz. voice onset time (VOT) and closure duration (CD), using data in Ibrahim et al. [1]. The target stop consonants /b, p, d, k/ occurred in stressed CV syllables in polysyllabic words embedded in a sentence, with either voiced or voiceless preceding contexts. The syllable occurred in either a low or a high predictable condition, which was based on a syllable-level trigram language model. We measured VOT and CD of the target consonants (voiced vs. voiceless). Our results showed an interaction effect of predictability and the voicing status of the target consonants on VOT, but a uniform effect on closure duration. This interaction effect on a primary cue like VOT indicates a selective effect of predictability on VOT, but not on CD. This suggests that the effect of predictability is sensitive to the phonological relevance of a language-specific phonetic cue.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Żygis, Marzena; Beňuš, Štefan; Andreeva, Bistra

Intonation: Other pragmatic functions and phonetic / phonological effects Book Chapter

Bermel, Neil; Fellerer, Jan;  (Ed.): The Oxford Guide to the Slavonic Languages, Oxford University Press, 2022.

@inbook{Zygis2021intonation,
title = {Intonation: Other pragmatic functions and phonetic / phonological effects},
author = {Marzena Żygis and Štefan Beňuš and Bistra Andreeva},
editor = {Neil Bermel and Jan Fellerer},
year = {2022},
date = {2022},
booktitle = {The Oxford Guide to the Slavonic Languages},
publisher = {Oxford University Press},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C1

Andreeva, Bistra; Dimitrova, Snezhina

Intonation and information structure Book Chapter Forthcoming

Bermel, Neil; Fellerer, Jan;  (Ed.): The Oxford Guide to the Slavonic Languages, Oxford University Press, 2022.

@inbook{Andreeva2022intonation,
title = {Intonation and information structure},
author = {Bistra Andreeva and Snezhina Dimitrova},
editor = {Neil Bermel and Jan Fellerer},
year = {2022},
date = {2022},
booktitle = {The Oxford Guide to the Slavonic Languages},
publisher = {Oxford University Press},
pubstate = {forthcoming},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C1

Yuen, Ivan; Xu Rattanasone, Nan; Schmidt, Elaine; Macdonald, Gretel; Holt, Rebecca; Demuth, Katherine

Five-year-olds produce prosodic cues to distinguish compounds from lists in Australian English Journal Article

Journal of Child Language, 48, Cambridge University Press, pp. 110-128, 2021.

Although previous research has indicated that five-year-olds can use acoustic cues to disambiguate compounds (N1 + N2) from lists (N1, N2) (e.g., ‘icecream’ vs. ‘ice, cream’) (Yoshida & Katz, 2004, 2006), their productions are not yet fully adult-like (Wells, Peppé & Goulandris, 2004). The goal of this study was to examine this issue in Australian English-speaking children, with a focus on their use of F0, word duration, and pauses. Twenty-four five-year-olds and 20 adults participated in an elicited production experiment. Like adults, children produced distinct F0 patterns for the two structures. They also used longer word durations and more pauses in lists compared to compounds, indicating the presence of a boundary in lists. However, unlike adults, they also inappropriately inserted more pauses within the compound, suggesting the presence of a boundary in compounds as well. The implications for understanding children’s developing knowledge of how to map acoustic cues to prosodic structures are discussed.

@article{YUENetal2020cues,
title = {Five-year-olds produce prosodic cues to distinguish compounds from lists in Australian English},
author = {Ivan Yuen and Nan Xu Rattanasone and Elaine Schmidt and Gretel Macdonald and Rebecca Holt and Katherine Demuth},
url = {https://doi.org/10.1017/S0305000920000227},
doi = {https://doi.org/10.1017/S0305000920000227},
year = {2021},
date = {2021},
journal = {Journal of Child Language},
pages = {110-128},
publisher = {Cambridge University Press},
volume = {48},
number = {1},
abstract = {

Although previous research has indicated that five-year-olds can use acoustic cues to disambiguate compounds (N1 + N2) from lists (N1, N2) (e.g., ‘ice-cream’ vs. ‘ice, cream’) (Yoshida & Katz, 2004, 2006), their productions are not yet fully adult-like (Wells, Pepp{\'e} & Goulandris, 2004). The goal of this study was to examine this issue in Australian English-speaking children, with a focus on their use of F0, word duration, and pauses. Twenty-four five-year-olds and 20 adults participated in an elicited production experiment. Like adults, children produced distinct F0 patterns for the two structures. They also used longer word durations and more pauses in lists compared to compounds, indicating the presence of a boundary in lists. However, unlike adults, they also inappropriately inserted more pauses within the compound, suggesting the presence of a boundary in compounds as well. The implications for understanding children's developing knowledge of how to map acoustic cues to prosodic structures are discussed.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Gessinger, Iona; Möbius, Bernd; Le Maguer, Sébastien; Raveh, Eran; Steiner, Ingmar

Phonetic accommodation in interaction with a virtual language learning tutor: A Wizard-of-Oz study Journal Article

Journal of Phonetics, 86, pp. 101029, 2021.

We present a Wizard-of-Oz experiment examining phonetic accommodation of human interlocutors in the context of human-computer interaction. Forty-two native speakers of German engaged in dynamic spoken interaction with a simulated virtual tutor for learning the German language called Mirabella. Mirabella was controlled by the experimenter and used either natural or hidden Markov model-based synthetic speech to communicate with the participants. In the course of four tasks, the participants’ accommodating behavior with respect to wh-question realization and allophonic variation in German was tested. The participants converged to Mirabella with respect to modified wh-question intonation, i.e., rising F0 contour and nuclear pitch accent on the interrogative pronoun, and the allophonic contrast [ɪç] vs. [ɪk] occurring in the word ending -ig. They did not accommodate to the allophonic contrast [ɛː] vs. [eː] as a realization of the long vowel -ä-. The results did not differ between the experimental groups that communicated with either the natural or the synthetic speech version of Mirabella. Testing the influence of the “Big Five” personality traits on the accommodating behavior revealed a tendency for neuroticism to influence the convergence of question intonation. On the level of individual speakers, we found considerable variation with respect to the degree and direction of accommodation. We conclude that phonetic accommodation on the level of local prosody and segmental pronunciation occurs in users of spoken dialog systems, which could be exploited in the context of computer-assisted language learning.

@article{Gessinger/etal:2021a,
title = {Phonetic accommodation in interaction with a virtual language learning tutor: A Wizard-of-Oz study},
author = {Iona Gessinger and Bernd M{\"o}bius and S{\'e}bastien Le Maguer and Eran Raveh and Ingmar Steiner},
url = {https://doi.org/10.1016/j.wocn.2021.101029},
doi = {https://doi.org/10.1016/j.wocn.2021.101029},
year = {2021},
date = {2021},
journal = {Journal of Phonetics},
pages = {101029},
volume = {86},
abstract = {We present a Wizard-of-Oz experiment examining phonetic accommodation of human interlocutors in the context of human-computer interaction. Forty-two native speakers of German engaged in dynamic spoken interaction with a simulated virtual tutor for learning the German language called Mirabella. Mirabella was controlled by the experimenter and used either natural or hidden Markov model-based synthetic speech to communicate with the participants. In the course of four tasks, the participants’ accommodating behavior with respect to wh-question realization and allophonic variation in German was tested. The participants converged to Mirabella with respect to modified wh-question intonation, i.e., rising F0 contour and nuclear pitch accent on the interrogative pronoun, and the allophonic contrast [ɪç] vs. [ɪk] occurring in the word ending -ig. They did not accommodate to the allophonic contrast [ɛː] vs. [eː] as a realization of the long vowel -{\"a}-. The results did not differ between the experimental groups that communicated with either the natural or the synthetic speech version of Mirabella. Testing the influence of the “Big Five” personality traits on the accommodating behavior revealed a tendency for neuroticism to influence the convergence of question intonation. On the level of individual speakers, we found considerable variation with respect to the degree and direction of accommodation. We conclude that phonetic accommodation on the level of local prosody and segmental pronunciation occurs in users of spoken dialog systems, which could be exploited in the context of computer-assisted language learning.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Gessinger, Iona

Phonetic accommodation of human interlocutors in the context of human-computer interaction PhD Thesis

Saarland University, Saarbruecken, Germany, 2021.

Phonetic accommodation refers to the phenomenon that interlocutors adapt their way of speaking to each other within an interaction. This can have a positive influence on the communication quality. As we increasingly use spoken language to interact with computers these days, the phenomenon of phonetic accommodation is also investigated in the context of human-computer interaction: on the one hand, to find out whether speakers adapt to a computer agent in a similar way as they do to a human interlocutor, on the other hand, to implement accommodation behavior in spoken dialog systems and explore how this affects their users. To date, the focus has been mainly on the global acoustic-prosodic level. The present work demonstrates that speakers interacting with a computer agent also identify locally anchored phonetic phenomena such as segmental allophonic variation and local prosodic features as accommodation targets and converge on them. To this end, we conducted two experiments. First, we applied the shadowing method, where the participants repeated short sentences from natural and synthetic model speakers. In the second experiment, we used the Wizard-of-Oz method, in which an intelligent spoken dialog system is simulated, to enable a dynamic exchange between the participants and a computer agent — the virtual language learning tutor Mirabella. The target language of our experiments was German. Phonetic convergence occurred in both experiments when natural voices were used as well as when synthetic voices were used as stimuli. Moreover, both native and non-native speakers of the target language converged to Mirabella. Thus, accommodation could be relevant, for example, in the context of computer-assisted language learning. Individual variation in accommodation behavior can be attributed in part to speaker-specific characteristics, one of which is assumed to be the personality structure. We included the Big Five personality traits as well as the concept of mental boundaries in the analysis of our data. Different personality traits influenced accommodation to different types of phonetic features. Mental boundaries have not been studied before in the context of phonetic accommodation. We created a validated German adaptation of a questionnaire that assesses the strength of mental boundaries. The latter can be used in future studies involving mental boundaries in native speakers of German.


Bei phonetischer Akkommodation handelt es sich um das Phänomen, dass Gesprächspartner ihre Sprechweise innerhalb einer Interaktion aneinander anpassen. Dies kann die Qualität der Kommunikation positiv beeinflussen. Da wir heutzutage immer öfter mittels gesprochener Sprache mit Computern interagieren, wird das Phänomen der phonetischen Akkommodation auch im Kontext der Mensch-Computer-Interaktion untersucht: zum einen, um herauszufinden, ob sich Sprecher an einen Computeragenten in ähnlicher Weise anpassen wie an einen menschlichen Gesprächspartner, zum anderen, um das Akkommodationsverhalten in Sprachdialogsysteme zu implementieren und zu erforschen, wie dieses auf ihre Benutzer wirkt. Bislang lag der Fokus dabei hauptsächlich auf der globalen akustisch-prosodischen Ebene. Die vorliegende Arbeit zeigt, dass Sprecher in Interaktion mit einem Computeragenten auch lokal verankerte phonetische Phänomene wie segmentale allophone Variation und lokale prosodische Merkmale als Akkommodationsziele identifizieren und in Bezug auf diese konvergieren. Dabei wendeten wir in einem ersten Experiment die Shadowing-Methode an, bei der die Teilnehmer kurze Sätze von natürlichen und synthetischen Modellsprechern wiederholten. In einem zweiten Experiment ermöglichten wir mit der Wizard-of-Oz-Methode, bei der ein intelligentes Sprachdialogsystem simuliert wird, einen dynamischen Austausch zwischen den Teilnehmern und einem Computeragenten — der virtuellen Sprachlerntutorin Mirabella. Die Zielsprache unserer Experimente war Deutsch. Phonetische Konvergenz trat in beiden Experimenten sowohl bei Verwendung natürlicher Stimmen als auch bei Verwendung synthetischer Stimmen als Stimuli auf. Zudem konvergierten sowohl Muttersprachler als auch Nicht-Muttersprachler der Zielsprache zu Mirabella. Somit könnte Akkommodation zum Beispiel im Kontext des computergstützten Sprachenlernens zum Tragen kommen. Individuelle Variation im Akkommodationsverhalten kann unter anderem auf sprecherspezifische Eigenschaften zurückgeführt werden. Es wird vermutet, dass zu diesen auch die Persönlichkeitsstruktur gehört. Wir bezogen die Big Five Persönlichkeitsmerkmale sowie das Konzept der mentalen Grenzen in die Analyse unserer Daten ein. Verschiedene Persönlichkeitsmerkmale beeinflussten die Akkommodation zu unterschiedlichen Typen von phonetischen Merkmalen. Die mentalen Grenzen sind im Zusammenhang mit phonetischer Akkommodation zuvor noch nicht untersucht worden. Wir erstellten eine validierte deutsche Adaptierung eines Fragebogens, der die Stärke der mentalen Grenzen erhebt. Diese kann in zukünftigen Untersuchungen mentaler Grenzen bei Muttersprachlern des Deutschen verwendet werden.

@phdthesis{Gessinger_Diss_2021,
title = {Phonetic accommodation of human interlocutors in the context of human-computer interaction},
author = {Iona Gessinger},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/32213},
doi = {https://doi.org/10.22028/D291-35154},
year = {2021},
date = {2021},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {Phonetic accommodation refers to the phenomenon that interlocutors adapt their way of speaking to each other within an interaction. This can have a positive influence on the communication quality. As we increasingly use spoken language to interact with computers these days, the phenomenon of phonetic accommodation is also investigated in the context of human-computer interaction: on the one hand, to find out whether speakers adapt to a computer agent in a similar way as they do to a human interlocutor, on the other hand, to implement accommodation behavior in spoken dialog systems and explore how this affects their users. To date, the focus has been mainly on the global acoustic-prosodic level. The present work demonstrates that speakers interacting with a computer agent also identify locally anchored phonetic phenomena such as segmental allophonic variation and local prosodic features as accommodation targets and converge on them. To this end, we conducted two experiments. First, we applied the shadowing method, where the participants repeated short sentences from natural and synthetic model speakers. In the second experiment, we used the Wizard-of-Oz method, in which an intelligent spoken dialog system is simulated, to enable a dynamic exchange between the participants and a computer agent — the virtual language learning tutor Mirabella. The target language of our experiments was German. Phonetic convergence occurred in both experiments when natural voices were used as well as when synthetic voices were used as stimuli. Moreover, both native and non-native speakers of the target language converged to Mirabella. Thus, accommodation could be relevant, for example, in the context of computer-assisted language learning. Individual variation in accommodation behavior can be attributed in part to speaker-specific characteristics, one of which is assumed to be the personality structure. We included the Big Five personality traits as well as the concept of mental boundaries in the analysis of our data. Different personality traits influenced accommodation to different types of phonetic features. Mental boundaries have not been studied before in the context of phonetic accommodation. We created a validated German adaptation of a questionnaire that assesses the strength of mental boundaries. The latter can be used in future studies involving mental boundaries in native speakers of German.


Bei phonetischer Akkommodation handelt es sich um das Ph{\"a}nomen, dass Gespr{\"a}chspartner ihre Sprechweise innerhalb einer Interaktion aneinander anpassen. Dies kann die Qualit{\"a}t der Kommunikation positiv beeinflussen. Da wir heutzutage immer {\"o}fter mittels gesprochener Sprache mit Computern interagieren, wird das Ph{\"a}nomen der phonetischen Akkommodation auch im Kontext der Mensch-Computer-Interaktion untersucht: zum einen, um herauszufinden, ob sich Sprecher an einen Computeragenten in {\"a}hnlicher Weise anpassen wie an einen menschlichen Gespr{\"a}chspartner, zum anderen, um das Akkommodationsverhalten in Sprachdialogsysteme zu implementieren und zu erforschen, wie dieses auf ihre Benutzer wirkt. Bislang lag der Fokus dabei haupts{\"a}chlich auf der globalen akustisch-prosodischen Ebene. Die vorliegende Arbeit zeigt, dass Sprecher in Interaktion mit einem Computeragenten auch lokal verankerte phonetische Ph{\"a}nomene wie segmentale allophone Variation und lokale prosodische Merkmale als Akkommodationsziele identifizieren und in Bezug auf diese konvergieren. Dabei wendeten wir in einem ersten Experiment die Shadowing-Methode an, bei der die Teilnehmer kurze S{\"a}tze von nat{\"u}rlichen und synthetischen Modellsprechern wiederholten. In einem zweiten Experiment erm{\"o}glichten wir mit der Wizard-of-Oz-Methode, bei der ein intelligentes Sprachdialogsystem simuliert wird, einen dynamischen Austausch zwischen den Teilnehmern und einem Computeragenten — der virtuellen Sprachlerntutorin Mirabella. Die Zielsprache unserer Experimente war Deutsch. Phonetische Konvergenz trat in beiden Experimenten sowohl bei Verwendung nat{\"u}rlicher Stimmen als auch bei Verwendung synthetischer Stimmen als Stimuli auf. Zudem konvergierten sowohl Muttersprachler als auch Nicht-Muttersprachler der Zielsprache zu Mirabella. Somit k{\"o}nnte Akkommodation zum Beispiel im Kontext des computergst{\"u}tzten Sprachenlernens zum Tragen kommen. Individuelle Variation im Akkommodationsverhalten kann unter anderem auf sprecherspezifische Eigenschaften zur{\"u}ckgef{\"u}hrt werden. Es wird vermutet, dass zu diesen auch die Pers{\"o}nlichkeitsstruktur geh{\"o}rt. Wir bezogen die Big Five Pers{\"o}nlichkeitsmerkmale sowie das Konzept der mentalen Grenzen in die Analyse unserer Daten ein. Verschiedene Pers{\"o}nlichkeitsmerkmale beeinflussten die Akkommodation zu unterschiedlichen Typen von phonetischen Merkmalen. Die mentalen Grenzen sind im Zusammenhang mit phonetischer Akkommodation zuvor noch nicht untersucht worden. Wir erstellten eine validierte deutsche Adaptierung eines Fragebogens, der die St{\"a}rke der mentalen Grenzen erhebt. Diese kann in zuk{\"u}nftigen Untersuchungen mentaler Grenzen bei Muttersprachlern des Deutschen verwendet werden.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C1

Raveh, Eran

Vocal accommodation in human-computer interaction: modeling and integration into spoken dialogue systems PhD Thesis

Saarland University, Saarbruecken, Germany, 2021.

With the rapidly increasing usage of voice-activated devices worldwide, verbal communication with computers is steadily becoming more common. Although speech is the principal natural manner of human communication, it is still challenging for computers, and users had been growing accustomed to adjusting their speaking style for computers. Such adjustments occur naturally, and typically unconsciously, in humans during an exchange to control the social distance between the interlocutors and improve the conversation’s efficiency. This phenomenon is called accommodation and it occurs on various modalities in human communication, like hand gestures, facial expressions, eye gaze, lexical and grammatical choices, and others. Vocal accommodation deals with phonetic-level changes occurring in segmental and suprasegmental features. A decrease in the difference between the speakers’ feature realizations results in convergence, while an increasing distance leads to divergence. The lack of such mutual adjustments made naturally by humans in computers’ speech creates a gap between human-human and human-computer interactions. Moreover, voice-activated systems currently speak in exactly the same manner to all users, regardless of their speech characteristics or realizations of specific features. Detecting phonetic variations and generating adaptive speech output would enhance user personalization, offer more human-like communication, and ultimately should improve the overall interaction experience. Thus, investigating these aspects of accommodation will help to understand and improving human-computer interaction. This thesis provides a comprehensive overview of the required building blocks for a roadmap toward the integration of accommodation capabilities into spoken dialogue systems. These include conducting human-human and human-computer interaction experiments to examine the differences in vocal behaviors, approaches for modeling these empirical findings, methods for introducing phonetic variations in synthesized speech, and a way to combine all these components into an accommodative system. While each component is a wide research field by itself, they depend on each other and hence should be jointly considered. The overarching goal of this thesis is therefore not only to show how each of the aspects can be further developed, but also to demonstrate and motivate the connections between them. A special emphasis is put throughout the thesis on the importance of the temporal aspect of accommodation. Humans constantly change their speech over the course of a conversation. Therefore, accommodation processes should be treated as continuous, dynamic phenomena. Measuring differences in a few discrete points, e.g., beginning and end of an interaction, may leave many accommodation events undiscovered or overly smoothed. To justify the effort of introducing accommodation in computers, it should first be proven that humans even show any phonetic adjustments when talking to a computer as they do with a human being. As there is no definitive metric for measuring accommodation and evaluating its quality, it is important to empirically study humans productions to later use as references for possible behaviors. In this work, this investigation encapsulates different experimental configurations to achieve a better picture of accommodation effects. First, vocal accommodation was inspected where it naturally occurs, namely in spontaneous human-human conversations. For this purpose, a collection of real-world sales conversations, each with a different representative-prospect pair, was collected and analyzed. These conversations offer a glance into accommodation effects in authentic, unscripted interactions with the common goal of negotiating a deal on the one hand, but with the individual facet of each side of trying to get the best terms on the other hand. The conversations were analyzed using cross-correlation and time series techniques to capture the change dynamics over time. It was found that successful conversations are distinguishable from failed ones by multiple measures. Furthermore, the sales representative proved to be better at leading the vocal changes, i.e., making the prospect follow their speech styles rather than the other way around. They also showed a stronger tendency to take that lead at an earlier stage, all the more so in successful conversations. The fact that accommodation occurs more by trained speakers and improves their performances fits anecdotal best practices of sales experts, which are now also proven scientifically. Following these results, the next experiment came closer to the final goal of this work and investigated vocal accommodation effects in human-computer interaction. This was done via a shadowing experiment, which offers a controlled setting for examining phonetic variations. As spoken dialogue systems with such accommodation capabilities (like this work aims to achieve) do not exist yet, a simulated system was used to introduce these changes to the participants, who believed they help with the testing of a language learning tutoring system. After determining their preference concerning three segmental phonetic features, participants were listen-ing to either natural or synthesized voices of male and female speakers, which produced the participants’ dispreferred variation of the aforementioned features. Accommodation occurred in all cases, but the natural voices triggered stronger effects. Nevertheless, it can be concluded that participants were accommodating toward synthetic voices as well, which means that social mechanisms are applied in humans also when speaking with computer-based interlocutors. The shadowing paradigm was utilized also to test whether accommodation is a phenomenon associated only with speech or with other vocal productions as well. To that end, accommodation in the singing of familiar and novel music was examined. Interestingly, accommodation was found in both cases, though in different ways. While participants seemed to use the familiar piece merely as a reference for singing more accurately, the novel piece became the goal for complete replicate. For example, one difference was that mostly pitch corrections were introduced in the former case, while in the latter also key and rhythmic patterns were adopted. Some of those findings were expected and they show that people’s more salient features are also harder to modify using external auditory influence. Lastly, a multiparty experiment with spontaneous human-human-computer interactions was carried out to compare accommodation in human-directed and computer-directed speech. The participants solved tasks for which they needed to talk both with a confederate and with an agent. This allows a direct comparison of their speech based on the addressee within the same conversation, which has not been done so far. Results show that some participants’ vocal behavior changed similarly when talking to the confederate and the agent, while others’ speech varied only with the confederate. Further analysis found that the greatest factor for this difference was the order in which the participants talked with the interlocutors. Apparently, those who first talked to the agent alone saw it more as a social actor in the conversation, while those who interacted with it after talking to the confederate treated it more as a means to achieve a goal, and thus behaved differently with it. In the latter case, the variations in the human-directed speech were much more prominent. Differences were also found between the analyzed features, but the task type did not influence the degree of accommodation effects. The results of these experiments lead to the conclusion that vocal accommodation does occur in human-computer interactions, even if often to lesser degrees. With the question of whether people accommodate to computer-based interlocutors as well answered, the next step would be to describe accommodative behaviors in a computer-processable manner. Two approaches are proposed here: computational and statistical. The computational model aims to capture the presumed cognitive process associated with accommodation in humans. This comprises various steps, such as detecting the variable feature’s sound, adding instances of it to the feature’s mental memory, and determining how much the sound will change while taking into account both its current representation and the external input. Due to its sequential nature, this model was implemented as a pipeline. Each of the pipeline’s five steps corresponds to a specific part of the cognitive process and can have one or more parameters to control its output (e.g., the size of the feature’s memory or the accommodation pace). Using these parameters, precise accommodative behaviors can be crafted while applying expert knowledge to motivate the chosen parameter values. These advantages make this approach suitable for experimentation with pre-defined, deterministic behaviors where each step can be changed individually. Ultimately, this approach makes a system vocally responsive to users’ speech input. The second approach grants more evolved behaviors, by defining different core behaviors and adding non-deterministic variations on top of them. This resembles human behavioral patterns, as each person has a base way of accommodating (or not accommodating), which may arbitrarily change based on the specific circumstances. This approach offers a data-driven statistical way to extract accommodation behaviors from a given collection of interactions. First, the target feature’s values of each speaker in an interaction are converted into continuous interpolated lines by drawing one sample from the posterior distribution of a Gaussian process conditioned on the given values. Then, the gradients of these lines, which represent rates of mutual change, are used to defined discrete levels of change based on their distribution. Finally, each level is assigned a symbol, which ultimately creates a symbol sequence representation for each interaction. The sequences are clustered so that each cluster stands for a type of behavior. The sequences of a cluster can then be used to calculate n-gram probabilities that enable the generation of new sequences of the captured behavior. The specific output value is sampled from the range corresponding to the generated symbol. With this approach, accommodation behaviors are extracted directly from data, as opposed to manually crafting them. However, it is harder to describe what exactly these behaviors represent and motivate the use of one of them over the other. To bridge this gap between these two approaches, it is also discussed how they can be combined to benefit from the advantages of both. Furthermore, to generate more structured behaviors, a hierarchy of accommodation complexity levels is suggested here, from a direct adoption of users’ realizations, via specified responsiveness, and up to independent core behaviors with non-deterministic variational productions. Besides a way to track and represent vocal changes, an accommodative system also needs a text-to-speech component that is able to realize those changes in the system’s speech output. Speech synthesis models are typically trained once on data with certain characteristics and do not change afterward. This prevents such models from introducing any variation in specific sounds and other phonetic features. Two methods for directly modifying such features are explored here. The first is based on signal modifications applied to the output signal after it was generated by the system. The processing is done between the timestamps of the target features and uses pre-defined scripts that modify the signal to achieve the desired values. This method is more suitable for continuous features like vowel quality, especially in the case of subtle changes that do not necessarily lead to a categorical sound change. The second method aims to capture phonetic variations in the training data. To that end, a training corpus with phonemic representations is used, as opposed to the regular graphemic representations. This way, the model can learn more direct relations between phonemes and sound instead of surface forms and sound, which, depending on the language, might be more complex and depend on their surrounding letters. The target variations themselves don’t necessarily need to be explicitly present in the training data, all time the different sounds are naturally distinguishable. In generation time, the current target feature’s state determines the phoneme to use for generating the desired sound. This method is suitable for categorical changes, especially for contrasts that naturally exist in the language. While both methods have certain limitations, they provide a proof of concept for the idea that spoken dialogue systems may phonetically adapt their speech output in real-time and without re-training their text-to-speech models. To combine the behavior definitions and the speech manipulations, a system is required, which can connect these elements to create a complete accommodation capability. The architecture suggested here extends the standard spoken dialogue system with an additional module, which receives the transcribed speech signal from the speech recognition component without influencing the input to the language understanding component. While language the understanding component uses only textual transcription to determine the user’s intention, the added component process the raw signal along with its phonetic transcription. In this extended architecture, the accommodation model is activated in the added module and the information required for speech manipulation is sent to the text-to-speech component. However, the text-to-speech component now has two inputs, viz. the content of the system’s response coming from the language generation component and the states of the defined target features from the added component. An implementation of a web-based system with this architecture is introduced here, and its functionality is showcased by demonstrating how it can be used to conduct a shadowing experiment automatically. This has two main advantage: First, since the system recognizes the participants’ phonetic variations and automatically selects the appropriate variation to use in its response, the experimenter saves time and prevents manual annotation errors. The experimenter also automatically gains additional information, like exact timestamps of utterances, real-time visualization of the interlocutors’ productions, and the possibility to replay and analyze the interaction after the experiment is finished. The second advantage is scalability. Multiple instances of the system can run on a server and be accessed by multiple clients at the same time. This not only saves time and the logistics of bringing participants into a lab, but also allows running the experiment with different configurations (e.g., other parameter values or target features) in a controlled and reproducible way. This completes a full cycle from examining human behaviors to integrating accommodation capabilities. Though each part of it can undoubtedly be further investigated, the emphasis here is on how they depend and connect to each other. Measuring changes features without showing how they can be modeled or achieving flexible speech synthesis without considering the desired final output might not lead to the final goal of introducing accommodation capabilities into computers. Treating accommodation in human-computer interaction as one large process rather than isolated sub-problems lays the ground for more comprehensive and complete solutions in the future.


Heutzutage wird die verbale Interaktion mit Computern immer gebräuchlicher, was der rasant wachsenden Anzahl von sprachaktivierten Geräten weltweit geschuldet ist. Allerdings stellt die computerseitige Handhabung gesprochener Sprache weiterhin eine große Herausforderung dar, obwohl sie die bevorzugte Art zwischenmenschlicher Kommunikation repräsentiert. Dieser Umstand führt auch dazu, dass Benutzer ihren Sprachstil an das jeweilige Gerät anpassen, um diese Handhabung zu erleichtern. Solche Anpassungen kommen in menschlicher gesprochener Sprache auch in der zwischenmenschlichen Kommunikation vor. Üblicherweise ereignen sie sich unbewusst und auf natürliche Weise während eines Gesprächs, etwa um die soziale Distanz zwischen den Gesprächsteilnehmern zu kontrollieren oder um die Effizienz des Gesprächs zu verbessern. Dieses Phänomen wird als Akkommodation bezeichnet und findet auf verschiedene Weise während menschlicher Kommunikation statt. Sie äußert sich zum Beispiel in der Gestik, Mimik, Blickrichtung oder aber auch in der Wortwahl und dem verwendeten Satzbau. Vokal- Akkommodation beschäftigt sich mit derartigen Anpassungen auf phonetischer Ebene, die sich in segmentalen und suprasegmentalen Merkmalen zeigen. Werden Ausprägungen dieser Merkmale bei den Gesprächsteilnehmern im Laufe des Gesprächs ähnlicher, spricht man von Konvergenz, vergrößern sich allerdings die Unterschiede, so wird dies als Divergenz bezeichnet. Dieser natürliche gegenseitige Anpassungsvorgang fehlt jedoch auf der Seite des Computers, was zu einer Lücke in der Mensch-Maschine-Interaktion führt. Darüber hinaus verwenden sprachaktivierte Systeme immer dieselbe Sprachausgabe und ignorieren folglich etwaige Unterschiede zum Sprachstil des momentanen Benutzers. Die Erkennung dieser phonetischen Abweichungen und die Erstellung von anpassungsfähiger Sprachausgabe würden zur Personalisierung dieser Systeme beitragen und könnten letztendlich die insgesamte Benutzererfahrung verbessern. Aus diesem Grund kann die Erforschung dieser Aspekte von Akkommodation helfen, Mensch-Maschine-Interaktion besser zu verstehen und weiterzuentwickeln. Die vorliegende Dissertation stellt einen umfassenden Überblick zu Bausteinen bereit, die nötig sind, um Akkommodationsfähigkeiten in Sprachdialogsysteme zu integrieren. In diesem Zusammenhang wurden auch interaktive Mensch-Mensch- und Mensch- Maschine-Experimente durchgeführt. In diesen Experimenten wurden Differenzen der vokalen Verhaltensweisen untersucht und Methoden erforscht, wie phonetische Abweichungen in synthetische Sprachausgabe integriert werden können. Um die erhaltenen Ergebnisse empirisch auswerten zu können, wurden hierbei auch verschiedene Modellierungsansätze erforscht. Fernerhin wurde der Frage nachgegangen, wie sich die betreffenden Komponenten kombinieren lassen, um ein Akkommodationssystem zu konstruieren. Jeder dieser Aspekte stellt für sich genommen bereits einen überaus breiten Forschungsbereich dar. Allerdings sind sie voneinander abhängig und sollten zusammen betrachtet werden. Aus diesem Grund liegt ein übergreifender Schwerpunkt dieser Dissertation darauf, nicht nur aufzuzeigen, wie sich diese Aspekte weiterentwickeln lassen, sondern auch zu motivieren, wie sie zusammenhängen. Ein weiterer Schwerpunkt dieser Arbeit befasst sich mit der zeitlichen Komponente des Akkommodationsprozesses, was auf der Beobachtung fußt, dass Menschen im Laufe eines Gesprächs ständig ihren Sprachstil ändern. Diese Beobachtung legt nahe, derartige Prozesse als kontinuierliche und dynamische Prozesse anzusehen. Fasst man jedoch diesen Prozess als diskret auf und betrachtet z.B. nur den Beginn und das Ende einer Interaktion, kann dies dazu führen, dass viele Akkommodationsereignisse unentdeckt bleiben oder übermäßig geglättet werden. Um die Entwicklung eines vokalen Akkommodationssystems zu rechtfertigen, muss zuerst bewiesen werden, dass Menschen bei der vokalen Interaktion mit einem Computer ein ähnliches Anpassungsverhalten zeigen wie bei der Interaktion mit einem Menschen. Da es keine eindeutig festgelegte Metrik für das Messen des Akkommodationsgrades und für die Evaluierung der Akkommodationsqualität gibt, ist es besonders wichtig, die Sprachproduktion von Menschen empirisch zu untersuchen, um sie als Referenz für mögliche Verhaltensweisen anzuwenden. In dieser Arbeit schließt diese Untersuchung verschiedene experimentelle Anordnungen ein, um einen besseren Überblick über Akkommodationseffekte zu erhalten. In einer ersten Studie wurde die vokale Akkommodation in einer Umgebung untersucht, in der sie natürlich vorkommt: in einem spontanen Mensch-Mensch Gespräch. Zu diesem Zweck wurde eine Sammlung von echten Verkaufsgesprächen gesammelt und analysiert, wobei in jedem dieser Gespräche ein anderes Handelsvertreter-Neukunde Paar teilgenommen hatte. Diese Gespräche verschaffen einen Einblick in Akkommodationseffekte während spontanen authentischen Interaktionen, wobei die Gesprächsteilnehmer zwei Ziele verfolgen: zum einen soll ein Geschäft verhandelt werden, zum anderen möchte aber jeder Teilnehmer für sich die besten Bedingungen aushandeln. Die Konversationen wurde durch das Kreuzkorrelation-Zeitreihen-Verfahren analysiert, um die dynamischen Änderungen im Zeitverlauf zu erfassen. Hierbei kam zum Vorschein, dass sich erfolgreiche Konversationen von fehlgeschlagenen Gesprächen deutlich unterscheiden lassen. Überdies wurde festgestellt, dass die Handelsvertreter die treibende Kraft von vokalen Änderungen sind, d.h. sie können die Neukunden eher dazu zu bringen, ihren Sprachstil anzupassen, als andersherum. Es wurde auch beobachtet, dass sie diese Akkommodation oft schon zu einem frühen Zeitpunkt auslösen, was besonders bei erfolgreichen Gesprächen beobachtet werden konnte. Dass diese Akkommodation stärker bei trainierten Sprechern ausgelöst wird, deckt sich mit den meist anekdotischen Empfehlungen von erfahrenen Handelsvertretern, die bisher nie wissenschaftlich nachgewiesen worden sind. Basierend auf diesen Ergebnissen beschäftigte sich die nächste Studie mehr mit dem Hauptziel dieser Arbeit und untersuchte Akkommodationseffekte bei Mensch-Maschine-Interaktionen. Diese Studie führte ein Shadowing-Experiment durch, das ein kontrolliertes Umfeld für die Untersuchung phonetischer Abweichungen anbietet. Da Sprachdialogsysteme mit solchen Akkommodationsfähigkeiten noch nicht existieren, wurde stattdessen ein simuliertes System eingesetzt, um diese Akkommodationsprozesse bei den Teilnehmern auszulösen, wobei diese im Glauben waren, ein Sprachlernsystem zu testen. Nach der Bestimmung ihrer Präferenzen hinsichtlich dreier segmentaler Merkmale hörten die Teilnehmer entweder natürlichen oder synthetischen Stimmen von männlichen und weiblichen Sprechern zu, die nicht die bevorzugten Variation der oben genannten Merkmale produzierten. Akkommodation fand in allen Fällen statt, obwohl die natürlichen Stimmen stärkere Effekte auslösten. Es kann jedoch gefolgert werden, dass Teilnehmer sich auch an den synthetischen Stimmen orientierten, was bedeutet, dass soziale Mechanismen bei Menschen auch beim Sprechen mit Computern angewendet werden. Das Shadowing-Paradigma wurde auch verwendet, um zu testen, ob Akkommodation ein nur mit Sprache assoziiertes Phänomen ist oder ob sie auch in anderen vokalen Aktivitäten stattfindet. Hierzu wurde Akkommodation im Gesang zu vertrauter und unbekannter Musik untersucht. Interessanterweise wurden in beiden Fällen Akkommodationseffekte gemessen, wenn auch nur auf unterschiedliche Weise. Wohingegen die Teilnehmer das vertraute Stück lediglich als Referenz für einen genaueren Gesang zu verwenden schienen, wurde das neuartige Stück zum Ziel einer vollständigen Nachbildung. Ein Unterschied bestand z.B. darin, dass im ersteren Fall hauptsächlich Tonhöhenkorrekturen durchgeführt wurden, während im zweiten Fall auch Tonart und Rhythmusmuster übernommen wurden. Einige dieser Ergebnisse wurden erwartet und zeigen, dass die hervorstechenderen Merkmale von Menschen auch durch externen auditorischen Einfluss schwerer zu modifizieren sind. Zuletzt wurde ein Mehrparteienexperiment mit spontanen Mensch-Mensch-Computer-Interaktionen durchgeführt, um Akkommodation in mensch- und computergerichteter Sprache zu vergleichen. Die Teilnehmer lösten Aufgaben, für die sie sowohl mit einem Konföderierten als auch mit einem Agenten sprechen mussten. Dies ermöglicht einen direkten Vergleich ihrer Sprache basierend auf dem Adressaten innerhalb derselben Konversation, was bisher noch nicht erforscht worden ist. Die Ergebnisse zeigen, dass sich das vokale Verhalten einiger Teilnehmer im Gespräch mit dem Konföderierten und dem Agenten ähnlich änderte, während die Sprache anderer Teilnehmer nur mit dem Konföderierten variierte. Weitere Analysen ergaben, dass der größte Faktor für diesen Unterschied die Reihenfolge war, in der die Teilnehmer mit den Gesprächspartnern sprachen. Anscheinend sahen die Teilnehmer, die zuerst mit dem Agenten allein sprachen, ihn eher als einen sozialen Akteur im Gespräch, während diejenigen, die erst mit dem Konföderierten interagierten, ihn eher als Mittel zur Erreichung eines Ziels betrachteten und sich deswegen anders verhielten. Im letzteren Fall waren die Variationen in der menschgerichteten Sprache viel ausgeprägter. Unterschiede wurden auch zwischen den analysierten Merkmalen festgestellt, aber der Aufgabentyp hatte keinen Einfluss auf den Grad der Akkommodationseffekte. Die Ergebnisse dieser Experimente lassen den Schluss zu, dass bei Mensch-Computer-Interaktionen vokale Akkommodation auftritt, wenn auch häufig in geringerem Maße. Da nun eine Bestätigung dafür vorliegt, dass Menschen auch bei der Interaktion mit Computern ein Akkommodationsverhalten aufzeigen, liegt der Schritt nahe, dieses Verhalten auf eine computergestützte Weise zu beschreiben. Hier werden zwei Ansätze vorgeschlagen: ein Ansatz basierend auf einem Rechenmodell und einer basierend auf einem statistischen Modell. Das Ziel des Rechenmodells ist es, den vermuteten kognitiven Prozess zu erfassen, der mit der Akkommodation beim Menschen verbunden ist. Dies umfasst verschiedene Schritte, z.B. das Erkennen des Klangs des variablen Merkmals, das Hinzufügen von Instanzen davon zum mentalen Gedächtnis des Merkmals und das Bestimmen, wie stark sich das Merkmal ändert, wobei sowohl seine aktuelle Darstellung als auch die externe Eingabe berücksichtigt werden. Aufgrund seiner sequenziellen Natur wurde dieses Modell als eine Pipeline implementiert. Jeder der fünf Schritte der Pipeline entspricht einem bestimmten Teil des kognitiven Prozesses und kann einen oder mehrere Parameter zur Steuerung seiner Ausgabe aufweisen (z.B. die Größe des Ge-dächtnisses des Merkmals oder die Akkommodationsgeschwindigkeit). Mit Hilfe dieser Parameter können präzise akkommodative Verhaltensweisen zusammen mit Expertenwissen erstellt werden, um die ausgewählten Parameterwerte zu motivieren. Durch diese Vorteile ist diesen Ansatz besonders zum Experimentieren mit vordefinierten, deterministischen Verhaltensweisen geeignet, bei denen jeder Schritt einzeln geändert werden kann. Letztendlich macht dieser Ansatz ein System stimmlich auf die Spracheingabe von Benutzern ansprechbar. Der zweite Ansatz gewährt weiterentwickelte Verhaltensweisen, indem verschiedene Kernverhalten definiert und nicht deterministische Variationen hinzugefügt werden. Dies ähnelt menschlichen Verhaltensmustern, da jede Person eine grundlegende Art von Akkommodationsverhalten hat, das sich je nach den spezifischen Umständen willkürlich ändern kann. Dieser Ansatz bietet eine datengesteuerte statistische Methode, um das Akkommodationsverhalten aus einer bestimmten Sammlung von Interaktionen zu extrahieren. Zunächst werden die Werte des Zielmerkmals jedes Sprechers in einer Interaktion in kontinuierliche interpolierte Linien umgewandelt, indem eine Probe aus der a posteriori Verteilung eines Gaußprozesses gezogen wird, der von den angegebenen Werten abhängig ist. Dann werden die Gradienten dieser Linien, die die gegenseitigen Änderungsraten darstellen, verwendet, um diskrete Änderungsniveaus basierend auf ihren Verteilungen zu definieren. Schließlich wird jeder Ebene ein Symbol zugewiesen, das letztendlich eine Symbolsequenzdarstellung für jede Interaktion darstellt. Die Sequenzen sind geclustert, sodass jeder Cluster für eine Art von Verhalten steht. Die Sequenzen eines Clusters können dann verwendet werden, um N-Gramm Wahrscheinlichkeiten zu berechnen, die die Erzeugung neuer Sequenzen des erfassten Verhaltens ermöglichen. Der spezifische Ausgabewert wird aus dem Bereich abgetastet, der dem erzeugten Symbol entspricht. Bei diesem Ansatz wird das Akkommodationsverhalten direkt aus Daten extrahiert, anstatt manuell erstellt zu werden. Es kann jedoch schwierig sein, zu beschreiben, was genau jedes Verhalten darstellt und die Verwendung eines von ihnen gegenüber dem anderen zu motivieren. Um diesen Spalt zwischen diesen beiden Ansätzen zu schließen, wird auch diskutiert, wie sie kombiniert werden könnten, um von den Vorteilen beider zu profitieren. Darüber hinaus, um strukturiertere Verhaltensweisen zu generieren, wird hier eine Hierarchie von Akkommodationskomplexitätsstufen vorgeschlagen, die von einer direkten Übernahme der Benutzerrealisierungen über eine bestimmte Änderungssensitivität und bis hin zu unabhängigen Kernverhalten mit nicht-deterministischen Variationsproduktionen reicht. Neben der Möglichkeit, Stimmänderungen zu verfolgen und darzustellen, benötigt ein akkommodatives System auch eine Text-zu-Sprache Komponente, die diese Änderungen in der Sprachausgabe des Systems realisieren kann. Sprachsynthesemodelle werden in der Regel einmal mit Daten mit bestimmten Merkmalen trainiert und ändern sich danach nicht mehr. Dies verhindert, dass solche Modelle Variationen in bestimmten Klängen und anderen phonetischen Merkmalen generieren können. Zwei Methoden zum direkten Ändern solcher Merkmale werden hier untersucht. Die erste basiert auf Signalverarbeitung, die auf das Ausgangssignal angewendet wird, nachdem es vom System erzeugt wurde. Die Verarbeitung erfolgt zwischen den Zeitstempeln der Zielmerkmale und verwendet vordefinierte Skripte, die das Signal modifizieren, um die gewünschten Werte zu erreichen. Diese Methode eignet sich besser für kontinuierliche Merkmale wie Vokalqualität, insbesondere bei subtilen Änderungen, die nicht unbedingt zu einer kategorialen Klangänderung führen. Die zweite Methode zielt darauf ab, phonetische Variationen in den Trainingsdaten zu erfassen. Zu diesem Zweck wird im Gegensatz zu den regulären graphemischen Darstellungen ein Trainingskorpus mit phonemischen Darstellungen verwendet. Auf diese Weise kann das Modell direktere Beziehungen zwischen Phonemen und Klang anstelle von Oberflächenformen und Klang erlernen, die je nach Sprache komplexer und von ihren umgebenden Buchstaben abhängen können. Die Zielvariationen selbst müssen nicht unbedingt explizit in den Trainingsdaten enthalten sein, solange die verschiedenen Klänge natürlich immer unterscheidbar sind. In der Generierungsphase bestimmt der Zustand des aktuellen Zielmerkmals das Phonem, das zum Erzeugen des gewünschten Klangs verwendet werden sollte. Diese Methode eignet sich für kategoriale Änderungen, insbesondere für Kontraste, die sich natürlich in der Sprache unterscheiden. Obwohl beide Methoden eindeutig verschiedene Einschränkungen aufweisen, liefern sie einen Machbarkeitsnachweis für die Idee, dass Sprachdialogsysteme ihre Sprachausgabe in Echtzeit phonetisch anpassen können, ohne ihre Text-zu-Sprache Modelle wieder zu trainieren. Um die Verhaltensdefinitionen und die Sprachmanipulation zu kombinieren, ist ein System erforderlich, das diese Elemente verbinden kann, um ein vollständiges akkommodationsfähiges System zu schaffen. Die hier vorgeschlagene Architektur erweitert den Standardfluss von Sprachdialogsystemen um ein zusätzliches Modul, das das transkribierte Sprachsignal von der Spracherkennungskomponente empfängt, ohne die Eingabe in die Sprachverständniskomponente zu beeinflussen. Während die Sprachverständnis-komponente nur die Texttranskription verwendet, um die Absicht des Benutzers zu bestimmen, verarbeitet die hinzugefügte Komponente das Rohsignal zusammen mit seiner phonetischen Transkription. In dieser erweiterten Architektur wird das Akkommodationsmodell in dem hinzugefügten Modul aktiviert und die für die Sprachmanipulation erforderlichen Informationen werden an die Text-zu-Sprache Komponente gesendet. Die Text-zu-Sprache Komponente hat jetzt zwei Eingaben, nämlich den Inhalt der Systemantwort, der von der Sprachgenerierungskomponente stammt, und die Zustände der definierten Zielmerkmale von der hinzugefügten Komponente. Hier wird eine Implementierung eines webbasierten Systems mit dieser Architektur vorgestellt und dessen Funktionalitäten wurden durch ein Vorzeigeszenario demonstriert, indem es verwendet wird, um ein Shadowing-Experiment automatisch durchzuführen. Dies hat zwei Hauptvorteile: Erstens spart der Experimentator Zeit und vermeidet manuelle Annotationsfehler, da das System die phonetischen Variationen der Teilnehmer erkennt und automatisch die geeignete Variation für die Rückmeldung auswählt. Der Experimentator erhält außerdem automatisch zusätzliche Informationen wie genaue Zeitstempel der Äußerungen, Echtzeitvisualisierung der Produktionen der Gesprächspartner und die Möglichkeit, die Interaktion nach Abschluss des Experiments erneut abzuspielen und zu analysieren. Der zweite Vorteil ist Skalierbarkeit. Mehrere Instanzen des Systems können auf einem Server ausgeführt werden, auf die mehrere Clients gleichzeitig zugreifen können. Dies spart nicht nur Zeit und Logistik, um Teilnehmer in ein Labor zu bringen, sondern ermöglicht auch die kontrollierte und reproduzierbare Durchführung von Experimenten mit verschiedenen Konfigurationen (z.B. andere Parameterwerte oder Zielmerkmale). Dies schließt einen vollständigen Zyklus von der Untersuchung des menschlichen Verhaltens bis zur Integration der Akkommodationsfähigkeiten ab. Obwohl jeder Teil davon zweifellos weiter untersucht werden kann, liegt der Schwerpunkt hier darauf, wie sie voneinander abhängen und sich miteinander kombinieren lassen. Das Messen von Änderungsmerkmalen, ohne zu zeigen, wie sie modelliert werden können, oder das Erreichen einer flexiblen Sprachsynthese ohne Berücksichtigung der gewünschten endgültigen Ausgabe führt möglicherweise nicht zum endgültigen Ziel, Akkommodationsfähigkeiten in Computer zu integrieren. Indem diese Dissertation die Vokal-Akkommodation in der Mensch-Computer-Interaktion als einen einzigen großen Prozess betrachtet und nicht als eine Sammlung isolierter Unterprobleme, schafft sie ein Fundament für umfassendere und vollständigere Lösungen in der Zukunft.

@phdthesis{Raveh_Diss_2021,
title = {Vocal accommodation in human-computer interaction: modeling and integration into spoken dialogue systems},
author = {Eran Raveh},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/31960},
doi = {https://doi.org/10.22028/D291-34889},
year = {2021},
date = {2021-12-07},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {With the rapidly increasing usage of voice-activated devices worldwide, verbal communication with computers is steadily becoming more common. Although speech is the principal natural manner of human communication, it is still challenging for computers, and users had been growing accustomed to adjusting their speaking style for computers. Such adjustments occur naturally, and typically unconsciously, in humans during an exchange to control the social distance between the interlocutors and improve the conversation’s efficiency. This phenomenon is called accommodation and it occurs on various modalities in human communication, like hand gestures, facial expressions, eye gaze, lexical and grammatical choices, and others. Vocal accommodation deals with phonetic-level changes occurring in segmental and suprasegmental features. A decrease in the difference between the speakers’ feature realizations results in convergence, while an increasing distance leads to divergence. The lack of such mutual adjustments made naturally by humans in computers’ speech creates a gap between human-human and human-computer interactions. Moreover, voice-activated systems currently speak in exactly the same manner to all users, regardless of their speech characteristics or realizations of specific features. Detecting phonetic variations and generating adaptive speech output would enhance user personalization, offer more human-like communication, and ultimately should improve the overall interaction experience. Thus, investigating these aspects of accommodation will help to understand and improving human-computer interaction. This thesis provides a comprehensive overview of the required building blocks for a roadmap toward the integration of accommodation capabilities into spoken dialogue systems. These include conducting human-human and human-computer interaction experiments to examine the differences in vocal behaviors, approaches for modeling these empirical findings, methods for introducing phonetic variations in synthesized speech, and a way to combine all these components into an accommodative system. While each component is a wide research field by itself, they depend on each other and hence should be jointly considered. The overarching goal of this thesis is therefore not only to show how each of the aspects can be further developed, but also to demonstrate and motivate the connections between them. A special emphasis is put throughout the thesis on the importance of the temporal aspect of accommodation. Humans constantly change their speech over the course of a conversation. Therefore, accommodation processes should be treated as continuous, dynamic phenomena. Measuring differences in a few discrete points, e.g., beginning and end of an interaction, may leave many accommodation events undiscovered or overly smoothed. To justify the effort of introducing accommodation in computers, it should first be proven that humans even show any phonetic adjustments when talking to a computer as they do with a human being. As there is no definitive metric for measuring accommodation and evaluating its quality, it is important to empirically study humans productions to later use as references for possible behaviors. In this work, this investigation encapsulates different experimental configurations to achieve a better picture of accommodation effects. First, vocal accommodation was inspected where it naturally occurs, namely in spontaneous human-human conversations. For this purpose, a collection of real-world sales conversations, each with a different representative-prospect pair, was collected and analyzed. These conversations offer a glance into accommodation effects in authentic, unscripted interactions with the common goal of negotiating a deal on the one hand, but with the individual facet of each side of trying to get the best terms on the other hand. The conversations were analyzed using cross-correlation and time series techniques to capture the change dynamics over time. It was found that successful conversations are distinguishable from failed ones by multiple measures. Furthermore, the sales representative proved to be better at leading the vocal changes, i.e., making the prospect follow their speech styles rather than the other way around. They also showed a stronger tendency to take that lead at an earlier stage, all the more so in successful conversations. The fact that accommodation occurs more by trained speakers and improves their performances fits anecdotal best practices of sales experts, which are now also proven scientifically. Following these results, the next experiment came closer to the final goal of this work and investigated vocal accommodation effects in human-computer interaction. This was done via a shadowing experiment, which offers a controlled setting for examining phonetic variations. As spoken dialogue systems with such accommodation capabilities (like this work aims to achieve) do not exist yet, a simulated system was used to introduce these changes to the participants, who believed they help with the testing of a language learning tutoring system. After determining their preference concerning three segmental phonetic features, participants were listen-ing to either natural or synthesized voices of male and female speakers, which produced the participants’ dispreferred variation of the aforementioned features. Accommodation occurred in all cases, but the natural voices triggered stronger effects. Nevertheless, it can be concluded that participants were accommodating toward synthetic voices as well, which means that social mechanisms are applied in humans also when speaking with computer-based interlocutors. The shadowing paradigm was utilized also to test whether accommodation is a phenomenon associated only with speech or with other vocal productions as well. To that end, accommodation in the singing of familiar and novel music was examined. Interestingly, accommodation was found in both cases, though in different ways. While participants seemed to use the familiar piece merely as a reference for singing more accurately, the novel piece became the goal for complete replicate. For example, one difference was that mostly pitch corrections were introduced in the former case, while in the latter also key and rhythmic patterns were adopted. Some of those findings were expected and they show that people’s more salient features are also harder to modify using external auditory influence. Lastly, a multiparty experiment with spontaneous human-human-computer interactions was carried out to compare accommodation in human-directed and computer-directed speech. The participants solved tasks for which they needed to talk both with a confederate and with an agent. This allows a direct comparison of their speech based on the addressee within the same conversation, which has not been done so far. Results show that some participants’ vocal behavior changed similarly when talking to the confederate and the agent, while others’ speech varied only with the confederate. Further analysis found that the greatest factor for this difference was the order in which the participants talked with the interlocutors. Apparently, those who first talked to the agent alone saw it more as a social actor in the conversation, while those who interacted with it after talking to the confederate treated it more as a means to achieve a goal, and thus behaved differently with it. In the latter case, the variations in the human-directed speech were much more prominent. Differences were also found between the analyzed features, but the task type did not influence the degree of accommodation effects. The results of these experiments lead to the conclusion that vocal accommodation does occur in human-computer interactions, even if often to lesser degrees. With the question of whether people accommodate to computer-based interlocutors as well answered, the next step would be to describe accommodative behaviors in a computer-processable manner. Two approaches are proposed here: computational and statistical. The computational model aims to capture the presumed cognitive process associated with accommodation in humans. This comprises various steps, such as detecting the variable feature’s sound, adding instances of it to the feature’s mental memory, and determining how much the sound will change while taking into account both its current representation and the external input. Due to its sequential nature, this model was implemented as a pipeline. Each of the pipeline’s five steps corresponds to a specific part of the cognitive process and can have one or more parameters to control its output (e.g., the size of the feature’s memory or the accommodation pace). Using these parameters, precise accommodative behaviors can be crafted while applying expert knowledge to motivate the chosen parameter values. These advantages make this approach suitable for experimentation with pre-defined, deterministic behaviors where each step can be changed individually. Ultimately, this approach makes a system vocally responsive to users’ speech input. The second approach grants more evolved behaviors, by defining different core behaviors and adding non-deterministic variations on top of them. This resembles human behavioral patterns, as each person has a base way of accommodating (or not accommodating), which may arbitrarily change based on the specific circumstances. This approach offers a data-driven statistical way to extract accommodation behaviors from a given collection of interactions. First, the target feature’s values of each speaker in an interaction are converted into continuous interpolated lines by drawing one sample from the posterior distribution of a Gaussian process conditioned on the given values. Then, the gradients of these lines, which represent rates of mutual change, are used to defined discrete levels of change based on their distribution. Finally, each level is assigned a symbol, which ultimately creates a symbol sequence representation for each interaction. The sequences are clustered so that each cluster stands for a type of behavior. The sequences of a cluster can then be used to calculate n-gram probabilities that enable the generation of new sequences of the captured behavior. The specific output value is sampled from the range corresponding to the generated symbol. With this approach, accommodation behaviors are extracted directly from data, as opposed to manually crafting them. However, it is harder to describe what exactly these behaviors represent and motivate the use of one of them over the other. To bridge this gap between these two approaches, it is also discussed how they can be combined to benefit from the advantages of both. Furthermore, to generate more structured behaviors, a hierarchy of accommodation complexity levels is suggested here, from a direct adoption of users’ realizations, via specified responsiveness, and up to independent core behaviors with non-deterministic variational productions. Besides a way to track and represent vocal changes, an accommodative system also needs a text-to-speech component that is able to realize those changes in the system’s speech output. Speech synthesis models are typically trained once on data with certain characteristics and do not change afterward. This prevents such models from introducing any variation in specific sounds and other phonetic features. Two methods for directly modifying such features are explored here. The first is based on signal modifications applied to the output signal after it was generated by the system. The processing is done between the timestamps of the target features and uses pre-defined scripts that modify the signal to achieve the desired values. This method is more suitable for continuous features like vowel quality, especially in the case of subtle changes that do not necessarily lead to a categorical sound change. The second method aims to capture phonetic variations in the training data. To that end, a training corpus with phonemic representations is used, as opposed to the regular graphemic representations. This way, the model can learn more direct relations between phonemes and sound instead of surface forms and sound, which, depending on the language, might be more complex and depend on their surrounding letters. The target variations themselves don’t necessarily need to be explicitly present in the training data, all time the different sounds are naturally distinguishable. In generation time, the current target feature’s state determines the phoneme to use for generating the desired sound. This method is suitable for categorical changes, especially for contrasts that naturally exist in the language. While both methods have certain limitations, they provide a proof of concept for the idea that spoken dialogue systems may phonetically adapt their speech output in real-time and without re-training their text-to-speech models. To combine the behavior definitions and the speech manipulations, a system is required, which can connect these elements to create a complete accommodation capability. The architecture suggested here extends the standard spoken dialogue system with an additional module, which receives the transcribed speech signal from the speech recognition component without influencing the input to the language understanding component. While language the understanding component uses only textual transcription to determine the user’s intention, the added component process the raw signal along with its phonetic transcription. In this extended architecture, the accommodation model is activated in the added module and the information required for speech manipulation is sent to the text-to-speech component. However, the text-to-speech component now has two inputs, viz. the content of the system’s response coming from the language generation component and the states of the defined target features from the added component. An implementation of a web-based system with this architecture is introduced here, and its functionality is showcased by demonstrating how it can be used to conduct a shadowing experiment automatically. This has two main advantage: First, since the system recognizes the participants’ phonetic variations and automatically selects the appropriate variation to use in its response, the experimenter saves time and prevents manual annotation errors. The experimenter also automatically gains additional information, like exact timestamps of utterances, real-time visualization of the interlocutors’ productions, and the possibility to replay and analyze the interaction after the experiment is finished. The second advantage is scalability. Multiple instances of the system can run on a server and be accessed by multiple clients at the same time. This not only saves time and the logistics of bringing participants into a lab, but also allows running the experiment with different configurations (e.g., other parameter values or target features) in a controlled and reproducible way. This completes a full cycle from examining human behaviors to integrating accommodation capabilities. Though each part of it can undoubtedly be further investigated, the emphasis here is on how they depend and connect to each other. Measuring changes features without showing how they can be modeled or achieving flexible speech synthesis without considering the desired final output might not lead to the final goal of introducing accommodation capabilities into computers. Treating accommodation in human-computer interaction as one large process rather than isolated sub-problems lays the ground for more comprehensive and complete solutions in the future.


Heutzutage wird die verbale Interaktion mit Computern immer gebr{\"a}uchlicher, was der rasant wachsenden Anzahl von sprachaktivierten Ger{\"a}ten weltweit geschuldet ist. Allerdings stellt die computerseitige Handhabung gesprochener Sprache weiterhin eine gro{\ss}e Herausforderung dar, obwohl sie die bevorzugte Art zwischenmenschlicher Kommunikation repr{\"a}sentiert. Dieser Umstand führt auch dazu, dass Benutzer ihren Sprachstil an das jeweilige Ger{\"a}t anpassen, um diese Handhabung zu erleichtern. Solche Anpassungen kommen in menschlicher gesprochener Sprache auch in der zwischenmenschlichen Kommunikation vor. {\"U}blicherweise ereignen sie sich unbewusst und auf natürliche Weise w{\"a}hrend eines Gespr{\"a}chs, etwa um die soziale Distanz zwischen den Gespr{\"a}chsteilnehmern zu kontrollieren oder um die Effizienz des Gespr{\"a}chs zu verbessern. Dieses Ph{\"a}nomen wird als Akkommodation bezeichnet und findet auf verschiedene Weise w{\"a}hrend menschlicher Kommunikation statt. Sie {\"a}u{\ss}ert sich zum Beispiel in der Gestik, Mimik, Blickrichtung oder aber auch in der Wortwahl und dem verwendeten Satzbau. Vokal- Akkommodation besch{\"a}ftigt sich mit derartigen Anpassungen auf phonetischer Ebene, die sich in segmentalen und suprasegmentalen Merkmalen zeigen. Werden Auspr{\"a}gungen dieser Merkmale bei den Gespr{\"a}chsteilnehmern im Laufe des Gespr{\"a}chs {\"a}hnlicher, spricht man von Konvergenz, vergr{\"o}{\ss}ern sich allerdings die Unterschiede, so wird dies als Divergenz bezeichnet. Dieser natürliche gegenseitige Anpassungsvorgang fehlt jedoch auf der Seite des Computers, was zu einer Lücke in der Mensch-Maschine-Interaktion führt. Darüber hinaus verwenden sprachaktivierte Systeme immer dieselbe Sprachausgabe und ignorieren folglich etwaige Unterschiede zum Sprachstil des momentanen Benutzers. Die Erkennung dieser phonetischen Abweichungen und die Erstellung von anpassungsf{\"a}higer Sprachausgabe würden zur Personalisierung dieser Systeme beitragen und k{\"o}nnten letztendlich die insgesamte Benutzererfahrung verbessern. Aus diesem Grund kann die Erforschung dieser Aspekte von Akkommodation helfen, Mensch-Maschine-Interaktion besser zu verstehen und weiterzuentwickeln. Die vorliegende Dissertation stellt einen umfassenden {\"U}berblick zu Bausteinen bereit, die n{\"o}tig sind, um Akkommodationsf{\"a}higkeiten in Sprachdialogsysteme zu integrieren. In diesem Zusammenhang wurden auch interaktive Mensch-Mensch- und Mensch- Maschine-Experimente durchgeführt. In diesen Experimenten wurden Differenzen der vokalen Verhaltensweisen untersucht und Methoden erforscht, wie phonetische Abweichungen in synthetische Sprachausgabe integriert werden k{\"o}nnen. Um die erhaltenen Ergebnisse empirisch auswerten zu k{\"o}nnen, wurden hierbei auch verschiedene Modellierungsans{\"a}tze erforscht. Fernerhin wurde der Frage nachgegangen, wie sich die betreffenden Komponenten kombinieren lassen, um ein Akkommodationssystem zu konstruieren. Jeder dieser Aspekte stellt für sich genommen bereits einen überaus breiten Forschungsbereich dar. Allerdings sind sie voneinander abh{\"a}ngig und sollten zusammen betrachtet werden. Aus diesem Grund liegt ein übergreifender Schwerpunkt dieser Dissertation darauf, nicht nur aufzuzeigen, wie sich diese Aspekte weiterentwickeln lassen, sondern auch zu motivieren, wie sie zusammenh{\"a}ngen. Ein weiterer Schwerpunkt dieser Arbeit befasst sich mit der zeitlichen Komponente des Akkommodationsprozesses, was auf der Beobachtung fu{\ss}t, dass Menschen im Laufe eines Gespr{\"a}chs st{\"a}ndig ihren Sprachstil {\"a}ndern. Diese Beobachtung legt nahe, derartige Prozesse als kontinuierliche und dynamische Prozesse anzusehen. Fasst man jedoch diesen Prozess als diskret auf und betrachtet z.B. nur den Beginn und das Ende einer Interaktion, kann dies dazu führen, dass viele Akkommodationsereignisse unentdeckt bleiben oder überm{\"a}{\ss}ig gegl{\"a}ttet werden. Um die Entwicklung eines vokalen Akkommodationssystems zu rechtfertigen, muss zuerst bewiesen werden, dass Menschen bei der vokalen Interaktion mit einem Computer ein {\"a}hnliches Anpassungsverhalten zeigen wie bei der Interaktion mit einem Menschen. Da es keine eindeutig festgelegte Metrik für das Messen des Akkommodationsgrades und für die Evaluierung der Akkommodationsqualit{\"a}t gibt, ist es besonders wichtig, die Sprachproduktion von Menschen empirisch zu untersuchen, um sie als Referenz für m{\"o}gliche Verhaltensweisen anzuwenden. In dieser Arbeit schlie{\ss}t diese Untersuchung verschiedene experimentelle Anordnungen ein, um einen besseren {\"U}berblick über Akkommodationseffekte zu erhalten. In einer ersten Studie wurde die vokale Akkommodation in einer Umgebung untersucht, in der sie natürlich vorkommt: in einem spontanen Mensch-Mensch Gespr{\"a}ch. Zu diesem Zweck wurde eine Sammlung von echten Verkaufsgespr{\"a}chen gesammelt und analysiert, wobei in jedem dieser Gespr{\"a}che ein anderes Handelsvertreter-Neukunde Paar teilgenommen hatte. Diese Gespr{\"a}che verschaffen einen Einblick in Akkommodationseffekte w{\"a}hrend spontanen authentischen Interaktionen, wobei die Gespr{\"a}chsteilnehmer zwei Ziele verfolgen: zum einen soll ein Gesch{\"a}ft verhandelt werden, zum anderen m{\"o}chte aber jeder Teilnehmer für sich die besten Bedingungen aushandeln. Die Konversationen wurde durch das Kreuzkorrelation-Zeitreihen-Verfahren analysiert, um die dynamischen {\"A}nderungen im Zeitverlauf zu erfassen. Hierbei kam zum Vorschein, dass sich erfolgreiche Konversationen von fehlgeschlagenen Gespr{\"a}chen deutlich unterscheiden lassen. {\"U}berdies wurde festgestellt, dass die Handelsvertreter die treibende Kraft von vokalen {\"A}nderungen sind, d.h. sie k{\"o}nnen die Neukunden eher dazu zu bringen, ihren Sprachstil anzupassen, als andersherum. Es wurde auch beobachtet, dass sie diese Akkommodation oft schon zu einem frühen Zeitpunkt ausl{\"o}sen, was besonders bei erfolgreichen Gespr{\"a}chen beobachtet werden konnte. Dass diese Akkommodation st{\"a}rker bei trainierten Sprechern ausgel{\"o}st wird, deckt sich mit den meist anekdotischen Empfehlungen von erfahrenen Handelsvertretern, die bisher nie wissenschaftlich nachgewiesen worden sind. Basierend auf diesen Ergebnissen besch{\"a}ftigte sich die n{\"a}chste Studie mehr mit dem Hauptziel dieser Arbeit und untersuchte Akkommodationseffekte bei Mensch-Maschine-Interaktionen. Diese Studie führte ein Shadowing-Experiment durch, das ein kontrolliertes Umfeld für die Untersuchung phonetischer Abweichungen anbietet. Da Sprachdialogsysteme mit solchen Akkommodationsf{\"a}higkeiten noch nicht existieren, wurde stattdessen ein simuliertes System eingesetzt, um diese Akkommodationsprozesse bei den Teilnehmern auszul{\"o}sen, wobei diese im Glauben waren, ein Sprachlernsystem zu testen. Nach der Bestimmung ihrer Pr{\"a}ferenzen hinsichtlich dreier segmentaler Merkmale h{\"o}rten die Teilnehmer entweder natürlichen oder synthetischen Stimmen von m{\"a}nnlichen und weiblichen Sprechern zu, die nicht die bevorzugten Variation der oben genannten Merkmale produzierten. Akkommodation fand in allen F{\"a}llen statt, obwohl die natürlichen Stimmen st{\"a}rkere Effekte ausl{\"o}sten. Es kann jedoch gefolgert werden, dass Teilnehmer sich auch an den synthetischen Stimmen orientierten, was bedeutet, dass soziale Mechanismen bei Menschen auch beim Sprechen mit Computern angewendet werden. Das Shadowing-Paradigma wurde auch verwendet, um zu testen, ob Akkommodation ein nur mit Sprache assoziiertes Ph{\"a}nomen ist oder ob sie auch in anderen vokalen Aktivit{\"a}ten stattfindet. Hierzu wurde Akkommodation im Gesang zu vertrauter und unbekannter Musik untersucht. Interessanterweise wurden in beiden F{\"a}llen Akkommodationseffekte gemessen, wenn auch nur auf unterschiedliche Weise. Wohingegen die Teilnehmer das vertraute Stück lediglich als Referenz für einen genaueren Gesang zu verwenden schienen, wurde das neuartige Stück zum Ziel einer vollst{\"a}ndigen Nachbildung. Ein Unterschied bestand z.B. darin, dass im ersteren Fall haupts{\"a}chlich Tonh{\"o}henkorrekturen durchgeführt wurden, w{\"a}hrend im zweiten Fall auch Tonart und Rhythmusmuster übernommen wurden. Einige dieser Ergebnisse wurden erwartet und zeigen, dass die hervorstechenderen Merkmale von Menschen auch durch externen auditorischen Einfluss schwerer zu modifizieren sind. Zuletzt wurde ein Mehrparteienexperiment mit spontanen Mensch-Mensch-Computer-Interaktionen durchgeführt, um Akkommodation in mensch- und computergerichteter Sprache zu vergleichen. Die Teilnehmer l{\"o}sten Aufgaben, für die sie sowohl mit einem Konf{\"o}derierten als auch mit einem Agenten sprechen mussten. Dies erm{\"o}glicht einen direkten Vergleich ihrer Sprache basierend auf dem Adressaten innerhalb derselben Konversation, was bisher noch nicht erforscht worden ist. Die Ergebnisse zeigen, dass sich das vokale Verhalten einiger Teilnehmer im Gespr{\"a}ch mit dem Konf{\"o}derierten und dem Agenten {\"a}hnlich {\"a}nderte, w{\"a}hrend die Sprache anderer Teilnehmer nur mit dem Konf{\"o}derierten variierte. Weitere Analysen ergaben, dass der gr{\"o}{\ss}te Faktor für diesen Unterschied die Reihenfolge war, in der die Teilnehmer mit den Gespr{\"a}chspartnern sprachen. Anscheinend sahen die Teilnehmer, die zuerst mit dem Agenten allein sprachen, ihn eher als einen sozialen Akteur im Gespr{\"a}ch, w{\"a}hrend diejenigen, die erst mit dem Konf{\"o}derierten interagierten, ihn eher als Mittel zur Erreichung eines Ziels betrachteten und sich deswegen anders verhielten. Im letzteren Fall waren die Variationen in der menschgerichteten Sprache viel ausgepr{\"a}gter. Unterschiede wurden auch zwischen den analysierten Merkmalen festgestellt, aber der Aufgabentyp hatte keinen Einfluss auf den Grad der Akkommodationseffekte. Die Ergebnisse dieser Experimente lassen den Schluss zu, dass bei Mensch-Computer-Interaktionen vokale Akkommodation auftritt, wenn auch h{\"a}ufig in geringerem Ma{\ss}e. Da nun eine Best{\"a}tigung dafür vorliegt, dass Menschen auch bei der Interaktion mit Computern ein Akkommodationsverhalten aufzeigen, liegt der Schritt nahe, dieses Verhalten auf eine computergestützte Weise zu beschreiben. Hier werden zwei Ans{\"a}tze vorgeschlagen: ein Ansatz basierend auf einem Rechenmodell und einer basierend auf einem statistischen Modell. Das Ziel des Rechenmodells ist es, den vermuteten kognitiven Prozess zu erfassen, der mit der Akkommodation beim Menschen verbunden ist. Dies umfasst verschiedene Schritte, z.B. das Erkennen des Klangs des variablen Merkmals, das Hinzufügen von Instanzen davon zum mentalen Ged{\"a}chtnis des Merkmals und das Bestimmen, wie stark sich das Merkmal {\"a}ndert, wobei sowohl seine aktuelle Darstellung als auch die externe Eingabe berücksichtigt werden. Aufgrund seiner sequenziellen Natur wurde dieses Modell als eine Pipeline implementiert. Jeder der fünf Schritte der Pipeline entspricht einem bestimmten Teil des kognitiven Prozesses und kann einen oder mehrere Parameter zur Steuerung seiner Ausgabe aufweisen (z.B. die Gr{\"o}{\ss}e des Ge-d{\"a}chtnisses des Merkmals oder die Akkommodationsgeschwindigkeit). Mit Hilfe dieser Parameter k{\"o}nnen pr{\"a}zise akkommodative Verhaltensweisen zusammen mit Expertenwissen erstellt werden, um die ausgew{\"a}hlten Parameterwerte zu motivieren. Durch diese Vorteile ist diesen Ansatz besonders zum Experimentieren mit vordefinierten, deterministischen Verhaltensweisen geeignet, bei denen jeder Schritt einzeln ge{\"a}ndert werden kann. Letztendlich macht dieser Ansatz ein System stimmlich auf die Spracheingabe von Benutzern ansprechbar. Der zweite Ansatz gew{\"a}hrt weiterentwickelte Verhaltensweisen, indem verschiedene Kernverhalten definiert und nicht deterministische Variationen hinzugefügt werden. Dies {\"a}hnelt menschlichen Verhaltensmustern, da jede Person eine grundlegende Art von Akkommodationsverhalten hat, das sich je nach den spezifischen Umst{\"a}nden willkürlich {\"a}ndern kann. Dieser Ansatz bietet eine datengesteuerte statistische Methode, um das Akkommodationsverhalten aus einer bestimmten Sammlung von Interaktionen zu extrahieren. Zun{\"a}chst werden die Werte des Zielmerkmals jedes Sprechers in einer Interaktion in kontinuierliche interpolierte Linien umgewandelt, indem eine Probe aus der a posteriori Verteilung eines Gau{\ss}prozesses gezogen wird, der von den angegebenen Werten abh{\"a}ngig ist. Dann werden die Gradienten dieser Linien, die die gegenseitigen {\"A}nderungsraten darstellen, verwendet, um diskrete {\"A}nderungsniveaus basierend auf ihren Verteilungen zu definieren. Schlie{\ss}lich wird jeder Ebene ein Symbol zugewiesen, das letztendlich eine Symbolsequenzdarstellung für jede Interaktion darstellt. Die Sequenzen sind geclustert, sodass jeder Cluster für eine Art von Verhalten steht. Die Sequenzen eines Clusters k{\"o}nnen dann verwendet werden, um N-Gramm Wahrscheinlichkeiten zu berechnen, die die Erzeugung neuer Sequenzen des erfassten Verhaltens erm{\"o}glichen. Der spezifische Ausgabewert wird aus dem Bereich abgetastet, der dem erzeugten Symbol entspricht. Bei diesem Ansatz wird das Akkommodationsverhalten direkt aus Daten extrahiert, anstatt manuell erstellt zu werden. Es kann jedoch schwierig sein, zu beschreiben, was genau jedes Verhalten darstellt und die Verwendung eines von ihnen gegenüber dem anderen zu motivieren. Um diesen Spalt zwischen diesen beiden Ans{\"a}tzen zu schlie{\ss}en, wird auch diskutiert, wie sie kombiniert werden k{\"o}nnten, um von den Vorteilen beider zu profitieren. Darüber hinaus, um strukturiertere Verhaltensweisen zu generieren, wird hier eine Hierarchie von Akkommodationskomplexit{\"a}tsstufen vorgeschlagen, die von einer direkten {\"U}bernahme der Benutzerrealisierungen über eine bestimmte {\"A}nderungssensitivit{\"a}t und bis hin zu unabh{\"a}ngigen Kernverhalten mit nicht-deterministischen Variationsproduktionen reicht. Neben der M{\"o}glichkeit, Stimm{\"a}nderungen zu verfolgen und darzustellen, ben{\"o}tigt ein akkommodatives System auch eine Text-zu-Sprache Komponente, die diese {\"A}nderungen in der Sprachausgabe des Systems realisieren kann. Sprachsynthesemodelle werden in der Regel einmal mit Daten mit bestimmten Merkmalen trainiert und {\"a}ndern sich danach nicht mehr. Dies verhindert, dass solche Modelle Variationen in bestimmten Kl{\"a}ngen und anderen phonetischen Merkmalen generieren k{\"o}nnen. Zwei Methoden zum direkten {\"A}ndern solcher Merkmale werden hier untersucht. Die erste basiert auf Signalverarbeitung, die auf das Ausgangssignal angewendet wird, nachdem es vom System erzeugt wurde. Die Verarbeitung erfolgt zwischen den Zeitstempeln der Zielmerkmale und verwendet vordefinierte Skripte, die das Signal modifizieren, um die gewünschten Werte zu erreichen. Diese Methode eignet sich besser für kontinuierliche Merkmale wie Vokalqualit{\"a}t, insbesondere bei subtilen {\"A}nderungen, die nicht unbedingt zu einer kategorialen Klang{\"a}nderung führen. Die zweite Methode zielt darauf ab, phonetische Variationen in den Trainingsdaten zu erfassen. Zu diesem Zweck wird im Gegensatz zu den regul{\"a}ren graphemischen Darstellungen ein Trainingskorpus mit phonemischen Darstellungen verwendet. Auf diese Weise kann das Modell direktere Beziehungen zwischen Phonemen und Klang anstelle von Oberfl{\"a}chenformen und Klang erlernen, die je nach Sprache komplexer und von ihren umgebenden Buchstaben abh{\"a}ngen k{\"o}nnen. Die Zielvariationen selbst müssen nicht unbedingt explizit in den Trainingsdaten enthalten sein, solange die verschiedenen Kl{\"a}nge natürlich immer unterscheidbar sind. In der Generierungsphase bestimmt der Zustand des aktuellen Zielmerkmals das Phonem, das zum Erzeugen des gewünschten Klangs verwendet werden sollte. Diese Methode eignet sich für kategoriale {\"A}nderungen, insbesondere für Kontraste, die sich natürlich in der Sprache unterscheiden. Obwohl beide Methoden eindeutig verschiedene Einschr{\"a}nkungen aufweisen, liefern sie einen Machbarkeitsnachweis für die Idee, dass Sprachdialogsysteme ihre Sprachausgabe in Echtzeit phonetisch anpassen k{\"o}nnen, ohne ihre Text-zu-Sprache Modelle wieder zu trainieren. Um die Verhaltensdefinitionen und die Sprachmanipulation zu kombinieren, ist ein System erforderlich, das diese Elemente verbinden kann, um ein vollst{\"a}ndiges akkommodationsf{\"a}higes System zu schaffen. Die hier vorgeschlagene Architektur erweitert den Standardfluss von Sprachdialogsystemen um ein zus{\"a}tzliches Modul, das das transkribierte Sprachsignal von der Spracherkennungskomponente empf{\"a}ngt, ohne die Eingabe in die Sprachverst{\"a}ndniskomponente zu beeinflussen. W{\"a}hrend die Sprachverst{\"a}ndnis-komponente nur die Texttranskription verwendet, um die Absicht des Benutzers zu bestimmen, verarbeitet die hinzugefügte Komponente das Rohsignal zusammen mit seiner phonetischen Transkription. In dieser erweiterten Architektur wird das Akkommodationsmodell in dem hinzugefügten Modul aktiviert und die für die Sprachmanipulation erforderlichen Informationen werden an die Text-zu-Sprache Komponente gesendet. Die Text-zu-Sprache Komponente hat jetzt zwei Eingaben, n{\"a}mlich den Inhalt der Systemantwort, der von der Sprachgenerierungskomponente stammt, und die Zust{\"a}nde der definierten Zielmerkmale von der hinzugefügten Komponente. Hier wird eine Implementierung eines webbasierten Systems mit dieser Architektur vorgestellt und dessen Funktionalit{\"a}ten wurden durch ein Vorzeigeszenario demonstriert, indem es verwendet wird, um ein Shadowing-Experiment automatisch durchzuführen. Dies hat zwei Hauptvorteile: Erstens spart der Experimentator Zeit und vermeidet manuelle Annotationsfehler, da das System die phonetischen Variationen der Teilnehmer erkennt und automatisch die geeignete Variation für die Rückmeldung ausw{\"a}hlt. Der Experimentator erh{\"a}lt au{\ss}erdem automatisch zus{\"a}tzliche Informationen wie genaue Zeitstempel der {\"A}u{\ss}erungen, Echtzeitvisualisierung der Produktionen der Gespr{\"a}chspartner und die M{\"o}glichkeit, die Interaktion nach Abschluss des Experiments erneut abzuspielen und zu analysieren. Der zweite Vorteil ist Skalierbarkeit. Mehrere Instanzen des Systems k{\"o}nnen auf einem Server ausgeführt werden, auf die mehrere Clients gleichzeitig zugreifen k{\"o}nnen. Dies spart nicht nur Zeit und Logistik, um Teilnehmer in ein Labor zu bringen, sondern erm{\"o}glicht auch die kontrollierte und reproduzierbare Durchführung von Experimenten mit verschiedenen Konfigurationen (z.B. andere Parameterwerte oder Zielmerkmale). Dies schlie{\ss}t einen vollst{\"a}ndigen Zyklus von der Untersuchung des menschlichen Verhaltens bis zur Integration der Akkommodationsf{\"a}higkeiten ab. Obwohl jeder Teil davon zweifellos weiter untersucht werden kann, liegt der Schwerpunkt hier darauf, wie sie voneinander abh{\"a}ngen und sich miteinander kombinieren lassen. Das Messen von {\"A}nderungsmerkmalen, ohne zu zeigen, wie sie modelliert werden k{\"o}nnen, oder das Erreichen einer flexiblen Sprachsynthese ohne Berücksichtigung der gewünschten endgültigen Ausgabe führt m{\"o}glicherweise nicht zum endgültigen Ziel, Akkommodationsf{\"a}higkeiten in Computer zu integrieren. Indem diese Dissertation die Vokal-Akkommodation in der Mensch-Computer-Interaktion als einen einzigen gro{\ss}en Prozess betrachtet und nicht als eine Sammlung isolierter Unterprobleme, schafft sie ein Fundament für umfassendere und vollst{\"a}ndigere L{\"o}sungen in der Zukunft.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C1

Ibrahim, Omnia; Yuen, Ivan; van Os, Marjolein; Andreeva, Bistra; Möbius, Bernd

The effect of Lombard speech modifications in different information density contexts Inproceedings

Elektronische Sprachsignalverarbeitung 2021, Tagungsband der 32. Konferenz (Berlin), TUDpress, pp. 185-191, Dresden, 2021.

Speakers adapt their speech to increase clarity in the presence of back-ground noise (Lombard speech) [1, 2]. However, they also modify their speech tobe efficient by shortening word duration in more predictable contexts [3]. To meetthese two communicative functions, speakers will attempt to resolve any conflicting communicative demands. The present study focuses on how this can be resolvedin the acoustic domain. A total of 1520 target CV syllables were annotated andanalysed from 38 German speakers in 2 white-noise (no noise vs. -10 dB SNR) and 2 surprisal (H vs. L) contexts. Median fundamental frequency (F0), intensityrange, and syllable duration were extracted. Our results revealed effects of bothnoise and surprisal on syllable duration and intensity range, but only an effect ofnoise on F0. This might suggest redundant (multi-dimensional) acoustic coding in Lombard speech modification, but not so in surprisal modification.

@inproceedings{Ibrahim2021,
title = {The effect of Lombard speech modifications in different information density contexts},
author = {Omnia Ibrahim and Ivan Yuen and Marjolein van Os and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://www.essv.de/paper.php?id=1117},
year = {2021},
date = {2021},
booktitle = {Elektronische Sprachsignalverarbeitung 2021, Tagungsband der 32. Konferenz (Berlin)},
pages = {185-191},
publisher = {TUDpress},
address = {Dresden},
abstract = {Speakers adapt their speech to increase clarity in the presence of back-ground noise (Lombard speech) [1, 2]. However, they also modify their speech tobe efficient by shortening word duration in more predictable contexts [3]. To meetthese two communicative functions, speakers will attempt to resolve any conflicting communicative demands. The present study focuses on how this can be resolvedin the acoustic domain. A total of 1520 target CV syllables were annotated andanalysed from 38 German speakers in 2 white-noise (no noise vs. -10 dB SNR) and 2 surprisal (H vs. L) contexts. Median fundamental frequency (F0), intensityrange, and syllable duration were extracted. Our results revealed effects of bothnoise and surprisal on syllable duration and intensity range, but only an effect ofnoise on F0. This might suggest redundant (multi-dimensional) acoustic coding in Lombard speech modification, but not so in surprisal modification.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Successfully