Publications

Raveh, Eran

Vocal accommodation in human-computer interaction: modeling and integration into spoken dialogue systems PhD Thesis

Saarland University, Saarbruecken, Germany, 2021.

With the rapidly increasing usage of voice-activated devices worldwide, verbal communication with computers is steadily becoming more common. Although speech is the principal natural manner of human communication, it is still challenging for computers, and users had been growing accustomed to adjusting their speaking style for computers. Such adjustments occur naturally, and typically unconsciously, in humans during an exchange to control the social distance between the interlocutors and improve the conversation’s efficiency. This phenomenon is called accommodation and it occurs on various modalities in human communication, like hand gestures, facial expressions, eye gaze, lexical and grammatical choices, and others. Vocal accommodation deals with phonetic-level changes occurring in segmental and suprasegmental features. A decrease in the difference between the speakers’ feature realizations results in convergence, while an increasing distance leads to divergence. The lack of such mutual adjustments made naturally by humans in computers’ speech creates a gap between human-human and human-computer interactions. Moreover, voice-activated systems currently speak in exactly the same manner to all users, regardless of their speech characteristics or realizations of specific features. Detecting phonetic variations and generating adaptive speech output would enhance user personalization, offer more human-like communication, and ultimately should improve the overall interaction experience. Thus, investigating these aspects of accommodation will help to understand and improving human-computer interaction. This thesis provides a comprehensive overview of the required building blocks for a roadmap toward the integration of accommodation capabilities into spoken dialogue systems. These include conducting human-human and human-computer interaction experiments to examine the differences in vocal behaviors, approaches for modeling these empirical findings, methods for introducing phonetic variations in synthesized speech, and a way to combine all these components into an accommodative system. While each component is a wide research field by itself, they depend on each other and hence should be jointly considered. The overarching goal of this thesis is therefore not only to show how each of the aspects can be further developed, but also to demonstrate and motivate the connections between them. A special emphasis is put throughout the thesis on the importance of the temporal aspect of accommodation. Humans constantly change their speech over the course of a conversation. Therefore, accommodation processes should be treated as continuous, dynamic phenomena. Measuring differences in a few discrete points, e.g., beginning and end of an interaction, may leave many accommodation events undiscovered or overly smoothed. To justify the effort of introducing accommodation in computers, it should first be proven that humans even show any phonetic adjustments when talking to a computer as they do with a human being. As there is no definitive metric for measuring accommodation and evaluating its quality, it is important to empirically study humans productions to later use as references for possible behaviors. In this work, this investigation encapsulates different experimental configurations to achieve a better picture of accommodation effects. First, vocal accommodation was inspected where it naturally occurs, namely in spontaneous human-human conversations. For this purpose, a collection of real-world sales conversations, each with a different representative-prospect pair, was collected and analyzed. These conversations offer a glance into accommodation effects in authentic, unscripted interactions with the common goal of negotiating a deal on the one hand, but with the individual facet of each side of trying to get the best terms on the other hand. The conversations were analyzed using cross-correlation and time series techniques to capture the change dynamics over time. It was found that successful conversations are distinguishable from failed ones by multiple measures. Furthermore, the sales representative proved to be better at leading the vocal changes, i.e., making the prospect follow their speech styles rather than the other way around. They also showed a stronger tendency to take that lead at an earlier stage, all the more so in successful conversations. The fact that accommodation occurs more by trained speakers and improves their performances fits anecdotal best practices of sales experts, which are now also proven scientifically. Following these results, the next experiment came closer to the final goal of this work and investigated vocal accommodation effects in human-computer interaction. This was done via a shadowing experiment, which offers a controlled setting for examining phonetic variations. As spoken dialogue systems with such accommodation capabilities (like this work aims to achieve) do not exist yet, a simulated system was used to introduce these changes to the participants, who believed they help with the testing of a language learning tutoring system. After determining their preference concerning three segmental phonetic features, participants were listen-ing to either natural or synthesized voices of male and female speakers, which produced the participants’ dispreferred variation of the aforementioned features. Accommodation occurred in all cases, but the natural voices triggered stronger effects. Nevertheless, it can be concluded that participants were accommodating toward synthetic voices as well, which means that social mechanisms are applied in humans also when speaking with computer-based interlocutors. The shadowing paradigm was utilized also to test whether accommodation is a phenomenon associated only with speech or with other vocal productions as well. To that end, accommodation in the singing of familiar and novel music was examined. Interestingly, accommodation was found in both cases, though in different ways. While participants seemed to use the familiar piece merely as a reference for singing more accurately, the novel piece became the goal for complete replicate. For example, one difference was that mostly pitch corrections were introduced in the former case, while in the latter also key and rhythmic patterns were adopted. Some of those findings were expected and they show that people’s more salient features are also harder to modify using external auditory influence. Lastly, a multiparty experiment with spontaneous human-human-computer interactions was carried out to compare accommodation in human-directed and computer-directed speech. The participants solved tasks for which they needed to talk both with a confederate and with an agent. This allows a direct comparison of their speech based on the addressee within the same conversation, which has not been done so far. Results show that some participants’ vocal behavior changed similarly when talking to the confederate and the agent, while others’ speech varied only with the confederate. Further analysis found that the greatest factor for this difference was the order in which the participants talked with the interlocutors. Apparently, those who first talked to the agent alone saw it more as a social actor in the conversation, while those who interacted with it after talking to the confederate treated it more as a means to achieve a goal, and thus behaved differently with it. In the latter case, the variations in the human-directed speech were much more prominent. Differences were also found between the analyzed features, but the task type did not influence the degree of accommodation effects. The results of these experiments lead to the conclusion that vocal accommodation does occur in human-computer interactions, even if often to lesser degrees. With the question of whether people accommodate to computer-based interlocutors as well answered, the next step would be to describe accommodative behaviors in a computer-processable manner. Two approaches are proposed here: computational and statistical. The computational model aims to capture the presumed cognitive process associated with accommodation in humans. This comprises various steps, such as detecting the variable feature’s sound, adding instances of it to the feature’s mental memory, and determining how much the sound will change while taking into account both its current representation and the external input. Due to its sequential nature, this model was implemented as a pipeline. Each of the pipeline’s five steps corresponds to a specific part of the cognitive process and can have one or more parameters to control its output (e.g., the size of the feature’s memory or the accommodation pace). Using these parameters, precise accommodative behaviors can be crafted while applying expert knowledge to motivate the chosen parameter values. These advantages make this approach suitable for experimentation with pre-defined, deterministic behaviors where each step can be changed individually. Ultimately, this approach makes a system vocally responsive to users’ speech input. The second approach grants more evolved behaviors, by defining different core behaviors and adding non-deterministic variations on top of them. This resembles human behavioral patterns, as each person has a base way of accommodating (or not accommodating), which may arbitrarily change based on the specific circumstances. This approach offers a data-driven statistical way to extract accommodation behaviors from a given collection of interactions. First, the target feature’s values of each speaker in an interaction are converted into continuous interpolated lines by drawing one sample from the posterior distribution of a Gaussian process conditioned on the given values. Then, the gradients of these lines, which represent rates of mutual change, are used to defined discrete levels of change based on their distribution. Finally, each level is assigned a symbol, which ultimately creates a symbol sequence representation for each interaction. The sequences are clustered so that each cluster stands for a type of behavior. The sequences of a cluster can then be used to calculate n-gram probabilities that enable the generation of new sequences of the captured behavior. The specific output value is sampled from the range corresponding to the generated symbol. With this approach, accommodation behaviors are extracted directly from data, as opposed to manually crafting them. However, it is harder to describe what exactly these behaviors represent and motivate the use of one of them over the other. To bridge this gap between these two approaches, it is also discussed how they can be combined to benefit from the advantages of both. Furthermore, to generate more structured behaviors, a hierarchy of accommodation complexity levels is suggested here, from a direct adoption of users’ realizations, via specified responsiveness, and up to independent core behaviors with non-deterministic variational productions. Besides a way to track and represent vocal changes, an accommodative system also needs a text-to-speech component that is able to realize those changes in the system’s speech output. Speech synthesis models are typically trained once on data with certain characteristics and do not change afterward. This prevents such models from introducing any variation in specific sounds and other phonetic features. Two methods for directly modifying such features are explored here. The first is based on signal modifications applied to the output signal after it was generated by the system. The processing is done between the timestamps of the target features and uses pre-defined scripts that modify the signal to achieve the desired values. This method is more suitable for continuous features like vowel quality, especially in the case of subtle changes that do not necessarily lead to a categorical sound change. The second method aims to capture phonetic variations in the training data. To that end, a training corpus with phonemic representations is used, as opposed to the regular graphemic representations. This way, the model can learn more direct relations between phonemes and sound instead of surface forms and sound, which, depending on the language, might be more complex and depend on their surrounding letters. The target variations themselves don’t necessarily need to be explicitly present in the training data, all time the different sounds are naturally distinguishable. In generation time, the current target feature’s state determines the phoneme to use for generating the desired sound. This method is suitable for categorical changes, especially for contrasts that naturally exist in the language. While both methods have certain limitations, they provide a proof of concept for the idea that spoken dialogue systems may phonetically adapt their speech output in real-time and without re-training their text-to-speech models. To combine the behavior definitions and the speech manipulations, a system is required, which can connect these elements to create a complete accommodation capability. The architecture suggested here extends the standard spoken dialogue system with an additional module, which receives the transcribed speech signal from the speech recognition component without influencing the input to the language understanding component. While language the understanding component uses only textual transcription to determine the user’s intention, the added component process the raw signal along with its phonetic transcription. In this extended architecture, the accommodation model is activated in the added module and the information required for speech manipulation is sent to the text-to-speech component. However, the text-to-speech component now has two inputs, viz. the content of the system’s response coming from the language generation component and the states of the defined target features from the added component. An implementation of a web-based system with this architecture is introduced here, and its functionality is showcased by demonstrating how it can be used to conduct a shadowing experiment automatically. This has two main advantage: First, since the system recognizes the participants’ phonetic variations and automatically selects the appropriate variation to use in its response, the experimenter saves time and prevents manual annotation errors. The experimenter also automatically gains additional information, like exact timestamps of utterances, real-time visualization of the interlocutors’ productions, and the possibility to replay and analyze the interaction after the experiment is finished. The second advantage is scalability. Multiple instances of the system can run on a server and be accessed by multiple clients at the same time. This not only saves time and the logistics of bringing participants into a lab, but also allows running the experiment with different configurations (e.g., other parameter values or target features) in a controlled and reproducible way. This completes a full cycle from examining human behaviors to integrating accommodation capabilities. Though each part of it can undoubtedly be further investigated, the emphasis here is on how they depend and connect to each other. Measuring changes features without showing how they can be modeled or achieving flexible speech synthesis without considering the desired final output might not lead to the final goal of introducing accommodation capabilities into computers. Treating accommodation in human-computer interaction as one large process rather than isolated sub-problems lays the ground for more comprehensive and complete solutions in the future.


Heutzutage wird die verbale Interaktion mit Computern immer gebräuchlicher, was der rasant wachsenden Anzahl von sprachaktivierten Geräten weltweit geschuldet ist. Allerdings stellt die computerseitige Handhabung gesprochener Sprache weiterhin eine große Herausforderung dar, obwohl sie die bevorzugte Art zwischenmenschlicher Kommunikation repräsentiert. Dieser Umstand führt auch dazu, dass Benutzer ihren Sprachstil an das jeweilige Gerät anpassen, um diese Handhabung zu erleichtern. Solche Anpassungen kommen in menschlicher gesprochener Sprache auch in der zwischenmenschlichen Kommunikation vor. Üblicherweise ereignen sie sich unbewusst und auf natürliche Weise während eines Gesprächs, etwa um die soziale Distanz zwischen den Gesprächsteilnehmern zu kontrollieren oder um die Effizienz des Gesprächs zu verbessern. Dieses Phänomen wird als Akkommodation bezeichnet und findet auf verschiedene Weise während menschlicher Kommunikation statt. Sie äußert sich zum Beispiel in der Gestik, Mimik, Blickrichtung oder aber auch in der Wortwahl und dem verwendeten Satzbau. Vokal- Akkommodation beschäftigt sich mit derartigen Anpassungen auf phonetischer Ebene, die sich in segmentalen und suprasegmentalen Merkmalen zeigen. Werden Ausprägungen dieser Merkmale bei den Gesprächsteilnehmern im Laufe des Gesprächs ähnlicher, spricht man von Konvergenz, vergrößern sich allerdings die Unterschiede, so wird dies als Divergenz bezeichnet. Dieser natürliche gegenseitige Anpassungsvorgang fehlt jedoch auf der Seite des Computers, was zu einer Lücke in der Mensch-Maschine-Interaktion führt. Darüber hinaus verwenden sprachaktivierte Systeme immer dieselbe Sprachausgabe und ignorieren folglich etwaige Unterschiede zum Sprachstil des momentanen Benutzers. Die Erkennung dieser phonetischen Abweichungen und die Erstellung von anpassungsfähiger Sprachausgabe würden zur Personalisierung dieser Systeme beitragen und könnten letztendlich die insgesamte Benutzererfahrung verbessern. Aus diesem Grund kann die Erforschung dieser Aspekte von Akkommodation helfen, Mensch-Maschine-Interaktion besser zu verstehen und weiterzuentwickeln. Die vorliegende Dissertation stellt einen umfassenden Überblick zu Bausteinen bereit, die nötig sind, um Akkommodationsfähigkeiten in Sprachdialogsysteme zu integrieren. In diesem Zusammenhang wurden auch interaktive Mensch-Mensch- und Mensch- Maschine-Experimente durchgeführt. In diesen Experimenten wurden Differenzen der vokalen Verhaltensweisen untersucht und Methoden erforscht, wie phonetische Abweichungen in synthetische Sprachausgabe integriert werden können. Um die erhaltenen Ergebnisse empirisch auswerten zu können, wurden hierbei auch verschiedene Modellierungsansätze erforscht. Fernerhin wurde der Frage nachgegangen, wie sich die betreffenden Komponenten kombinieren lassen, um ein Akkommodationssystem zu konstruieren. Jeder dieser Aspekte stellt für sich genommen bereits einen überaus breiten Forschungsbereich dar. Allerdings sind sie voneinander abhängig und sollten zusammen betrachtet werden. Aus diesem Grund liegt ein übergreifender Schwerpunkt dieser Dissertation darauf, nicht nur aufzuzeigen, wie sich diese Aspekte weiterentwickeln lassen, sondern auch zu motivieren, wie sie zusammenhängen. Ein weiterer Schwerpunkt dieser Arbeit befasst sich mit der zeitlichen Komponente des Akkommodationsprozesses, was auf der Beobachtung fußt, dass Menschen im Laufe eines Gesprächs ständig ihren Sprachstil ändern. Diese Beobachtung legt nahe, derartige Prozesse als kontinuierliche und dynamische Prozesse anzusehen. Fasst man jedoch diesen Prozess als diskret auf und betrachtet z.B. nur den Beginn und das Ende einer Interaktion, kann dies dazu führen, dass viele Akkommodationsereignisse unentdeckt bleiben oder übermäßig geglättet werden. Um die Entwicklung eines vokalen Akkommodationssystems zu rechtfertigen, muss zuerst bewiesen werden, dass Menschen bei der vokalen Interaktion mit einem Computer ein ähnliches Anpassungsverhalten zeigen wie bei der Interaktion mit einem Menschen. Da es keine eindeutig festgelegte Metrik für das Messen des Akkommodationsgrades und für die Evaluierung der Akkommodationsqualität gibt, ist es besonders wichtig, die Sprachproduktion von Menschen empirisch zu untersuchen, um sie als Referenz für mögliche Verhaltensweisen anzuwenden. In dieser Arbeit schließt diese Untersuchung verschiedene experimentelle Anordnungen ein, um einen besseren Überblick über Akkommodationseffekte zu erhalten. In einer ersten Studie wurde die vokale Akkommodation in einer Umgebung untersucht, in der sie natürlich vorkommt: in einem spontanen Mensch-Mensch Gespräch. Zu diesem Zweck wurde eine Sammlung von echten Verkaufsgesprächen gesammelt und analysiert, wobei in jedem dieser Gespräche ein anderes Handelsvertreter-Neukunde Paar teilgenommen hatte. Diese Gespräche verschaffen einen Einblick in Akkommodationseffekte während spontanen authentischen Interaktionen, wobei die Gesprächsteilnehmer zwei Ziele verfolgen: zum einen soll ein Geschäft verhandelt werden, zum anderen möchte aber jeder Teilnehmer für sich die besten Bedingungen aushandeln. Die Konversationen wurde durch das Kreuzkorrelation-Zeitreihen-Verfahren analysiert, um die dynamischen Änderungen im Zeitverlauf zu erfassen. Hierbei kam zum Vorschein, dass sich erfolgreiche Konversationen von fehlgeschlagenen Gesprächen deutlich unterscheiden lassen. Überdies wurde festgestellt, dass die Handelsvertreter die treibende Kraft von vokalen Änderungen sind, d.h. sie können die Neukunden eher dazu zu bringen, ihren Sprachstil anzupassen, als andersherum. Es wurde auch beobachtet, dass sie diese Akkommodation oft schon zu einem frühen Zeitpunkt auslösen, was besonders bei erfolgreichen Gesprächen beobachtet werden konnte. Dass diese Akkommodation stärker bei trainierten Sprechern ausgelöst wird, deckt sich mit den meist anekdotischen Empfehlungen von erfahrenen Handelsvertretern, die bisher nie wissenschaftlich nachgewiesen worden sind. Basierend auf diesen Ergebnissen beschäftigte sich die nächste Studie mehr mit dem Hauptziel dieser Arbeit und untersuchte Akkommodationseffekte bei Mensch-Maschine-Interaktionen. Diese Studie führte ein Shadowing-Experiment durch, das ein kontrolliertes Umfeld für die Untersuchung phonetischer Abweichungen anbietet. Da Sprachdialogsysteme mit solchen Akkommodationsfähigkeiten noch nicht existieren, wurde stattdessen ein simuliertes System eingesetzt, um diese Akkommodationsprozesse bei den Teilnehmern auszulösen, wobei diese im Glauben waren, ein Sprachlernsystem zu testen. Nach der Bestimmung ihrer Präferenzen hinsichtlich dreier segmentaler Merkmale hörten die Teilnehmer entweder natürlichen oder synthetischen Stimmen von männlichen und weiblichen Sprechern zu, die nicht die bevorzugten Variation der oben genannten Merkmale produzierten. Akkommodation fand in allen Fällen statt, obwohl die natürlichen Stimmen stärkere Effekte auslösten. Es kann jedoch gefolgert werden, dass Teilnehmer sich auch an den synthetischen Stimmen orientierten, was bedeutet, dass soziale Mechanismen bei Menschen auch beim Sprechen mit Computern angewendet werden. Das Shadowing-Paradigma wurde auch verwendet, um zu testen, ob Akkommodation ein nur mit Sprache assoziiertes Phänomen ist oder ob sie auch in anderen vokalen Aktivitäten stattfindet. Hierzu wurde Akkommodation im Gesang zu vertrauter und unbekannter Musik untersucht. Interessanterweise wurden in beiden Fällen Akkommodationseffekte gemessen, wenn auch nur auf unterschiedliche Weise. Wohingegen die Teilnehmer das vertraute Stück lediglich als Referenz für einen genaueren Gesang zu verwenden schienen, wurde das neuartige Stück zum Ziel einer vollständigen Nachbildung. Ein Unterschied bestand z.B. darin, dass im ersteren Fall hauptsächlich Tonhöhenkorrekturen durchgeführt wurden, während im zweiten Fall auch Tonart und Rhythmusmuster übernommen wurden. Einige dieser Ergebnisse wurden erwartet und zeigen, dass die hervorstechenderen Merkmale von Menschen auch durch externen auditorischen Einfluss schwerer zu modifizieren sind. Zuletzt wurde ein Mehrparteienexperiment mit spontanen Mensch-Mensch-Computer-Interaktionen durchgeführt, um Akkommodation in mensch- und computergerichteter Sprache zu vergleichen. Die Teilnehmer lösten Aufgaben, für die sie sowohl mit einem Konföderierten als auch mit einem Agenten sprechen mussten. Dies ermöglicht einen direkten Vergleich ihrer Sprache basierend auf dem Adressaten innerhalb derselben Konversation, was bisher noch nicht erforscht worden ist. Die Ergebnisse zeigen, dass sich das vokale Verhalten einiger Teilnehmer im Gespräch mit dem Konföderierten und dem Agenten ähnlich änderte, während die Sprache anderer Teilnehmer nur mit dem Konföderierten variierte. Weitere Analysen ergaben, dass der größte Faktor für diesen Unterschied die Reihenfolge war, in der die Teilnehmer mit den Gesprächspartnern sprachen. Anscheinend sahen die Teilnehmer, die zuerst mit dem Agenten allein sprachen, ihn eher als einen sozialen Akteur im Gespräch, während diejenigen, die erst mit dem Konföderierten interagierten, ihn eher als Mittel zur Erreichung eines Ziels betrachteten und sich deswegen anders verhielten. Im letzteren Fall waren die Variationen in der menschgerichteten Sprache viel ausgeprägter. Unterschiede wurden auch zwischen den analysierten Merkmalen festgestellt, aber der Aufgabentyp hatte keinen Einfluss auf den Grad der Akkommodationseffekte. Die Ergebnisse dieser Experimente lassen den Schluss zu, dass bei Mensch-Computer-Interaktionen vokale Akkommodation auftritt, wenn auch häufig in geringerem Maße. Da nun eine Bestätigung dafür vorliegt, dass Menschen auch bei der Interaktion mit Computern ein Akkommodationsverhalten aufzeigen, liegt der Schritt nahe, dieses Verhalten auf eine computergestützte Weise zu beschreiben. Hier werden zwei Ansätze vorgeschlagen: ein Ansatz basierend auf einem Rechenmodell und einer basierend auf einem statistischen Modell. Das Ziel des Rechenmodells ist es, den vermuteten kognitiven Prozess zu erfassen, der mit der Akkommodation beim Menschen verbunden ist. Dies umfasst verschiedene Schritte, z.B. das Erkennen des Klangs des variablen Merkmals, das Hinzufügen von Instanzen davon zum mentalen Gedächtnis des Merkmals und das Bestimmen, wie stark sich das Merkmal ändert, wobei sowohl seine aktuelle Darstellung als auch die externe Eingabe berücksichtigt werden. Aufgrund seiner sequenziellen Natur wurde dieses Modell als eine Pipeline implementiert. Jeder der fünf Schritte der Pipeline entspricht einem bestimmten Teil des kognitiven Prozesses und kann einen oder mehrere Parameter zur Steuerung seiner Ausgabe aufweisen (z.B. die Größe des Ge-dächtnisses des Merkmals oder die Akkommodationsgeschwindigkeit). Mit Hilfe dieser Parameter können präzise akkommodative Verhaltensweisen zusammen mit Expertenwissen erstellt werden, um die ausgewählten Parameterwerte zu motivieren. Durch diese Vorteile ist diesen Ansatz besonders zum Experimentieren mit vordefinierten, deterministischen Verhaltensweisen geeignet, bei denen jeder Schritt einzeln geändert werden kann. Letztendlich macht dieser Ansatz ein System stimmlich auf die Spracheingabe von Benutzern ansprechbar. Der zweite Ansatz gewährt weiterentwickelte Verhaltensweisen, indem verschiedene Kernverhalten definiert und nicht deterministische Variationen hinzugefügt werden. Dies ähnelt menschlichen Verhaltensmustern, da jede Person eine grundlegende Art von Akkommodationsverhalten hat, das sich je nach den spezifischen Umständen willkürlich ändern kann. Dieser Ansatz bietet eine datengesteuerte statistische Methode, um das Akkommodationsverhalten aus einer bestimmten Sammlung von Interaktionen zu extrahieren. Zunächst werden die Werte des Zielmerkmals jedes Sprechers in einer Interaktion in kontinuierliche interpolierte Linien umgewandelt, indem eine Probe aus der a posteriori Verteilung eines Gaußprozesses gezogen wird, der von den angegebenen Werten abhängig ist. Dann werden die Gradienten dieser Linien, die die gegenseitigen Änderungsraten darstellen, verwendet, um diskrete Änderungsniveaus basierend auf ihren Verteilungen zu definieren. Schließlich wird jeder Ebene ein Symbol zugewiesen, das letztendlich eine Symbolsequenzdarstellung für jede Interaktion darstellt. Die Sequenzen sind geclustert, sodass jeder Cluster für eine Art von Verhalten steht. Die Sequenzen eines Clusters können dann verwendet werden, um N-Gramm Wahrscheinlichkeiten zu berechnen, die die Erzeugung neuer Sequenzen des erfassten Verhaltens ermöglichen. Der spezifische Ausgabewert wird aus dem Bereich abgetastet, der dem erzeugten Symbol entspricht. Bei diesem Ansatz wird das Akkommodationsverhalten direkt aus Daten extrahiert, anstatt manuell erstellt zu werden. Es kann jedoch schwierig sein, zu beschreiben, was genau jedes Verhalten darstellt und die Verwendung eines von ihnen gegenüber dem anderen zu motivieren. Um diesen Spalt zwischen diesen beiden Ansätzen zu schließen, wird auch diskutiert, wie sie kombiniert werden könnten, um von den Vorteilen beider zu profitieren. Darüber hinaus, um strukturiertere Verhaltensweisen zu generieren, wird hier eine Hierarchie von Akkommodationskomplexitätsstufen vorgeschlagen, die von einer direkten Übernahme der Benutzerrealisierungen über eine bestimmte Änderungssensitivität und bis hin zu unabhängigen Kernverhalten mit nicht-deterministischen Variationsproduktionen reicht. Neben der Möglichkeit, Stimmänderungen zu verfolgen und darzustellen, benötigt ein akkommodatives System auch eine Text-zu-Sprache Komponente, die diese Änderungen in der Sprachausgabe des Systems realisieren kann. Sprachsynthesemodelle werden in der Regel einmal mit Daten mit bestimmten Merkmalen trainiert und ändern sich danach nicht mehr. Dies verhindert, dass solche Modelle Variationen in bestimmten Klängen und anderen phonetischen Merkmalen generieren können. Zwei Methoden zum direkten Ändern solcher Merkmale werden hier untersucht. Die erste basiert auf Signalverarbeitung, die auf das Ausgangssignal angewendet wird, nachdem es vom System erzeugt wurde. Die Verarbeitung erfolgt zwischen den Zeitstempeln der Zielmerkmale und verwendet vordefinierte Skripte, die das Signal modifizieren, um die gewünschten Werte zu erreichen. Diese Methode eignet sich besser für kontinuierliche Merkmale wie Vokalqualität, insbesondere bei subtilen Änderungen, die nicht unbedingt zu einer kategorialen Klangänderung führen. Die zweite Methode zielt darauf ab, phonetische Variationen in den Trainingsdaten zu erfassen. Zu diesem Zweck wird im Gegensatz zu den regulären graphemischen Darstellungen ein Trainingskorpus mit phonemischen Darstellungen verwendet. Auf diese Weise kann das Modell direktere Beziehungen zwischen Phonemen und Klang anstelle von Oberflächenformen und Klang erlernen, die je nach Sprache komplexer und von ihren umgebenden Buchstaben abhängen können. Die Zielvariationen selbst müssen nicht unbedingt explizit in den Trainingsdaten enthalten sein, solange die verschiedenen Klänge natürlich immer unterscheidbar sind. In der Generierungsphase bestimmt der Zustand des aktuellen Zielmerkmals das Phonem, das zum Erzeugen des gewünschten Klangs verwendet werden sollte. Diese Methode eignet sich für kategoriale Änderungen, insbesondere für Kontraste, die sich natürlich in der Sprache unterscheiden. Obwohl beide Methoden eindeutig verschiedene Einschränkungen aufweisen, liefern sie einen Machbarkeitsnachweis für die Idee, dass Sprachdialogsysteme ihre Sprachausgabe in Echtzeit phonetisch anpassen können, ohne ihre Text-zu-Sprache Modelle wieder zu trainieren. Um die Verhaltensdefinitionen und die Sprachmanipulation zu kombinieren, ist ein System erforderlich, das diese Elemente verbinden kann, um ein vollständiges akkommodationsfähiges System zu schaffen. Die hier vorgeschlagene Architektur erweitert den Standardfluss von Sprachdialogsystemen um ein zusätzliches Modul, das das transkribierte Sprachsignal von der Spracherkennungskomponente empfängt, ohne die Eingabe in die Sprachverständniskomponente zu beeinflussen. Während die Sprachverständnis-komponente nur die Texttranskription verwendet, um die Absicht des Benutzers zu bestimmen, verarbeitet die hinzugefügte Komponente das Rohsignal zusammen mit seiner phonetischen Transkription. In dieser erweiterten Architektur wird das Akkommodationsmodell in dem hinzugefügten Modul aktiviert und die für die Sprachmanipulation erforderlichen Informationen werden an die Text-zu-Sprache Komponente gesendet. Die Text-zu-Sprache Komponente hat jetzt zwei Eingaben, nämlich den Inhalt der Systemantwort, der von der Sprachgenerierungskomponente stammt, und die Zustände der definierten Zielmerkmale von der hinzugefügten Komponente. Hier wird eine Implementierung eines webbasierten Systems mit dieser Architektur vorgestellt und dessen Funktionalitäten wurden durch ein Vorzeigeszenario demonstriert, indem es verwendet wird, um ein Shadowing-Experiment automatisch durchzuführen. Dies hat zwei Hauptvorteile: Erstens spart der Experimentator Zeit und vermeidet manuelle Annotationsfehler, da das System die phonetischen Variationen der Teilnehmer erkennt und automatisch die geeignete Variation für die Rückmeldung auswählt. Der Experimentator erhält außerdem automatisch zusätzliche Informationen wie genaue Zeitstempel der Äußerungen, Echtzeitvisualisierung der Produktionen der Gesprächspartner und die Möglichkeit, die Interaktion nach Abschluss des Experiments erneut abzuspielen und zu analysieren. Der zweite Vorteil ist Skalierbarkeit. Mehrere Instanzen des Systems können auf einem Server ausgeführt werden, auf die mehrere Clients gleichzeitig zugreifen können. Dies spart nicht nur Zeit und Logistik, um Teilnehmer in ein Labor zu bringen, sondern ermöglicht auch die kontrollierte und reproduzierbare Durchführung von Experimenten mit verschiedenen Konfigurationen (z.B. andere Parameterwerte oder Zielmerkmale). Dies schließt einen vollständigen Zyklus von der Untersuchung des menschlichen Verhaltens bis zur Integration der Akkommodationsfähigkeiten ab. Obwohl jeder Teil davon zweifellos weiter untersucht werden kann, liegt der Schwerpunkt hier darauf, wie sie voneinander abhängen und sich miteinander kombinieren lassen. Das Messen von Änderungsmerkmalen, ohne zu zeigen, wie sie modelliert werden können, oder das Erreichen einer flexiblen Sprachsynthese ohne Berücksichtigung der gewünschten endgültigen Ausgabe führt möglicherweise nicht zum endgültigen Ziel, Akkommodationsfähigkeiten in Computer zu integrieren. Indem diese Dissertation die Vokal-Akkommodation in der Mensch-Computer-Interaktion als einen einzigen großen Prozess betrachtet und nicht als eine Sammlung isolierter Unterprobleme, schafft sie ein Fundament für umfassendere und vollständigere Lösungen in der Zukunft.

@phdthesis{Raveh_Diss_2021,
title = {Vocal accommodation in human-computer interaction: modeling and integration into spoken dialogue systems},
author = {Eran Raveh},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/31960},
doi = {https://doi.org/10.22028/D291-34889},
year = {2021},
date = {2021-12-07},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {With the rapidly increasing usage of voice-activated devices worldwide, verbal communication with computers is steadily becoming more common. Although speech is the principal natural manner of human communication, it is still challenging for computers, and users had been growing accustomed to adjusting their speaking style for computers. Such adjustments occur naturally, and typically unconsciously, in humans during an exchange to control the social distance between the interlocutors and improve the conversation’s efficiency. This phenomenon is called accommodation and it occurs on various modalities in human communication, like hand gestures, facial expressions, eye gaze, lexical and grammatical choices, and others. Vocal accommodation deals with phonetic-level changes occurring in segmental and suprasegmental features. A decrease in the difference between the speakers’ feature realizations results in convergence, while an increasing distance leads to divergence. The lack of such mutual adjustments made naturally by humans in computers’ speech creates a gap between human-human and human-computer interactions. Moreover, voice-activated systems currently speak in exactly the same manner to all users, regardless of their speech characteristics or realizations of specific features. Detecting phonetic variations and generating adaptive speech output would enhance user personalization, offer more human-like communication, and ultimately should improve the overall interaction experience. Thus, investigating these aspects of accommodation will help to understand and improving human-computer interaction. This thesis provides a comprehensive overview of the required building blocks for a roadmap toward the integration of accommodation capabilities into spoken dialogue systems. These include conducting human-human and human-computer interaction experiments to examine the differences in vocal behaviors, approaches for modeling these empirical findings, methods for introducing phonetic variations in synthesized speech, and a way to combine all these components into an accommodative system. While each component is a wide research field by itself, they depend on each other and hence should be jointly considered. The overarching goal of this thesis is therefore not only to show how each of the aspects can be further developed, but also to demonstrate and motivate the connections between them. A special emphasis is put throughout the thesis on the importance of the temporal aspect of accommodation. Humans constantly change their speech over the course of a conversation. Therefore, accommodation processes should be treated as continuous, dynamic phenomena. Measuring differences in a few discrete points, e.g., beginning and end of an interaction, may leave many accommodation events undiscovered or overly smoothed. To justify the effort of introducing accommodation in computers, it should first be proven that humans even show any phonetic adjustments when talking to a computer as they do with a human being. As there is no definitive metric for measuring accommodation and evaluating its quality, it is important to empirically study humans productions to later use as references for possible behaviors. In this work, this investigation encapsulates different experimental configurations to achieve a better picture of accommodation effects. First, vocal accommodation was inspected where it naturally occurs, namely in spontaneous human-human conversations. For this purpose, a collection of real-world sales conversations, each with a different representative-prospect pair, was collected and analyzed. These conversations offer a glance into accommodation effects in authentic, unscripted interactions with the common goal of negotiating a deal on the one hand, but with the individual facet of each side of trying to get the best terms on the other hand. The conversations were analyzed using cross-correlation and time series techniques to capture the change dynamics over time. It was found that successful conversations are distinguishable from failed ones by multiple measures. Furthermore, the sales representative proved to be better at leading the vocal changes, i.e., making the prospect follow their speech styles rather than the other way around. They also showed a stronger tendency to take that lead at an earlier stage, all the more so in successful conversations. The fact that accommodation occurs more by trained speakers and improves their performances fits anecdotal best practices of sales experts, which are now also proven scientifically. Following these results, the next experiment came closer to the final goal of this work and investigated vocal accommodation effects in human-computer interaction. This was done via a shadowing experiment, which offers a controlled setting for examining phonetic variations. As spoken dialogue systems with such accommodation capabilities (like this work aims to achieve) do not exist yet, a simulated system was used to introduce these changes to the participants, who believed they help with the testing of a language learning tutoring system. After determining their preference concerning three segmental phonetic features, participants were listen-ing to either natural or synthesized voices of male and female speakers, which produced the participants’ dispreferred variation of the aforementioned features. Accommodation occurred in all cases, but the natural voices triggered stronger effects. Nevertheless, it can be concluded that participants were accommodating toward synthetic voices as well, which means that social mechanisms are applied in humans also when speaking with computer-based interlocutors. The shadowing paradigm was utilized also to test whether accommodation is a phenomenon associated only with speech or with other vocal productions as well. To that end, accommodation in the singing of familiar and novel music was examined. Interestingly, accommodation was found in both cases, though in different ways. While participants seemed to use the familiar piece merely as a reference for singing more accurately, the novel piece became the goal for complete replicate. For example, one difference was that mostly pitch corrections were introduced in the former case, while in the latter also key and rhythmic patterns were adopted. Some of those findings were expected and they show that people’s more salient features are also harder to modify using external auditory influence. Lastly, a multiparty experiment with spontaneous human-human-computer interactions was carried out to compare accommodation in human-directed and computer-directed speech. The participants solved tasks for which they needed to talk both with a confederate and with an agent. This allows a direct comparison of their speech based on the addressee within the same conversation, which has not been done so far. Results show that some participants’ vocal behavior changed similarly when talking to the confederate and the agent, while others’ speech varied only with the confederate. Further analysis found that the greatest factor for this difference was the order in which the participants talked with the interlocutors. Apparently, those who first talked to the agent alone saw it more as a social actor in the conversation, while those who interacted with it after talking to the confederate treated it more as a means to achieve a goal, and thus behaved differently with it. In the latter case, the variations in the human-directed speech were much more prominent. Differences were also found between the analyzed features, but the task type did not influence the degree of accommodation effects. The results of these experiments lead to the conclusion that vocal accommodation does occur in human-computer interactions, even if often to lesser degrees. With the question of whether people accommodate to computer-based interlocutors as well answered, the next step would be to describe accommodative behaviors in a computer-processable manner. Two approaches are proposed here: computational and statistical. The computational model aims to capture the presumed cognitive process associated with accommodation in humans. This comprises various steps, such as detecting the variable feature’s sound, adding instances of it to the feature’s mental memory, and determining how much the sound will change while taking into account both its current representation and the external input. Due to its sequential nature, this model was implemented as a pipeline. Each of the pipeline’s five steps corresponds to a specific part of the cognitive process and can have one or more parameters to control its output (e.g., the size of the feature’s memory or the accommodation pace). Using these parameters, precise accommodative behaviors can be crafted while applying expert knowledge to motivate the chosen parameter values. These advantages make this approach suitable for experimentation with pre-defined, deterministic behaviors where each step can be changed individually. Ultimately, this approach makes a system vocally responsive to users’ speech input. The second approach grants more evolved behaviors, by defining different core behaviors and adding non-deterministic variations on top of them. This resembles human behavioral patterns, as each person has a base way of accommodating (or not accommodating), which may arbitrarily change based on the specific circumstances. This approach offers a data-driven statistical way to extract accommodation behaviors from a given collection of interactions. First, the target feature’s values of each speaker in an interaction are converted into continuous interpolated lines by drawing one sample from the posterior distribution of a Gaussian process conditioned on the given values. Then, the gradients of these lines, which represent rates of mutual change, are used to defined discrete levels of change based on their distribution. Finally, each level is assigned a symbol, which ultimately creates a symbol sequence representation for each interaction. The sequences are clustered so that each cluster stands for a type of behavior. The sequences of a cluster can then be used to calculate n-gram probabilities that enable the generation of new sequences of the captured behavior. The specific output value is sampled from the range corresponding to the generated symbol. With this approach, accommodation behaviors are extracted directly from data, as opposed to manually crafting them. However, it is harder to describe what exactly these behaviors represent and motivate the use of one of them over the other. To bridge this gap between these two approaches, it is also discussed how they can be combined to benefit from the advantages of both. Furthermore, to generate more structured behaviors, a hierarchy of accommodation complexity levels is suggested here, from a direct adoption of users’ realizations, via specified responsiveness, and up to independent core behaviors with non-deterministic variational productions. Besides a way to track and represent vocal changes, an accommodative system also needs a text-to-speech component that is able to realize those changes in the system’s speech output. Speech synthesis models are typically trained once on data with certain characteristics and do not change afterward. This prevents such models from introducing any variation in specific sounds and other phonetic features. Two methods for directly modifying such features are explored here. The first is based on signal modifications applied to the output signal after it was generated by the system. The processing is done between the timestamps of the target features and uses pre-defined scripts that modify the signal to achieve the desired values. This method is more suitable for continuous features like vowel quality, especially in the case of subtle changes that do not necessarily lead to a categorical sound change. The second method aims to capture phonetic variations in the training data. To that end, a training corpus with phonemic representations is used, as opposed to the regular graphemic representations. This way, the model can learn more direct relations between phonemes and sound instead of surface forms and sound, which, depending on the language, might be more complex and depend on their surrounding letters. The target variations themselves don’t necessarily need to be explicitly present in the training data, all time the different sounds are naturally distinguishable. In generation time, the current target feature’s state determines the phoneme to use for generating the desired sound. This method is suitable for categorical changes, especially for contrasts that naturally exist in the language. While both methods have certain limitations, they provide a proof of concept for the idea that spoken dialogue systems may phonetically adapt their speech output in real-time and without re-training their text-to-speech models. To combine the behavior definitions and the speech manipulations, a system is required, which can connect these elements to create a complete accommodation capability. The architecture suggested here extends the standard spoken dialogue system with an additional module, which receives the transcribed speech signal from the speech recognition component without influencing the input to the language understanding component. While language the understanding component uses only textual transcription to determine the user’s intention, the added component process the raw signal along with its phonetic transcription. In this extended architecture, the accommodation model is activated in the added module and the information required for speech manipulation is sent to the text-to-speech component. However, the text-to-speech component now has two inputs, viz. the content of the system’s response coming from the language generation component and the states of the defined target features from the added component. An implementation of a web-based system with this architecture is introduced here, and its functionality is showcased by demonstrating how it can be used to conduct a shadowing experiment automatically. This has two main advantage: First, since the system recognizes the participants’ phonetic variations and automatically selects the appropriate variation to use in its response, the experimenter saves time and prevents manual annotation errors. The experimenter also automatically gains additional information, like exact timestamps of utterances, real-time visualization of the interlocutors’ productions, and the possibility to replay and analyze the interaction after the experiment is finished. The second advantage is scalability. Multiple instances of the system can run on a server and be accessed by multiple clients at the same time. This not only saves time and the logistics of bringing participants into a lab, but also allows running the experiment with different configurations (e.g., other parameter values or target features) in a controlled and reproducible way. This completes a full cycle from examining human behaviors to integrating accommodation capabilities. Though each part of it can undoubtedly be further investigated, the emphasis here is on how they depend and connect to each other. Measuring changes features without showing how they can be modeled or achieving flexible speech synthesis without considering the desired final output might not lead to the final goal of introducing accommodation capabilities into computers. Treating accommodation in human-computer interaction as one large process rather than isolated sub-problems lays the ground for more comprehensive and complete solutions in the future.


Heutzutage wird die verbale Interaktion mit Computern immer gebr{\"a}uchlicher, was der rasant wachsenden Anzahl von sprachaktivierten Ger{\"a}ten weltweit geschuldet ist. Allerdings stellt die computerseitige Handhabung gesprochener Sprache weiterhin eine gro{\ss}e Herausforderung dar, obwohl sie die bevorzugte Art zwischenmenschlicher Kommunikation repr{\"a}sentiert. Dieser Umstand führt auch dazu, dass Benutzer ihren Sprachstil an das jeweilige Ger{\"a}t anpassen, um diese Handhabung zu erleichtern. Solche Anpassungen kommen in menschlicher gesprochener Sprache auch in der zwischenmenschlichen Kommunikation vor. {\"U}blicherweise ereignen sie sich unbewusst und auf natürliche Weise w{\"a}hrend eines Gespr{\"a}chs, etwa um die soziale Distanz zwischen den Gespr{\"a}chsteilnehmern zu kontrollieren oder um die Effizienz des Gespr{\"a}chs zu verbessern. Dieses Ph{\"a}nomen wird als Akkommodation bezeichnet und findet auf verschiedene Weise w{\"a}hrend menschlicher Kommunikation statt. Sie {\"a}u{\ss}ert sich zum Beispiel in der Gestik, Mimik, Blickrichtung oder aber auch in der Wortwahl und dem verwendeten Satzbau. Vokal- Akkommodation besch{\"a}ftigt sich mit derartigen Anpassungen auf phonetischer Ebene, die sich in segmentalen und suprasegmentalen Merkmalen zeigen. Werden Auspr{\"a}gungen dieser Merkmale bei den Gespr{\"a}chsteilnehmern im Laufe des Gespr{\"a}chs {\"a}hnlicher, spricht man von Konvergenz, vergr{\"o}{\ss}ern sich allerdings die Unterschiede, so wird dies als Divergenz bezeichnet. Dieser natürliche gegenseitige Anpassungsvorgang fehlt jedoch auf der Seite des Computers, was zu einer Lücke in der Mensch-Maschine-Interaktion führt. Darüber hinaus verwenden sprachaktivierte Systeme immer dieselbe Sprachausgabe und ignorieren folglich etwaige Unterschiede zum Sprachstil des momentanen Benutzers. Die Erkennung dieser phonetischen Abweichungen und die Erstellung von anpassungsf{\"a}higer Sprachausgabe würden zur Personalisierung dieser Systeme beitragen und k{\"o}nnten letztendlich die insgesamte Benutzererfahrung verbessern. Aus diesem Grund kann die Erforschung dieser Aspekte von Akkommodation helfen, Mensch-Maschine-Interaktion besser zu verstehen und weiterzuentwickeln. Die vorliegende Dissertation stellt einen umfassenden {\"U}berblick zu Bausteinen bereit, die n{\"o}tig sind, um Akkommodationsf{\"a}higkeiten in Sprachdialogsysteme zu integrieren. In diesem Zusammenhang wurden auch interaktive Mensch-Mensch- und Mensch- Maschine-Experimente durchgeführt. In diesen Experimenten wurden Differenzen der vokalen Verhaltensweisen untersucht und Methoden erforscht, wie phonetische Abweichungen in synthetische Sprachausgabe integriert werden k{\"o}nnen. Um die erhaltenen Ergebnisse empirisch auswerten zu k{\"o}nnen, wurden hierbei auch verschiedene Modellierungsans{\"a}tze erforscht. Fernerhin wurde der Frage nachgegangen, wie sich die betreffenden Komponenten kombinieren lassen, um ein Akkommodationssystem zu konstruieren. Jeder dieser Aspekte stellt für sich genommen bereits einen überaus breiten Forschungsbereich dar. Allerdings sind sie voneinander abh{\"a}ngig und sollten zusammen betrachtet werden. Aus diesem Grund liegt ein übergreifender Schwerpunkt dieser Dissertation darauf, nicht nur aufzuzeigen, wie sich diese Aspekte weiterentwickeln lassen, sondern auch zu motivieren, wie sie zusammenh{\"a}ngen. Ein weiterer Schwerpunkt dieser Arbeit befasst sich mit der zeitlichen Komponente des Akkommodationsprozesses, was auf der Beobachtung fu{\ss}t, dass Menschen im Laufe eines Gespr{\"a}chs st{\"a}ndig ihren Sprachstil {\"a}ndern. Diese Beobachtung legt nahe, derartige Prozesse als kontinuierliche und dynamische Prozesse anzusehen. Fasst man jedoch diesen Prozess als diskret auf und betrachtet z.B. nur den Beginn und das Ende einer Interaktion, kann dies dazu führen, dass viele Akkommodationsereignisse unentdeckt bleiben oder überm{\"a}{\ss}ig gegl{\"a}ttet werden. Um die Entwicklung eines vokalen Akkommodationssystems zu rechtfertigen, muss zuerst bewiesen werden, dass Menschen bei der vokalen Interaktion mit einem Computer ein {\"a}hnliches Anpassungsverhalten zeigen wie bei der Interaktion mit einem Menschen. Da es keine eindeutig festgelegte Metrik für das Messen des Akkommodationsgrades und für die Evaluierung der Akkommodationsqualit{\"a}t gibt, ist es besonders wichtig, die Sprachproduktion von Menschen empirisch zu untersuchen, um sie als Referenz für m{\"o}gliche Verhaltensweisen anzuwenden. In dieser Arbeit schlie{\ss}t diese Untersuchung verschiedene experimentelle Anordnungen ein, um einen besseren {\"U}berblick über Akkommodationseffekte zu erhalten. In einer ersten Studie wurde die vokale Akkommodation in einer Umgebung untersucht, in der sie natürlich vorkommt: in einem spontanen Mensch-Mensch Gespr{\"a}ch. Zu diesem Zweck wurde eine Sammlung von echten Verkaufsgespr{\"a}chen gesammelt und analysiert, wobei in jedem dieser Gespr{\"a}che ein anderes Handelsvertreter-Neukunde Paar teilgenommen hatte. Diese Gespr{\"a}che verschaffen einen Einblick in Akkommodationseffekte w{\"a}hrend spontanen authentischen Interaktionen, wobei die Gespr{\"a}chsteilnehmer zwei Ziele verfolgen: zum einen soll ein Gesch{\"a}ft verhandelt werden, zum anderen m{\"o}chte aber jeder Teilnehmer für sich die besten Bedingungen aushandeln. Die Konversationen wurde durch das Kreuzkorrelation-Zeitreihen-Verfahren analysiert, um die dynamischen {\"A}nderungen im Zeitverlauf zu erfassen. Hierbei kam zum Vorschein, dass sich erfolgreiche Konversationen von fehlgeschlagenen Gespr{\"a}chen deutlich unterscheiden lassen. {\"U}berdies wurde festgestellt, dass die Handelsvertreter die treibende Kraft von vokalen {\"A}nderungen sind, d.h. sie k{\"o}nnen die Neukunden eher dazu zu bringen, ihren Sprachstil anzupassen, als andersherum. Es wurde auch beobachtet, dass sie diese Akkommodation oft schon zu einem frühen Zeitpunkt ausl{\"o}sen, was besonders bei erfolgreichen Gespr{\"a}chen beobachtet werden konnte. Dass diese Akkommodation st{\"a}rker bei trainierten Sprechern ausgel{\"o}st wird, deckt sich mit den meist anekdotischen Empfehlungen von erfahrenen Handelsvertretern, die bisher nie wissenschaftlich nachgewiesen worden sind. Basierend auf diesen Ergebnissen besch{\"a}ftigte sich die n{\"a}chste Studie mehr mit dem Hauptziel dieser Arbeit und untersuchte Akkommodationseffekte bei Mensch-Maschine-Interaktionen. Diese Studie führte ein Shadowing-Experiment durch, das ein kontrolliertes Umfeld für die Untersuchung phonetischer Abweichungen anbietet. Da Sprachdialogsysteme mit solchen Akkommodationsf{\"a}higkeiten noch nicht existieren, wurde stattdessen ein simuliertes System eingesetzt, um diese Akkommodationsprozesse bei den Teilnehmern auszul{\"o}sen, wobei diese im Glauben waren, ein Sprachlernsystem zu testen. Nach der Bestimmung ihrer Pr{\"a}ferenzen hinsichtlich dreier segmentaler Merkmale h{\"o}rten die Teilnehmer entweder natürlichen oder synthetischen Stimmen von m{\"a}nnlichen und weiblichen Sprechern zu, die nicht die bevorzugten Variation der oben genannten Merkmale produzierten. Akkommodation fand in allen F{\"a}llen statt, obwohl die natürlichen Stimmen st{\"a}rkere Effekte ausl{\"o}sten. Es kann jedoch gefolgert werden, dass Teilnehmer sich auch an den synthetischen Stimmen orientierten, was bedeutet, dass soziale Mechanismen bei Menschen auch beim Sprechen mit Computern angewendet werden. Das Shadowing-Paradigma wurde auch verwendet, um zu testen, ob Akkommodation ein nur mit Sprache assoziiertes Ph{\"a}nomen ist oder ob sie auch in anderen vokalen Aktivit{\"a}ten stattfindet. Hierzu wurde Akkommodation im Gesang zu vertrauter und unbekannter Musik untersucht. Interessanterweise wurden in beiden F{\"a}llen Akkommodationseffekte gemessen, wenn auch nur auf unterschiedliche Weise. Wohingegen die Teilnehmer das vertraute Stück lediglich als Referenz für einen genaueren Gesang zu verwenden schienen, wurde das neuartige Stück zum Ziel einer vollst{\"a}ndigen Nachbildung. Ein Unterschied bestand z.B. darin, dass im ersteren Fall haupts{\"a}chlich Tonh{\"o}henkorrekturen durchgeführt wurden, w{\"a}hrend im zweiten Fall auch Tonart und Rhythmusmuster übernommen wurden. Einige dieser Ergebnisse wurden erwartet und zeigen, dass die hervorstechenderen Merkmale von Menschen auch durch externen auditorischen Einfluss schwerer zu modifizieren sind. Zuletzt wurde ein Mehrparteienexperiment mit spontanen Mensch-Mensch-Computer-Interaktionen durchgeführt, um Akkommodation in mensch- und computergerichteter Sprache zu vergleichen. Die Teilnehmer l{\"o}sten Aufgaben, für die sie sowohl mit einem Konf{\"o}derierten als auch mit einem Agenten sprechen mussten. Dies erm{\"o}glicht einen direkten Vergleich ihrer Sprache basierend auf dem Adressaten innerhalb derselben Konversation, was bisher noch nicht erforscht worden ist. Die Ergebnisse zeigen, dass sich das vokale Verhalten einiger Teilnehmer im Gespr{\"a}ch mit dem Konf{\"o}derierten und dem Agenten {\"a}hnlich {\"a}nderte, w{\"a}hrend die Sprache anderer Teilnehmer nur mit dem Konf{\"o}derierten variierte. Weitere Analysen ergaben, dass der gr{\"o}{\ss}te Faktor für diesen Unterschied die Reihenfolge war, in der die Teilnehmer mit den Gespr{\"a}chspartnern sprachen. Anscheinend sahen die Teilnehmer, die zuerst mit dem Agenten allein sprachen, ihn eher als einen sozialen Akteur im Gespr{\"a}ch, w{\"a}hrend diejenigen, die erst mit dem Konf{\"o}derierten interagierten, ihn eher als Mittel zur Erreichung eines Ziels betrachteten und sich deswegen anders verhielten. Im letzteren Fall waren die Variationen in der menschgerichteten Sprache viel ausgepr{\"a}gter. Unterschiede wurden auch zwischen den analysierten Merkmalen festgestellt, aber der Aufgabentyp hatte keinen Einfluss auf den Grad der Akkommodationseffekte. Die Ergebnisse dieser Experimente lassen den Schluss zu, dass bei Mensch-Computer-Interaktionen vokale Akkommodation auftritt, wenn auch h{\"a}ufig in geringerem Ma{\ss}e. Da nun eine Best{\"a}tigung dafür vorliegt, dass Menschen auch bei der Interaktion mit Computern ein Akkommodationsverhalten aufzeigen, liegt der Schritt nahe, dieses Verhalten auf eine computergestützte Weise zu beschreiben. Hier werden zwei Ans{\"a}tze vorgeschlagen: ein Ansatz basierend auf einem Rechenmodell und einer basierend auf einem statistischen Modell. Das Ziel des Rechenmodells ist es, den vermuteten kognitiven Prozess zu erfassen, der mit der Akkommodation beim Menschen verbunden ist. Dies umfasst verschiedene Schritte, z.B. das Erkennen des Klangs des variablen Merkmals, das Hinzufügen von Instanzen davon zum mentalen Ged{\"a}chtnis des Merkmals und das Bestimmen, wie stark sich das Merkmal {\"a}ndert, wobei sowohl seine aktuelle Darstellung als auch die externe Eingabe berücksichtigt werden. Aufgrund seiner sequenziellen Natur wurde dieses Modell als eine Pipeline implementiert. Jeder der fünf Schritte der Pipeline entspricht einem bestimmten Teil des kognitiven Prozesses und kann einen oder mehrere Parameter zur Steuerung seiner Ausgabe aufweisen (z.B. die Gr{\"o}{\ss}e des Ge-d{\"a}chtnisses des Merkmals oder die Akkommodationsgeschwindigkeit). Mit Hilfe dieser Parameter k{\"o}nnen pr{\"a}zise akkommodative Verhaltensweisen zusammen mit Expertenwissen erstellt werden, um die ausgew{\"a}hlten Parameterwerte zu motivieren. Durch diese Vorteile ist diesen Ansatz besonders zum Experimentieren mit vordefinierten, deterministischen Verhaltensweisen geeignet, bei denen jeder Schritt einzeln ge{\"a}ndert werden kann. Letztendlich macht dieser Ansatz ein System stimmlich auf die Spracheingabe von Benutzern ansprechbar. Der zweite Ansatz gew{\"a}hrt weiterentwickelte Verhaltensweisen, indem verschiedene Kernverhalten definiert und nicht deterministische Variationen hinzugefügt werden. Dies {\"a}hnelt menschlichen Verhaltensmustern, da jede Person eine grundlegende Art von Akkommodationsverhalten hat, das sich je nach den spezifischen Umst{\"a}nden willkürlich {\"a}ndern kann. Dieser Ansatz bietet eine datengesteuerte statistische Methode, um das Akkommodationsverhalten aus einer bestimmten Sammlung von Interaktionen zu extrahieren. Zun{\"a}chst werden die Werte des Zielmerkmals jedes Sprechers in einer Interaktion in kontinuierliche interpolierte Linien umgewandelt, indem eine Probe aus der a posteriori Verteilung eines Gau{\ss}prozesses gezogen wird, der von den angegebenen Werten abh{\"a}ngig ist. Dann werden die Gradienten dieser Linien, die die gegenseitigen {\"A}nderungsraten darstellen, verwendet, um diskrete {\"A}nderungsniveaus basierend auf ihren Verteilungen zu definieren. Schlie{\ss}lich wird jeder Ebene ein Symbol zugewiesen, das letztendlich eine Symbolsequenzdarstellung für jede Interaktion darstellt. Die Sequenzen sind geclustert, sodass jeder Cluster für eine Art von Verhalten steht. Die Sequenzen eines Clusters k{\"o}nnen dann verwendet werden, um N-Gramm Wahrscheinlichkeiten zu berechnen, die die Erzeugung neuer Sequenzen des erfassten Verhaltens erm{\"o}glichen. Der spezifische Ausgabewert wird aus dem Bereich abgetastet, der dem erzeugten Symbol entspricht. Bei diesem Ansatz wird das Akkommodationsverhalten direkt aus Daten extrahiert, anstatt manuell erstellt zu werden. Es kann jedoch schwierig sein, zu beschreiben, was genau jedes Verhalten darstellt und die Verwendung eines von ihnen gegenüber dem anderen zu motivieren. Um diesen Spalt zwischen diesen beiden Ans{\"a}tzen zu schlie{\ss}en, wird auch diskutiert, wie sie kombiniert werden k{\"o}nnten, um von den Vorteilen beider zu profitieren. Darüber hinaus, um strukturiertere Verhaltensweisen zu generieren, wird hier eine Hierarchie von Akkommodationskomplexit{\"a}tsstufen vorgeschlagen, die von einer direkten {\"U}bernahme der Benutzerrealisierungen über eine bestimmte {\"A}nderungssensitivit{\"a}t und bis hin zu unabh{\"a}ngigen Kernverhalten mit nicht-deterministischen Variationsproduktionen reicht. Neben der M{\"o}glichkeit, Stimm{\"a}nderungen zu verfolgen und darzustellen, ben{\"o}tigt ein akkommodatives System auch eine Text-zu-Sprache Komponente, die diese {\"A}nderungen in der Sprachausgabe des Systems realisieren kann. Sprachsynthesemodelle werden in der Regel einmal mit Daten mit bestimmten Merkmalen trainiert und {\"a}ndern sich danach nicht mehr. Dies verhindert, dass solche Modelle Variationen in bestimmten Kl{\"a}ngen und anderen phonetischen Merkmalen generieren k{\"o}nnen. Zwei Methoden zum direkten {\"A}ndern solcher Merkmale werden hier untersucht. Die erste basiert auf Signalverarbeitung, die auf das Ausgangssignal angewendet wird, nachdem es vom System erzeugt wurde. Die Verarbeitung erfolgt zwischen den Zeitstempeln der Zielmerkmale und verwendet vordefinierte Skripte, die das Signal modifizieren, um die gewünschten Werte zu erreichen. Diese Methode eignet sich besser für kontinuierliche Merkmale wie Vokalqualit{\"a}t, insbesondere bei subtilen {\"A}nderungen, die nicht unbedingt zu einer kategorialen Klang{\"a}nderung führen. Die zweite Methode zielt darauf ab, phonetische Variationen in den Trainingsdaten zu erfassen. Zu diesem Zweck wird im Gegensatz zu den regul{\"a}ren graphemischen Darstellungen ein Trainingskorpus mit phonemischen Darstellungen verwendet. Auf diese Weise kann das Modell direktere Beziehungen zwischen Phonemen und Klang anstelle von Oberfl{\"a}chenformen und Klang erlernen, die je nach Sprache komplexer und von ihren umgebenden Buchstaben abh{\"a}ngen k{\"o}nnen. Die Zielvariationen selbst müssen nicht unbedingt explizit in den Trainingsdaten enthalten sein, solange die verschiedenen Kl{\"a}nge natürlich immer unterscheidbar sind. In der Generierungsphase bestimmt der Zustand des aktuellen Zielmerkmals das Phonem, das zum Erzeugen des gewünschten Klangs verwendet werden sollte. Diese Methode eignet sich für kategoriale {\"A}nderungen, insbesondere für Kontraste, die sich natürlich in der Sprache unterscheiden. Obwohl beide Methoden eindeutig verschiedene Einschr{\"a}nkungen aufweisen, liefern sie einen Machbarkeitsnachweis für die Idee, dass Sprachdialogsysteme ihre Sprachausgabe in Echtzeit phonetisch anpassen k{\"o}nnen, ohne ihre Text-zu-Sprache Modelle wieder zu trainieren. Um die Verhaltensdefinitionen und die Sprachmanipulation zu kombinieren, ist ein System erforderlich, das diese Elemente verbinden kann, um ein vollst{\"a}ndiges akkommodationsf{\"a}higes System zu schaffen. Die hier vorgeschlagene Architektur erweitert den Standardfluss von Sprachdialogsystemen um ein zus{\"a}tzliches Modul, das das transkribierte Sprachsignal von der Spracherkennungskomponente empf{\"a}ngt, ohne die Eingabe in die Sprachverst{\"a}ndniskomponente zu beeinflussen. W{\"a}hrend die Sprachverst{\"a}ndnis-komponente nur die Texttranskription verwendet, um die Absicht des Benutzers zu bestimmen, verarbeitet die hinzugefügte Komponente das Rohsignal zusammen mit seiner phonetischen Transkription. In dieser erweiterten Architektur wird das Akkommodationsmodell in dem hinzugefügten Modul aktiviert und die für die Sprachmanipulation erforderlichen Informationen werden an die Text-zu-Sprache Komponente gesendet. Die Text-zu-Sprache Komponente hat jetzt zwei Eingaben, n{\"a}mlich den Inhalt der Systemantwort, der von der Sprachgenerierungskomponente stammt, und die Zust{\"a}nde der definierten Zielmerkmale von der hinzugefügten Komponente. Hier wird eine Implementierung eines webbasierten Systems mit dieser Architektur vorgestellt und dessen Funktionalit{\"a}ten wurden durch ein Vorzeigeszenario demonstriert, indem es verwendet wird, um ein Shadowing-Experiment automatisch durchzuführen. Dies hat zwei Hauptvorteile: Erstens spart der Experimentator Zeit und vermeidet manuelle Annotationsfehler, da das System die phonetischen Variationen der Teilnehmer erkennt und automatisch die geeignete Variation für die Rückmeldung ausw{\"a}hlt. Der Experimentator erh{\"a}lt au{\ss}erdem automatisch zus{\"a}tzliche Informationen wie genaue Zeitstempel der {\"A}u{\ss}erungen, Echtzeitvisualisierung der Produktionen der Gespr{\"a}chspartner und die M{\"o}glichkeit, die Interaktion nach Abschluss des Experiments erneut abzuspielen und zu analysieren. Der zweite Vorteil ist Skalierbarkeit. Mehrere Instanzen des Systems k{\"o}nnen auf einem Server ausgeführt werden, auf die mehrere Clients gleichzeitig zugreifen k{\"o}nnen. Dies spart nicht nur Zeit und Logistik, um Teilnehmer in ein Labor zu bringen, sondern erm{\"o}glicht auch die kontrollierte und reproduzierbare Durchführung von Experimenten mit verschiedenen Konfigurationen (z.B. andere Parameterwerte oder Zielmerkmale). Dies schlie{\ss}t einen vollst{\"a}ndigen Zyklus von der Untersuchung des menschlichen Verhaltens bis zur Integration der Akkommodationsf{\"a}higkeiten ab. Obwohl jeder Teil davon zweifellos weiter untersucht werden kann, liegt der Schwerpunkt hier darauf, wie sie voneinander abh{\"a}ngen und sich miteinander kombinieren lassen. Das Messen von {\"A}nderungsmerkmalen, ohne zu zeigen, wie sie modelliert werden k{\"o}nnen, oder das Erreichen einer flexiblen Sprachsynthese ohne Berücksichtigung der gewünschten endgültigen Ausgabe führt m{\"o}glicherweise nicht zum endgültigen Ziel, Akkommodationsf{\"a}higkeiten in Computer zu integrieren. Indem diese Dissertation die Vokal-Akkommodation in der Mensch-Computer-Interaktion als einen einzigen gro{\ss}en Prozess betrachtet und nicht als eine Sammlung isolierter Unterprobleme, schafft sie ein Fundament für umfassendere und vollst{\"a}ndigere L{\"o}sungen in der Zukunft.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C1

Ibrahim, Omnia; Yuen, Ivan; van Os, Marjolein; Andreeva, Bistra; Möbius, Bernd

The effect of Lombard speech modifications in different information density contexts Inproceedings

Elektronische Sprachsignalverarbeitung 2021, Tagungsband der 32. Konferenz (Berlin), TUDpress, pp. 185-191, Dresden, 2021.

Speakers adapt their speech to increase clarity in the presence of back-ground noise (Lombard speech) [1, 2]. However, they also modify their speech tobe efficient by shortening word duration in more predictable contexts [3]. To meetthese two communicative functions, speakers will attempt to resolve any conflicting communicative demands. The present study focuses on how this can be resolvedin the acoustic domain. A total of 1520 target CV syllables were annotated andanalysed from 38 German speakers in 2 white-noise (no noise vs. -10 dB SNR) and 2 surprisal (H vs. L) contexts. Median fundamental frequency (F0), intensityrange, and syllable duration were extracted. Our results revealed effects of bothnoise and surprisal on syllable duration and intensity range, but only an effect ofnoise on F0. This might suggest redundant (multi-dimensional) acoustic coding in Lombard speech modification, but not so in surprisal modification.

@inproceedings{Ibrahim2021,
title = {The effect of Lombard speech modifications in different information density contexts},
author = {Omnia Ibrahim and Ivan Yuen and Marjolein van Os and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://www.essv.de/paper.php?id=1117},
year = {2021},
date = {2021},
booktitle = {Elektronische Sprachsignalverarbeitung 2021, Tagungsband der 32. Konferenz (Berlin)},
pages = {185-191},
publisher = {TUDpress},
address = {Dresden},
abstract = {Speakers adapt their speech to increase clarity in the presence of back-ground noise (Lombard speech) [1, 2]. However, they also modify their speech tobe efficient by shortening word duration in more predictable contexts [3]. To meetthese two communicative functions, speakers will attempt to resolve any conflicting communicative demands. The present study focuses on how this can be resolvedin the acoustic domain. A total of 1520 target CV syllables were annotated andanalysed from 38 German speakers in 2 white-noise (no noise vs. -10 dB SNR) and 2 surprisal (H vs. L) contexts. Median fundamental frequency (F0), intensityrange, and syllable duration were extracted. Our results revealed effects of bothnoise and surprisal on syllable duration and intensity range, but only an effect ofnoise on F0. This might suggest redundant (multi-dimensional) acoustic coding in Lombard speech modification, but not so in surprisal modification.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Kudera, Jacek; van Os, Marjolein; Möbius, Bernd

Natural and synthetic speech comprehension in simulated tonal and pulsatile tinnitus: A pilot study Inproceedings

Elektronische Sprachsignalverarbeitung 2021, Tagungsband der 32. Konferenz (Berlin), TUDpress, pp. 273-280, Dresden, 2021.

This paper summarizes the results of a Modified Rhyme Test conducted with masked stimuli to simulate two common types of hearing impairment: bilateral pulsatile and pure tinnitus. Two types of stimuli, meaningful German words (natural read speech and TTS output) differing in initial or final positioned minimal pairs were modified to correspond to six listening conditions. Results showed higher recognition scores for natural speech compared to synthetic and better intelligibility for pulsatile tinnitus noise over pure tone tinnitus. These insights are of relevance given the alarming rates of tinnitus in epidemiological reports.

@inproceedings{Kudera2021,
title = {Natural and synthetic speech comprehension in simulated tonal and pulsatile tinnitus: A pilot study},
author = {Jacek Kudera and Marjolein van Os and Bernd M{\"o}bius},
url = {https://www.essv.de/paper.php?id=1129},
year = {2021},
date = {2021},
booktitle = {Elektronische Sprachsignalverarbeitung 2021, Tagungsband der 32. Konferenz (Berlin)},
pages = {273-280},
publisher = {TUDpress},
address = {Dresden},
abstract = {This paper summarizes the results of a Modified Rhyme Test conducted with masked stimuli to simulate two common types of hearing impairment: bilateral pulsatile and pure tinnitus. Two types of stimuli, meaningful German words (natural read speech and TTS output) differing in initial or final positioned minimal pairs were modified to correspond to six listening conditions. Results showed higher recognition scores for natural speech compared to synthetic and better intelligibility for pulsatile tinnitus noise over pure tone tinnitus. These insights are of relevance given the alarming rates of tinnitus in epidemiological reports.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Brandt, Erika; Möbius, Bernd; Andreeva, Bistra

Dynamic Formant Trajectories in German Read Speech: Impact of Predictability and Prominence Journal Article

Frontiers in Communication, section Language Sciences, 6, pp. 1-15, 2021.

Phonetic structures expand temporally and spectrally when they are difficult to predict from their context. To some extent, effects of predictability are modulated by prosodic structure. So far, studies on the impact of contextual predictability and prosody on phonetic structures have neglected the dynamic nature of the speech signal. This study investigates the impact of predictability and prominence on the dynamic structure of the first and second formants of German vowels. We expect to find differences in the formant movements between vowels standing in different predictability contexts and a modulation of this effect by prominence. First and second formant values are extracted from a large German corpus. Formant trajectories of peripheral vowels are modeled using generalized additive mixed models, which estimate nonlinear regressions between a dependent variable and predictors. Contextual predictability is measured as biphone and triphone surprisal based on a statistical German language model. We test for the effects of the information-theoretic measures surprisal and word frequency, as well as prominence, on formant movement, while controlling for vowel phonemes and duration. Primary lexical stress and vowel phonemes are significant predictors of first and second formant trajectory shape. We replicate previous findings that vowels are more dispersed in stressed syllables than in unstressed syllables. The interaction of stress and surprisal explains formant movement: unstressed vowels show more variability in their formant trajectory shape at different surprisal levels than stressed vowels. This work shows that effects of contextual predictability on fine phonetic detail can be observed not only in pointwise measures but also in dynamic features of phonetic segments.

@article{Brandt/etal:2021,
title = {Dynamic Formant Trajectories in German Read Speech: Impact of Predictability and Prominence},
author = {Erika Brandt and Bernd M{\"o}bius and Bistra Andreeva},
url = {https://www.frontiersin.org/articles/10.3389/fcomm.2021.643528/full},
doi = {https://doi.org/10.3389/fcomm.2021.643528},
year = {2021},
date = {2021-06-21},
journal = {Frontiers in Communication, section Language Sciences},
pages = {1-15},
volume = {6},
number = {643528},
abstract = {Phonetic structures expand temporally and spectrally when they are difficult to predict from their context. To some extent, effects of predictability are modulated by prosodic structure. So far, studies on the impact of contextual predictability and prosody on phonetic structures have neglected the dynamic nature of the speech signal. This study investigates the impact of predictability and prominence on the dynamic structure of the first and second formants of German vowels. We expect to find differences in the formant movements between vowels standing in different predictability contexts and a modulation of this effect by prominence. First and second formant values are extracted from a large German corpus. Formant trajectories of peripheral vowels are modeled using generalized additive mixed models, which estimate nonlinear regressions between a dependent variable and predictors. Contextual predictability is measured as biphone and triphone surprisal based on a statistical German language model. We test for the effects of the information-theoretic measures surprisal and word frequency, as well as prominence, on formant movement, while controlling for vowel phonemes and duration. Primary lexical stress and vowel phonemes are significant predictors of first and second formant trajectory shape. We replicate previous findings that vowels are more dispersed in stressed syllables than in unstressed syllables. The interaction of stress and surprisal explains formant movement: unstressed vowels show more variability in their formant trajectory shape at different surprisal levels than stressed vowels. This work shows that effects of contextual predictability on fine phonetic detail can be observed not only in pointwise measures but also in dynamic features of phonetic segments.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Meier, David; Andreeva, Bistra

Einflussfaktoren auf die Wahrnehmung von Prominenz im natürlichen Dialog Inproceedings

Elektronische Sprachsignalverarbeitung 2020, Tagungsband der 31. Konferenz , pp. 257-264, Magdeburg, 2020.

Turnbull et al. [1] stellen fest, dass sich auf die Wahrnehmung der prosodischen Prominenz von isolierten Adjektiv-Nomen-Paaren mehrere konkurrierende Faktoren auswirken, nämlich die Phonologie, der Diskurskontext und das Wissen über den Diskurs. Der vorliegende Beitrag hat das Ziel, den relativen Einfluss der evozierten Fokussierung (eng kontrastiv vs. weit kontrastiv) und der Akzentuierung (akzentuiert vs. nicht akzentuiert) auf die Wahrnehmung von Prominenz zu untersuchen und zu überprüfen, ob die in Turnbull et al. vorgestellten Konzepte in einer Umgebung reproduzierbar sind, die eher mit einem natürlichsprachlichen Dialog vergleichbar ist. Für die Studie wurden 144 realisierte Sätze eines einzelnen männlichen Sprechers so zusammengeschnitten, dass ein semantischer Kontrast entweder auf dem betreffenden Nomen oder auf dem Adjektiv entsteht. Die metrisch starken Silben des Adjektivs oder des Nomens waren entweder entsprechend der Fokusstruktur oder gegen Erwartung akzentuiert. Die Ergebnisse zeigen, dass die Akzentuierung einen größeren Einfluss auf die Prominenzwahrnehmung als die Fokusbedingung hat, was im Einklang mit den Ergebnissen von Turnbull et al. ist. Adjektive werden zudem konsequent als prominenter eingestuft als Nomen in vergleichbaren Kontexten. Eine Erweiterung des Diskurskontextes und der Hintergrundinformationen, die dem Versuchsteilnehmer zur Verfügung standen, haben in dem hier vorgestellten Versuchsaufbau allerdings nur vernachlässigbare Effekte.

@inproceedings{Meier2020,
title = {Einflussfaktoren auf die Wahrnehmung von Prominenz im nat{\"u}rlichen Dialog},
author = {David Meier and Bistra Andreeva},
url = {https://www.essv.de/paper.php?id=465},
year = {2020},
date = {2020},
booktitle = {Elektronische Sprachsignalverarbeitung 2020, Tagungsband der 31. Konferenz},
pages = {257-264},
address = {Magdeburg},
abstract = {Turnbull et al. [1] stellen fest, dass sich auf die Wahrnehmung der prosodischen Prominenz von isolierten Adjektiv-Nomen-Paaren mehrere konkurrierende Faktoren auswirken, n{\"a}mlich die Phonologie, der Diskurskontext und das Wissen {\"u}ber den Diskurs. Der vorliegende Beitrag hat das Ziel, den relativen Einfluss der evozierten Fokussierung (eng kontrastiv vs. weit kontrastiv) und der Akzentuierung (akzentuiert vs. nicht akzentuiert) auf die Wahrnehmung von Prominenz zu untersuchen und zu {\"u}berpr{\"u}fen, ob die in Turnbull et al. vorgestellten Konzepte in einer Umgebung reproduzierbar sind, die eher mit einem nat{\"u}rlichsprachlichen Dialog vergleichbar ist. F{\"u}r die Studie wurden 144 realisierte S{\"a}tze eines einzelnen m{\"a}nnlichen Sprechers so zusammengeschnitten, dass ein semantischer Kontrast entweder auf dem betreffenden Nomen oder auf dem Adjektiv entsteht. Die metrisch starken Silben des Adjektivs oder des Nomens waren entweder entsprechend der Fokusstruktur oder gegen Erwartung akzentuiert. Die Ergebnisse zeigen, dass die Akzentuierung einen gr{\"o}{\ss}eren Einfluss auf die Prominenzwahrnehmung als die Fokusbedingung hat, was im Einklang mit den Ergebnissen von Turnbull et al. ist. Adjektive werden zudem konsequent als prominenter eingestuft als Nomen in vergleichbaren Kontexten. Eine Erweiterung des Diskurskontextes und der Hintergrundinformationen, die dem Versuchsteilnehmer zur Verf{\"u}gung standen, haben in dem hier vorgestellten Versuchsaufbau allerdings nur vernachl{\"a}ssigbare Effekte.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Andreeva, Bistra; Möbius, Bernd; Whang, James

Effects of surprisal and boundary strength on phrase-final lengthening Inproceedings

Proc. 10th International Conference on Speech Prosody 2020, pp. 146-150, 2020.

This study examines the influence of prosodic structure (pitch accents and boundary strength) and information density (ID) on phrase-final syllable duration. Phrase-final syllable durations and following pause durations were measured in a subset of a German radio-news corpus (DIRNDL), consisting of about 5 hours of manually annotated speech. The prosodic annotation is in accordance with the autosegmental intonation model and includes labels for pitch accents and boundary tones. We treated pause duration as a quantitative proxy for boundary strength.

ID was calculated as the surprisal of the syllable trigram of the preceding context, based on language models trained on the DeWaC corpus. We found a significant positive correlation between surprisal and phrase-final syllable duration. Syllable duration was statistically modeled as a function of prosodic factors (pitch accent and boundary strength) and surprisal in linear mixed effects models. The results revealed an interaction of surprisal and boundary strength with respect to phrase-final syllable duration. Syllables with high surprisal values are longer before stronger boundaries, whereas low-surprisal syllables are longer before weaker boundaries. This modulation of pre-boundary syllable duration is observed above and beyond the well-established phrase-final lengthening effect.

@inproceedings{Andreeva2020,
title = {Effects of surprisal and boundary strength on phrase-final lengthening},
author = {Bistra Andreeva and Bernd M{\"o}bius andJames Whang},
url = {http://dx.doi.org/10.21437/SpeechProsody.2020-30},
year = {2020},
date = {2020-10-20},
booktitle = {Proc. 10th International Conference on Speech Prosody 2020},
pages = {146-150},
abstract = {This study examines the influence of prosodic structure (pitch accents and boundary strength) and information density (ID) on phrase-final syllable duration. Phrase-final syllable durations and following pause durations were measured in a subset of a German radio-news corpus (DIRNDL), consisting of about 5 hours of manually annotated speech. The prosodic annotation is in accordance with the autosegmental intonation model and includes labels for pitch accents and boundary tones. We treated pause duration as a quantitative proxy for boundary strength. ID was calculated as the surprisal of the syllable trigram of the preceding context, based on language models trained on the DeWaC corpus. We found a significant positive correlation between surprisal and phrase-final syllable duration. Syllable duration was statistically modeled as a function of prosodic factors (pitch accent and boundary strength) and surprisal in linear mixed effects models. The results revealed an interaction of surprisal and boundary strength with respect to phrase-final syllable duration. Syllables with high surprisal values are longer before stronger boundaries, whereas low-surprisal syllables are longer before weaker boundaries. This modulation of pre-boundary syllable duration is observed above and beyond the well-established phrase-final lengthening effect.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Batliner, Anton; Möbius, Bernd

Prosody in automatic speech processing Book Chapter

Gussenhoven, Carlos; Chen, Aoju (Ed.): The Oxford Handbook of Language Prosody, Chap. 46, Oxford University Press, pp. 633-645, 2020, ISBN 9780198832232.

Automatic speech processing (ASP) is understood as covering word recognition, the processing of higher linguistic components (syntax, semantics, and pragmatics), and the processing of computational paralinguistics (CP), which deals with speaker states and traits. This chapter attempts to track the role of prosody in ASP from the word level up to CP. A short history of the field from 1980 to 2020 distinguishes the early years (until 2000)— when the prosodic contribution to the modelling of linguistic phenomena, such as accents, boundaries, syntax, semantics, and dialogue acts, was the focus—from the later years, when the focus shifted to paralinguistics; prosody ceased to be visible. Different types of predictor variables are addressed, among them high-performance power features as well as leverage features, which can also be employed in teaching and therapy.

@inbook{Batliner/Moebius:2020,
title = {Prosody in automatic speech processing},
author = {Anton Batliner and Bernd M{\"o}bius},
editor = {Carlos Gussenhoven and Aoju Chen},
url = {https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780198832232.001.0001/oxfordhb-9780198832232-e-42},
doi = {https://doi.org/10.1093/oxfordhb/9780198832232.013.42},
year = {2020},
date = {2020},
booktitle = {The Oxford Handbook of Language Prosody, Chap. 46},
isbn = {9780198832232},
pages = {633-645},
publisher = {Oxford University Press},
abstract = {Automatic speech processing (ASP) is understood as covering word recognition, the processing of higher linguistic components (syntax, semantics, and pragmatics), and the processing of computational paralinguistics (CP), which deals with speaker states and traits. This chapter attempts to track the role of prosody in ASP from the word level up to CP. A short history of the field from 1980 to 2020 distinguishes the early years (until 2000)— when the prosodic contribution to the modelling of linguistic phenomena, such as accents, boundaries, syntax, semantics, and dialogue acts, was the focus—from the later years, when the focus shifted to paralinguistics; prosody ceased to be visible. Different types of predictor variables are addressed, among them high-performance power features as well as leverage features, which can also be employed in teaching and therapy.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C1

Karpiňski, Maciej; Andreeva, Bistra; Asu, Eva Liina; Beňuš, Štefan; Daugavet, Anna; Mády, Katalin

Central and Eastern Europe Book Chapter

Gussenhoven, Carlos; Chen, Aoju (Ed.): The Oxford Handbook of Language Prosody, Chap. 15, Oxford University Press, pp. 225-235, 2020, ISBN 9780198832232.

The languages of Central and Eastern Europe addressed in this chapter form a typologically divergent collection that includes Slavic (Belarusian, Bulgarian, Czech, Macedonian, Polish, Russian, pluricentric Bosnian-Croatian-Montenegrin-Serbian, Slovak, Slovenian, Ukrainian), Baltic (Latvian, Lithuanian), Finno-Ugric (Hungarian, Finnish, Estonian), and Romance (Romanian). Their prosodic features and structures have been explored to various depths, from different theoretical perspectives, sometimes on the basis of relatively sparse material. Still, enough is known to see that their typological divergence as well as other factors contribute to vivid differences in their prosodic systems. While belonging to intonational languages, they differ in pitch patterns and their usage, duration, and rhythm (some involve phonological duration), as well as prominence mechanisms, accentuation, and word stress (fixed or mobile). Several languages in the area have what is referred to by different traditions as pitch accents, tones or syllable accents, or intonations.

 

@inbook{Karpinski/etal:2020,
title = {Central and Eastern Europe},
author = {Maciej Karpiňski and Bistra Andreeva and Eva Liina Asu and Štefan Beňuš and Anna Daugavet and Katalin M{\'a}dy},
editor = {Carlos Gussenhoven and Aoju Chen},
url = {https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780198832232.001.0001/oxfordhb-9780198832232-e-14},
year = {2020},
date = {2020},
booktitle = {The Oxford Handbook of Language Prosody, Chap. 15},
isbn = {9780198832232},
pages = {225-235},
publisher = {Oxford University Press},
abstract = {The languages of Central and Eastern Europe addressed in this chapter form a typologically divergent collection that includes Slavic (Belarusian, Bulgarian, Czech, Macedonian, Polish, Russian, pluricentric Bosnian-Croatian-Montenegrin-Serbian, Slovak, Slovenian, Ukrainian), Baltic (Latvian, Lithuanian), Finno-Ugric (Hungarian, Finnish, Estonian), and Romance (Romanian). Their prosodic features and structures have been explored to various depths, from different theoretical perspectives, sometimes on the basis of relatively sparse material. Still, enough is known to see that their typological divergence as well as other factors contribute to vivid differences in their prosodic systems. While belonging to intonational languages, they differ in pitch patterns and their usage, duration, and rhythm (some involve phonological duration), as well as prominence mechanisms, accentuation, and word stress (fixed or mobile). Several languages in the area have what is referred to by different traditions as pitch accents, tones or syllable accents, or intonations.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C1

Abdullah, Badr M.; Avgustinova, Tania; Möbius, Bernd; Klakow, Dietrich

Cross-Domain Adaptation of Spoken Language Identification for Related Languages: The Curious Case of Slavic Languages Inproceedings

Proceedings of Interspeech 2020, pp. 477-481, 2020.

State-of-the-art spoken language identification (LID) systems, which are based on end-to-end deep neural networks, have shown remarkable success not only in discriminating between distant languages but also between closely-related languages or even different spoken varieties of the same language. However, it is still unclear to what extent neural LID models generalize to speech samples with different acoustic conditions due to domain shift. In this paper, we present a set of experiments to investigate the impact of domain mismatch on the performance of neural LID systems for a subset of six Slavic languages across two domains (read speech and radio broadcast) and examine two low-level signal descriptors (spectral and cepstral features) for this task. Our experiments show that (1) out-of-domain speech samples severely hinder the performance of neural LID models, and (2) while both spectral and cepstral features show comparable performance within-domain, spectral features show more robustness under domain mismatch. Moreover, we apply unsupervised domain adaptation to minimize the discrepancy between the two domains in our study. We achieve relative accuracy improvements that range from 9% to 77% depending on the diversity of acoustic conditions in the source domain.

@inproceedings{abdullah_etal_is2020,
title = {Cross-Domain Adaptation of Spoken Language Identification for Related Languages: The Curious Case of Slavic Languages},
author = {Badr M. Abdullah and Tania Avgustinova and Bernd M{\"o}bius and Dietrich Klakow},
url = {https://arxiv.org/abs/2008.00545},
doi = {https://doi.org/10.21437/Interspeech.2020-2930},
year = {2020},
date = {2020},
booktitle = {Proceedings of Interspeech 2020},
pages = {477-481},
abstract = {State-of-the-art spoken language identification (LID) systems, which are based on end-to-end deep neural networks, have shown remarkable success not only in discriminating between distant languages but also between closely-related languages or even different spoken varieties of the same language. However, it is still unclear to what extent neural LID models generalize to speech samples with different acoustic conditions due to domain shift. In this paper, we present a set of experiments to investigate the impact of domain mismatch on the performance of neural LID systems for a subset of six Slavic languages across two domains (read speech and radio broadcast) and examine two low-level signal descriptors (spectral and cepstral features) for this task. Our experiments show that (1) out-of-domain speech samples severely hinder the performance of neural LID models, and (2) while both spectral and cepstral features show comparable performance within-domain, spectral features show more robustness under domain mismatch. Moreover, we apply unsupervised domain adaptation to minimize the discrepancy between the two domains in our study. We achieve relative accuracy improvements that range from 9% to 77% depending on the diversity of acoustic conditions in the source domain.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   C1 C4

Abdullah, Badr M.; Kudera, Jacek; Avgustinova, Tania; Möbius, Bernd; Klakow, Dietrich

Rediscovering the Slavic Continuum in Representations Emerging from Neural Models of Spoken Language Identification Inproceedings

Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2020), International Committee on Computational Linguistics (ICCL), pp. 128-139, Barcelona, Spain (Online), 2020.

Deep neural networks have been employed for various spoken language recognition tasks, including tasks that are multilingual by definition such as spoken language identification (LID). In this paper, we present a neural model for Slavic language identification in speech signals and analyze its emergent representations to investigate whether they reflect objective measures of language relatedness or non-linguists’ perception of language similarity. While our analysis shows that the language representation space indeed captures language relatedness to a great extent, we find perceptual confusability to be the best predictor of the language representation similarity.

@inproceedings{abdullah_etal_vardial2020,
title = {Rediscovering the Slavic Continuum in Representations Emerging from Neural Models of Spoken Language Identification},
author = {Badr M. Abdullah and Jacek Kudera and Tania Avgustinova and Bernd M{\"o}bius and Dietrich Klakow},
url = {https://www.aclweb.org/anthology/2020.vardial-1.12},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2020)},
pages = {128-139},
publisher = {International Committee on Computational Linguistics (ICCL)},
address = {Barcelona, Spain (Online)},
abstract = {Deep neural networks have been employed for various spoken language recognition tasks, including tasks that are multilingual by definition such as spoken language identification (LID). In this paper, we present a neural model for Slavic language identification in speech signals and analyze its emergent representations to investigate whether they reflect objective measures of language relatedness or non-linguists’ perception of language similarity. While our analysis shows that the language representation space indeed captures language relatedness to a great extent, we find perceptual confusability to be the best predictor of the language representation similarity.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   C1 C4

Brandt, Erika

Information density and phonetic structure: explaining segmental variability PhD Thesis

Saarland University, Saarbruecken, Germany, 2019.

There is growing evidence that information-theoretic principles influence linguistic structures. Regarding speech several studies have found that phonetic structures lengthen in duration and strengthen in their spectral features when they are difficult to predict from their context, whereas easily predictable phonetic structures are shortened and reduced spectrally. Most of this evidence comes from studies on American English, only some studies have shown similar tendencies in Dutch, Finnish, or Russian. In this context, the Smooth Signal Redundancy hypothesis (Aylett and Turk 2004, Aylett and Turk 2006) emerged claiming that the effect of information-theoretic factors on the segmental structure is moderated through the prosodic structure. In this thesis, we investigate the impact and interaction of information density and prosodic structure on segmental variability in production analyses, mainly based on German read speech, and also listeners‘ perception of differences in phonetic detail caused by predictability effects. Information density (ID) is defined as contextual predictability or surprisal (S(unit_i) = -log2 P(unit_i|context)) and estimated from language models based on large text corpora. In addition to surprisal, we include word frequency, and prosodic factors, such as primary lexical stress, prosodic boundary, and articulation rate, as predictors of segmental variability in our statistical analysis. As acoustic-phonetic measures, we investigate segment duration and deletion, voice onset time (VOT), vowel dispersion, global spectral characteristics of vowels, dynamic formant measures and voice quality metrics. Vowel dispersion is analyzed in the context of German learners‘ speech and in a cross-linguistic study. As results, we replicate previous findings of reduced segment duration (and VOT), higher likelihood to delete, and less vowel dispersion for easily predictable segments. Easily predictable German vowels have less formant change in their vowel section length (VSL), F1 slope and velocity, are less curved in their F2, and show increased breathiness values in cepstral peak prominence (smoothed) than vowels that are difficult to predict from their context. Results for word frequency show similar tendencies: German segments in high-frequency words are shorter, more likely to delete, less dispersed, and show less magnitude in formant change, less F2 curvature, as well as less harmonic richness in open quotient smoothed than German segments in low-frequency words. These effects are found even though we control for the expected and much more effective effects of stress, boundary, and speech rate. In the cross-linguistic analysis of vowel dispersion, the effect of ID is robust across almost all of the six languages and the three intended speech rates. Surprisal does not affect vowel dispersion of non-native German speakers. Surprisal and prosodic factors interact in explaining segmental variability. Especially, stress and surprisal complement each other in their positive effect on segment duration, vowel dispersion and magnitude in formant change. Regarding perception we observe that listeners are sensitive to differences in phonetic detail stemming from high and low surprisal contexts for the same lexical target.

@phdthesis{Brandt_diss_2019,
title = {Information density and phonetic structure: explaining segmental variability},
author = {Erika Brandt},
url = {http://nbn-resolving.de/urn:nbn:de:bsz:291--ds-279181},
doi = {https://doi.org/10.22028/D291-27918},
year = {2019},
date = {2019},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {There is growing evidence that information-theoretic principles influence linguistic structures. Regarding speech several studies have found that phonetic structures lengthen in duration and strengthen in their spectral features when they are difficult to predict from their context, whereas easily predictable phonetic structures are shortened and reduced spectrally. Most of this evidence comes from studies on American English, only some studies have shown similar tendencies in Dutch, Finnish, or Russian. In this context, the Smooth Signal Redundancy hypothesis (Aylett and Turk 2004, Aylett and Turk 2006) emerged claiming that the effect of information-theoretic factors on the segmental structure is moderated through the prosodic structure. In this thesis, we investigate the impact and interaction of information density and prosodic structure on segmental variability in production analyses, mainly based on German read speech, and also listeners' perception of differences in phonetic detail caused by predictability effects. Information density (ID) is defined as contextual predictability or surprisal (S(unit_i) = -log2 P(unit_i|context)) and estimated from language models based on large text corpora. In addition to surprisal, we include word frequency, and prosodic factors, such as primary lexical stress, prosodic boundary, and articulation rate, as predictors of segmental variability in our statistical analysis. As acoustic-phonetic measures, we investigate segment duration and deletion, voice onset time (VOT), vowel dispersion, global spectral characteristics of vowels, dynamic formant measures and voice quality metrics. Vowel dispersion is analyzed in the context of German learners' speech and in a cross-linguistic study. As results, we replicate previous findings of reduced segment duration (and VOT), higher likelihood to delete, and less vowel dispersion for easily predictable segments. Easily predictable German vowels have less formant change in their vowel section length (VSL), F1 slope and velocity, are less curved in their F2, and show increased breathiness values in cepstral peak prominence (smoothed) than vowels that are difficult to predict from their context. Results for word frequency show similar tendencies: German segments in high-frequency words are shorter, more likely to delete, less dispersed, and show less magnitude in formant change, less F2 curvature, as well as less harmonic richness in open quotient smoothed than German segments in low-frequency words. These effects are found even though we control for the expected and much more effective effects of stress, boundary, and speech rate. In the cross-linguistic analysis of vowel dispersion, the effect of ID is robust across almost all of the six languages and the three intended speech rates. Surprisal does not affect vowel dispersion of non-native German speakers. Surprisal and prosodic factors interact in explaining segmental variability. Especially, stress and surprisal complement each other in their positive effect on segment duration, vowel dispersion and magnitude in formant change. Regarding perception we observe that listeners are sensitive to differences in phonetic detail stemming from high and low surprisal contexts for the same lexical target.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C1

Brandt, Erika; Andreeva, Bistra; Möbius, Bernd

Information density and vowel dispersion in the productions of Bulgarian L2 speakers of German Inproceedings

Proceedings of the 19th International Congress of Phonetic Sciences , pp. 3165-3169, Melbourne, Australia, 2019.

We investigated the influence of information density (ID) on vowel space size in L2. Vowel dispersion was measured for the stressed tense vowels /i:, o:, a:/ and their lax counterpart /I, O, a/ in read speech from six German speakers, six advanced and six intermediate Bulgarian speakers of German. The Euclidean distance between center of the vowel space and formant values for each speaker was used as a measure for vowel dispersion. ID was calculated as the surprisal of the triphone of the preceding context. We found a significant positive correlation between surprisal and vowel dispersion in German native speakers. The advanced L2 speakers showed a significant positive relationship between these two measures, while this was not observed in intermediate L2 vowel productions. The intermediate speakers raised their vowel space, reflecting native Bulgarian vowel raising in unstressed positions.

@inproceedings{Brandt2019,
title = {Information density and vowel dispersion in the productions of Bulgarian L2 speakers of German},
author = {Erika Brandt and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/29548},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 19th International Congress of Phonetic Sciences},
pages = {3165-3169},
address = {Melbourne, Australia},
abstract = {We investigated the influence of information density (ID) on vowel space size in L2. Vowel dispersion was measured for the stressed tense vowels /i:, o:, a:/ and their lax counterpart /I, O, a/ in read speech from six German speakers, six advanced and six intermediate Bulgarian speakers of German. The Euclidean distance between center of the vowel space and formant values for each speaker was used as a measure for vowel dispersion. ID was calculated as the surprisal of the triphone of the preceding context. We found a significant positive correlation between surprisal and vowel dispersion in German native speakers. The advanced L2 speakers showed a significant positive relationship between these two measures, while this was not observed in intermediate L2 vowel productions. The intermediate speakers raised their vowel space, reflecting native Bulgarian vowel raising in unstressed positions.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Whang, James

Effects of phonotactic predictability on sensitivity to phonetic detail Journal Article

Laboratory Phonology: Journal of the Association for Laboratory Phonology, 10, pp. 1-28, 2019.

Japanese speakers systematically devoice or delete high vowels [i, u] between two voiceless consonants. Japanese listeners also report perceiving the same high vowels between consonant clusters even in the absence of a vocalic segment. Although perceptual vowel epenthesis has been described primarily as a phonotactic repair strategy, where a phonetically minimal vowel is epenthesized by default, few studies have investigated how the predictability of a vowel in a given context affects the choice of epenthetic vowel. The present study uses a forced-choice labeling task to test how sensitive Japanese listeners are to coarticulatory cues of high vowels [i, u] and non-high vowel [a] in devoicing and non-devoicing contexts. Devoicing contexts were further divided into high-predictability contexts, where the phonotactic distribution strongly favors one of the high vowels, and low-predictability contexts, where both high vowels are allowed, to specifically test for the effects of predictability. Results reveal a strong tendency towards [u] epenthesis as previous studies have found, but the results also reveal a sensitivity to coarticulatory cues that override the default [u] epenthesis, particularly in low-predictability contexts. Previous studies have shown that predictability affects phonetic implementation during production, and this study provides evidence predictability has similar effects during perception.

@article{Whang2019,
title = {Effects of phonotactic predictability on sensitivity to phonetic detail},
author = {James Whang},
url = {https://www.journal-labphon.org/articles/10.5334/labphon.125/},
doi = {https://doi.org/10.5334/labphon.125},
year = {2019},
date = {2019-04-23},
journal = {Laboratory Phonology: Journal of the Association for Laboratory Phonology},
pages = {1-28},
volume = {10},
number = {1},
abstract = {Japanese speakers systematically devoice or delete high vowels [i, u] between two voiceless consonants. Japanese listeners also report perceiving the same high vowels between consonant clusters even in the absence of a vocalic segment. Although perceptual vowel epenthesis has been described primarily as a phonotactic repair strategy, where a phonetically minimal vowel is epenthesized by default, few studies have investigated how the predictability of a vowel in a given context affects the choice of epenthetic vowel. The present study uses a forced-choice labeling task to test how sensitive Japanese listeners are to coarticulatory cues of high vowels [i, u] and non-high vowel [a] in devoicing and non-devoicing contexts. Devoicing contexts were further divided into high-predictability contexts, where the phonotactic distribution strongly favors one of the high vowels, and low-predictability contexts, where both high vowels are allowed, to specifically test for the effects of predictability. Results reveal a strong tendency towards [u] epenthesis as previous studies have found, but the results also reveal a sensitivity to coarticulatory cues that override the default [u] epenthesis, particularly in low-predictability contexts. Previous studies have shown that predictability affects phonetic implementation during production, and this study provides evidence predictability has similar effects during perception.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Malisz, Zofia; Brandt, Erika; Möbius, Bernd; Oh, Yoon Mi; Andreeva, Bistra

Dimensions of segmental variability: interaction of prosody and surprisal in six languages Journal Article

Frontiers in Communication / Language Sciences, 3, pp. 1-18, 2018.

Contextual predictability variation affects phonological and phonetic structure. Reduction and expansion of acoustic-phonetic features is also characteristic of prosodic variability. In this study, we assess the impact of surprisal and prosodic structure on phonetic encoding, both independently of each other and in interaction. We model segmental duration, vowel space size and spectral characteristics of vowels and consonants as a function of surprisal as well as of syllable prominence, phrase boundary, and speech rate. Correlates of phonetic encoding density are extracted from a subset of the BonnTempo corpus for six languages: American English, Czech, Finnish, French, German, and Polish. Surprisal is estimated from segmental n-gram language models trained on large text corpora. Our findings are generally compatible with a weak version of Aylett and Turk’s Smooth Signal Redundancy hypothesis, suggesting that prosodic structure mediates between the requirements of efficient communication and the speech signal. However, this mediation is not perfect, as we found evidence for additional, direct effects of changes in surprisal on the phonetic structure of utterances. These effects appear to be stable across different speech rates.

@article{Malisz2018,
title = {Dimensions of segmental variability: interaction of prosody and surprisal in six languages},
author = {Zofia Malisz and Erika Brandt and Bernd M{\"o}bius and Yoon Mi Oh and Bistra Andreeva},
url = {https://www.frontiersin.org/articles/10.3389/fcomm.2018.00025/full},
doi = {https://doi.org/10.3389/fcomm.2018.00025},
year = {2018},
date = {2018-07-20},
journal = {Frontiers in Communication / Language Sciences},
pages = {1-18},
volume = {3},
number = {25},
abstract = {Contextual predictability variation affects phonological and phonetic structure. Reduction and expansion of acoustic-phonetic features is also characteristic of prosodic variability. In this study, we assess the impact of surprisal and prosodic structure on phonetic encoding, both independently of each other and in interaction. We model segmental duration, vowel space size and spectral characteristics of vowels and consonants as a function of surprisal as well as of syllable prominence, phrase boundary, and speech rate. Correlates of phonetic encoding density are extracted from a subset of the BonnTempo corpus for six languages: American English, Czech, Finnish, French, German, and Polish. Surprisal is estimated from segmental n-gram language models trained on large text corpora. Our findings are generally compatible with a weak version of Aylett and Turk's Smooth Signal Redundancy hypothesis, suggesting that prosodic structure mediates between the requirements of efficient communication and the speech signal. However, this mediation is not perfect, as we found evidence for additional, direct effects of changes in surprisal on the phonetic structure of utterances. These effects appear to be stable across different speech rates.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Zimmerer, Frank; Brandt, Erika; Andreeva, Bistra; Möbius, Bernd

Idiomatic or literal? Production of collocations in German read speech Inproceedings

Proc. Speech Prosody 2018, pp. 428-432, Poznan, 2018.

Collocations have been identified as an interesting field to study the effects of frequency of occurrence in language and speech. We report results of a production experiment including a duration analysis based on the production of German collocations. The collocations occurred in a condition where the phrase was produced with a literal meaning and in another condition where it was idiomatic. A durational difference was found for the collocations, which were reduced in the idiomatic condition. This difference was also observed for the function word und (‘and’) in collocations like Mord und Totschlag (‘murder and manslaughter’). However, an analysis of the vowel /U/ of the function word did not show a durational difference. Some explanations as to why speakers showed different patterns of reduction (not all collocations were produced with a shorter duration in the idiomatic condition by all speakers) and why not all speakers use the durational cue (one out of eight speakers produced the conditions identically) are proposed.

@inproceedings{Zimmerer2018SpPro,
title = {Idiomatic or literal? Production of collocations in German read speech},
author = {Frank Zimmerer and Erika Brandt and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://www.isca-speech.org/archive/speechprosody_2018/zimmerer18_speechprosody.html},
doi = {https://doi.org/10.21437/SpeechProsody.2018-87},
year = {2018},
date = {2018},
booktitle = {Proc. Speech Prosody 2018},
pages = {428-432},
address = {Poznan},
abstract = {Collocations have been identified as an interesting field to study the effects of frequency of occurrence in language and speech. We report results of a production experiment including a duration analysis based on the production of German collocations. The collocations occurred in a condition where the phrase was produced with a literal meaning and in another condition where it was idiomatic. A durational difference was found for the collocations, which were reduced in the idiomatic condition. This difference was also observed for the function word und (‘and’) in collocations like Mord und Totschlag (‘murder and manslaughter’). However, an analysis of the vowel /U/ of the function word did not show a durational difference. Some explanations as to why speakers showed different patterns of reduction (not all collocations were produced with a shorter duration in the idiomatic condition by all speakers) and why not all speakers use the durational cue (one out of eight speakers produced the conditions identically) are proposed.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Brandt, Erika; Zimmerer, Frank; Andreeva, Bistra; Möbius, Bernd

Impact of prosodic structure and information density on dynamic formant trajectories in German Inproceedings

Klessa, Katarzyna; Bachan, Jolanta; Wagner, Agnieszka; Karpiński, Maciej; Śledziński, Daniel (Ed.): Speech Prosody 2018, Speech Prosody Special Interest Group, pp. 119-123, Urbana, 2018, ISSN 2333-2042.

This study investigated the influence of prosodic structure and information density (ID), defined as contextual predictability, on vowel-inherent spectral change (VISC). We extracted formant measurements from the onset and offset of the vowels of a large German corpus of newspaper read speech. Vector length (VL), the Euclidean distance between F1 and F2 trajectory, and F1 and F2 slope, formant deltas of onset and offset relative to vowel duration, were calculated as measures of formant change. ID factors were word frequency and phoneme-based surprisal measures, while the prosodic factors contained global and local articulation rate, primary lexical stress, and prosodic boundary. We expected that vowels increased in spectral change when they were difficult to predict from the context, or stood in low-frequency words while controlling for known effects of prosodic structure. The ID effects were assumed to be modulated by prosodic factors to a certain extent. We confirmed our hypotheses for VL, and found expected independent effects of prosody and ID on F1 slope and F2 slope.

@inproceedings{Brandt2018SpPro,
title = {Impact of prosodic structure and information density on dynamic formant trajectories in German},
author = {Erika Brandt and Frank Zimmerer and Bistra Andreeva and Bernd M{\"o}bius},
editor = {Katarzyna Klessa and Jolanta Bachan and Agnieszka Wagner and Maciej Karpiński and Daniel Śledziński},
url = {https://www.researchgate.net/publication/325744530_Impact_of_prosodic_structure_and_information_density_on_dynamic_formant_trajectories_in_German},
doi = {https://doi.org/10.22028/D291-32050},
year = {2018},
date = {2018},
booktitle = {Speech Prosody 2018},
issn = {2333-2042},
pages = {119-123},
publisher = {Speech Prosody Special Interest Group},
address = {Urbana},
abstract = {This study investigated the influence of prosodic structure and information density (ID), defined as contextual predictability, on vowel-inherent spectral change (VISC). We extracted formant measurements from the onset and offset of the vowels of a large German corpus of newspaper read speech. Vector length (VL), the Euclidean distance between F1 and F2 trajectory, and F1 and F2 slope, formant deltas of onset and offset relative to vowel duration, were calculated as measures of formant change. ID factors were word frequency and phoneme-based surprisal measures, while the prosodic factors contained global and local articulation rate, primary lexical stress, and prosodic boundary. We expected that vowels increased in spectral change when they were difficult to predict from the context, or stood in low-frequency words while controlling for known effects of prosodic structure. The ID effects were assumed to be modulated by prosodic factors to a certain extent. We confirmed our hypotheses for VL, and found expected independent effects of prosody and ID on F1 slope and F2 slope.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Brandt, Erika; Zimmerer, Frank; Möbius, Bernd; Andreeva, Bistra

Mel-cepstral distortion of German vowels in different information density contexts Inproceedings

Proceedings of Interspeech, Stockholm, Sweden, 2017.

This study investigated whether German vowels differ significantly from each other in mel-cepstral distortion (MCD) when they stand in different information density (ID) contexts. We hypothesized that vowels in the same ID contexts are more similar to each other than vowels that stand in different ID conditions. Read speech material from PhonDat2 of 16 German natives (m = 10, f = 6) was analyzed. Bi-phone and word language models were calculated based on DeWaC. To account for additional variability in the data, prosodic factors, as well as corpusspecific frequency values were also entered into the statistical models. Results showed that vowels in different ID conditions were significantly different in their MCD values. Unigram word probability and corpus-specific word frequency showed the expected effect on vowel similarity with a hierarchy between noncontrasting and contrasting conditions. However, these did not form a homogeneous group since there were group-internal significant differences. The largest distance can be found between vowels produced at fast speech rate, and between unstressed vowels.

@inproceedings{Brandt/etal:2017,
title = {Mel-cepstral distortion of German vowels in different information density contexts},
author = {Erika Brandt and Frank Zimmerer and Bernd M{\"o}bius and Bistra Andreeva},
url = {https://www.researchgate.net/publication/319185343_Mel-Cepstral_Distortion_of_German_Vowels_in_Different_Information_Density_Contexts},
year = {2017},
date = {2017},
booktitle = {Proceedings of Interspeech},
address = {Stockholm, Sweden},
abstract = {This study investigated whether German vowels differ significantly from each other in mel-cepstral distortion (MCD) when they stand in different information density (ID) contexts. We hypothesized that vowels in the same ID contexts are more similar to each other than vowels that stand in different ID conditions. Read speech material from PhonDat2 of 16 German natives (m = 10, f = 6) was analyzed. Bi-phone and word language models were calculated based on DeWaC. To account for additional variability in the data, prosodic factors, as well as corpusspecific frequency values were also entered into the statistical models. Results showed that vowels in different ID conditions were significantly different in their MCD values. Unigram word probability and corpus-specific word frequency showed the expected effect on vowel similarity with a hierarchy between noncontrasting and contrasting conditions. However, these did not form a homogeneous group since there were group-internal significant differences. The largest distance can be found between vowels produced at fast speech rate, and between unstressed vowels.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Zimmerer, Frank; Andreeva, Bistra; Möbius, Bernd; Malisz, Zofia; Ferragne, Emmanuel; Pellegrino, François; Brandt, Erika

Perzeption von Sprechgeschwindigkeit und der (nicht nachgewiesene) Einfluss von Surprisal Inproceedings

Möbius, Bernd;  (Ed.): Elektronische Sprachsignalverarbeitung 2017 - Tagungsband der 28. Konferenz, Saarbrücken, 15.-17. März 2017. Studientexte zur Sprachkommunikation, Band 86, pp. 174-179, 2017.

In zwei Perzeptionsexperimenten wurde die Perzeption von Sprechgeschwindigkeit untersucht. Ein Faktor, der dabei besonders im Zentrum des Interesses steht, ist Surprisal, ein informationstheoretisches Maß für die Vorhersagbarkeit einer linguistischen Einheit im Kontext. Zusammengenommen legen die Ergebnisse der Experimente den Schluss nahe, dass Surprisal keinen signifikanten Einfluss auf die Wahrnehmung von Sprechgeschwindigkeit ausübt.

@inproceedings{Zimmerer/etal:2017a,
title = {Perzeption von Sprechgeschwindigkeit und der (nicht nachgewiesene) Einfluss von Surprisal},
author = {Frank Zimmerer and Bistra Andreeva and Bernd M{\"o}bius and Zofia Malisz and Emmanuel Ferragne and François Pellegrino and Erika Brandt},
editor = {Bernd M{\"o}bius},
url = {https://www.researchgate.net/publication/318589916_PERZEPTION_VON_SPRECHGESCHWINDIGKEIT_UND_DER_NICHT_NACHGEWIESENE_EINFLUSS_VON_SURPRISAL},
year = {2017},
date = {2017-03-15},
booktitle = {Elektronische Sprachsignalverarbeitung 2017 - Tagungsband der 28. Konferenz, Saarbr{\"u}cken, 15.-17. M{\"a}rz 2017. Studientexte zur Sprachkommunikation, Band 86},
pages = {174-179},
abstract = {In zwei Perzeptionsexperimenten wurde die Perzeption von Sprechgeschwindigkeit untersucht. Ein Faktor, der dabei besonders im Zentrum des Interesses steht, ist Surprisal, ein informationstheoretisches Ma{\ss} f{\"u}r die Vorhersagbarkeit einer linguistischen Einheit im Kontext. Zusammengenommen legen die Ergebnisse der Experimente den Schluss nahe, dass Surprisal keinen signifikanten Einfluss auf die Wahrnehmung von Sprechgeschwindigkeit aus{\"u}bt.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Malisz, Zofia; O'Dell, Michael; Nieminen, Tommi; Wagner, Petra

Perspectives on speech timing: Coupled oscillator modeling of Polish and Finnish Journal Article

Phonetica, 73, pp. 229-255, 2016.

This study was aimed at analyzing empirical duration data for Polish spoken at different tempos using an updated version of the Coupled Oscillator Model of speech timing and rhythm variability (O’Dell and Nieminen, 1999, 2009). We use Bayesian inference on parameters relating to speech rate to investigate how tempo affects timing in Polish. The model parameters found are then compared with parameters obtained for equivalent material in Finnish to shed light on which of the effects represent general speech rate mechanisms and which are specific to Polish. We discuss the model and its predictions in the context of current perspectives on speech timing.

@article{Malisz/etal:2016,
title = {Perspectives on speech timing: Coupled oscillator modeling of Polish and Finnish},
author = {Zofia Malisz and Michael O'Dell and Tommi Nieminen and Petra Wagner},
url = {https://www.degruyter.com/document/doi/10.1159/000450829/html},
doi = {https://doi.org/10.1159/000450829},
year = {2016},
date = {2016},
journal = {Phonetica},
pages = {229-255},
volume = {73},
abstract = {

This study was aimed at analyzing empirical duration data for Polish spoken at different tempos using an updated version of the Coupled Oscillator Model of speech timing and rhythm variability (O'Dell and Nieminen, 1999, 2009). We use Bayesian inference on parameters relating to speech rate to investigate how tempo affects timing in Polish. The model parameters found are then compared with parameters obtained for equivalent material in Finnish to shed light on which of the effects represent general speech rate mechanisms and which are specific to Polish. We discuss the model and its predictions in the context of current perspectives on speech timing.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Schulz, Erika; Oh, Yoon Mi; Andreeva, Bistra; Möbius, Bernd

Impact of Prosodic Structure and Information Density on Vowel Space Size Inproceedings

Proceedings of Speech Prosody, pp. 350-354, Boston, MA, USA, 2016.

We investigated the influence of prosodic structure and information density on vowel space size. Vowels were measured in five languages from the BonnTempo corpus, French, German, Finnish, Czech, and Polish, each with three female and three male speakers. Speakers read the text at normal, slow, and fast speech rate. The Euclidean distance between vowel space midpoint and formant values for each speaker was used as a measure for vowel distinctiveness. The prosodic model consisted of prominence and boundary. Information density was calculated for each language using the surprisal of the biphone Xn|Xn−1. On average, there is a positive relationship between vowel space expansion and information density. Detailed analysis revealed that this relationship did not hold for Finnish, and was only weak for Polish. When vowel distinctiveness was modeled as a function of prosodic factors and information density in linear mixed effects models (LMM), only prosodic factors were significant in explaining the variance in vowel space expansion. All prosodic factors, except word boundary, showed significant positive results in LMM. Vowels were more distinct in stressed syllables, before a prosodic boundary and at normal and slow speech rate compared to fast speech.

@inproceedings{Schulz/etal:2016a,
title = {Impact of Prosodic Structure and Information Density on Vowel Space Size},
author = {Erika Schulz and Yoon Mi Oh and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://www.researchgate.net/publication/303755409_Impact_of_prosodic_structure_and_information_density_on_vowel_space_size},
year = {2016},
date = {2016},
booktitle = {Proceedings of Speech Prosody},
pages = {350-354},
address = {Boston, MA, USA},
abstract = {We investigated the influence of prosodic structure and information density on vowel space size. Vowels were measured in five languages from the BonnTempo corpus, French, German, Finnish, Czech, and Polish, each with three female and three male speakers. Speakers read the text at normal, slow, and fast speech rate. The Euclidean distance between vowel space midpoint and formant values for each speaker was used as a measure for vowel distinctiveness. The prosodic model consisted of prominence and boundary. Information density was calculated for each language using the surprisal of the biphone Xn|Xn−1. On average, there is a positive relationship between vowel space expansion and information density. Detailed analysis revealed that this relationship did not hold for Finnish, and was only weak for Polish. When vowel distinctiveness was modeled as a function of prosodic factors and information density in linear mixed effects models (LMM), only prosodic factors were significant in explaining the variance in vowel space expansion. All prosodic factors, except word boundary, showed significant positive results in LMM. Vowels were more distinct in stressed syllables, before a prosodic boundary and at normal and slow speech rate compared to fast speech.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Successfully