Publications - SFB 1102

Bourgonje, Peter; Demberg, Vera

Generalizing across Languages and Domains for Discourse Relation Classification Inproceedings

Kawahara, Tatsuya; Demberg, Vera; Ultes, Stefan; Inoue, Koji; Mehri, Shikib; Howcroft, David; Komatani, Kazunori (Ed.): Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Association for Computational Linguistics, pp. 554-565, Kyoto, Japan, 2024.

Abstract
|
Links
|
BibTeX

The availability of corpora annotated for discourse relations is limited and discourse relation classification performance varies greatly depending on both language and domain. This is a problem for downstream applications that are intended for a language (i.e., not English) or a domain (i.e., not financial news) with comparatively low coverage for discourse annotations. In this paper, we experiment with a state-of-the-art model for discourse relation classification, originally developed for English, extend it to a multi-lingual setting (testing on Italian, Portuguese and Turkish), and employ a simple, yet effective method to mark out-of-domain training instances. By doing so, we aim to contribute to better generalization and more robust discourse relation classification performance across both language and domain.

@inproceedings{bourgonje-demberg-2024-generalizing,
title = {Generalizing across Languages and Domains for Discourse Relation Classification},
author = {Peter Bourgonje and Vera Demberg},
editor = {Tatsuya Kawahara and Vera Demberg and Stefan Ultes and Koji Inoue and Shikib Mehri and David Howcroft and Kazunori Komatani},
url = {https://aclanthology.org/2024.sigdial-1.47/},
doi = {https://doi.org/10.18653/v1/2024.sigdial-1.47},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue},
pages = {554-565},
publisher = {Association for Computational Linguistics},
address = {Kyoto, Japan},
abstract = {The availability of corpora annotated for discourse relations is limited and discourse relation classification performance varies greatly depending on both language and domain. This is a problem for downstream applications that are intended for a language (i.e., not English) or a domain (i.e., not financial news) with comparatively low coverage for discourse annotations. In this paper, we experiment with a state-of-the-art model for discourse relation classification, originally developed for English, extend it to a multi-lingual setting (testing on Italian, Portuguese and Turkish), and employ a simple, yet effective method to mark out-of-domain training instances. By doing so, we aim to contribute to better generalization and more robust discourse relation classification performance across both language and domain.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B2

Sommerfeld, Linda

Predictive language processing in the complex visual world in children and adults PhD Thesis

Saarländische Universitäts- und Landesbibliothek, Saarland University, Saarbruecken, Germany, 2024.

Abstract
|
Links
|
BibTeX

Given the sentence “On sunny days, Rose rides to work with her …”, it is likely that you predict the word “bicycle” before reading it. Notably, not only adults, but even children from an early age predict language, which is seen as one reason of why language comprehension is remarkably fast and accurate. In everyday-life language is usually received in visual contexts which can influence what a comprehender predicts. Imagine processing the above sentence while looking at the picture of a bicycle. This could make you even more likely to predict the noun “bicycle”. Thus, prediction research often applies the Visual World Paradigm. Here, participants listen to predictable sentences like the above while looking at visual scenes that show one visual prediction option that is (e.g., bicycle) and one distractor object that is not (e.g., cake) consistent with the predictive sentence context. When participants show an increase in fixations to the visual prediction option after the predictive cue was played (e.g., “ride”), but prior to the target noun, this indexes prediction. Cognitive models argue that visually situated prediction involves two mechanisms. Predictive linguistic cues (e.g., the semantically constraining verb “ride”) cause the pre-activation of the mental representations of prediction options such as “bicycle” in long-term memory. If a visual context allows to commit to a prediction option, this option is pre-updated (i.e., pre-processed) in working memory. Given this, individual differences in verbal and cognitive abilities could influence visually situated prediction. That is, language experience could determine which long-term memory representations can be pre-activated, while working memory capacity could affect the ability to pre-update prediction options. Since children have smaller language experience and working memory capacity than adults, we used a developmental approach and compared children and adults in their prediction behavior in the visual world to test the above model assumptions. First, we compared children and adults in their ability to make multiple predictions in parallel. With the Visual World Paradigm, adults have already been shown to rely on visual contexts to make multiple predictions: When hearing the sentence (“Rose rides to work with her …”) while looking at multiple “ridable” objects, adults have been shown to predict up to four sentence continuations in parallel. We examined whether also children can follow a multiple predictions pattern, or whether their limited language experience and cognitive capacity prevent them from doing so. Besides, since working memory engages more mental resources when more stimuli are processed, we examined whether children and adults show an increase in cognitive load to pre-update multiple versus only single prediction options in working memory. We examined whether this effect is more prominent in children given their smaller cognitive capacity. We finally investigated whether processing load of a predictable target word (e.g., “bicycle”) is smaller when that word was pre-updated alone or among multiple competitors. In Chapter 1 we outline the theoretical background of this work. This is followed by an empirical section that addresses the above questions. We conducted two studies in which children and adults were presented with sentences with semantically constraining verbs and predictable target nouns (e.g., “The father eats the waffle”) in visual scenes of four object pictures each. Across four conditions, the scenes varied in predictability: Either 0, 1, 3, or 4 visual objects were consistent with the verb constraints and thus viewed as visual prediction options. Chapter 2 shows a pretest of the sentences and the scenes with young children (4–6 years). Experiment 1 was an eye-tracking study in which children (5–6) and adults listened to the sentences while looking at the visual scenes. In Chapter 3, we used their anticipatory object fixations as an index of prediction behavior. Chapter 4 presents data collected in the same study. Here, the Index of Cognitive Activity (ICA) and pupil sizes were used as a measure of cognitive load engaged in sentence processing in the different visual conditions. Chapter 5 presents Experiment 2, where literate children (8–12 years) and adults were presented with the same sentences and scenes in a self-paced reading task. They read the sentences word-by-word while inspecting the scenes. We relied on word processing times as an index of cognitive load. Their anticipatory object fixations (Experiment 1) showed that children and adults followed a multiple predictions pattern. For children, this ability was positively related to their language experience, supporting the view that prediction involves the pre-activation of mental representations in long-term memory. We found no consistent evidence of whether children and adults engaged higher cognitive load to make multiple predictions. Both age groups’ ICA and pupil size values did not (Experiment 1) but their word processing times did (Experiment 2) suggest additional processing costs for multiple predictions. The latter result is in line with the view that prediction involves the pre-updating of input in the cognitive system. Finally, both studies found children and adults to engage less processing load for target nouns that could be pre-updated alone versus among multiple competitors. In sum, we provide indication that visual contexts can influence the ease of (predictive) language processing, which is discussed beyond cognitive perspectives of prediction in Chapter 6. Here, we also consider which questions about predictive language processing still remain open, in particular for children.

Stellen Sie sich folgenden Satz vor: „Um an sonnigen Tagen zur Arbeit zu kommen, fährt Rosa mit ihrem …“. Vermutlich haben Sie das Wort „Fahrrad“ antizipiert, ohne es gelesen zu haben. Dies wird prädiktive Sprachverarbeitung genannt und als ein Grund für die enorme Genauigkeit und Geschwindigkeit des Sprachverständnisses gesehen. Bemerkenswerterweise weisen nicht nur Erwachsene, sondern auch Kinder, die Fähigkeit zur sprachlichen Vorhersage auf. Im Alltag wird Sprache oft in visuellen Kontexten rezipiert, welche die Vorhersage beeinflussen. Stellen Sie sich vor, Sie hören obigen Satz, während Sie das Bild eines Fahrrades betrachten. Dies könnte die Wahrscheinlichkeit erhöhen, dass Sie das Wort „Fahrrad“ vorhersagen. Empirische Studien zur sprachlichen Vorhersage nutzen daher häufig das Visual World Paradigma. Hier hören Versuchspersonen vorhersagbare Sätze, wie den obigen, während sie visuelle Szenen betrachten. Diese zeigen typischerweise eine visuelle Vorhersageoption (z.B. das Bild eines Fahrrades) und ein weiteres Objekt, das inkonsistent mit dem prädiktiven Satzkontext ist (z.B. das Bild eines Kuchens). Dieses Paradigma weist sprachliche Vorhersage nach, wenn Versuchspersonen bereits nach dem prädiktive Hinweisreiz (z.B. „fahren“) und vor dem Zielwort (z.B. „Fahrrad“) einen Anstieg an Fixationen der visuellen Vorhersageoption im Vergleich zum inkonsistenten Objekt zeigen. Kognitive Modelle postulieren, dass zwei Mechanismen an der Vorhersage im visuellen Kontext beteiligt sind. Prädiktive sprachliche Hinweisreize (z.B. das Verb „fahren“) erwirken die Voraktivierung von Vorhersageoptionen (z.B. Fortbewegungsmitteln) im Langzeitgedächtnis. Wenn zudem eine visuelle Vorhersageoption verfügbar ist (z.B. das Bild eines Fahrrades), wird diese Option im Arbeitsgedächtnis vorverarbeitet. Infolgedessen könnten verbale und kognitive Fähigkeiten die sprachliche Vorhersage im visuellen Kontext beeinflussen. So könnte die Spracherfahrung bestimmen, welche Informationen im Langzeitgedächtnis voraktiviert werden können. Die Arbeitsgedächtniskapazität hingegen könnte die Fähigkeit zur Vorverarbeitung von Vorhersageoptionen beeinflussen. Da Kinder im Vergleich zu Erwachsenen über eine geringere Spracherfahrung sowie Kapazität des Arbeitsgedächtnisses verfügen, nutzte diese Arbeit einen entwicklungspsychologischen Ansatz, um obige Annahmen zur sprachlichen Vorhersage zu prüfen. Zunächst wurden Kinder und Erwachsene in ihrer Fähigkeit verglichen, mehrere Vorhersagen gleichzeitig zu treffen. Mit dem Visual World Paradigma wurde bereits gezeigt, dass Erwachsene visuelle Kontexte nutzen, um mehrere Vorhersagen zu treffen: Erwachsene, die obigen Beispielsatz hören und gleichzeitig mehrere „fahrbare“ Objekte betrachten, konnten nachweislich bis zu vier potentielle Zielwörter gleichzeitig vorhersagen. Diese Arbeit untersucht, ob auch Kinder mehrere Vorhersagen gleichzeig treffen oder ob ihre geringe Spracherfahrung und kognitive Kapazität ein solches Muster der Vorhersage einschränken. Weiterhin wird geprüft, ob Kinder und Erwachsene eine höhere kognitive Belastung zeigen, wenn sie mehrere, statt nur einer Vorhersageoption, vorverarbeiten. Dies wäre plausibel, da das Arbeitsgedächtnis in der Regel mehr mentale Ressourcen beansprucht, wenn es mehr Informationen verarbeitet. Zudem wird untersucht, ob dieser Effekt bei Kindern aufgrund ihrer geringen kognitiven Kapazität stärker ausgeprägt ist als bei Erwachsenen. Zuletzt wird ermittelt, ob mehr mentale Ressourcen zur Verarbeitung eines Zielwortes benötigt werden, wenn dieses Wort mit weiteren Vorhersageoptionen (statt als einzige Option) vorverarbeitet wurde. Kapitel 1 präsentiert den theoretischen Hintergrund dieser Arbeit. Es folgt ein empirischer Teil, in dem obige Fragen adressiert werden. Dieser umfasst zwei Studien, in denen Kindern und Erwachsenen Sätze mit prädiktiven Verben und Zielwörtern gezeigt wurden (z.B. „Der Vater isst die Waffel“). Die Sätze wurden zusammen mit visuellen Szenen präsentiert, die jeweils vier Bilder von Objekten zeigten. Die Szenen variierten in ihrer Vorhersagbarkeit: Basierend auf dem prädiktiven Verb stellten 0, 1, 3 oder 4 der Objekte eine visuelle Vorhersageoption dar. Kapitel 2 zeigt eine Studie, in der die Sätze und Szenen mit Kindern (4–6 Jahre) normiert wurden. Experiment 1 war eine Eye-Tracking Studie, in der Kinder (5–6 Jahre) und Erwachsene die Szenen betrachteten, während ihnen die Sätze vorgespielt wurden. In Kapitel 3 wurden die Objektfixationen der Versuchspersonen als Index für das Vorhersageverhalten verwendet. Kapitel 4 präsentiert Daten, die in derselben Studie erhoben wurden. Hier wurde die Pupillengröße sowie der Index of Cognitive Activity (ICA) als Maß für die kognitive Belastung der Satzverarbeitung in den verschiedenen visuellen Konditionen verwendet. Kapitel 5 präsentiert Experiment 2. Hier wurden Kindern (8–12 Jahre) und Erwachsenen dieselben Sätze und Szenen präsentiert, jedoch wurden die Sätze auf dem Bildschirm innerhalb der Szenen gezeigt und Wort für Wort gelesen. Die Wortverarbeitungszeit wurde als Maß für die kognitive Belastung gewertet. Anhand der Objektfixationen zeigte Experiment 1, dass beide Altersgruppen mehrere Vorhersagen gleichzeitig trafen. Bei Kindern stand diese Fähigkeit in positiver Relation zu ihrer Spracherfahrung. Wir fanden keine konsistente Evidenz, dass Kinder und Erwachsene eine höhere kognitive Belastung zeigen, wenn sie mehrere Vorhersagen gleichzeitig treffen. Dieser Effekt wurde durch die Wortverarbeitungszeiten beider Altersgruppen nachgewiesen (Experiment 2), nicht jedoch durch ihre Pupillengrößen und ICA-Daten (Experiment 1). In beiden Studien zeigten Kinder und Erwachsene eine höhere kognitive Belastung bei der Verarbeitung von Zielwörtern, die mit mehreren Vorhersageoptionen (statt als einzige Option) antizipiert wurden. Insgesamt zeigen die Ergebnisse dieser Arbeit, dass visuelle Kontexte einen Einfluss auf die prädiktive Sprachverarbeitung und ihre Leichtigkeit haben können. Dies wird in Kapitel 6 vor dem Hintergrund kognitiver Modelle der Vorhersage diskutiert. Hier werden zudem offene Fragen zur sprachlichen Vorhersage, insbesondere bei Kindern, thematisiert.

@phdthesis{Sommerfeld_Diss,
title = {Predictive language processing in the complex visual world in children and adults},
author = {Linda Sommerfeld},
url = {https://jahrbib.sulb.uni-saarland.de/handle/20.500.11880/37808},
doi = {https://doi.org/10.22028/D291-42078},
year = {2024},
date = {2024},
school = {Saarland University},
publisher = {Saarl{\"a}ndische Universit{\"a}ts- und Landesbibliothek},
address = {Saarbruecken, Germany},
abstract = {Given the sentence “On sunny days, Rose rides to work with her …”, it is likely that you predict the word “bicycle” before reading it. Notably, not only adults, but even children from an early age predict language, which is seen as one reason of why language comprehension is remarkably fast and accurate. In everyday-life language is usually received in visual contexts which can influence what a comprehender predicts. Imagine processing the above sentence while looking at the picture of a bicycle. This could make you even more likely to predict the noun “bicycle”. Thus, prediction research often applies the Visual World Paradigm. Here, participants listen to predictable sentences like the above while looking at visual scenes that show one visual prediction option that is (e.g., bicycle) and one distractor object that is not (e.g., cake) consistent with the predictive sentence context. When participants show an increase in fixations to the visual prediction option after the predictive cue was played (e.g., “ride”), but prior to the target noun, this indexes prediction. Cognitive models argue that visually situated prediction involves two mechanisms. Predictive linguistic cues (e.g., the semantically constraining verb “ride”) cause the pre-activation of the mental representations of prediction options such as “bicycle” in long-term memory. If a visual context allows to commit to a prediction option, this option is pre-updated (i.e., pre-processed) in working memory. Given this, individual differences in verbal and cognitive abilities could influence visually situated prediction. That is, language experience could determine which long-term memory representations can be pre-activated, while working memory capacity could affect the ability to pre-update prediction options. Since children have smaller language experience and working memory capacity than adults, we used a developmental approach and compared children and adults in their prediction behavior in the visual world to test the above model assumptions. First, we compared children and adults in their ability to make multiple predictions in parallel. With the Visual World Paradigm, adults have already been shown to rely on visual contexts to make multiple predictions: When hearing the sentence (“Rose rides to work with her …”) while looking at multiple “ridable” objects, adults have been shown to predict up to four sentence continuations in parallel. We examined whether also children can follow a multiple predictions pattern, or whether their limited language experience and cognitive capacity prevent them from doing so. Besides, since working memory engages more mental resources when more stimuli are processed, we examined whether children and adults show an increase in cognitive load to pre-update multiple versus only single prediction options in working memory. We examined whether this effect is more prominent in children given their smaller cognitive capacity. We finally investigated whether processing load of a predictable target word (e.g., “bicycle”) is smaller when that word was pre-updated alone or among multiple competitors. In Chapter 1 we outline the theoretical background of this work. This is followed by an empirical section that addresses the above questions. We conducted two studies in which children and adults were presented with sentences with semantically constraining verbs and predictable target nouns (e.g., “The father eats the waffle”) in visual scenes of four object pictures each. Across four conditions, the scenes varied in predictability: Either 0, 1, 3, or 4 visual objects were consistent with the verb constraints and thus viewed as visual prediction options. Chapter 2 shows a pretest of the sentences and the scenes with young children (4–6 years). Experiment 1 was an eye-tracking study in which children (5–6) and adults listened to the sentences while looking at the visual scenes. In Chapter 3, we used their anticipatory object fixations as an index of prediction behavior. Chapter 4 presents data collected in the same study. Here, the Index of Cognitive Activity (ICA) and pupil sizes were used as a measure of cognitive load engaged in sentence processing in the different visual conditions. Chapter 5 presents Experiment 2, where literate children (8–12 years) and adults were presented with the same sentences and scenes in a self-paced reading task. They read the sentences word-by-word while inspecting the scenes. We relied on word processing times as an index of cognitive load. Their anticipatory object fixations (Experiment 1) showed that children and adults followed a multiple predictions pattern. For children, this ability was positively related to their language experience, supporting the view that prediction involves the pre-activation of mental representations in long-term memory. We found no consistent evidence of whether children and adults engaged higher cognitive load to make multiple predictions. Both age groups’ ICA and pupil size values did not (Experiment 1) but their word processing times did (Experiment 2) suggest additional processing costs for multiple predictions. The latter result is in line with the view that prediction involves the pre-updating of input in the cognitive system. Finally, both studies found children and adults to engage less processing load for target nouns that could be pre-updated alone versus among multiple competitors. In sum, we provide indication that visual contexts can influence the ease of (predictive) language processing, which is discussed beyond cognitive perspectives of prediction in Chapter 6. Here, we also consider which questions about predictive language processing still remain open, in particular for children.

Stellen Sie sich folgenden Satz vor: „Um an sonnigen Tagen zur Arbeit zu kommen, f{\"a}hrt Rosa mit ihrem ...“. Vermutlich haben Sie das Wort „Fahrrad“ antizipiert, ohne es gelesen zu haben. Dies wird pr{\"a}diktive Sprachverarbeitung genannt und als ein Grund f{\"u}r die enorme Genauigkeit und Geschwindigkeit des Sprachverst{\"a}ndnisses gesehen. Bemerkenswerterweise weisen nicht nur Erwachsene, sondern auch Kinder, die F{\"a}higkeit zur sprachlichen Vorhersage auf. Im Alltag wird Sprache oft in visuellen Kontexten rezipiert, welche die Vorhersage beeinflussen. Stellen Sie sich vor, Sie h{\"o}ren obigen Satz, w{\"a}hrend Sie das Bild eines Fahrrades betrachten. Dies k{\"o}nnte die Wahrscheinlichkeit erh{\"o}hen, dass Sie das Wort „Fahrrad“ vorhersagen. Empirische Studien zur sprachlichen Vorhersage nutzen daher h{\"a}ufig das Visual World Paradigma. Hier h{\"o}ren Versuchspersonen vorhersagbare S{\"a}tze, wie den obigen, w{\"a}hrend sie visuelle Szenen betrachten. Diese zeigen typischerweise eine visuelle Vorhersageoption (z.B. das Bild eines Fahrrades) und ein weiteres Objekt, das inkonsistent mit dem pr{\"a}diktiven Satzkontext ist (z.B. das Bild eines Kuchens). Dieses Paradigma weist sprachliche Vorhersage nach, wenn Versuchspersonen bereits nach dem pr{\"a}diktive Hinweisreiz (z.B. „fahren“) und vor dem Zielwort (z.B. „Fahrrad“) einen Anstieg an Fixationen der visuellen Vorhersageoption im Vergleich zum inkonsistenten Objekt zeigen. Kognitive Modelle postulieren, dass zwei Mechanismen an der Vorhersage im visuellen Kontext beteiligt sind. Pr{\"a}diktive sprachliche Hinweisreize (z.B. das Verb „fahren“) erwirken die Voraktivierung von Vorhersageoptionen (z.B. Fortbewegungsmitteln) im Langzeitged{\"a}chtnis. Wenn zudem eine visuelle Vorhersageoption verf{\"u}gbar ist (z.B. das Bild eines Fahrrades), wird diese Option im Arbeitsged{\"a}chtnis vorverarbeitet. Infolgedessen k{\"o}nnten verbale und kognitive F{\"a}higkeiten die sprachliche Vorhersage im visuellen Kontext beeinflussen. So k{\"o}nnte die Spracherfahrung bestimmen, welche Informationen im Langzeitged{\"a}chtnis voraktiviert werden k{\"o}nnen. Die Arbeitsged{\"a}chtniskapazit{\"a}t hingegen k{\"o}nnte die F{\"a}higkeit zur Vorverarbeitung von Vorhersageoptionen beeinflussen. Da Kinder im Vergleich zu Erwachsenen {\"u}ber eine geringere Spracherfahrung sowie Kapazit{\"a}t des Arbeitsged{\"a}chtnisses verf{\"u}gen, nutzte diese Arbeit einen entwicklungspsychologischen Ansatz, um obige Annahmen zur sprachlichen Vorhersage zu pr{\"u}fen. Zun{\"a}chst wurden Kinder und Erwachsene in ihrer F{\"a}higkeit verglichen, mehrere Vorhersagen gleichzeitig zu treffen. Mit dem Visual World Paradigma wurde bereits gezeigt, dass Erwachsene visuelle Kontexte nutzen, um mehrere Vorhersagen zu treffen: Erwachsene, die obigen Beispielsatz h{\"o}ren und gleichzeitig mehrere „fahrbare“ Objekte betrachten, konnten nachweislich bis zu vier potentielle Zielw{\"o}rter gleichzeitig vorhersagen. Diese Arbeit untersucht, ob auch Kinder mehrere Vorhersagen gleichzeig treffen oder ob ihre geringe Spracherfahrung und kognitive Kapazit{\"a}t ein solches Muster der Vorhersage einschr{\"a}nken. Weiterhin wird gepr{\"u}ft, ob Kinder und Erwachsene eine h{\"o}here kognitive Belastung zeigen, wenn sie mehrere, statt nur einer Vorhersageoption, vorverarbeiten. Dies w{\"a}re plausibel, da das Arbeitsged{\"a}chtnis in der Regel mehr mentale Ressourcen beansprucht, wenn es mehr Informationen verarbeitet. Zudem wird untersucht, ob dieser Effekt bei Kindern aufgrund ihrer geringen kognitiven Kapazit{\"a}t st{\"a}rker ausgepr{\"a}gt ist als bei Erwachsenen. Zuletzt wird ermittelt, ob mehr mentale Ressourcen zur Verarbeitung eines Zielwortes ben{\"o}tigt werden, wenn dieses Wort mit weiteren Vorhersageoptionen (statt als einzige Option) vorverarbeitet wurde. Kapitel 1 pr{\"a}sentiert den theoretischen Hintergrund dieser Arbeit. Es folgt ein empirischer Teil, in dem obige Fragen adressiert werden. Dieser umfasst zwei Studien, in denen Kindern und Erwachsenen S{\"a}tze mit pr{\"a}diktiven Verben und Zielw{\"o}rtern gezeigt wurden (z.B. „Der Vater isst die Waffel“). Die S{\"a}tze wurden zusammen mit visuellen Szenen pr{\"a}sentiert, die jeweils vier Bilder von Objekten zeigten. Die Szenen variierten in ihrer Vorhersagbarkeit: Basierend auf dem pr{\"a}diktiven Verb stellten 0, 1, 3 oder 4 der Objekte eine visuelle Vorhersageoption dar. Kapitel 2 zeigt eine Studie, in der die S{\"a}tze und Szenen mit Kindern (4–6 Jahre) normiert wurden. Experiment 1 war eine Eye-Tracking Studie, in der Kinder (5–6 Jahre) und Erwachsene die Szenen betrachteten, w{\"a}hrend ihnen die S{\"a}tze vorgespielt wurden. In Kapitel 3 wurden die Objektfixationen der Versuchspersonen als Index f{\"u}r das Vorhersageverhalten verwendet. Kapitel 4 pr{\"a}sentiert Daten, die in derselben Studie erhoben wurden. Hier wurde die Pupillengr{\"o}{\ss}e sowie der Index of Cognitive Activity (ICA) als Ma{\ss} f{\"u}r die kognitive Belastung der Satzverarbeitung in den verschiedenen visuellen Konditionen verwendet. Kapitel 5 pr{\"a}sentiert Experiment 2. Hier wurden Kindern (8–12 Jahre) und Erwachsenen dieselben S{\"a}tze und Szenen pr{\"a}sentiert, jedoch wurden die S{\"a}tze auf dem Bildschirm innerhalb der Szenen gezeigt und Wort f{\"u}r Wort gelesen. Die Wortverarbeitungszeit wurde als Ma{\ss} f{\"u}r die kognitive Belastung gewertet. Anhand der Objektfixationen zeigte Experiment 1, dass beide Altersgruppen mehrere Vorhersagen gleichzeitig trafen. Bei Kindern stand diese F{\"a}higkeit in positiver Relation zu ihrer Spracherfahrung. Wir fanden keine konsistente Evidenz, dass Kinder und Erwachsene eine h{\"o}here kognitive Belastung zeigen, wenn sie mehrere Vorhersagen gleichzeitig treffen. Dieser Effekt wurde durch die Wortverarbeitungszeiten beider Altersgruppen nachgewiesen (Experiment 2), nicht jedoch durch ihre Pupillengr{\"o}{\ss}en und ICA-Daten (Experiment 1). In beiden Studien zeigten Kinder und Erwachsene eine h{\"o}here kognitive Belastung bei der Verarbeitung von Zielw{\"o}rtern, die mit mehreren Vorhersageoptionen (statt als einzige Option) antizipiert wurden. Insgesamt zeigen die Ergebnisse dieser Arbeit, dass visuelle Kontexte einen Einfluss auf die pr{\"a}diktive Sprachverarbeitung und ihre Leichtigkeit haben k{\"o}nnen. Dies wird in Kapitel 6 vor dem Hintergrund kognitiver Modelle der Vorhersage diskutiert. Hier werden zudem offene Fragen zur sprachlichen Vorhersage, insbesondere bei Kindern, thematisiert.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project: A5

Kunilovskaya, Maria; Dutta Chowdhury, Koel; Przybyl, Heike; España-Bonet, Cristina; van Genabith, Josef

Mitigating Translationese with GPT-4: Strategies and Performance Inproceedings

Proceedings of the 25th Annual Conference of the European Association for Machine Translation, 1, European Association for Machine Translation, pp. 411–430, 2024.

Abstract
|
Links
|
BibTeX

Translations differ in systematic ways from texts originally authored in the same language. These differences, collectively known as translationese, can pose challenges in cross-lingual natural language processing: models trained or tested on translated input might struggle when presented with non-translated language.Translationese mitigation can alleviate this problem. This study investigates the generative capacities of GPT-4 to reduce translationese in human-translated texts. The task is framed as a rewriting process aimed
at modified translations indistinguishable from the original text in the target language. Our focus is on prompt engineering that tests the utility of linguistic knowledge as part of the instruction for GPT-4. Through a series of prompt design experiments, we show that GPT4-generated revisions are more similar to originals in the target language when the prompts incorporate specific linguistic instructions instead of relying solely on the model’s internal knowledge. Furthermore, we release the segment-aligned bidirectional German–English data built from the Europarl corpus that underpins this study.

eamt24 (0.64MB)
https://eamt2024.github.io/proceedings/vol1.pdf

@inproceedings{kunilovskaya-etal-2024-mitigating,
title = {Mitigating Translationese with GPT-4: Strategies and Performance},
author = {Maria Kunilovskaya and Koel Dutta Chowdhury and Heike Przybyl and Cristina Espa{\~n}a-Bonet and Josef van Genabith},
url = {https://eamt2024.github.io/proceedings/vol1.pdf},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 25th Annual Conference of the European Association for Machine Translation},
pages = {411–430},
publisher = {European Association for Machine Translation},
abstract = {Translations differ in systematic ways from texts originally authored in the same language. These differences, collectively known as translationese, can pose challenges in cross-lingual natural language processing: models trained or tested on translated input might struggle when presented with non-translated language.Translationese mitigation can alleviate this problem. This study investigates the generative capacities of GPT-4 to reduce translationese in human-translated texts. The task is framed as a rewriting process aimed at modified translations indistinguishable from the original text in the target language. Our focus is on prompt engineering that tests the utility of linguistic knowledge as part of the instruction for GPT-4. Through a series of prompt design experiments, we show that GPT4-generated revisions are more similar to originals in the target language when the prompts incorporate specific linguistic instructions instead of relying solely on the model’s internal knowledge. Furthermore, we release the segment-aligned bidirectional German–English data built from the Europarl corpus that underpins this study.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects: B6 B7

Bafna, Niyati; España-Bonet, Cristina; van Genabith, Josef; Sagot, Benoît; Bawden, Rachel

When Your Cousin Has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages Inproceedings

Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani; Xue, Nianwen (Ed.): Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, pp. 17544-17556, Torino, Italia, 2024.

Abstract
|
Links
|
BibTeX

Most existing approaches for unsupervised bilingual lexicon induction (BLI) depend on good quality static or contextual embeddings requiring large monolingual corpora for both languages. However, unsupervised BLI is most likely to be useful for low-resource languages (LRLs), where large datasets are not available. Often we are interested in building bilingual resources for LRLs against related high-resource languages (HRLs), resulting in severely imbalanced data settings for BLI. We first show that state-of-the-art BLI methods in the literature exhibit near-zero performance for severely data-imbalanced language pairs, indicating that these settings require more robust techniques. We then present a new method for unsupervised BLI between a related LRL and HRL that only requires inference on a masked language model of the HRL, and demonstrate its effectiveness on truly low-resource languages Bhojpuri and Magahi (with <5M monolingual tokens each), against Hindi. We further present experiments on (mid-resource) Marathi and Nepali to compare approach performances by resource range, and release our resulting lexicons for five low-resource Indic languages: Bhojpuri, Magahi, Awadhi, Braj, and Maithili, against Hindi.

@inproceedings{bafna-etal-2024-cousin-right,
title = {When Your Cousin Has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages},
author = {Niyati Bafna and Cristina Espa{\~n}a-Bonet and Josef van Genabith and Benoît Sagot and Rachel Bawden},
editor = {Nicoletta Calzolari and Min-Yen Kan and Veronique Hoste and Alessandro Lenci and Sakriani Sakti and Nianwen Xue},
url = {https://aclanthology.org/2024.lrec-main.1526},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
pages = {17544-17556},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {Most existing approaches for unsupervised bilingual lexicon induction (BLI) depend on good quality static or contextual embeddings requiring large monolingual corpora for both languages. However, unsupervised BLI is most likely to be useful for low-resource languages (LRLs), where large datasets are not available. Often we are interested in building bilingual resources for LRLs against related high-resource languages (HRLs), resulting in severely imbalanced data settings for BLI. We first show that state-of-the-art BLI methods in the literature exhibit near-zero performance for severely data-imbalanced language pairs, indicating that these settings require more robust techniques. We then present a new method for unsupervised BLI between a related LRL and HRL that only requires inference on a masked language model of the HRL, and demonstrate its effectiveness on truly low-resource languages Bhojpuri and Magahi (with <5M monolingual tokens each), against Hindi. We further present experiments on (mid-resource) Marathi and Nepali to compare approach performances by resource range, and release our resulting lexicons for five low-resource Indic languages: Bhojpuri, Magahi, Awadhi, Braj, and Maithili, against Hindi.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B6

Manzoni-Luxenburger, Judith; Andreeva, Bistra; Zahner-Ritter, Katharina

Intonational Patterns under Time Pressure: Phonetic Strategies in Bulgarian Learners of German and English Inproceedings

Proc. Speech Prosody 2024, pp. 369-373, 2024.

Abstract
|
Links
|
BibTeX

Research on the second-language (L2) acquisition of intonation is a growing field but only few studies have (so far) focused on the fine phonetic detail of intonational patterns in the L2. The present study concentrates on the phonetic realization of nuclear intonation contours under time pressure, testing Bulgarian learners in their L2s German and English – two languages in which intonation contours are accommodated differently by native speakers (L1) when little sonorant material is available. In particular, nuclear falling contours (H* L-%) tend to be truncated in L1 German while they are compressed in L1 English. Here we recorded 14 Bulgarian learners in their L2s German and English (within subjects, language order counterbalanced) when producing utterances in a statement context. The target word, a surname placed at the end of the utterance, differed in the available sonorant material (disyllable vs. monosyllables with long and short vowels). Our findings showed that Bulgarian speakers primarily truncate nuclear falling movements ((L+)H* L-%) in both L2s, suggesting transfer irrespective of the target strategy. However, our data show substantial inter- and intra-individual variation which we will discuss, along with factors that might explain this variation.

@inproceedings{manzoniluxenburger24_speechprosody,
title = {Intonational Patterns under Time Pressure: Phonetic Strategies in Bulgarian Learners of German and English},
author = {Judith Manzoni-Luxenburger and Bistra Andreeva and Katharina Zahner-Ritter},
url = {https://www.isca-archive.org/speechprosody_2024/manzoniluxenburger24_speechprosody.html},
doi = {https://doi.org/10.21437/SpeechProsody.2024-75},
year = {2024},
date = {2024},
booktitle = {Proc. Speech Prosody 2024},
pages = {369-373},
abstract = {Research on the second-language (L2) acquisition of intonation is a growing field but only few studies have (so far) focused on the fine phonetic detail of intonational patterns in the L2. The present study concentrates on the phonetic realization of nuclear intonation contours under time pressure, testing Bulgarian learners in their L2s German and English – two languages in which intonation contours are accommodated differently by native speakers (L1) when little sonorant material is available. In particular, nuclear falling contours (H* L-%) tend to be truncated in L1 German while they are compressed in L1 English. Here we recorded 14 Bulgarian learners in their L2s German and English (within subjects, language order counterbalanced) when producing utterances in a statement context. The target word, a surname placed at the end of the utterance, differed in the available sonorant material (disyllable vs. monosyllables with long and short vowels). Our findings showed that Bulgarian speakers primarily truncate nuclear falling movements ((L+)H* L-%) in both L2s, suggesting transfer irrespective of the target strategy. However, our data show substantial inter- and intra-individual variation which we will discuss, along with factors that might explain this variation.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C1

Yuen, Ivan; Andreeva, Bistra; Ibrahim, Omnia; Möbius, Bernd

Differential effects of word frequency and utterance position on the duration of tense and lax vowels in German Inproceedings

Proc. Speech Prosody 2024 (Leiden, The Netherlands), pp. 442-446, Leiden, The Netherlands, 2024.

Abstract
|
Links
|
BibTeX

Acoustic duration is subject to modification from multiple sources, for example, utterance position [13] and predictability such as occurrence frequency at word and syllable levels [e.g., 2, 3, 4]. A study of German radio corpus data showed that these two sources interact to modify syllable duration. On the one hand, the predictability effect can percolate downstream to the segmental level, and this downstream effect is sensitive to phonological contrasts [9]. On the other, [6] showed that utterance-final lengthening is uniformly applied to tense and lax vowels in German. This then raises some questions as to whether the effects of the two sources of durational variation are uniformly applied or sensitive to phonological contrasts. The current study focused on the duration of tense and lax vowels in the stressed syllable of monosyllabic and disyllabic words in utterance-medial and utterance-final positions. Twenty German speakers participated in a question-answer elicitation task. A preliminary analysis of seven speakers showed effects of utterance position and word frequency, as well as interactions with vowel type, suggesting a non-uniform application of durational adjustments contingent on phonological vowel length. Interestingly, the frequency effect affects the duration of lax vowels, but utterance position affects the duration of tense vowels.

@inproceedings{Yuen/etal:2024a,
title = {Differential effects of word frequency and utterance position on the duration of tense and lax vowels in German},
author = {Ivan Yuen and Bistra Andreeva and Omnia Ibrahim and Bernd M{\"o}bius},
url = {https://www.isca-archive.org/speechprosody_2024/yuen24_speechprosody.html},
doi = {https://doi.org/10.21437/SpeechProsody.2024-90},
year = {2024},
date = {2024},
booktitle = {Proc. Speech Prosody 2024 (Leiden, The Netherlands)},
pages = {442-446},
address = {Leiden, The Netherlands},
abstract = {Acoustic duration is subject to modification from multiple sources, for example, utterance position [13] and predictability such as occurrence frequency at word and syllable levels [e.g., 2, 3, 4]. A study of German radio corpus data showed that these two sources interact to modify syllable duration. On the one hand, the predictability effect can percolate downstream to the segmental level, and this downstream effect is sensitive to phonological contrasts [9]. On the other, [6] showed that utterance-final lengthening is uniformly applied to tense and lax vowels in German. This then raises some questions as to whether the effects of the two sources of durational variation are uniformly applied or sensitive to phonological contrasts. The current study focused on the duration of tense and lax vowels in the stressed syllable of monosyllabic and disyllabic words in utterance-medial and utterance-final positions. Twenty German speakers participated in a question-answer elicitation task. A preliminary analysis of seven speakers showed effects of utterance position and word frequency, as well as interactions with vowel type, suggesting a non-uniform application of durational adjustments contingent on phonological vowel length. Interestingly, the frequency effect affects the duration of lax vowels, but utterance position affects the duration of tense vowels.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C1

Chingacham, Anupama; Zhang, Miaoran; Demberg, Vera; Klakow, Dietrich

Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It? Inproceedings

Soni, Nikita; Flek, Lucie; Sharma, Ashish; Yang, Diyi; Hooker, Sara; Andrew Schwartz, H. (Ed.): Proceedings of the 1st Human-Centered Large Language Modeling Workshop, ACL, pp. 1-15, TBD, 2024.

Abstract
|
Links
|
BibTeX

Large Language Models (LLMs) can generate text by transferring style attributes like formality resulting in formal or informal text. However, instructing LLMs to generate text that when spoken, is more intelligible in an acoustically difficult environment, is an under-explored topic. We conduct the first study to evaluate LLMs on a novel task of generating acoustically intelligible paraphrases for better human speech perception in noise. Our experiments in English demonstrated that with standard prompting, LLMs struggle to control the non-textual attribute, i.e., acoustic intelligibility, while efficiently capturing the desired textual attributes like semantic equivalence. To remedy this issue, we propose a simple prompting approach, prompt-and-select, which generates paraphrases by decoupling the desired textual and non-textual attributes in the text generation pipeline. Our approach resulted in a 40% relative improvement in human speech perception, by paraphrasing utterances that are highly distorted in a listening condition with babble noise at signal-to-noise ratio (SNR) -5 dB. This study reveals the limitation of LLMs in capturing non-textual attributes, and our proposed method showcases the potential of using LLMs for better human speech perception in noise.

2024.hucllm-1.1 (0.41MB)
https://aclanthology.org/2024.hucllm-1.1

@inproceedings{chingacham-etal-2024-human,
title = {Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It?},
author = {Anupama Chingacham and Miaoran Zhang and Vera Demberg and Dietrich Klakow},
editor = {Nikita Soni and Lucie Flek and Ashish Sharma and Diyi Yang and Sara Hooker and H. Andrew Schwartz},
url = {https://aclanthology.org/2024.hucllm-1.1},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 1st Human-Centered Large Language Modeling Workshop},
pages = {1-15},
publisher = {ACL},
address = {TBD},
abstract = {Large Language Models (LLMs) can generate text by transferring style attributes like formality resulting in formal or informal text. However, instructing LLMs to generate text that when spoken, is more intelligible in an acoustically difficult environment, is an under-explored topic. We conduct the first study to evaluate LLMs on a novel task of generating acoustically intelligible paraphrases for better human speech perception in noise. Our experiments in English demonstrated that with standard prompting, LLMs struggle to control the non-textual attribute, i.e., acoustic intelligibility, while efficiently capturing the desired textual attributes like semantic equivalence. To remedy this issue, we propose a simple prompting approach, prompt-and-select, which generates paraphrases by decoupling the desired textual and non-textual attributes in the text generation pipeline. Our approach resulted in a 40% relative improvement in human speech perception, by paraphrasing utterances that are highly distorted in a listening condition with babble noise at signal-to-noise ratio (SNR) -5 dB. This study reveals the limitation of LLMs in capturing non-textual attributes, and our proposed method showcases the potential of using LLMs for better human speech perception in noise.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: A4

Verkerk, Annemarie; Talamo, Luigi

mini-CIEP+ : A Shareable Parallel Corpus of Prose Inproceedings

Zweigenbaum, Pierre; Rapp, Reinhard; Sharoff, Serge (Ed.): Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024, ELRA and ICCL, pp. 135-143, Torino, Italia, 2024.

Abstract
|
Links
|
BibTeX

In this paper we present mini-CIEP+, a sharable parallel corpus of prose. mini-CIEP+ consists of the first part of ten different works of prose across many different languages, allowing for the cross-linguistic investigation of larger discourse units. Subcorpora typically contain 5750 sentences and almost 125K tokens. Subcorpora have dependency grammar annotation based on the Universal Dependencies standard (de Marneffe et al., 2021). mini-CIEP+ version 1.0 is available in 35 languages, with the aim of increasing the sample to 50 languages. It is shareable due to recent developments in German law, which allow researchers to share up to 15% of copy-righted material with a select group of people for their own research. Hence, mini-CIEP+ is not publically available, but is rather shareable in a modular fashion with select researchers. We additionally describe future plans for further annotation of mini-CIEP+ as well as its limitations.

2024.bucc-1.15 (0.23MB)
https://aclanthology.org/2024.bucc-1.15

@inproceedings{verkerk-talamo-2024-mini,
title = {mini-CIEP+ : A Shareable Parallel Corpus of Prose},
author = {Annemarie Verkerk and Luigi Talamo},
editor = {Pierre Zweigenbaum and Reinhard Rapp and Serge Sharoff},
url = {https://aclanthology.org/2024.bucc-1.15},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024},
pages = {135-143},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {In this paper we present mini-CIEP+, a sharable parallel corpus of prose. mini-CIEP+ consists of the first part of ten different works of prose across many different languages, allowing for the cross-linguistic investigation of larger discourse units. Subcorpora typically contain 5750 sentences and almost 125K tokens. Subcorpora have dependency grammar annotation based on the Universal Dependencies standard (de Marneffe et al., 2021). mini-CIEP+ version 1.0 is available in 35 languages, with the aim of increasing the sample to 50 languages. It is shareable due to recent developments in German law, which allow researchers to share up to 15% of copy-righted material with a select group of people for their own research. Hence, mini-CIEP+ is not publically available, but is rather shareable in a modular fashion with select researchers. We additionally describe future plans for further annotation of mini-CIEP+ as well as its limitations.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C7

Jablotschkin, Sarah; Teich, Elke; Zinsmeister, Heike

DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis Inproceedings

Raya Chakravarthi, Bharathi; B, Bharathi; Buitelaar, Paul; Durairaj, Thenmozhi; Kovács, György; Ángel García Cumbreras, Miguel (Ed.): Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion, Association for Computational Linguistics, pp. 106-117, St. Julians, Malta, 2024.

Abstract
|
Links
|
BibTeX

In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.

2024.ltedi-1.9 (0.27MB)
https://aclanthology.org/2024.ltedi-1.9

@inproceedings{jablotschkin-etal-2024-de,
title = {DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis},
author = {Sarah Jablotschkin and Elke Teich and Heike Zinsmeister},
editor = {Bharathi Raya Chakravarthi and Bharathi B and Paul Buitelaar and Thenmozhi Durairaj and Gy{\"o}rgy Kov{\'a}cs and Miguel {\'A}ngel Garc{\'i}a Cumbreras},
url = {https://aclanthology.org/2024.ltedi-1.9},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion},
pages = {106-117},
publisher = {Association for Computational Linguistics},
address = {St. Julians, Malta},
abstract = {In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: T1

Fischer, Stefan; Haidarzhyi, Kateryna; Knappen, Jörg; Polishchuk, Olha; Stodolinska, Yuliya; Teich, Elke

A Contemporary News Corpus of Ukrainian (CNC-UA): Compilation, Annotation, Publication Inproceedings

Romanyshyn, Mariana; Romanyshyn, Nataliia; Hlybovets, Andrii; Ignatenko, Oleksii (Ed.): Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024, ELRA and ICCL, pp. 1-7, Torino, Italia, 2024.

Abstract
|
Links
|
BibTeX

We present a corpus of contemporary Ukrainian news articles published between 2019 and 2022 on the news website of the national public broadcaster of Ukraine, commonly known as SUSPILNE. The current release comprises 87 210 364 words in 292 955 texts. Texts are annotated with titles and their time of publication. In addition, the corpus has been linguistically annotated at the token level with a dependency parser. To provide further aspects for investigation, a topic model was trained on the corpus. The corpus is hosted (Fischer et al., 2023) at the Saarbrücken CLARIN center under a CC BY-NC-ND 4.0 license and available in two tab-separated formats: CoNLL-U (de Marneffe et al., 2021) and vertical text format (VRT) as used by the IMS Open Corpus Workbench (CWB; Evert and Hardie, 2011) and CQPweb (Hardie, 2012). We show examples of using the CQPweb interface, which allows to extract the quantitative data necessary for distributional and collocation analyses of the CNC-UA. As the CNC-UA contains news texts documenting recent events, it is highly relevant not only for linguistic analyses of the modern Ukrainian language but also for socio-cultural and political studies.

2024.unlp-1.1 (0.67MB)
https://aclanthology.org/2024.unlp-1.1

@inproceedings{fischer-etal-2024-contemporary,
title = {A Contemporary News Corpus of Ukrainian (CNC-UA): Compilation, Annotation, Publication},
author = {Stefan Fischer and Kateryna Haidarzhyi and J{\"o}rg Knappen and Olha Polishchuk and Yuliya Stodolinska and Elke Teich},
editor = {Mariana Romanyshyn and Nataliia Romanyshyn and Andrii Hlybovets and Oleksii Ignatenko},
url = {https://aclanthology.org/2024.unlp-1.1},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024},
pages = {1-7},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {We present a corpus of contemporary Ukrainian news articles published between 2019 and 2022 on the news website of the national public broadcaster of Ukraine, commonly known as SUSPILNE. The current release comprises 87 210 364 words in 292 955 texts. Texts are annotated with titles and their time of publication. In addition, the corpus has been linguistically annotated at the token level with a dependency parser. To provide further aspects for investigation, a topic model was trained on the corpus. The corpus is hosted (Fischer et al., 2023) at the Saarbr{\"u}cken CLARIN center under a CC BY-NC-ND 4.0 license and available in two tab-separated formats: CoNLL-U (de Marneffe et al., 2021) and vertical text format (VRT) as used by the IMS Open Corpus Workbench (CWB; Evert and Hardie, 2011) and CQPweb (Hardie, 2012). We show examples of using the CQPweb interface, which allows to extract the quantitative data necessary for distributional and collocation analyses of the CNC-UA. As the CNC-UA contains news texts documenting recent events, it is highly relevant not only for linguistic analyses of the modern Ukrainian language but also for socio-cultural and political studies.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B7

Menzel, Katrin

Exploring Word Formation Trends in Written, Spoken, Translated and Interpreted European Parliament Data - A Case Study on Initialisms in English and German Inproceedings

Fiser, Darja; Eskevich, Maria; Bordon, David (Ed.): Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024, ELRA and ICCL, pp. 57-65, Torino, Italia, 2024.

Abstract
|
Links
|
BibTeX

This paper demonstrates the research potential of a unique European Parliament dataset for register studies, contrastive linguistics, translation and interpreting studies. The dataset consists of parallel data for several European languages, including written source texts and their translations as well as spoken source texts and the transcripts of their simultaneously interpreted versions. The paper presents a cross-linguistic, corpus-based case study on a word formation phenomenon in these European Parliament data that are enriched with various linguistic annotations and metadata as well as with information-theoretic surprisal scores. It addresses the questions of how initialisms are used across languages and production modes in the English and German corpus sections of these European Parliament data, whether there is a correlation between the use of initialisms and the use of their corresponding multiword full forms in the analysed corpus sections and what insights on the informativity and possible processing difficulties of initialisms we can gain from an analysis of information-theoretic surprisal values. The results show that English written originals and German translations are the corpus sections with the highest frequencies of initialisms. The majority of cross-language transfer situations lead to fewer initialisms in the target texts than in the source texts. In the English data, there is a positive correlation between the frequency of initialisms and the frequency of the respective full forms. There is a similar correlation in the German data, apart from the interpreted data. Additionally, the results show that initialisms represent peaks of information with regard to their surprisal values within their segments. Particularly the German data show higher surprisal values of initialisms in mediated language than in non-mediated discourse types, which indicates that in German mediated discourse, initialisms tend to be used in less conventionalised textual contexts than in English.

@inproceedings{menzel-2024-exploring,
title = {Exploring Word Formation Trends in Written, Spoken, Translated and Interpreted European Parliament Data - A Case Study on Initialisms in English and German},
author = {Katrin Menzel},
editor = {Darja Fiser and Maria Eskevich and David Bordon},
url = {https://aclanthology.org/2024.parlaclarin-1.9},
year = {2024},
date = {2024},
booktitle = {Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024},
pages = {57-65},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {This paper demonstrates the research potential of a unique European Parliament dataset for register studies, contrastive linguistics, translation and interpreting studies. The dataset consists of parallel data for several European languages, including written source texts and their translations as well as spoken source texts and the transcripts of their simultaneously interpreted versions. The paper presents a cross-linguistic, corpus-based case study on a word formation phenomenon in these European Parliament data that are enriched with various linguistic annotations and metadata as well as with information-theoretic surprisal scores. It addresses the questions of how initialisms are used across languages and production modes in the English and German corpus sections of these European Parliament data, whether there is a correlation between the use of initialisms and the use of their corresponding multiword full forms in the analysed corpus sections and what insights on the informativity and possible processing difficulties of initialisms we can gain from an analysis of information-theoretic surprisal values. The results show that English written originals and German translations are the corpus sections with the highest frequencies of initialisms. The majority of cross-language transfer situations lead to fewer initialisms in the target texts than in the source texts. In the English data, there is a positive correlation between the frequency of initialisms and the frequency of the respective full forms. There is a similar correlation in the German data, apart from the interpreted data. Additionally, the results show that initialisms represent peaks of information with regard to their surprisal values within their segments. Particularly the German data show higher surprisal values of initialisms in mediated language than in non-mediated discourse types, which indicates that in German mediated discourse, initialisms tend to be used in less conventionalised textual contexts than in English.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B7

Alves, Diego; Degaetano-Ortlieb, Stefania; Schmidt, Elena; Teich, Elke

Diachronic Analysis of Multi-word Expression Functional Categories in Scientific English Inproceedings

Bhatia, Archna; Bouma, Gosse; Seza Dogruoz, A.; Evang, Kilian; Garcia, Marcos; Giouli, Voula; Han, Lifeng; Nivre, Joakim; Rademaker, Alexandre (Ed.): Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024, ELRA and ICCL, pp. 81-87, Torino, Italia, 2024.

Abstract
|
Links
|
BibTeX

We present a diachronic analysis of multi-word expressions (MWEs) in English based on the Royal Society Corpus, a dataset containing 300+ years of the scientific publications of the Royal Society of London. Specifically, we investigate the functions of MWEs, such as stance markers (“is is interesting”) or discourse organizers (“in this section”), and their development over time. Our approach is multi-disciplinary: to detect MWEs we use Universal Dependencies, to classify them functionally we use an approach from register linguistics, and to assess their role in diachronic development we use an information-theoretic measure, relative entropy.

2024.mwe-1.12 (0.52MB)
https://aclanthology.org/2024.mwe-1.12

@inproceedings{alves-etal-2024-diachronic,
title = {Diachronic Analysis of Multi-word Expression Functional Categories in Scientific English},
author = {Diego Alves and Stefania Degaetano-Ortlieb and Elena Schmidt and Elke Teich},
editor = {Archna Bhatia and Gosse Bouma and A. Seza Dogruoz and Kilian Evang and Marcos Garcia and Voula Giouli and Lifeng Han and Joakim Nivre and Alexandre Rademaker},
url = {https://aclanthology.org/2024.mwe-1.12},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024},
pages = {81-87},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {We present a diachronic analysis of multi-word expressions (MWEs) in English based on the Royal Society Corpus, a dataset containing 300+ years of the scientific publications of the Royal Society of London. Specifically, we investigate the functions of MWEs, such as stance markers (“is is interesting”) or discourse organizers (“in this section”), and their development over time. Our approach is multi-disciplinary: to detect MWEs we use Universal Dependencies, to classify them functionally we use an approach from register linguistics, and to assess their role in diachronic development we use an information-theoretic measure, relative entropy.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Bagdasarov, Sergei; Degaetano-Ortlieb, Stefania

Applying Information-theoretic Notions to Measure Effects of the Plain English Movement on English Law Reports and Scientific Articles Inproceedings

Bizzoni, Yuri; Degaetano-Ortlieb, Stefania; Kazantseva, Anna; Szpakowicz, Stan (Ed.): Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024), Association for Computational Linguistics, pp. 101-110, St. Julians, Malta, 2024.

Abstract
|
Links
|
BibTeX

We investigate the impact of the Plain English Movement (PEM) on the complexity of legal language in UK law reports from the 1950s-2010s, contrasting it with the evolution of scientific language. The PEM, emerging in the late 20th century, advocated for clear and understandable legal language. We define complexity through the concept of surprisal – an information-theoretic measure correlating with cognitive processing difficulty. Our research contrasts surprisal with traditional readability measures, which often overlook content. We hypothesize that, if the PEM has influenced legal language, there would be a reduction in complexity over time and a shift from a nominal to a more verbal style. We analyze text complexity and lexico-grammatical changes in line with PEM recommendations. Results indicate minimal impact of the PEM on both legal and scientific domains. This finding suggests future research should consider processing effort when advocating for linguistic norms to enhance accessibility.

@inproceedings{bagdasarov-degaetano-ortlieb-2024-applying,
title = {Applying Information-theoretic Notions to Measure Effects of the Plain English Movement on English Law Reports and Scientific Articles},
author = {Sergei Bagdasarov and Stefania Degaetano-Ortlieb},
editor = {Yuri Bizzoni and Stefania Degaetano-Ortlieb and Anna Kazantseva and Stan Szpakowicz},
url = {https://aclanthology.org/2024.latechclfl-1.11},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)},
pages = {101-110},
publisher = {Association for Computational Linguistics},
address = {St. Julians, Malta},
abstract = {We investigate the impact of the Plain English Movement (PEM) on the complexity of legal language in UK law reports from the 1950s-2010s, contrasting it with the evolution of scientific language. The PEM, emerging in the late 20th century, advocated for clear and understandable legal language. We define complexity through the concept of surprisal - an information-theoretic measure correlating with cognitive processing difficulty. Our research contrasts surprisal with traditional readability measures, which often overlook content. We hypothesize that, if the PEM has influenced legal language, there would be a reduction in complexity over time and a shift from a nominal to a more verbal style. We analyze text complexity and lexico-grammatical changes in line with PEM recommendations. Results indicate minimal impact of the PEM on both legal and scientific domains. This finding suggests future research should consider processing effort when advocating for linguistic norms to enhance accessibility.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Kray, Jutta; Sommerfeld, Linda; Borovsky, Arielle; Häuser, Katja

The role of prediction error in the development of language learning and memory Journal Article

Child Development Perspectives, 18, pp. 190-203, 2024.

Abstract
|
Links
|
BibTeX

Prediction error plays a pivotal role in theories of learning, including theories of language acquisition and use. Researchers have investigated whether and under which conditions children, like adults, use prediction to facilitate language comprehension at different levels of linguistic representation. However, many aspects of the reciprocal relation between prediction error and the development of language learning remain unclear. In this article, we review studies in language development that can inform us about the role of prediction error in updating, learning, and retrieving linguistic information. We argue that the study of individual differences in linguistic and cognitive skills will help the field understand more thoroughly whether, when, and why prediction aids language learning, and whether prediction error necessarily results in language learning and retrieval from memory. We close with a discussion of the needs and challenges for researchers to answer these questions.

@article{Kray_etal_2024,
title = {The role of prediction error in the development of language learning and memory},
author = {Jutta Kray and Linda Sommerfeld and Arielle Borovsky and Katja H{\"a}user},
url = {https://srcd.onlinelibrary.wiley.com/doi/10.1111/cdep.12515},
doi = {https://doi.org/10.1111/cdep.12515},
year = {2024},
date = {2024},
journal = {Child Development Perspectives},
pages = {190-203},
volume = {18},
number = {4},
abstract = {

Prediction error plays a pivotal role in theories of learning, including theories of language acquisition and use. Researchers have investigated whether and under which conditions children, like adults, use prediction to facilitate language comprehension at different levels of linguistic representation. However, many aspects of the reciprocal relation between prediction error and the development of language learning remain unclear. In this article, we review studies in language development that can inform us about the role of prediction error in updating, learning, and retrieving linguistic information. We argue that the study of individual differences in linguistic and cognitive skills will help the field understand more thoroughly whether, when, and why prediction aids language learning, and whether prediction error necessarily results in language learning and retrieval from memory. We close with a discussion of the needs and challenges for researchers to answer these questions.

},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project: A5

Marchal, Marian; Scholman, Merel; Sanders, Ted J. M.; Demberg, Vera

What processing instructions do connectives provide? Modeling the facilitative effect of the connective Inproceedings

Proceedings of the Annual Meeting of the Cognitive Science Society, 46, pp. 3435-3441, 2024.

Abstract
|
Links
|
BibTeX

Connectives like ‘because’ are referred to as ‘processing instructions’ as they facilitate processing of linguistic material directly following the connective. In an expectation-driven account of discourse processing, this can be attributed to predictions that readers make about the upcoming discourse relation, but also to predictions about up-coming discourse content. By modeling these two accounts, termed the relation prediction account and the content prediction account respectively, we show that they make different predictions about when the presence of a connective is most beneficial. In a self-paced reading study, we replicate the facilitative effect of the connective on processing, but do not find any evidence that this effect can be explained by a strong or weak version of either of the two accounts. This suggests that the role of the connective goes above and beyond informing the reader about the upcoming relation and content and possibly triggers a different processing strategy.

@inproceedings{marchal-etal-2024,
title = {What processing instructions do connectives provide? Modeling the facilitative effect of the connective},
author = {Marian Marchal and Merel Scholman and Ted J. M. Sanders and Vera Demberg},
url = {https://escholarship.org/uc/item/2sc1k7pf},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Annual Meeting of the Cognitive Science Society},
pages = {3435-3441},
abstract = {Connectives like ‘because’ are referred to as ‘processing instructions’ as they facilitate processing of linguistic material directly following the connective. In an expectation-driven account of discourse processing, this can be attributed to predictions that readers make about the upcoming discourse relation, but also to predictions about up-coming discourse content. By modeling these two accounts, termed the relation prediction account and the content prediction account respectively, we show that they make different predictions about when the presence of a connective is most beneficial. In a self-paced reading study, we replicate the facilitative effect of the connective on processing, but do not find any evidence that this effect can be explained by a strong or weak version of either of the two accounts. This suggests that the role of the connective goes above and beyond informing the reader about the upcoming relation and content and possibly triggers a different processing strategy.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B2

Achimova, Asya; van Os, Marjolein; Demberg, Vera; Butz, Martin V.

Interpreting implausible event descriptions under noise Inproceedings

Proceedings of the Annual Meeting of the Cognitive Science Society, 46, pp. 3399-3406, 2024.

Abstract
|
Links
|
BibTeX

Gricean maxims prescribe cooperative speakers to make their utterances maximally informative so that listeners have the highest chance of understanding the utterances. At the same time, speakers are expected to save effort and not produce descriptions that are more explicit than necessary. In this work, we first ask how predictability of the described events affects the choice of anaphoric referring expressions. We show that speakers prefer phonologically overt descriptions, such as definite NPs, when they refer to agents that behave in an unexpected way. We further test how the interpretation of referring expressions changes depending on the listening conditions and prior expectations about the plausibility of an event. Our work shows that the speaker’s extra effort in choosing a more phonologically overt referring expression is justified by listeners‘ behavior: they report having heard an utterance which is more plausible than the originally spoken utterance and which contains additional phonological material.

@inproceedings{Achimova-etal-2024,
title = {Interpreting implausible event descriptions under noise},
author = {Asya Achimova and Marjolein van Os and Vera Demberg and Martin V. Butz},
url = {https://escholarship.org/uc/item/13n5660h},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Annual Meeting of the Cognitive Science Society},
pages = {3399-3406},
abstract = {Gricean maxims prescribe cooperative speakers to make their utterances maximally informative so that listeners have the highest chance of understanding the utterances. At the same time, speakers are expected to save effort and not produce descriptions that are more explicit than necessary. In this work, we first ask how predictability of the described events affects the choice of anaphoric referring expressions. We show that speakers prefer phonologically overt descriptions, such as definite NPs, when they refer to agents that behave in an unexpected way. We further test how the interpretation of referring expressions changes depending on the listening conditions and prior expectations about the plausibility of an event. Our work shows that the speaker's extra effort in choosing a more phonologically overt referring expression is justified by listeners' behavior: they report having heard an utterance which is more plausible than the originally spoken utterance and which contains additional phonological material.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: A4

Liang, Yiming; Amsili, Pascal; Burnett, Heather; Demberg, Vera

Uniform information density explains subject doubling in French Inproceedings

Proceedings of the Annual Meeting of the Cognitive Science Society, 46, pp. 780-788, 2024.

Abstract
|
Links
|
BibTeX

In this paper we investigate whether subject doubling in French is affected by the Uniform Information Density (UID) principle, which states that speakers prefer language encoding that minimizes fluctuations in information density. We show that, other factors being controlled, speakers are more likely to double the NP subject when it has a high surprisal, thus providing further empirical evidence to the UID principle which predicts a surprisal-redundancy trade-off as a property of natural languages. We argue for the importance of employing GPT-2 to investigate complex linguistic phenomena such as subject doubling, as it enables the estimation of subject surprisal by considering a rather large conversational context, a task made possible by powerful language models that incorporate linguistic knowledge through pre-training on extensive datasets.

@inproceedings{Liang-etal-2024,
title = {Uniform information density explains subject doubling in French},
author = {Yiming Liang and Pascal Amsili and Heather Burnett and Vera Demberg},
url = {https://escholarship.org/uc/item/645673fs},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Annual Meeting of the Cognitive Science Society},
pages = {780-788},
abstract = {In this paper we investigate whether subject doubling in French is affected by the Uniform Information Density (UID) principle, which states that speakers prefer language encoding that minimizes fluctuations in information density. We show that, other factors being controlled, speakers are more likely to double the NP subject when it has a high surprisal, thus providing further empirical evidence to the UID principle which predicts a surprisal-redundancy trade-off as a property of natural languages. We argue for the importance of employing GPT-2 to investigate complex linguistic phenomena such as subject doubling, as it enables the estimation of subject surprisal by considering a rather large conversational context, a task made possible by powerful language models that incorporate linguistic knowledge through pre-training on extensive datasets.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: A4

Voigtmann, Sophia

Wie Informationsdichte Extraposition beeinflusst. Eine Korpusuntersuchung an wissenschaftlichen Texten des frühen Neuhochdeutschen PhD Thesis

Saarländische Universitäts- und Landesbibliothek, Saarland University, Saarbruecken, Germany, 2024.

Abstract
|
Links
|
BibTeX

Die vorliegende Arbeit untersucht die Nachfeldstellung von Nominalphrasen, Präpositionalphrasen und Relativsätzen in wissenschaftlichen Texten des Zeitraums von 1650 bis 1900 mit einer Korpusstudie. Sie setzt Extraposition in Zusammenhang mit Verarbeitung. Dabei wird angenommen, dass das Nachfeld Vorteile für die Verarbeitung bietet, da hier alle notwendigen Aktanten des Satzes aufgrund der erfolgten Verarbeitung der rechten Satzklammer entweder bekannt sind oder mit (größerer) Sicherheit vorhergesagt werden können. Somit ist im Nachfeld mehr kognitive Kapazität zur Verarbeitung lexikalischer Information frei. Um diese erwartungsbasierte Verarbeitung auch im historischen Kontext operationalisieren zu können, wird Surprisal im Sinne von Shannon (1948), Hale (2001) und Levy (2008) genutzt. Gleichzeitig ist aufgrund der bisherigen Forschung, die Extraposition vor allem mit Länge assoziiert, auch ein gedächtnisbasierter Verarbeitungsansatz in die Betrachtung von Extraposition eingeflossen. Außerdem wurde untersucht, ob Extraposition von der konzeptionellen Mündlichkeit eines Textes (vgl. Koch & Österreicher 2007, Ortmann & Dipper 2024) beeinflusst wird. Auch Veränderungen innerhalb der untersuchten Periode wurden betrachtet. Daraus ergeben sich drei Hypothesen: 1) Relativsätze und Nominal- sowie Präpositionalphrasen mit hohen Surprisalwerten werden ausgelagert. 2a) Auslagerung wird verstärkt in mündlichkeitsnahen Texten verwendet. 2b) In Texten, die mündlichkeitsnäher sind, ist der Einfluss von hohen Surprisalwerten größer als in schriftlichkeitsnahen Texten. 3) Über die Zeit wird der Einfluss der Informationsdichte auf Auslagerung geringer. Zur Überprüfung dieser Hypothesen wurde ein Korpus aus medizinischen und theologischen Texten aus dem Deutschen Textarchiv (DTA, BAW 2019) gebildet. Darin wurden händisch alle extraponierten Nominal- und Präpositionalphrasen mit Gegenstücken, sog. Minimalpaaren, sowie alle adjazenten und extraponierten Relativsätze, die Satzklammern und gegebenenfalls Antezedenzien annotiert. Ebenso wurden die lemmabasierten Skipgramwerte pro 50-Jahresstufe über das Tool von Kusmirek et al. (2023) berechnet. Aus den so ermittelten Werten wurde das „durchschnittliche Surprisal“ der eingebetteten beziehungsweise extraponierten (Teil-)Konstituenten berechnet. Über das COAST-Tool (Ortmann & Dipper 2022, 2024) wurde der Orality Score, ein automatisierter Score zur Bestimmung der Mündlichkeitsnähe, ermittelt. Zusätzlich wurde die Länge für jede Konstituente bestimmt. Insgesamt konnte gezeigt werden, dass Surprisalwerte vor allem die Position von Nominalphrasen vorhersagen können, was mit deren vielfältigeren Funktionen – verglichen mit den Präpositionalphrasen und attributiven Relativsätzen – erklärt wird. Bei den beiden anderen Phänomenen spielt die Länge eine größere Rolle. Des Weiteren finden sich Unterschiede zwischen den beiden Genres, die mit den Inhalten der Texte und der Schreibpraxis der jeweiligen Autorengruppen sowie Veränderungen in den beiden Wissenschaftsrichtungen in Zusammenhang gebracht werden. Die untersuchten theologischen Texte sind außerdem mündlichkeitsnäher als die medizinischen Texte. Beide Genre werden über den untersuchten Zeitraum hinweg aber schriftlichkeitsnäher, was auch für eine Annäherung beider Schreibstile zu sprechen scheint. Zudem kann der Zusammenhang zwischen Mündlichkeitsnähe und Extraposition nur für Nominalphrasen bestätigt werden. Bei einer Zweiteilung des Korpus in mündlichkeitsnahe und schriftlichkeitsnahe Texte zeigt sich, dass die Surprisalwerte eher in den mündlichkeitsnahen Texten Extraposition erklären können. Im Zusammenhang mit der dritten Hypothese wurde gezeigt, dass die Bedeutung der Länge die der Surprisalwerte in jüngeren Texten übersteigt. Es wurde dafür argumentiert, dass eine Gewöhnung an kürzere Satzrahmen erfolgte und die Schreibpraxis der Theologen und Mediziner professioneller wird. Neben den Unterschieden zwischen den Genres und den Registern, stellt die Arbeit vor allem die Bedeutung der Satzklammer für die Verarbeitung in den Mittelpunkt.

@phdthesis{Voigtmann_Diss_2024,
title = {Wie Informationsdichte Extraposition beeinflusst. Eine Korpusuntersuchung an wissenschaftlichen Texten des fr{\"u}hen Neuhochdeutschen},
author = {Sophia Voigtmann},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/37369},
doi = {https://doi.org/10.22028/D291-41751},
year = {2024},
date = {2024},
school = {Saarland University},
publisher = {Saarl{\"a}ndische Universit{\"a}ts- und Landesbibliothek},
address = {Saarbruecken, Germany},
abstract = {Die vorliegende Arbeit untersucht die Nachfeldstellung von Nominalphrasen, Pr{\"a}positionalphrasen und Relativs{\"a}tzen in wissenschaftlichen Texten des Zeitraums von 1650 bis 1900 mit einer Korpusstudie. Sie setzt Extraposition in Zusammenhang mit Verarbeitung. Dabei wird angenommen, dass das Nachfeld Vorteile f{\"u}r die Verarbeitung bietet, da hier alle notwendigen Aktanten des Satzes aufgrund der erfolgten Verarbeitung der rechten Satzklammer entweder bekannt sind oder mit (gr{\"o}{\ss}erer) Sicherheit vorhergesagt werden k{\"o}nnen. Somit ist im Nachfeld mehr kognitive Kapazit{\"a}t zur Verarbeitung lexikalischer Information frei. Um diese erwartungsbasierte Verarbeitung auch im historischen Kontext operationalisieren zu k{\"o}nnen, wird Surprisal im Sinne von Shannon (1948), Hale (2001) und Levy (2008) genutzt. Gleichzeitig ist aufgrund der bisherigen Forschung, die Extraposition vor allem mit L{\"a}nge assoziiert, auch ein ged{\"a}chtnisbasierter Verarbeitungsansatz in die Betrachtung von Extraposition eingeflossen. Au{\ss}erdem wurde untersucht, ob Extraposition von der konzeptionellen M{\"u}ndlichkeit eines Textes (vgl. Koch & {\"O}sterreicher 2007, Ortmann & Dipper 2024) beeinflusst wird. Auch Ver{\"a}nderungen innerhalb der untersuchten Periode wurden betrachtet. Daraus ergeben sich drei Hypothesen: 1) Relativs{\"a}tze und Nominal- sowie Pr{\"a}positionalphrasen mit hohen Surprisalwerten werden ausgelagert. 2a) Auslagerung wird verst{\"a}rkt in m{\"u}ndlichkeitsnahen Texten verwendet. 2b) In Texten, die m{\"u}ndlichkeitsn{\"a}her sind, ist der Einfluss von hohen Surprisalwerten gr{\"o}{\ss}er als in schriftlichkeitsnahen Texten. 3) {\"U}ber die Zeit wird der Einfluss der Informationsdichte auf Auslagerung geringer. Zur {\"U}berpr{\"u}fung dieser Hypothesen wurde ein Korpus aus medizinischen und theologischen Texten aus dem Deutschen Textarchiv (DTA, BAW 2019) gebildet. Darin wurden h{\"a}ndisch alle extraponierten Nominal- und Pr{\"a}positionalphrasen mit Gegenst{\"u}cken, sog. Minimalpaaren, sowie alle adjazenten und extraponierten Relativs{\"a}tze, die Satzklammern und gegebenenfalls Antezedenzien annotiert. Ebenso wurden die lemmabasierten Skipgramwerte pro 50-Jahresstufe {\"u}ber das Tool von Kusmirek et al. (2023) berechnet. Aus den so ermittelten Werten wurde das „durchschnittliche Surprisal“ der eingebetteten beziehungsweise extraponierten (Teil-)Konstituenten berechnet. {\"U}ber das COAST-Tool (Ortmann & Dipper 2022, 2024) wurde der Orality Score, ein automatisierter Score zur Bestimmung der M{\"u}ndlichkeitsn{\"a}he, ermittelt. Zus{\"a}tzlich wurde die L{\"a}nge f{\"u}r jede Konstituente bestimmt. Insgesamt konnte gezeigt werden, dass Surprisalwerte vor allem die Position von Nominalphrasen vorhersagen k{\"o}nnen, was mit deren vielf{\"a}ltigeren Funktionen – verglichen mit den Pr{\"a}positionalphrasen und attributiven Relativs{\"a}tzen – erkl{\"a}rt wird. Bei den beiden anderen Ph{\"a}nomenen spielt die L{\"a}nge eine gr{\"o}{\ss}ere Rolle. Des Weiteren finden sich Unterschiede zwischen den beiden Genres, die mit den Inhalten der Texte und der Schreibpraxis der jeweiligen Autorengruppen sowie Ver{\"a}nderungen in den beiden Wissenschaftsrichtungen in Zusammenhang gebracht werden. Die untersuchten theologischen Texte sind au{\ss}erdem m{\"u}ndlichkeitsn{\"a}her als die medizinischen Texte. Beide Genre werden {\"u}ber den untersuchten Zeitraum hinweg aber schriftlichkeitsn{\"a}her, was auch f{\"u}r eine Ann{\"a}herung beider Schreibstile zu sprechen scheint. Zudem kann der Zusammenhang zwischen M{\"u}ndlichkeitsn{\"a}he und Extraposition nur f{\"u}r Nominalphrasen best{\"a}tigt werden. Bei einer Zweiteilung des Korpus in m{\"u}ndlichkeitsnahe und schriftlichkeitsnahe Texte zeigt sich, dass die Surprisalwerte eher in den m{\"u}ndlichkeitsnahen Texten Extraposition erkl{\"a}ren k{\"o}nnen. Im Zusammenhang mit der dritten Hypothese wurde gezeigt, dass die Bedeutung der L{\"a}nge die der Surprisalwerte in j{\"u}ngeren Texten {\"u}bersteigt. Es wurde daf{\"u}r argumentiert, dass eine Gew{\"o}hnung an k{\"u}rzere Satzrahmen erfolgte und die Schreibpraxis der Theologen und Mediziner professioneller wird. Neben den Unterschieden zwischen den Genres und den Registern, stellt die Arbeit vor allem die Bedeutung der Satzklammer f{\"u}r die Verarbeitung in den Mittelpunkt.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project: C6

Bourgonje, Peter; Lin, Pin-Jie

Projecting Annotations for Discourse Relations: Connective Identification for Low-Resource Languages Inproceedings

Strube, Michael; Braud, Chloe; Hardmeier, Christian; Jessy Li, Junyi; Loaiciga, Sharid; Zeldes, Amir; Li, Chuyuan (Ed.): Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024), Association for Computational Linguistics, pp. 39-49, St. Julians, Malta, 2024.

Abstract
|
Links
|
BibTeX

We present a pipeline for multi-lingual Shallow Discourse Parsing. The pipeline exploits Machine Translation and Word Alignment, by translating any incoming non-English input text into English, applying an English discourse parser, and projecting the found relations onto the original input text through word alignments. While the purpose of the pipeline is to provide rudimentary discourse relation annotations for low-resource languages, in order to get an idea of performance, we evaluate it on the sub-task of discourse connective identification for several languages for which gold data are available. We experiment with different setups of our modular pipeline architecture and analyze intermediate results. Our code is made available on GitHub.

2024.codi-1.4 (0.17MB)
https://aclanthology.org/2024.codi-1.4

@inproceedings{bourgonje-lin-2024-projecting,
title = {Projecting Annotations for Discourse Relations: Connective Identification for Low-Resource Languages},
author = {Peter Bourgonje and Pin-Jie Lin},
editor = {Michael Strube and Chloe Braud and Christian Hardmeier and Junyi Jessy Li and Sharid Loaiciga and Amir Zeldes and Chuyuan Li},
url = {https://aclanthology.org/2024.codi-1.4},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)},
pages = {39-49},
publisher = {Association for Computational Linguistics},
address = {St. Julians, Malta},
abstract = {We present a pipeline for multi-lingual Shallow Discourse Parsing. The pipeline exploits Machine Translation and Word Alignment, by translating any incoming non-English input text into English, applying an English discourse parser, and projecting the found relations onto the original input text through word alignments. While the purpose of the pipeline is to provide rudimentary discourse relation annotations for low-resource languages, in order to get an idea of performance, we evaluate it on the sub-task of discourse connective identification for several languages for which gold data are available. We experiment with different setups of our modular pipeline architecture and analyze intermediate results. Our code is made available on GitHub.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B2

Meßmer, Julia

A functional perspective on schema-based learning and recognition of novel word associations PhD Thesis

Saarländische Universitäts- und Landesbibliothek, Saarland University, Saarbruecken, Germany, 2024.

Abstract
|
BibTeX

With the current research, we sought to develop a functional perspective on schema-based learning of novel word associations, i.e., novel compound words and their later recognition. In combining the idea that both, schema-based learning (e.g., Hebscher et al., 2019; van Kesteren et al., 2012) and unitization (e.g., Bader et al., 2014; Haskins et al., 2008; see Henke, 2010) might rely less on hippocampal contribution than traditional associative learning, we hypothesized that schema-congruency might support the formation of unitized representations that could then be recognized by means of an absolute familiarity process (Mecklinger & Bader, 2020). All three experiments presented include an incidental learning phase, in which novel compound words were learned together with a preceding definition that was either congruent or neutral (experimental manipulation of schema congruency). After a retention interval of about 10 minutes, a surprise memory test followed. In the test phase, participants were shown different types of compound words and instructed to classify each as intact, recombined, or new (Exp. 1), as old (intact) or new (recombined, similar lures; Exp. 3) or underwent an implicit lexical decision task (Exp. 2). Our results imply that three processes might underly schema-based learning. Semantic priming, indicated by an N400 attenuation effect in the schema-congruent condition, establishes schema congruency. Condition-independent semantic integration of the constituents is beneficial for memory formation, as indicated by an N400 subsequent memory effect (SME). Lastly, we found a larger parietal SME in the congruent than in the neutral condition. This might reflect the formation of a conceptual (unitized) representation under the influence of a congruent schema. Second, based on our results, schema-congruency might support the formation of unitized representations, indicated by schema-congruency being more beneficial for associative than item memory performance (see Parks & Yonelinas, 2015). The neurocognitive processes underling recognition of those compound words might include larger absolute familiarity contributing to associative recognition in the congruent than in a neutral control condition, indicated by an N400 attenuation effect. Based on data from our third experiment including semantically similar distractors during the recognition memory test, we concluded that the representations formed under the influence of a schema might be gist-like. Those might be created next to episodic associations that are probably also formed in traditional associative learning. Lastly, those unitized memory representations formed under the influence of a schema cannot only be accessed in an explicit memory test, but also affect performance in an implicit memory test.

Das Ziel der vorliegenden Arbeit war es, eine funktionelle Perspektive auf das schema-basierte Lernen neuer Wortassoziationen (Komposita) und deren späteres Wiedererkennen zu entwickeln. Dazu wurden zwei Forschungsideen zusammengeführt. Da sowohl schema-basiertes Lernen (z.B., Hebscher et al., 2019; van Kesteren et al., 2012) als auch Unitarisierung (z.B., Bader et al., 2014; Haskins et al., 2008; siehe auch Henke, 2010) weniger hippocampale Beteiligung aufweisen als traditionelles Assoziationslernen, formulierten wir die Hypothese, dass Schemakongruenz die Bildung unitarisierter Repräsentationen unterstützen könnte, die dann mittels eines absoluten Vertrautheitsprozesses wiedererkannt werden könnten (Mecklinger & Bader, 2020). Die drei Experimente, die in der vorliegenden Arbeit dargestellt sind, beinhalten alle eine inzidentelle Lernphase, in der neue Komposita zusammen mit einer kongruenten oder neutralen vorangehenden Definition gelernt wurden (experimentelle Manipulation von Schemakongruenz). Nach einem Retentionsintervall von etwa 10 Minuten folgte ein überraschender, nicht vorangekündigter Gedächtnistest. In dieser Testphase sahen die Teilnehmenden verschiedene Arten von Komposita und sollten diese als intakt, rekombiniert oder neu klassifizieren (Experiment 1), als alt (intakt) oder neu (rekombiniert, ähnliche Distraktoren; Experiment 3) oder bearbeiteten eine lexikalische Entscheidungsaufgabe (Experiment 2). Unsere Ergebnisse implizieren, dass drei Prozesse am schema-basiertem Lernen beteiligt sind. Semantisches Priming, angezeigt durch eine reduzierte N400 Amplitude in der schema-kongruenten Bedingungen, führt zu Schemakongruenz. Die bedingungsunabhängige semantische Integration der Wortbestandteile ist förderlich für die Gedächtnisbildung, indiziert durch einen N400 Subsequent Memory Effect (SME). Der dritte Prozess, die schemakongruenzgetriebene Bildung einer konzeptuellen (unitarisierten) Repräsentation wird angezeigt durch einen größeren parietalen SME in der kongruenten im Vergleich zur neutralen Bedingung. Basierend auf dem behavioralen Ergebnismuster, dass assoziatives Gedächtnis mehr von Schemakongruenz profitiert als Itemgedächtnis (siehe auch Parks & Yonelinas, 2015), könnte Schemakongruenz die Bildung von unitarisierten Repräsentationen fördern. Die neurokognitiven Prozesse, die dem Wiedererkennen solcher Komposita unterliegen, beinhalten wahrscheinlich einen höheren Anteil absoluter Vertrautheit in der kongruenten als in der neutralen Bedingung, indiziert durch einen entsprechenden reduzierten N400-Effekt. Basierend auf den Ergebnissen des dritten Experiments, bei dem der Rekognitionstest semantisch ähnliche Distraktoren beinhaltete, schlussfolgerten wir, dass die Repräsentationen, die unter dem Einfluss eines Schemas gebildet werden, detailarm sind und lediglich die semantische Konzeptstruktur (gist) beinhalten. Diese Repräsentationen könnten parallel zu episodischen Assoziationen geformt werden, die wahrscheinlich beim traditionellen Assoziationslernen gebildet werden. Die unitarisierten Repräsentationen konnten hierbei nicht nur in einem expliziten Gedächtnistest verwendet werden, sondern auch die Performanz in einer impliziten Gedächtnisaufgabe beeinflussen.

@phdthesis{Meßmer_Diss,
title = {A functional perspective on schema-based learning and recognition of novel word associations},
author = {Julia Me{\ss}mer},
year = {2024},
date = {2024},
school = {Saarland University},
publisher = {Saarl{\"a}ndische Universit{\"a}ts- und Landesbibliothek},
address = {Saarbruecken, Germany},
abstract = {With the current research, we sought to develop a functional perspective on schema-based learning of novel word associations, i.e., novel compound words and their later recognition. In combining the idea that both, schema-based learning (e.g., Hebscher et al., 2019; van Kesteren et al., 2012) and unitization (e.g., Bader et al., 2014; Haskins et al., 2008; see Henke, 2010) might rely less on hippocampal contribution than traditional associative learning, we hypothesized that schema-congruency might support the formation of unitized representations that could then be recognized by means of an absolute familiarity process (Mecklinger & Bader, 2020). All three experiments presented include an incidental learning phase, in which novel compound words were learned together with a preceding definition that was either congruent or neutral (experimental manipulation of schema congruency). After a retention interval of about 10 minutes, a surprise memory test followed. In the test phase, participants were shown different types of compound words and instructed to classify each as intact, recombined, or new (Exp. 1), as old (intact) or new (recombined, similar lures; Exp. 3) or underwent an implicit lexical decision task (Exp. 2). Our results imply that three processes might underly schema-based learning. Semantic priming, indicated by an N400 attenuation effect in the schema-congruent condition, establishes schema congruency. Condition-independent semantic integration of the constituents is beneficial for memory formation, as indicated by an N400 subsequent memory effect (SME). Lastly, we found a larger parietal SME in the congruent than in the neutral condition. This might reflect the formation of a conceptual (unitized) representation under the influence of a congruent schema. Second, based on our results, schema-congruency might support the formation of unitized representations, indicated by schema-congruency being more beneficial for associative than item memory performance (see Parks & Yonelinas, 2015). The neurocognitive processes underling recognition of those compound words might include larger absolute familiarity contributing to associative recognition in the congruent than in a neutral control condition, indicated by an N400 attenuation effect. Based on data from our third experiment including semantically similar distractors during the recognition memory test, we concluded that the representations formed under the influence of a schema might be gist-like. Those might be created next to episodic associations that are probably also formed in traditional associative learning. Lastly, those unitized memory representations formed under the influence of a schema cannot only be accessed in an explicit memory test, but also affect performance in an implicit memory test.

Das Ziel der vorliegenden Arbeit war es, eine funktionelle Perspektive auf das schema-basierte Lernen neuer Wortassoziationen (Komposita) und deren sp{\"a}teres Wiedererkennen zu entwickeln. Dazu wurden zwei Forschungsideen zusammengef{\"u}hrt. Da sowohl schema-basiertes Lernen (z.B., Hebscher et al., 2019; van Kesteren et al., 2012) als auch Unitarisierung (z.B., Bader et al., 2014; Haskins et al., 2008; siehe auch Henke, 2010) weniger hippocampale Beteiligung aufweisen als traditionelles Assoziationslernen, formulierten wir die Hypothese, dass Schemakongruenz die Bildung unitarisierter Repr{\"a}sentationen unterst{\"u}tzen k{\"o}nnte, die dann mittels eines absoluten Vertrautheitsprozesses wiedererkannt werden k{\"o}nnten (Mecklinger & Bader, 2020). Die drei Experimente, die in der vorliegenden Arbeit dargestellt sind, beinhalten alle eine inzidentelle Lernphase, in der neue Komposita zusammen mit einer kongruenten oder neutralen vorangehenden Definition gelernt wurden (experimentelle Manipulation von Schemakongruenz). Nach einem Retentionsintervall von etwa 10 Minuten folgte ein {\"u}berraschender, nicht vorangek{\"u}ndigter Ged{\"a}chtnistest. In dieser Testphase sahen die Teilnehmenden verschiedene Arten von Komposita und sollten diese als intakt, rekombiniert oder neu klassifizieren (Experiment 1), als alt (intakt) oder neu (rekombiniert, {\"a}hnliche Distraktoren; Experiment 3) oder bearbeiteten eine lexikalische Entscheidungsaufgabe (Experiment 2). Unsere Ergebnisse implizieren, dass drei Prozesse am schema-basiertem Lernen beteiligt sind. Semantisches Priming, angezeigt durch eine reduzierte N400 Amplitude in der schema-kongruenten Bedingungen, f{\"u}hrt zu Schemakongruenz. Die bedingungsunabh{\"a}ngige semantische Integration der Wortbestandteile ist f{\"o}rderlich f{\"u}r die Ged{\"a}chtnisbildung, indiziert durch einen N400 Subsequent Memory Effect (SME). Der dritte Prozess, die schemakongruenzgetriebene Bildung einer konzeptuellen (unitarisierten) Repr{\"a}sentation wird angezeigt durch einen gr{\"o}{\ss}eren parietalen SME in der kongruenten im Vergleich zur neutralen Bedingung. Basierend auf dem behavioralen Ergebnismuster, dass assoziatives Ged{\"a}chtnis mehr von Schemakongruenz profitiert als Itemged{\"a}chtnis (siehe auch Parks & Yonelinas, 2015), k{\"o}nnte Schemakongruenz die Bildung von unitarisierten Repr{\"a}sentationen f{\"o}rdern. Die neurokognitiven Prozesse, die dem Wiedererkennen solcher Komposita unterliegen, beinhalten wahrscheinlich einen h{\"o}heren Anteil absoluter Vertrautheit in der kongruenten als in der neutralen Bedingung, indiziert durch einen entsprechenden reduzierten N400-Effekt. Basierend auf den Ergebnissen des dritten Experiments, bei dem der Rekognitionstest semantisch {\"a}hnliche Distraktoren beinhaltete, schlussfolgerten wir, dass die Repr{\"a}sentationen, die unter dem Einfluss eines Schemas gebildet werden, detailarm sind und lediglich die semantische Konzeptstruktur (gist) beinhalten. Diese Repr{\"a}sentationen k{\"o}nnten parallel zu episodischen Assoziationen geformt werden, die wahrscheinlich beim traditionellen Assoziationslernen gebildet werden. Die unitarisierten Repr{\"a}sentationen konnten hierbei nicht nur in einem expliziten Ged{\"a}chtnistest verwendet werden, sondern auch die Performanz in einer impliziten Ged{\"a}chtnisaufgabe beeinflussen.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project: A6

«
1
2
3
4
…
27
28
29
»

Successfully