Publications

Zaitova, Iuliia; Abdullah, Badr M.; Xue, Wei; Klakow, Dietrich; Möbius, Bernd; Avgustinova, Tania

It’s not a walk in the park! Challenges of idiom translation in speech-to-text systems Inproceedings

Che, Wanxiang; Nabende, Joyce; Shutova, Ekaterina; Taher Pilehvar, Mohammad (Ed.): Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers, Association for Computational Linguistics, pp. 31310-31322, Vienna, Austria, 2025, ISBN 979-8-89176-251-0.
Idioms are defined as a group of words with a figurative meaning not deducible from their individual components. Although modern machine translation systems have made remarkable progress, translating idioms remains a major challenge, especially for speech-to-text systems, where research on this topic is notably sparse. In this paper, we systematically evaluate idiom translation as compared to conventional news translation in both text-to-text machine translation (MT) and speech-to-text translation (SLT) systems across two language pairs (German to English, Russian to English). We compare state-of-the-art end-to-end SLT systems (SeamlessM4T SLT-to-text, Whisper Large v3) with MT systems (SeamlessM4T SLT-to-text, No Language Left Behind), Large Language Models (DeepSeek, LLaMA) and cascaded alternatives. Our results reveal that SLT systems experience a pronounced performance drop on idiomatic data, often reverting to literal translations even in higher layers, whereas MT systems and Large Language Models demonstrate better handling of idioms. These findings underscore the need for idiom-specific strategies and improved internal representations in SLT architectures.

@inproceedings{Zaitova/etal:2025,
title = {It’s not a walk in the park! Challenges of idiom translation in speech-to-text systems},
author = {Iuliia Zaitova and Badr M. Abdullah and Wei Xue and Dietrich Klakow and Bernd M{\"o}bius and Tania Avgustinova},
editor = {Wanxiang Che and Joyce Nabende and Ekaterina Shutova and Mohammad Taher Pilehvar},
url = {https://aclanthology.org/2025.acl-long.1512/},
year = {2025},
date = {2025},
booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics},
isbn = {979-8-89176-251-0},
pages = {31310-31322},
publisher = {Association for Computational Linguistics},
address = {Vienna, Austria},
abstract = {

Idioms are defined as a group of words with a figurative meaning not deducible from their individual components. Although modern machine translation systems have made remarkable progress, translating idioms remains a major challenge, especially for speech-to-text systems, where research on this topic is notably sparse. In this paper, we systematically evaluate idiom translation as compared to conventional news translation in both text-to-text machine translation (MT) and speech-to-text translation (SLT) systems across two language pairs (German to English, Russian to English). We compare state-of-the-art end-to-end SLT systems (SeamlessM4T SLT-to-text, Whisper Large v3) with MT systems (SeamlessM4T SLT-to-text, No Language Left Behind), Large Language Models (DeepSeek, LLaMA) and cascaded alternatives. Our results reveal that SLT systems experience a pronounced performance drop on idiomatic data, often reverting to literal translations even in higher layers, whereas MT systems and Large Language Models demonstrate better handling of idioms. These findings underscore the need for idiom-specific strategies and improved internal representations in SLT architectures.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Andreeva, Bistra; Möbius, Bernd; Yuen, Ivan; Ibrahim, Omnia

Informationsdichte und die Vorhersagbarkeit der phonetischen Struktur Journal Article

Petkova-Kessanlis, Mikaela; Ivanova, Radka; Kileva-Stamenova, Reneta; Arnaudova, Svetlana; Endreva, Maria (Ed.): Journal for German and Scandinavian Studies, 1, pp. 251-266, 2025.

Die Studie untersucht die Beziehung zwischen Informationsdichte und linguistischer Kodierung in der Phonetik sowie der menschlichen Sprachverarbeitung. Die Informationsdichte einer linguistischen Einheit wird in Bezug auf Surprisal (den negativen Logarithmus der Wahrscheinlichkeit einer Einheit in einem gegebenen Kontext) definiert. Die Effekte von Surprisal auf die phonetische Kodierung wurden hinsichtlich verschiedener Aspekte wie Formantenverläufe von Vokalen, Stimmhaftigkeit von Plosiven, Silbendauer und Vokaldispersion (auch im L2) untersucht, wobei Kontrollfaktoren der prosodischen Struktur sowie mögliche Interaktionen mit dem Lombard-Effekt und der prosodischen Struktur berücksichtigt wurden. Die Ergebnisse deuten darauf hin, dass Sprecher phonetische Details anpassen, um ein Gleichgewicht zwischen Informationsdichte und phonetischer Kodierung aufrechtzuerhalten.

@article{Andreeva/etal:2025a,
title = {Informationsdichte und die Vorhersagbarkeit der phonetischen Struktur},
author = {Bistra Andreeva and Bernd M{\"o}bius and Ivan Yuen and Omnia Ibrahim},
editor = {Mikaela Petkova-Kessanlis and Radka Ivanova and Reneta Kileva-Stamenova and Svetlana Arnaudova and Maria Endreva},
url = {https://doi.org/10.60055/GerSk.2025.izv.1},
doi = {https://doi.org/10.60055/GerSk.2025.izv.1},
year = {2025},
date = {2025},
journal = {Journal for German and Scandinavian Studies},
pages = {251-266},
volume = {1},
abstract = {Die Studie untersucht die Beziehung zwischen Informationsdichte und linguistischer Kodierung in der Phonetik sowie der menschlichen Sprachverarbeitung. Die Informationsdichte einer linguistischen Einheit wird in Bezug auf Surprisal (den negativen Logarithmus der Wahrscheinlichkeit einer Einheit in einem gegebenen Kontext) definiert. Die Effekte von Surprisal auf die phonetische Kodierung wurden hinsichtlich verschiedener Aspekte wie Formantenverl{\"a}ufe von Vokalen, Stimmhaftigkeit von Plosiven, Silbendauer und Vokaldispersion (auch im L2) untersucht, wobei Kontrollfaktoren der prosodischen Struktur sowie m{\"o}gliche Interaktionen mit dem Lombard-Effekt und der prosodischen Struktur ber{\"u}cksichtigt wurden. Die Ergebnisse deuten darauf hin, dass Sprecher phonetische Details anpassen, um ein Gleichgewicht zwischen Informationsdichte und phonetischer Kodierung aufrechtzuerhalten.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Menzel, Katrin; Przybyl, Heike; Lapshinova-Koltunski, Ekaterina

EPIC-UdS – ein mehrsprachiges Korpus als Grundlage für die korpusbasierte Dolmetsch- und Übersetzungswissenschaft Incollection

Schmidhofer, Astrid; Ángeles Recio Ariza, María (Ed.): Zukunftsperspektiven in der Translationswissenschaft Ausgewählte Beiträge der Translata IV, innsbruck university press, pp. 323-349, Innsbruck, 2025, ISBN 978-3-99106-165-6.

In this paper, we present examples of current research foci and results of analyses of the Collaborative Research Centre subproject “Translation as Rational Communication” that focuses on the specific linguistic properties of interpreted and translated texts distinguishing them from original productions. We describe the creation and annotation of EPIC-UdS, a multilingual corpus of simultaneous interpreting for English, German and Spanish. We give an overview of the corpus variants and explore various applications of the corpus. Building on the ‘translationese’ hypothesis from translation studies, we investigate whether simultaneous interpreted language resembles written translated language with regard to specific features or whether it carries ‘interpretese’ features as a result of a unique language transfer process so that traces of these features tend to occur in all simultaneously interpreted texts and distinguish them from other texts. For instance, with regard to the simplification hypothesis put forward by various translation scholars, we can observe in interpreted texts that there is a tendency towards syntactic simplification. Another analysis shows that interpreted language is characterised by a particular use of discourse particles. EPIC-UdS contains rich metadata and fine-grained linguistic annotations tailored for diverse applications across a broad range of linguistic subfields. This paper provides the first overview in German on the EPIC-UdS corpus with the aim of bringing together results from individual studies, such as Bizzoni/Teich 2019; Karakanta/Vela/Teich 2018; Lapshinova-Koltunski et al. 2021a; Lapshinova-Koltunski/Przybyl/Bizzoni 2021b; Lapshinova-Koltunski/Pollkläsener/Przybyl 2022; Przybyl/Teich 2021; Pollkläsener 2021; Przybyl et al. 2022a, 2022b, and giving a concise and informative summary of applications and impacts of the project.

@incollection{Menzel_2025_EPIC,
title = {EPIC-UdS – ein mehrsprachiges Korpus als Grundlage f{\"u}r die korpusbasierte Dolmetsch- und {\"U}bersetzungswissenschaft},
author = {Katrin Menzel and Heike Przybyl and Ekaterina Lapshinova-Koltunski},
editor = {Astrid Schmidhofer and Mar{\'i}a {\'A}ngeles Recio Ariza},
url = {https://www.uibk.ac.at/iup/buecher/9783991061656.html},
doi = {https://doi.org/10.15203/99106-165-6-16},
year = {2025},
date = {2025},
booktitle = {Zukunftsperspektiven in der Translationswissenschaft Ausgew{\"a}hlte Beitr{\"a}ge der Translata IV},
isbn = {978-3-99106-165-6},
pages = {323-349},
publisher = {innsbruck university press},
address = {Innsbruck},
abstract = {In this paper, we present examples of current research foci and results of analyses of the Collaborative Research Centre subproject “Translation as Rational Communication” that focuses on the specific linguistic properties of interpreted and translated texts distinguishing them from original productions. We describe the creation and annotation of EPIC-UdS, a multilingual corpus of simultaneous interpreting for English, German and Spanish. We give an overview of the corpus variants and explore various applications of the corpus. Building on the ‘translationese’ hypothesis from translation studies, we investigate whether simultaneous interpreted language resembles written translated language with regard to specific features or whether it carries ‘interpretese’ features as a result of a unique language transfer process so that traces of these features tend to occur in all simultaneously interpreted texts and distinguish them from other texts. For instance, with regard to the simplification hypothesis put forward by various translation scholars, we can observe in interpreted texts that there is a tendency towards syntactic simplification. Another analysis shows that interpreted language is characterised by a particular use of discourse particles. EPIC-UdS contains rich metadata and fine-grained linguistic annotations tailored for diverse applications across a broad range of linguistic subfields. This paper provides the first overview in German on the EPIC-UdS corpus with the aim of bringing together results from individual studies, such as Bizzoni/Teich 2019; Karakanta/Vela/Teich 2018; Lapshinova-Koltunski et al. 2021a; Lapshinova-Koltunski/Przybyl/Bizzoni 2021b; Lapshinova-Koltunski/Pollkl{\"a}sener/Przybyl 2022; Przybyl/Teich 2021; Pollkl{\"a}sener 2021; Przybyl et al. 2022a, 2022b, and giving a concise and informative summary of applications and impacts of the project.},
pubstate = {published},
type = {incollection}
}

Copy BibTeX to Clipboard

Project:   B7

Spalek, Katharina; Bader, Regine; Glaser, Sandra; Höltje, Gerrit; Mecklinger, Axel

Contrastive focus accent retroactively modulates memory for focus alternatives: evidence from event-related potentials Journal Article

Language, Cognition and Neuroscience, Routledge, pp. 1-18, 2025.

Contrastive focus accent in spoken language indicates that alternatives to the focused element are relevant for interpretation. The sentence “Could I have some TEA, please?”, with contrastive accent on tea, is probably the response to an offer of several alternative beverages. Research shows that contrastive focus accent helps listeners remember such alternatives. We investigated the time-course of and mechanisms behind the effects of contrastive focus accent on memory with a variant of the subsequent memory effect paradigm with eventrelated potentials (ERPs). The ERP time-locked to a critical word was more positive-going if participants remembered two earlier mentioned alternatives than just one, but only if the critical word had been contrastively accented. This effect further was only observed when the critical word itself was remembered. These findings suggest that contrastive focus marking triggers a reinstatement of the preceding sentence context (retrieval practice) by which these elements are prioritised in memory

@article{Spaleck.etal.2025,
title = {Contrastive focus accent retroactively modulates memory for focus alternatives: evidence from event-related potentials},
author = {Katharina Spalek and Regine Bader and Sandra Glaser and Gerrit H{\"o}ltje and Axel Mecklinger},
url = {https://doi.org/10.1080/23273798.2025.2503906},
doi = {https://doi.org/10.1080/23273798.2025.2503906},
year = {2025},
date = {2025},
journal = {Language, Cognition and Neuroscience},
pages = {1-18},
publisher = {Routledge},
abstract = {Contrastive focus accent in spoken language indicates that alternatives to the focused element are relevant for interpretation. The sentence “Could I have some TEA, please?”, with contrastive accent on tea, is probably the response to an offer of several alternative beverages. Research shows that contrastive focus accent helps listeners remember such alternatives. We investigated the time-course of and mechanisms behind the effects of contrastive focus accent on memory with a variant of the subsequent memory effect paradigm with eventrelated potentials (ERPs). The ERP time-locked to a critical word was more positive-going if participants remembered two earlier mentioned alternatives than just one, but only if the critical word had been contrastively accented. This effect further was only observed when the critical word itself was remembered. These findings suggest that contrastive focus marking triggers a reinstatement of the preceding sentence context (retrieval practice) by which these elements are prioritised in memory},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A6

Scholman, Merel; Marchal, Marian; Brown, AriaRay; Demberg, Vera

DiscoNaija: A discourse-annotated parallel Nigerian Pidgin-English corpus Journal Article

Language Resources and Evaluation, pp. 3597-3633, 2025.

This article presents a parallel English-Nigerian Pidgin corpus of PTB 3.0-style discourse relation annotations, named DiscoNaija. We explain the corpus design criteria, report inter-annotator agreement, and alignment and projection evaluations. We also present an update to a Nigerian Pidgin connective lexicon, named NaijaLex 2.0. An exploratory corpus analysis focused on comparing the distributions found in DiscoNaija to those found in PDTB 3.0 and a comparable corpus of English, DiscoSPICE. We identify various features of Nigerian Pidgin discourse coherence: (i) relations tend to be expressed implicitly more often in Nigerian Pidgin in general; (ii) anti-chronological temporal relations tend to be expressed less and are more likely to be expressed explicitly in Nigerian Pidgin; and (iii) coordinating conjunctions occur less frequently in Nigerian Pidgin than in English. The DiscoNaija corpus can facilitate a multitude of applications and research purposes, for example to function as training data to improve the performance of discourse relation parsers for Nigerian Pidgin, and to facilitate research into discourse features of creole languages.

@article{scholman_etal_2025_disconaija,
title = {DiscoNaija: A discourse-annotated parallel Nigerian Pidgin-English corpus},
author = {Merel Scholman and Marian Marchal and AriaRay Brown and Vera Demberg},
url = {https://link.springer.com/article/10.1007/s10579-025-09850-3},
doi = {https://doi.org/10.1007/s10579-025-09850-3},
year = {2025},
date = {2025},
journal = {Language Resources and Evaluation},
pages = {3597-3633},
abstract = {

This article presents a parallel English-Nigerian Pidgin corpus of PTB 3.0-style discourse relation annotations, named DiscoNaija. We explain the corpus design criteria, report inter-annotator agreement, and alignment and projection evaluations. We also present an update to a Nigerian Pidgin connective lexicon, named NaijaLex 2.0. An exploratory corpus analysis focused on comparing the distributions found in DiscoNaija to those found in PDTB 3.0 and a comparable corpus of English, DiscoSPICE. We identify various features of Nigerian Pidgin discourse coherence: (i) relations tend to be expressed implicitly more often in Nigerian Pidgin in general; (ii) anti-chronological temporal relations tend to be expressed less and are more likely to be expressed explicitly in Nigerian Pidgin; and (iii) coordinating conjunctions occur less frequently in Nigerian Pidgin than in English. The DiscoNaija corpus can facilitate a multitude of applications and research purposes, for example to function as training data to improve the performance of discourse relation parsers for Nigerian Pidgin, and to facilitate research into discourse features of creole languages.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B2

Zaitova, Iuliia; Hirak, Vitalii; Abdullah, Badr M.; Klakow, Dietrich; Möbius, Bernd; Avgustinova, Tania

Attention on Multiword Expressions: A Multilingual Study of BERT-based Models with Regard to Idiomaticity and Microsyntax Inproceedings

Chiruzzo, Luis; Ritter, Alan; Wang, Lu (Ed.): Findings of the Association for Computational Linguistics: NAACL 2025, Association for Computational Linguistics, pp. 4083-4092, Albuquerque, New Mexico, 2025, ISBN 979-8-89176-195-7.

This study analyzes the attention patterns of fine-tuned encoder-only models based on the BERT architecture (BERT-based models) towards two distinct types of Multiword Expressions (MWEs): idioms and microsyntactic units (MSUs). Idioms present challenges in semantic non-compositionality, whereas MSUs demonstrate unconventional syntactic behavior that does not conform to standard grammatical categorizations. We aim to understand whether fine-tuning BERT-based models on specific tasks influences their attention to MWEs, and how this attention differs between semantic and syntactic tasks. We examine attention scores to MWEs in both pre-trained and fine-tuned BERT-based models. We utilize monolingual models and datasets in six Indo-European languages —  English, German, Dutch, Polish, Russian, and Ukrainian. Our results show that fine-tuning significantly influences how models allocate attention to MWEs. Specifically, models fine-tuned on semantic tasks tend to distribute attention to idiomatic expressions more evenly across layers. Models fine-tuned on syntactic tasks show an increase in attention to MSUs in the lower layers, corresponding with syntactic processing requirements.

@inproceedings{zaitova-etal-2025-attention,
title = {Attention on Multiword Expressions: A Multilingual Study of BERT-based Models with Regard to Idiomaticity and Microsyntax},
author = {Iuliia Zaitova and Vitalii Hirak and Badr M. Abdullah and Dietrich Klakow and Bernd M{\"o}bius and Tania Avgustinova},
editor = {Luis Chiruzzo and Alan Ritter and Lu Wang},
url = {https://aclanthology.org/2025.findings-naacl.228/},
year = {2025},
date = {2025},
booktitle = {Findings of the Association for Computational Linguistics: NAACL 2025},
isbn = {979-8-89176-195-7},
pages = {4083-4092},
publisher = {Association for Computational Linguistics},
address = {Albuquerque, New Mexico},
abstract = {This study analyzes the attention patterns of fine-tuned encoder-only models based on the BERT architecture (BERT-based models) towards two distinct types of Multiword Expressions (MWEs): idioms and microsyntactic units (MSUs). Idioms present challenges in semantic non-compositionality, whereas MSUs demonstrate unconventional syntactic behavior that does not conform to standard grammatical categorizations. We aim to understand whether fine-tuning BERT-based models on specific tasks influences their attention to MWEs, and how this attention differs between semantic and syntactic tasks. We examine attention scores to MWEs in both pre-trained and fine-tuned BERT-based models. We utilize monolingual models and datasets in six Indo-European languages —  English, German, Dutch, Polish, Russian, and Ukrainian. Our results show that fine-tuning significantly influences how models allocate attention to MWEs. Specifically, models fine-tuned on semantic tasks tend to distribute attention to idiomatic expressions more evenly across layers. Models fine-tuned on syntactic tasks show an increase in attention to MSUs in the lower layers, corresponding with syntactic processing requirements.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Sabev, Mitko; Andreeva, Bistra; Möbius, Bernd; Yuen, Ivan; Ibrahim, Omnia

The effects of lexical frequency on anticipatory voice assimilation in Bulgarian obstruents Inproceedings

Grawunder, Sven (Ed.): Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2025, TUDpress, pp. 163-169, Dresden, 2025, ISBN 978-3-95908-803-9, ISSN 0940-6832.

This study investigates the relation between the surprisal (or unpredictability) of linguistic items and anticipatory voicing assimilation in Bulgarian obstruents. Using a corpus of speech read by 140 Bulgarian speakers and wordlevel language models, we calculated unigram surprisal for word forms ending in obstruents followed by a word-initial obstruent of the opposite underlying [±voice] specification. Percentage of voicing was computed for 9,712 word-final obstruents. Linear mixed models were used to determine the effect of surprisal on the percentage of voicing in assimilating obstruents. The results confirm that Bulgarian obstruents do indeed in general assimilate to the voicing of a following obstruent: voiceless obstruents become voiced before voiced ones, while voiced obstruents are devoiced before voiceless ones. Crucially, however, surprisal had a significant effect on the percentage of voicing found in assimilating obstruents: in words with higher surprisal values, we found significantly lower degrees of voicing in voiceless obstruents before voiced ones, as well as significantly less devoicing of voiced obstruents before voiceless ones. This shows that assimilation is stronger in lowsurprisal words, while in high-surprisal words speakers attempt to maintain the underlying [±voice] specification of an obstruent to a higher degree. Our findings add to a growing body of research that demonstrates that processes once thought of as entirely categorical in fact exhibit gradient variation in fine phonetic detail, which is attributable to speakers’ awareness of statistical patterns in language use and their response to the predictability of linguistic items in maintaining a balance between phonetic encoding and information density.

@inproceedings{sabev_etal_essv2025,
title = {The effects of lexical frequency on anticipatory voice assimilation in Bulgarian obstruents},
author = {Mitko Sabev and Bistra Andreeva and Bernd M{\"o}bius and Ivan Yuen and Omnia Ibrahim},
editor = {Sven Grawunder},
url = {https://www.essv.de/paper.php?id=1249},
year = {2025},
date = {2025},
booktitle = {Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2025},
isbn = {978-3-95908-803-9},
issn = {0940-6832},
pages = {163-169},
publisher = {TUDpress},
address = {Dresden},
abstract = {This study investigates the relation between the surprisal (or unpredictability) of linguistic items and anticipatory voicing assimilation in Bulgarian obstruents. Using a corpus of speech read by 140 Bulgarian speakers and wordlevel language models, we calculated unigram surprisal for word forms ending in obstruents followed by a word-initial obstruent of the opposite underlying [±voice] specification. Percentage of voicing was computed for 9,712 word-final obstruents. Linear mixed models were used to determine the effect of surprisal on the percentage of voicing in assimilating obstruents. The results confirm that Bulgarian obstruents do indeed in general assimilate to the voicing of a following obstruent: voiceless obstruents become voiced before voiced ones, while voiced obstruents are devoiced before voiceless ones. Crucially, however, surprisal had a significant effect on the percentage of voicing found in assimilating obstruents: in words with higher surprisal values, we found significantly lower degrees of voicing in voiceless obstruents before voiced ones, as well as significantly less devoicing of voiced obstruents before voiceless ones. This shows that assimilation is stronger in lowsurprisal words, while in high-surprisal words speakers attempt to maintain the underlying [±voice] specification of an obstruent to a higher degree. Our findings add to a growing body of research that demonstrates that processes once thought of as entirely categorical in fact exhibit gradient variation in fine phonetic detail, which is attributable to speakers’ awareness of statistical patterns in language use and their response to the predictability of linguistic items in maintaining a balance between phonetic encoding and information density.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

On the Persistence of Discourse Predictions: The Facilitative Effect of Discourse Markers Diminishes in the Presence of Intervening Material

, 2025.
The current study investigates for how long readers maintain expectations about an upcoming discourse relation. We use the pair of discourse markers On the one hand (OT1H) and On the other hand (OTOH) to test the facilitative effect of OT1H on the processing of OTOH and the sensitivity of this effect to the presence of intervening material. Results from a story continuation study indicate that intervening material slightly weakens the effect of OT1H on offline representations of the discourse. Results from a self-paced reading and two eye-tracking studies suggest that the presence of intervening material diminishes the facilitative effect of OT1H in online processing. These results support memory-based models of processing by showing that discourse dependencies, while they are built as fine-grained representations, are not unbounded in real-time processing.
Andreeva, Bistra; Sabev, Mitko

Ефектът На Лексикалната Честотност И Типа Морфема Върху Регресивната Асимилация На Българските Обструенти По Признака Звучност (The Effects Of Lexical Frequency And Morpheme Type On Anticipatory Voice Assimilation In Bulgarian Obstruents) Journal Article

БЪЛГАРСК ЕЗИК, ПРИЛОЖЕНИЕ / BULGARIAN LANGUAGE, SUPPLEMENT, 72, pp. 357–369, 2025, ISSN 2603-3372.

Настоящото изследване разглежда връзката между честотността (предсказуемостта) на езиковите единици и регресивната асимилация на обструентите по признака звучност в българския език. Въз основа на данните от речеви корпус и езикови модели на ниво словоформа изчисляваме стойността на изненадата на думи с краесловни съгласни /t/, /ʃ/ и /x/ в състава на формообразуващи и словообразуващи морфеми. Чрез смесен линеен модел (LMM) анализираме как предсказуемостта, звучността на следващия обструент и типът морфема влияят върху реализираната звучност на съгласните. Резултатите показват, че езиковата предсказуемост влияе върху прецизността на артикулацията, като модулира степента на регресивната асимилация в зависимост от морфологичния контекст. Установяваме, че обструентите в по-малко предсказуемите думи се произнасят с по-прецизна артикулация, отколкото обструентите в по-предсказуемите думи, особено в състава на словообразуващи морфеми. Това подчертава комплексната динамика между информационната плътност, морфологията и артикулационния процес, което може да даде основа за по-нататъшни изследвания в областта на фонетиката и фонологията.


This study investigates the relationship between the frequency (or predictability) of linguistic units and the anticipatory voice assimilation of obstruents in Bulgarian. Using a speech corpus and word-level language models, we calculate the surprisal of word forms ending in the consonants /t/, /ʃ/, and /x/ in inflectional and lexical morphemes. We then employ a linear mixed model (LMM) to analyse how surprisal, the voicing of the following obstruent, and morpheme type affect the voicing realised in the examined consonants. The findings demonstrate that linguistic predictability affects articulatory precision by modulating the degree of anticipatory assimilation, with different effect sizes for different morpheme types. More specifically, obstruents in less predictable words are articulated with greater precision than those in more predictable words, especially within lexical morphemes. This reveals a complex interaction between information density, morphology and articulation, providing avenues for further research in phonetics and phonology.

@article{AndreevaSabev2025,
title = {Ефектът На Лексикалната Честотност И Типа Морфема Върху Регресивната Асимилация На Българските Обструенти По Признака Звучност (The Effects Of Lexical Frequency And Morpheme Type On Anticipatory Voice Assimilation In Bulgarian Obstruents)},
author = {Bistra Andreeva and Mitko Sabev},
url = {https://www.balgarskiezik.eu/p-2025/0_0_25_BISTRA%20ANDREEVA,%20MITKO%20SABEV_357-369_BG.pdf},
year = {2025},
date = {2025},
journal = {БЪЛГАРСК ЕЗИК, ПРИЛОЖЕНИЕ / BULGARIAN LANGUAGE, SUPPLEMENT},
pages = {357–369},
volume = {72},
abstract = {Настоящото изследване разглежда връзката между честотността (предсказуемостта) на езиковите единици и регресивната асимилация на обструентите по признака звучност в българския език. Въз основа на данните от речеви корпус и езикови модели на ниво словоформа изчисляваме стойността на изненадата на думи с краесловни съгласни /t/, /ʃ/ и /x/ в състава на формообразуващи и словообразуващи морфеми. Чрез смесен линеен модел (LMM) анализираме как предсказуемостта, звучността на следващия обструент и типът морфема влияят върху реализираната звучност на съгласните. Резултатите показват, че езиковата предсказуемост влияе върху прецизността на артикулацията, като модулира степента на регресивната асимилация в зависимост от морфологичния контекст. Установяваме, че обструентите в по-малко предсказуемите думи се произнасят с по-прецизна артикулация, отколкото обструентите в по-предсказуемите думи, особено в състава на словообразуващи морфеми. Това подчертава комплексната динамика между информационната плътност, морфологията и артикулационния процес, което може да даде основа за по-нататъшни изследвания в областта на фонетиката и фонологията.


This study investigates the relationship between the frequency (or predictability) of linguistic units and the anticipatory voice assimilation of obstruents in Bulgarian. Using a speech corpus and word-level language models, we calculate the surprisal of word forms ending in the consonants /t/, /ʃ/, and /x/ in inflectional and lexical morphemes. We then employ a linear mixed model (LMM) to analyse how surprisal, the voicing of the following obstruent, and morpheme type affect the voicing realised in the examined consonants. The findings demonstrate that linguistic predictability affects articulatory precision by modulating the degree of anticipatory assimilation, with different effect sizes for different morpheme types. More specifically, obstruents in less predictable words are articulated with greater precision than those in more predictable words, especially within lexical morphemes. This reveals a complex interaction between information density, morphology and articulation, providing avenues for further research in phonetics and phonology.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Alves, Diego

Information Theory and Linguistic Variation: A Study of Brazilian and European Portuguese Inproceedings

Scherrer, Yves; Jauhiainen, Tommi; Ljubešić, Nikola; Nakov, Preslav; Tiedemann, Jorg; Zampieri, Marcos (Ed.): Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects, Association for Computational Linguistics, pp. 9-19, Abu Dhabi, UAE, 2025.

We present a general analysis of the lexical and grammatical differences between Brazilian and European Portuguese by applying entropy measures, including Kullback-Leibler divergence and word order entropy, across various linguistic levels. Using a parallel corpus of BP and EP sentences translated from English, we quantified these differences and identified characteristic phenomena underlying the divergences between the two varieties. The highest divergence was observed at the lexical level due to word pairs unique to each variety but also related to grammatical distinctions. Furthermore, the analysis of parts-of-speech (POS), dependency relations, and POS tri-grams provided information concerning distinctive grammatical constructions. Finally, the word order entropy analysis revealed that while most of the syntactic features analysed showed similar patterns across BP and EP, specific word order preferences were still apparent.

@inproceedings{alves-2025-information,
title = {Information Theory and Linguistic Variation: A Study of Brazilian and European Portuguese},
author = {Diego Alves},
editor = {Yves Scherrer and Tommi Jauhiainen and Nikola Ljubešić and Preslav Nakov and Jorg Tiedemann and Marcos Zampieri},
url = {https://aclanthology.org/2025.vardial-1.2/},
year = {2025},
date = {2025},
booktitle = {Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects},
pages = {9-19},
publisher = {Association for Computational Linguistics},
address = {Abu Dhabi, UAE},
abstract = {We present a general analysis of the lexical and grammatical differences between Brazilian and European Portuguese by applying entropy measures, including Kullback-Leibler divergence and word order entropy, across various linguistic levels. Using a parallel corpus of BP and EP sentences translated from English, we quantified these differences and identified characteristic phenomena underlying the divergences between the two varieties. The highest divergence was observed at the lexical level due to word pairs unique to each variety but also related to grammatical distinctions. Furthermore, the analysis of parts-of-speech (POS), dependency relations, and POS tri-grams provided information concerning distinctive grammatical constructions. Finally, the word order entropy analysis revealed that while most of the syntactic features analysed showed similar patterns across BP and EP, specific word order preferences were still apparent.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Delogu, Francesca; Aurnhammer, Christoph; Brouwer, Harm; Crocker, Matthew W.

On the biphasic nature of the N400-P600 complex underlying language comprehension Journal Article

Brain & Cognition, 2025.

The ERP literature on language comprehension reveals variability in observing monophasic N400 versus biphasic N400-P600 effects in response to incongruent input, with the reasons for this inconsistency remaining unclear. Two interrelated factors may contribute: spatiotemporal overlap between the N400 and P600, where a strong N400-effect can obscure the P600, and the P600’s sensitivity to depth of processing, as determined by the experimental setting. Building on previous findings reporting monophasic N400-effects with plausibility judgments, we investigated whether comprehension questions, encouraging more natural reading and deeper processing of the full content, would elicit a biphasic effect, suggesting reduced component overlap in such settings. Using a design fully crossing lexical association and plausibility, we found that the N400 is modulated by association and the P600 by plausibility. Crucially, a biphasic pattern emerged for implausible and unrelated words, suggesting a mitigation of component overlap compared to previous studies employing plausibility judgments. We interpret the results in light of current accounts of the N400 and P600, arguing that the empirical evidence strongly supports single-stream over multi-stream models. Importantly, our findings highlight the critical role of both component overlap and task demands in shaping the data that inform the development and evaluation of theoretical models.

@article{Delogu-etal-2025,
title = {On the biphasic nature of the N400-P600 complex underlying language comprehension},
author = {Francesca Delogu and Christoph Aurnhammer and Harm Brouwer and Matthew W. Crocker},
url = {https://www.sciencedirect.com/science/article/pii/S0278262625000338},
doi = {https://doi.org/10.1016/j.bandc.2025.106293},
year = {2025},
date = {2025},
journal = {Brain & Cognition},
abstract = {The ERP literature on language comprehension reveals variability in observing monophasic N400 versus biphasic N400-P600 effects in response to incongruent input, with the reasons for this inconsistency remaining unclear. Two interrelated factors may contribute: spatiotemporal overlap between the N400 and P600, where a strong N400-effect can obscure the P600, and the P600’s sensitivity to depth of processing, as determined by the experimental setting. Building on previous findings reporting monophasic N400-effects with plausibility judgments, we investigated whether comprehension questions, encouraging more natural reading and deeper processing of the full content, would elicit a biphasic effect, suggesting reduced component overlap in such settings. Using a design fully crossing lexical association and plausibility, we found that the N400 is modulated by association and the P600 by plausibility. Crucially, a biphasic pattern emerged for implausible and unrelated words, suggesting a mitigation of component overlap compared to previous studies employing plausibility judgments. We interpret the results in light of current accounts of the N400 and P600, arguing that the empirical evidence strongly supports single-stream over multi-stream models. Importantly, our findings highlight the critical role of both component overlap and task demands in shaping the data that inform the development and evaluation of theoretical models.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A1

Yung, Frances Pik Yu; Demberg, Vera

On Crowdsourcing Task Design for Discourse Relation Annotation Inproceedings

Roth, Michael; Schlechtweg, Dominik (Ed.): Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation, International Committee on Computational Linguistics, pp. 12-19, Abu Dhabi, UAE, 2025.

Interpreting implicit discourse relations involves complex reasoning, requiring the integration of semantic cues with background knowledge, as overt connectives like “because” or “then” are absent. These relations often allow multiple interpretations, best represented as distributions. In this study, we compare two established methods that crowdsource implicit discourse relation annotation by connective insertion: a free-choice approach, which allows annotators to select any suitable connective, and a forced-choice approach, which asks them to select among a set of predefined options. Specifically, we re-annotate the whole DiscoGeM 1.0 corpus – initially annotated with the free-choice method – using the forced-choice approach. The free-choice approach allows for flexible and intuitive insertion of various connectives, which are context-dependent. Comparison among over 130,000 annotations, however, shows that the free-choice strategy produces less diverse annotations, often converging on common labels. Analysis of the results reveals the interplay between task design and the annotators’ abilities to interpret and produce discourse relations.

@inproceedings{yung-demberg-2025-crowdsourcing ,
title = {On Crowdsourcing Task Design for Discourse Relation Annotation},
author = {Frances Pik Yu Yung and Vera Demberg},
editor = {Michael Roth and Dominik Schlechtweg},
url = {https://aclanthology.org/2025.comedi-1.2/},
year = {2025},
date = {2025},
booktitle = {Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation},
pages = {12-19},
publisher = {International Committee on Computational Linguistics},
address = {Abu Dhabi, UAE},
abstract = {Interpreting implicit discourse relations involves complex reasoning, requiring the integration of semantic cues with background knowledge, as overt connectives like “because” or “then” are absent. These relations often allow multiple interpretations, best represented as distributions. In this study, we compare two established methods that crowdsource implicit discourse relation annotation by connective insertion: a free-choice approach, which allows annotators to select any suitable connective, and a forced-choice approach, which asks them to select among a set of predefined options. Specifically, we re-annotate the whole DiscoGeM 1.0 corpus - initially annotated with the free-choice method - using the forced-choice approach. The free-choice approach allows for flexible and intuitive insertion of various connectives, which are context-dependent. Comparison among over 130,000 annotations, however, shows that the free-choice strategy produces less diverse annotations, often converging on common labels. Analysis of the results reveals the interplay between task design and the annotators’ abilities to interpret and produce discourse relations.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Thillainathan, Sarubi; Koller, Alexander

Controllable Text Adaptation Using In-context Learning with Linguistic Features Inproceedings

AAAI2025 AI for Education - Tools, Opportunities, and Risks in the Generative AI Era, 2025.

The diversity in readers’ cognitive abilities, including working memory capacity and prior knowledge, necessitates texts that align with individual comprehension levels. We address the challenge of rewriting text to match readers’ unique needs, approximating readers to specific grade levels. Unlike prior approaches that rely on fine-tuned models and large training datasets, our method leverages in-context learning (ICL), making it effective in data-sparse scenarios. By precisely controlling linguistic features such as syntactic depth, our approach delivers tailored rewrites aligned with specific grade levels. We demonstrate state-of-the-art performance in generating grade-specific adaptations, highlighting the potential of ICL-based methods to enhance text accessibility and inclusivity.

@inproceedings{Thillainathan2025Controllable,
title = {Controllable Text Adaptation Using In-context Learning with Linguistic Features},
author = {Sarubi Thillainathan and Alexander Koller},
url = {https://ai4ed.cc/workshops/aaai2025},
year = {2025},
date = {2025},
booktitle = {AAAI2025 AI for Education - Tools, Opportunities, and Risks in the Generative AI Era},
abstract = {The diversity in readers’ cognitive abilities, including working memory capacity and prior knowledge, necessitates texts that align with individual comprehension levels. We address the challenge of rewriting text to match readers’ unique needs, approximating readers to specific grade levels. Unlike prior approaches that rely on fine-tuned models and large training datasets, our method leverages in-context learning (ICL), making it effective in data-sparse scenarios. By precisely controlling linguistic features such as syntactic depth, our approach delivers tailored rewrites aligned with specific grade levels. We demonstrate state-of-the-art performance in generating grade-specific adaptations, highlighting the potential of ICL-based methods to enhance text accessibility and inclusivity.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A8

Sentsova, Uliana; Ciminari, Debora; van Genabith, Josef; España-Bonet, Cristina

MultiCoPIE: A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation Inproceedings

Kr. Ojha, Atul; Giouli, Voula; Barbu Mititelu, Verginica; Constant, Mathieu; Korvel, Gražina; Seza Doğruöz, A.; Rademaker, Alexandre (Ed.): 21st Workshop on Multiword Expressions (MWE 2025) @NAACL2025, Association for Computational Linguistics, pp. 67-81, Albuquerque, New Mexico, U.S.A., 2025, ISBN 979-8-89176-243-5.

Language models are able to handle compositionality and, to some extent, non-compositional phenomena such as semantic idiosyncrasy, a feature most prominent in the case of idioms. This work introduces the MultiCoPIE corpus that includes potentially idiomatic expressions in Catalan, Italian, and Russian, extending the language coverage of PIE corpus data. The new corpus provides additional linguistic features of idioms, such as their semantic compositionality, part-of-speech of idiom head as well as their corresponding idiomatic expressions in English. With this new resource at hand, we first fine-tune an XLM-RoBERTa model to classify figurative and literal usage of potentially idiomatic expressions in English. We then study cross-lingual transfer to the languages represented in the MultiCoPIE corpus, evaluating the model{‚}s ability to generalize an idiom-related task to languages not seen during fine-tuning. We show the effect of `cross-lingual lexical overlap‘: the performance of the model, fine-tuned on English idiomatic expressions and tested on the MultiCoPIE languages, increases significantly when classifying `shared idioms‘ -idiomatic expressions that have direct counterparts in English with similar form and meaning. While this observation raises questions about the generalizability of cross-lingual learning, the results from experiments on PIEs demonstrate strong evidence of effective cross-lingual transfer, even when accounting for idioms similar across languages.

@inproceedings{Sentsova-etal-2025-multicopie,
title = {MultiCoPIE: A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation},
author = {Uliana Sentsova and Debora Ciminari and Josef van Genabith and Cristina Espa{\~n}a-Bonet},
editor = {Atul Kr. Ojha and Voula Giouli and Verginica Barbu Mititelu and Mathieu Constant and Gra{\v{z}ina Korvel and A. Seza Doğru{\"o}z and Alexandre Rademaker},
url = {https://aclanthology.org/2025.mwe-1.8/},
doi = {https://doi.org/10.18653/v1/2025.mwe-1.8},
year = {2025},
date = {2025},
booktitle = {21st Workshop on Multiword Expressions (MWE 2025) @NAACL2025},
isbn = {979-8-89176-243-5},
pages = {67-81},
publisher = {Association for Computational Linguistics},
address = {Albuquerque, New Mexico, U.S.A.},
abstract = {Language models are able to handle compositionality and, to some extent, non-compositional phenomena such as semantic idiosyncrasy, a feature most prominent in the case of idioms. This work introduces the MultiCoPIE corpus that includes potentially idiomatic expressions in Catalan, Italian, and Russian, extending the language coverage of PIE corpus data. The new corpus provides additional linguistic features of idioms, such as their semantic compositionality, part-of-speech of idiom head as well as their corresponding idiomatic expressions in English. With this new resource at hand, we first fine-tune an XLM-RoBERTa model to classify figurative and literal usage of potentially idiomatic expressions in English. We then study cross-lingual transfer to the languages represented in the MultiCoPIE corpus, evaluating the model{'}s ability to generalize an idiom-related task to languages not seen during fine-tuning. We show the effect of `cross-lingual lexical overlap': the performance of the model, fine-tuned on English idiomatic expressions and tested on the MultiCoPIE languages, increases significantly when classifying `shared idioms' -idiomatic expressions that have direct counterparts in English with similar form and meaning. While this observation raises questions about the generalizability of cross-lingual learning, the results from experiments on PIEs demonstrate strong evidence of effective cross-lingual transfer, even when accounting for idioms similar across languages.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Alves, Diego

Diachronic Analysis of Phrasal Verbs in English Scientific Writing Inproceedings

Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), University of Tartu Library, Tallinn, Estonia, 2025.
Phrasal verbs (PVs) are a specific type of multi-word expressions and a specific feature of the English language. However, their usage in scientific prose is limited. Our study focuses on the analysis of phrasal verbs in the scientific domain using information theory methods to describe diachronic phenomena such as conventionalization and diversification regarding the usage of PVs. Thus, we analysed their developmental trajectory over time from the mid-17th century to the end of the 20th century by measuring the relative entropy (Kullback-Leibler divergence), predictability in context of the phrasal verbs particles (surprisal), and the paradigmatic variability using word embedding spaces. We were able to identify interesting phenomena such as the process of conventionalization over the 20th century and the peaks of diversification throughout the centuries.

@inproceedings{Alves-2025,
title = {Diachronic Analysis of Phrasal Verbs in English Scientific Writing},
author = {Diego Alves},
url = {https://dspace.ut.ee/items/ef26bd7f-e708-41b3-b5c8-84cf8057ab71},
year = {2025},
date = {2025},
booktitle = {Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)},
publisher = {University of Tartu Library},
address = {Tallinn, Estonia},
abstract = {

Phrasal verbs (PVs) are a specific type of multi-word expressions and a specific feature of the English language. However, their usage in scientific prose is limited. Our study focuses on the analysis of phrasal verbs in the scientific domain using information theory methods to describe diachronic phenomena such as conventionalization and diversification regarding the usage of PVs. Thus, we analysed their developmental trajectory over time from the mid-17th century to the end of the 20th century by measuring the relative entropy (Kullback-Leibler divergence), predictability in context of the phrasal verbs particles (surprisal), and the paradigmatic variability using word embedding spaces. We were able to identify interesting phenomena such as the process of conventionalization over the 20th century and the peaks of diversification throughout the centuries.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Kunilovskaya, Maria; Zaitova, Iuliia; Xue, Wei; Stenger, Irina; Avgustinova, Tania

Predictability of Microsyntactic Units across Slavic Languages: A translation-based Study Inproceedings

Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), University of Tartu Library, Tallinn, Estonia, 2025.
The paper presents the results of a free translation experiment, which was set up to explore Slavic cross-language intelligibility. In the experiment, native speakers of Russian were asked to read a sentence in one of the five Slavic languages and return a Russian translation of a highlighted item. The experiment is focused on microsyntactic units because they offer an increased intercomprehension difficulty due to opaque semantics. Each language is represented by at least 50 stimuli, and each stimulus has generated at least 20 responses. The levels of intercomprehension are captured by categorising participants‘ responses into seven types of translation solutions (paraphrase, correct, fluent_literal, awkward_literal, fantasy, noise, and empty), generally reflecting the level of the cross-linguistic intelligibility of the stimuli. The study aims to reveal linguistic factors that favour intercomprehension across Slavic languages. We use regression and correlation analysis to identify the most important intercomprehension predictors and statistical analysis to bring up the most typical cases and outliers. We explore several feature types that reflect the properties of the translation tasks and their outcomes, including point-wise phonological and orthographic distances, cosine similarities, surprisals, translation quality scores and translation solution entropy indices. The experimental data confirms the expected gradual increase of intelligibility from West-Slavic to East-Slavic languages for the speakers of Russian. We show that intelligibility is highly contingent on the ability of speakers to recognise and interpret formal similarities between languages as well as on the size of these similarities. For several Slavic languages, the context sentence complexity was a significant predictor of intelligibility.

@inproceedings{Kunilovskaya-etal-2025,
title = {Predictability of Microsyntactic Units across Slavic Languages: A translation-based Study},
author = {Maria Kunilovskaya and Iuliia Zaitova and Wei Xue and Irina Stenger and Tania Avgustinova},
url = {https://dspace.ut.ee/items/26e26504-9379-42cf-8f85-361a04dcd114},
year = {2025},
date = {2025},
booktitle = {Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)},
publisher = {University of Tartu Library},
address = {Tallinn, Estonia},
abstract = {

The paper presents the results of a free translation experiment, which was set up to explore Slavic cross-language intelligibility. In the experiment, native speakers of Russian were asked to read a sentence in one of the five Slavic languages and return a Russian translation of a highlighted item. The experiment is focused on microsyntactic units because they offer an increased intercomprehension difficulty due to opaque semantics. Each language is represented by at least 50 stimuli, and each stimulus has generated at least 20 responses. The levels of intercomprehension are captured by categorising participants' responses into seven types of translation solutions (paraphrase, correct, fluent_literal, awkward_literal, fantasy, noise, and empty), generally reflecting the level of the cross-linguistic intelligibility of the stimuli. The study aims to reveal linguistic factors that favour intercomprehension across Slavic languages. We use regression and correlation analysis to identify the most important intercomprehension predictors and statistical analysis to bring up the most typical cases and outliers. We explore several feature types that reflect the properties of the translation tasks and their outcomes, including point-wise phonological and orthographic distances, cosine similarities, surprisals, translation quality scores and translation solution entropy indices. The experimental data confirms the expected gradual increase of intelligibility from West-Slavic to East-Slavic languages for the speakers of Russian. We show that intelligibility is highly contingent on the ability of speakers to recognise and interpret formal similarities between languages as well as on the size of these similarities. For several Slavic languages, the context sentence complexity was a significant predictor of intelligibility.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B7 C4

Lemke, Tyll Robin

Investigating fragment usage with a gamified utterance selection task Proceeding

Experiments in Linguistic Meaning, 3, pp. 447-459, 2025.
Nonsentential utterances, or fragments, like A coffee, please! can often be used to communicate a propositional meaning otherwise encoded by a complete sentence I’d like to order a coffee, please!). Previous research focused mostly on the syntax and licensing of fragments, but the questions of why speakers use fragments and how listeners interpret them are still underexplored. I propose a simple game-theoretic account of fragment usage, which predicts that (i) listeners assign fragments the most likely interpretation in context and (ii) that speakers are aware of this and trade-off production cost and the risk of being misunderstood when choosing their utterance. Using a corpus of production data, empirically founded and precise model predictions are generated. These predictions are evaluated with two experiments using a novel gamified utterance selection paradigm. The experiments suggest that, as predicted, speakers take into account both potential gain in efficiency and the risk of being misunderstood when choosing their utterance.

@proceeding{Lemke_2025,
title = {Investigating fragment usage with a gamified utterance selection task},
author = {Tyll Robin Lemke},
url = {https://journals.linguisticsociety.org/proceedings/index.php/ELM/article/view/5836},
doi = {https://doi.org/10.3765/elm.3.5836},
year = {2025},
date = {2025},
booktitle = {Experiments in Linguistic Meaning},
pages = {447-459},
abstract = {

Nonsentential utterances, or fragments, like A coffee, please! can often be used to communicate a propositional meaning otherwise encoded by a complete sentence I'd like to order a coffee, please!). Previous research focused mostly on the syntax and licensing of fragments, but the questions of why speakers use fragments and how listeners interpret them are still underexplored. I propose a simple game-theoretic account of fragment usage, which predicts that (i) listeners assign fragments the most likely interpretation in context and (ii) that speakers are aware of this and trade-off production cost and the risk of being misunderstood when choosing their utterance. Using a corpus of production data, empirically founded and precise model predictions are generated. These predictions are evaluated with two experiments using a novel gamified utterance selection paradigm. The experiments suggest that, as predicted, speakers take into account both potential gain in efficiency and the risk of being misunderstood when choosing their utterance.
},
pubstate = {published},
type = {proceeding}
}

Copy BibTeX to Clipboard

Project:   B3

Häuser, Katja; Borovsky, Arielle

Got it right up front? Further evidence for parallel graded prediction during prenominal article processing in a self-paced reading study Journal Article

Glossa Psycholinguistics, 4, 2025.

Recent studies suggest that language users generate and maintain multiple predictions in parallel, especially in tasks that explicitly instruct participants to generate predictions. Here, we investigated the possibility of parallel gradedness of linguistic predictions in a simple reading task, using a new measure—imbalance—that captures the probabilistic difference between multiple sentence completions. We focus on prenominal gender-marked articles in German to obtain prediction-specific effects. Native speakers of German read predictable or unpredictable gender-marked nouns that were preceded by prediction-consistent or -inconsistent prenominal articles. Sentence frames either biased expectations more strongly toward the most likely continuation of the sentence, or balanced expectations between the first and second most likely continuation. The results showed reading facilitation for gender-marked articles when sentences were more biased but slowing when sentences were more balanced, irrespective of article predictability. We conclude that readers issue multiple prenominal predictions and weigh them according to their likelihood, providing evidence for parallel gradedness of prenominal predictions. The results are discussed in light of theoretical models on prediction and rational sentence processing.

@article{haeuser-borovsky-2025,
title = {Got it right up front? Further evidence for parallel graded prediction during prenominal article processing in a self-paced reading study},
author = {Katja H{\"a}user and Arielle Borovsky},
url = {https://escholarship.org/uc/item/7g30m0th},
doi = {https://doi.org/10.5070/G6011.1636},
year = {2025},
date = {2025},
journal = {Glossa Psycholinguistics},
volume = {4},
number = {1},
abstract = {Recent studies suggest that language users generate and maintain multiple predictions in parallel, especially in tasks that explicitly instruct participants to generate predictions. Here, we investigated the possibility of parallel gradedness of linguistic predictions in a simple reading task, using a new measure—imbalance—that captures the probabilistic difference between multiple sentence completions. We focus on prenominal gender-marked articles in German to obtain prediction-specific effects. Native speakers of German read predictable or unpredictable gender-marked nouns that were preceded by prediction-consistent or -inconsistent prenominal articles. Sentence frames either biased expectations more strongly toward the most likely continuation of the sentence, or balanced expectations between the first and second most likely continuation. The results showed reading facilitation for gender-marked articles when sentences were more biased but slowing when sentences were more balanced, irrespective of article predictability. We conclude that readers issue multiple prenominal predictions and weigh them according to their likelihood, providing evidence for parallel gradedness of prenominal predictions. The results are discussed in light of theoretical models on prediction and rational sentence processing.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A5

Talamo, Luigi

Introducing STAF: The Saarbrücken Treebank of Albanian Fiction Journal Article

Journal of Open Humanities Data, 11, pp. 1–6, 2025.

The present paper describes the building of STAF, a Universal Dependencies treebank for Albanian. STAF was bootstrapped using a Stanza model trained on previously unreleased data and then manually corrected by three Albanian speakers supervised by the author, who also revised all sentences. STAF focuses on the fiction genre, featuring 200 sentences selected from nine literary texts written by Albanian contemporary authors.

@article{Talamo-2025,
title = {Introducing STAF: The Saarbr{\"u}cken Treebank of Albanian Fiction},
author = {Luigi Talamo},
url = {https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.285},
doi = {https://doi.org/10.5334/johd.285},
year = {2025},
date = {2025},
journal = {Journal of Open Humanities Data},
pages = {1–6},
volume = {11},
number = {3},
abstract = {

The present paper describes the building of STAF, a Universal Dependencies treebank for Albanian. STAF was bootstrapped using a Stanza model trained on previously unreleased data and then manually corrected by three Albanian speakers supervised by the author, who also revised all sentences. STAF focuses on the fiction genre, featuring 200 sentences selected from nine literary texts written by Albanian contemporary authors.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C7

Li, Muqing

Informativity and linearization in reference production PhD Thesis

Saarland University, Saarbruecken, Germany, 2025.

In visually-situated referential communication tasks, speakers must select relevant visual properties and determine their linear order within a syntactic structure in order to encode a message that enables the listener to successfully identify the intended referent. While previous studies have primarily focused on the influence of informativity on property selection, especially overspecification, little is known about how informativity affects the linearization of property order, particularly when syntactic variation is involved. This thesis investigates whether and how the informativity of property words, as determined by visual-situated contexts and quantified via Referential Entropy Reduction, influences syntactic linearization. Five referential communication experiments investigate whether informativity modulates speakers‘ syntactic choice between pre-nominal and post-nominal modifications in German, when describing referents in visual scenes depicting animals performing actions. Additionally, the project explores the role of communication engagement by reinforcing perspective-taking and comparing web-based and face-to-face interaction settings. The results reveal two groups of speakers: *Group Consistent*, who are insensitive to informativity and adhere to a fixed syntactic structure, in line with a speaker-oriented, heuristic production approach; and *Group Varied*, who vary the use of syntactic structures to adjust property orders based on informativity, favouring an informative-first linearization strategy that facilitates target identification for listeners. The proportion of *Group Varied* speakers increases with communication engagement, particularly in the most engaging face-to-face interactions and when perspective-taking is reinforced. This thesis advances our understanding of referential production and communication efficiency. Examining informativity, a speaker-external, listener-oriented factor, provides a clearer distinction between the speaker-oriented and listener-oriented views of reference production, as reflected in the different linearization strategies adopted by the two speaker groups. The distribution of these two groups is mediated by perspective-taking and communication engagement. The informative-first linearization preference offers novel evidence for the role of communication efficiency and audience design in shaping syntactic choices during the early grammatical encoding phase of language production.


Diese Dissertation untersucht den Einfluss von Informativität auf die syntaktische Linearisierung von Objekteigenschaften in der referenziellen Produktion. In visuell eingebetteten referenziellen Kommunikationst-Tasks konstruieren Sprecher Ausdrücke, um einem anwesenden Hörer zu helfen, ein spezifisches Zielobjekt in einer visuellen Szene zu identifizieren. Dieser Prozess umfasst zwei zentrale Komponenten: 1) die Eigenschaftsauswahl, also die Auswahl relevanter Eigenschaften, die eine eindeutige Zielidentifikation unterstützen, und 2) die Eigenschaftsreihenfolge, also die Anordnung der gewählten Eigenschaften in einer syntaktisch linearisierten Form. Während frühere Studien besonderes Augenmerk auf Überspezifizierung (ÜS) bei der Eigenschaftsauswahl gelegt haben – bei der Sprecher mehr Informationen einfügen als notwendig (z.B. ein blaues Dreieck”, obwohl nur ein Dreieck vorhanden ist) – ist wenig darüber bekannt, wie Eigenschaftsreihenfolge als Linearierungsprozess in der referenziellen Produktion funktioniert, insbesondere wenn syntaktische Variation möglich ist, etwa bei prä-nominalen versus post-nominalen Strukturen (z.B. der weinende Hase vs. der Hase, der weint). Indem referenzielle Kommunikation als kooperativer Prozess (Clark & Wilkes- Gibbs, 1986; Grice, 1975) und als ein Akt der Informationsübertragung (Sperber & Wilson, 1995) verstanden wird, der auf Prinzipien der Informationstheorie beruht (Shannon, 1948), untersucht diese Dissertation, ob die referenziellen Äusserungen von Sprechern durch Informativität im Sinne kommunikativer Effizienz beeinflusst werden. Genauer gesagt wird erforscht, ob Informativität die Linearisierung von referenziellen Eigenschaften beeinflusst, wenn diese Linearisierung mit syntaktischen Variationen in der referenziellen Enkodierung einhergeht. Das Ziel, den Einfluss von Informativität auf syntaktische Linearisierung zu untersuchen, ist es, zur theoretischen Debatte zwischen sprecherorientierten und hörerorientierten Sichtweisen der referenziellen Produktion beizutragen, indem ein Untersuchungsgegenstand behandelt wird, der über bisherige Forschung zur ÜS hinausgeht.

@phdthesis{Li_Diss_2025,
title = {Informativity and linearization in reference production},
author = {Muqing Li},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/40721},
doi = {https://doi.org/20.500.11880/40721},
year = {2025},
date = {2025},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {In visually-situated referential communication tasks, speakers must select relevant visual properties and determine their linear order within a syntactic structure in order to encode a message that enables the listener to successfully identify the intended referent. While previous studies have primarily focused on the influence of informativity on property selection, especially overspecification, little is known about how informativity affects the linearization of property order, particularly when syntactic variation is involved. This thesis investigates whether and how the informativity of property words, as determined by visual-situated contexts and quantified via Referential Entropy Reduction, influences syntactic linearization. Five referential communication experiments investigate whether informativity modulates speakers' syntactic choice between pre-nominal and post-nominal modifications in German, when describing referents in visual scenes depicting animals performing actions. Additionally, the project explores the role of communication engagement by reinforcing perspective-taking and comparing web-based and face-to-face interaction settings. The results reveal two groups of speakers: *Group Consistent*, who are insensitive to informativity and adhere to a fixed syntactic structure, in line with a speaker-oriented, heuristic production approach; and *Group Varied*, who vary the use of syntactic structures to adjust property orders based on informativity, favouring an informative-first linearization strategy that facilitates target identification for listeners. The proportion of *Group Varied* speakers increases with communication engagement, particularly in the most engaging face-to-face interactions and when perspective-taking is reinforced. This thesis advances our understanding of referential production and communication efficiency. Examining informativity, a speaker-external, listener-oriented factor, provides a clearer distinction between the speaker-oriented and listener-oriented views of reference production, as reflected in the different linearization strategies adopted by the two speaker groups. The distribution of these two groups is mediated by perspective-taking and communication engagement. The informative-first linearization preference offers novel evidence for the role of communication efficiency and audience design in shaping syntactic choices during the early grammatical encoding phase of language production.


Diese Dissertation untersucht den Einfluss von Informativit{\"a}t auf die syntaktische Linearisierung von Objekteigenschaften in der referenziellen Produktion. In visuell eingebetteten referenziellen Kommunikationst-Tasks konstruieren Sprecher Ausdr{\"u}cke, um einem anwesenden H{\"o}rer zu helfen, ein spezifisches Zielobjekt in einer visuellen Szene zu identifizieren. Dieser Prozess umfasst zwei zentrale Komponenten: 1) die Eigenschaftsauswahl, also die Auswahl relevanter Eigenschaften, die eine eindeutige Zielidentifikation unterst{\"u}tzen, und 2) die Eigenschaftsreihenfolge, also die Anordnung der gew{\"a}hlten Eigenschaften in einer syntaktisch linearisierten Form. W{\"a}hrend fr{\"u}here Studien besonderes Augenmerk auf {\"U}berspezifizierung ({\"U}S) bei der Eigenschaftsauswahl gelegt haben – bei der Sprecher mehr Informationen einf{\"u}gen als notwendig (z.B. ein blaues Dreieck”, obwohl nur ein Dreieck vorhanden ist) – ist wenig dar{\"u}ber bekannt, wie Eigenschaftsreihenfolge als Linearierungsprozess in der referenziellen Produktion funktioniert, insbesondere wenn syntaktische Variation m{\"o}glich ist, etwa bei pr{\"a}-nominalen versus post-nominalen Strukturen (z.B. der weinende Hase vs. der Hase, der weint). Indem referenzielle Kommunikation als kooperativer Prozess (Clark & Wilkes- Gibbs, 1986; Grice, 1975) und als ein Akt der Informations{\"u}bertragung (Sperber & Wilson, 1995) verstanden wird, der auf Prinzipien der Informationstheorie beruht (Shannon, 1948), untersucht diese Dissertation, ob die referenziellen {\"A}usserungen von Sprechern durch Informativit{\"a}t im Sinne kommunikativer Effizienz beeinflusst werden. Genauer gesagt wird erforscht, ob Informativit{\"a}t die Linearisierung von referenziellen Eigenschaften beeinflusst, wenn diese Linearisierung mit syntaktischen Variationen in der referenziellen Enkodierung einhergeht. Das Ziel, den Einfluss von Informativit{\"a}t auf syntaktische Linearisierung zu untersuchen, ist es, zur theoretischen Debatte zwischen sprecherorientierten und h{\"o}rerorientierten Sichtweisen der referenziellen Produktion beizutragen, indem ein Untersuchungsgegenstand behandelt wird, der {\"u}ber bisherige Forschung zur {\"U}S hinausgeht.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C3

Successfully