Publications

Landwehr, Isabell

The Interplay of Noun Phrase Complexity and Modification Type in Scientific Writing Inproceedings

Chen, Xinying; Wang, Yaqin (Ed.): Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025), Association for Computational Linguistics, pp. 72-82, Ljubljana, Slovenia, 2025, ISBN 979-8-89176-293-0.

We investigate the interplay of noun phrase (NP) complexity and modification type, namely the choice between pre- and postmodification, using a corpus-based approach. Our dataset is the Royal Society Corpus (RSC, Fischer et al. 2020), a diachronic corpus of English scientific writing. We find that the number of dependents, length of the head noun and distance to the head noun{‚}s own syntactic head (typically the main verb) affect the likelihood of pre- vs. postmodification: NPs with more dependents are more likely to be premodified, NPs with a longer head noun and a head noun closer to its own head are more likely to be postmodified. In addition, we find an effect of syntactic role and definiteness as well as time: The likelihood of premodification over postmodification increases with time and subject NPs as well as indefinite NPs are more likely to be premodified than NPs in other syntactic roles or definite NPs.

@inproceedings{landwehr-2025-interplay,
title = {The Interplay of Noun Phrase Complexity and Modification Type in Scientific Writing},
author = {Isabell Landwehr},
editor = {Xinying Chen and Yaqin Wang},
url = {https://aclanthology.org/2025.quasy-1.10/},
year = {2025},
date = {2025},
booktitle = {Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025)},
isbn = {979-8-89176-293-0},
pages = {72-82},
publisher = {Association for Computational Linguistics},
address = {Ljubljana, Slovenia},
abstract = {We investigate the interplay of noun phrase (NP) complexity and modification type, namely the choice between pre- and postmodification, using a corpus-based approach. Our dataset is the Royal Society Corpus (RSC, Fischer et al. 2020), a diachronic corpus of English scientific writing. We find that the number of dependents, length of the head noun and distance to the head noun{'}s own syntactic head (typically the main verb) affect the likelihood of pre- vs. postmodification: NPs with more dependents are more likely to be premodified, NPs with a longer head noun and a head noun closer to its own head are more likely to be postmodified. In addition, we find an effect of syntactic role and definiteness as well as time: The likelihood of premodification over postmodification increases with time and subject NPs as well as indefinite NPs are more likely to be premodified than NPs in other syntactic roles or definite NPs.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Landwehr, Isabell; Krielke, Marie-Pauline; Degaetano-Ortlieb, Stefania; Zhao, Jin; Wang, Mingyang; Liu, Zhu

Exploring the Effect of Nominal Compound Structure in Scientific Texts on Reading Times of Experts and Novices Inproceedings

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Association for Computational Linguistics, pp. 396-408, Vienna, Austria, 2025, ISBN 979-8-89176-254-1.

We explore how different types of nominal compound complexity in scientific writing, in particular different types of compound structure, affect the reading times of experts and novices. We consider both in-domain and out-of-domain reading and use PoTeC (Jakobi et al. 2024), a corpus containing eye-tracking data of German native speakers reading passages from scientific textbooks. Our results suggest that some compound types are associated with longer reading times and that experts may not only have an advantage while reading in-domain texts, but also while reading out-of-domain.

@inproceedings{landwehr-etal-2025-exploring,
title = {Exploring the Effect of Nominal Compound Structure in Scientific Texts on Reading Times of Experts and Novices},
author = {Isabell Landwehr and Marie-Pauline Krielke and Stefania Degaetano-Ortlieb andJin Zhao and Mingyang Wang and Zhu Liu},
url = {https://aclanthology.org/2025.acl-srw.25/},
doi = {https://doi.org/10.18653/v1/2025.acl-srw.25},
year = {2025},
date = {2025},
booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)},
isbn = {979-8-89176-254-1},
pages = {396-408},
publisher = {Association for Computational Linguistics},
address = {Vienna, Austria},
abstract = {We explore how different types of nominal compound complexity in scientific writing, in particular different types of compound structure, affect the reading times of experts and novices. We consider both in-domain and out-of-domain reading and use PoTeC (Jakobi et al. 2024), a corpus containing eye-tracking data of German native speakers reading passages from scientific textbooks. Our results suggest that some compound types are associated with longer reading times and that experts may not only have an advantage while reading in-domain texts, but also while reading out-of-domain.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Pollkläsener, Christina; Kunilovskaya, Maria

Euh... where do interpreters hesitate? An information-theoretic perspective on sentence-initial filler particles in simultaneous interpreting Inproceedings

12th edition of the Disfluency in Spontaneous Speech Workshop (DiSS 2025), pp. 92-96, 2025.

This study investigates the occurrence of sentence-initial filler particles (e.g. euh, hum) in simultaneously interpreted and original speeches using a bidirectional English-German corpus of European Parliament debates. We assume that sentence-initial filler particles indicate planning difficulties at the conceptual level, whereas sentence-medial filler particles mark hesitations over syntactic structure or lexical access. Since interpreters convey the source speech and do not plan their own message, we expect differences between interpreting and original speeches. We operationalise conceptual complexity as average word surprisal per sentence and local lexical or syntactic production problems as surprisal of the word following the filler particle. Our findings indicate that sentence-initial filler particles appear in sentences with higher conceptual complexity but are not well associated with local retrieval difficulty.

@inproceedings{pollklasener25_diss,
title = {Euh... where do interpreters hesitate? An information-theoretic perspective on sentence-initial filler particles in simultaneous interpreting},
author = {Christina Pollkl{\"a}sener and Maria Kunilovskaya},
url = {https://www.isca-archive.org/tmp/diss_2025/pollklasener25_diss.pdf},
doi = {https://doi.org/10.21437/DiSS.2025-19},
year = {2025},
date = {2025},
booktitle = {12th edition of the Disfluency in Spontaneous Speech Workshop (DiSS 2025)},
pages = {92-96},
abstract = {This study investigates the occurrence of sentence-initial filler particles (e.g. euh, hum) in simultaneously interpreted and original speeches using a bidirectional English-German corpus of European Parliament debates. We assume that sentence-initial filler particles indicate planning difficulties at the conceptual level, whereas sentence-medial filler particles mark hesitations over syntactic structure or lexical access. Since interpreters convey the source speech and do not plan their own message, we expect differences between interpreting and original speeches. We operationalise conceptual complexity as average word surprisal per sentence and local lexical or syntactic production problems as surprisal of the word following the filler particle. Our findings indicate that sentence-initial filler particles appear in sentences with higher conceptual complexity but are not well associated with local retrieval difficulty.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Krieger, Benedict; Brouwer, Harm; Aurnhammer, Christoph; Crocker, Matthew W.

On the Limits of LLM Surprisal as a Functional Explanation of the N400 and P600 Journal Article

Brain Research, 1865, pp. 149841, 2025, ISSN 0006-8993.

Expectations about upcoming words play a central role in language comprehension, with expected words being processed more easily than less expected ones. Surprisal theory formalizes this relationship by positing that cognitive effort is proportional to a word’s negative log-probability in context, as determined by distributional, linguistic, and world knowledge constraints. The emergence of large language models (LLMs) demonstrating the capacity to compute richly contextualized surprisal estimates, has motivated their consideration as models of comprehension. We assess here the relationship of LLM surprisal with two key neural correlates of comprehension — the N400 and the P600 — which differ in sensitivity to semantic association and contextual expectancy. While prior work has focused on the N400, we propose that the P600 may offer a better index of surprisal, as it is unaffected by association while still patterning continuously with expectancy. Using regression-based ERPs (rERPs), we examine data from three German factorial studies to evaluate the extent to which LLM surprisal can account for ERP differences. Our results show that LLM surprisal captures neither component consistently. We find that it is contaminated by simple association, particularly in smaller LLMs. As a result, LLM surprisal can partially account for association-driven N400 effects, but not for the full attenuation of N400 effects. Correspondingly, this property of LLMs compromises their ability to model the P600, which is sensitive to expectancy but not to association.

@article{KRIEGER2025149841,
title = {On the Limits of LLM Surprisal as a Functional Explanation of the N400 and P600},
author = {Benedict Krieger and Harm Brouwer and Christoph Aurnhammer and Matthew W. Crocker},
url = {https://www.sciencedirect.com/science/article/pii/S0006899325004020},
doi = {https://doi.org/10.1016/j.brainres.2025.149841},
year = {2025},
date = {2025},
journal = {Brain Research},
pages = {149841},
volume = {1865},
abstract = {Expectations about upcoming words play a central role in language comprehension, with expected words being processed more easily than less expected ones. Surprisal theory formalizes this relationship by positing that cognitive effort is proportional to a word's negative log-probability in context, as determined by distributional, linguistic, and world knowledge constraints. The emergence of large language models (LLMs) demonstrating the capacity to compute richly contextualized surprisal estimates, has motivated their consideration as models of comprehension. We assess here the relationship of LLM surprisal with two key neural correlates of comprehension -- the N400 and the P600 -- which differ in sensitivity to semantic association and contextual expectancy. While prior work has focused on the N400, we propose that the P600 may offer a better index of surprisal, as it is unaffected by association while still patterning continuously with expectancy. Using regression-based ERPs (rERPs), we examine data from three German factorial studies to evaluate the extent to which LLM surprisal can account for ERP differences. Our results show that LLM surprisal captures neither component consistently. We find that it is contaminated by simple association, particularly in smaller LLMs. As a result, LLM surprisal can partially account for association-driven N400 effects, but not for the full attenuation of N400 effects. Correspondingly, this property of LLMs compromises their ability to model the P600, which is sensitive to expectancy but not to association.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A1

Andreeva, Bistra; Yuen, Ivan; Möbius, Bernd; Ibrahim, Omnia

Informationsdichte und die Vorhersagbarkeit der phonetischen Struktur Journal Article

Petkova-Kessanlis, Mikaela; Ivanova, Radka; Kileva-Stamenova, Reneta; Arnaudova, Svetlana; Endreva, Maria (Ed.): Journal for German and Scandinavian Studies, 1, 2025.

Die Studie untersucht die Beziehung zwischen Informationsdichte und linguistischer Kodierung in der Phonetik sowie der menschlichen Sprachverarbeitung. Die Informationsdichte einer linguistischen Einheit wird in Bezug auf Surprisal (den negativen Logarithmus der Wahrscheinlichkeit einer Einheit in einem gegebenen Kontext) definiert. Die Effekte von Surprisal auf die phonetische Kodierung wurden hinsichtlich verschiedener Aspekte wie Formantenverläufe von Vokalen, Stimmhaftigkeit von Plosiven, Silbendauer und Vokaldispersion (auch im L2) untersucht, wobei Kontrollfaktoren der prosodischen Struktur sowie mögliche Interaktionen mit dem Lombard-Effekt und der prosodischen Struktur berücksichtigt wurden. Die Ergebnisse deuten darauf hin, dass Sprecher phonetische Details anpassen, um ein Gleichgewicht zwischen Informationsdichte und phonetischer Kodierung aufrechtzuerhalten.

@article{andreeva_2025_informationsdichte,
title = {Informationsdichte und die Vorhersagbarkeit der phonetischen Struktur},
author = {Bistra Andreeva and Ivan Yuen and Bernd M{\"o}bius and Omnia Ibrahim},
editor = {Mikaela Petkova-Kessanlis and Radka Ivanova and Reneta Kileva-Stamenova and Svetlana Arnaudova and Maria Endreva},
url = {https://journalgermscand.fcml.uni-sofia.bg/wp-content/uploads/2025/04/14.-Andreeva-Spisanie___Wege-und-Umwege-zum-Wandel___VOL-1_Final-dragged.pdf},
year = {2025},
date = {2025},
journal = {Journal for German and Scandinavian Studies},
volume = {1},
abstract = {Die Studie untersucht die Beziehung zwischen Informationsdichte und linguistischer Kodierung in der Phonetik sowie der menschlichen Sprachverarbeitung. Die Informationsdichte einer linguistischen Einheit wird in Bezug auf Surprisal (den negativen Logarithmus der Wahrscheinlichkeit einer Einheit in einem gegebenen Kontext) definiert. Die Effekte von Surprisal auf die phonetische Kodierung wurden hinsichtlich verschiedener Aspekte wie Formantenverl{\"a}ufe von Vokalen, Stimmhaftigkeit von Plosiven, Silbendauer und Vokaldispersion (auch im L2) untersucht, wobei Kontrollfaktoren der prosodischen Struktur sowie m{\"o}gliche Interaktionen mit dem Lombard-Effekt und der prosodischen Struktur ber{\"u}cksichtigt wurden. Die Ergebnisse deuten darauf hin, dass Sprecher phonetische Details anpassen, um ein Gleichgewicht zwischen Informationsdichte und phonetischer Kodierung aufrechtzuerhalten.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Häuser, Katja; Kray, Jutta

Not so SUBTLE(X): Word frequency estimates and their fit to sentential reading times in interaction with predictability Journal Article

Linguistics - An Interdisciplinary Journal of the Language Sciences, 2025.

Frequency and predictability are two prominent psycholinguistic variables that determine the ease of word comprehension and have informed models of language processing. Here, we pooled the data from five self-paced reading studies to investigate (1) the usefulness of three well-known frequency databases of German in accounting for word reading times in context (i.e., the SUBTLEX-DE, CELEX, and dlexDB databases), and (2) whether frequency and predictability have additive or interactive effects on lexical processing. Regarding (1), goodness of fit comparisons between the three frequency measures showed that, in the majority of models, dlexDB frequencies performed best (in contrast to earlier investigations recommending to use SUBTLEX), even though nearly all frequency effects were statistically invariant and dwarfed by the contributions of other more potent variables such as predictability or trial number. Regarding (2), we found that, even though predictability influenced reading times, there was no evidence for interactive effects of frequency and predictability. Our results call into question the current default practice in many psycholinguistic studies to rely on subtitle norms when it comes to estimating lexical frequencies, but they also suggest that frequency effects may be negligible in paradigms which promote contextual word-by-word reading. Our findings are more in line with modular models of language comprehension in which lexical access operates independently from contextual predictability.

@article{haeuser_2025_subtlex,
title = {Not so SUBTLE(X): Word frequency estimates and their fit to sentential reading times in interaction with predictability},
author = {Katja H{\"a}user and Jutta Kray},
url = {https://www.degruyterbrill.com/document/doi/10.1515/ling-2024-0143/html},
doi = {https://doi.org/10.1515/ling-2024-0143},
year = {2025},
date = {2025},
journal = {Linguistics - An Interdisciplinary Journal of the Language Sciences},
abstract = {

Frequency and predictability are two prominent psycholinguistic variables that determine the ease of word comprehension and have informed models of language processing. Here, we pooled the data from five self-paced reading studies to investigate (1) the usefulness of three well-known frequency databases of German in accounting for word reading times in context (i.e., the SUBTLEX-DE, CELEX, and dlexDB databases), and (2) whether frequency and predictability have additive or interactive effects on lexical processing. Regarding (1), goodness of fit comparisons between the three frequency measures showed that, in the majority of models, dlexDB frequencies performed best (in contrast to earlier investigations recommending to use SUBTLEX), even though nearly all frequency effects were statistically invariant and dwarfed by the contributions of other more potent variables such as predictability or trial number. Regarding (2), we found that, even though predictability influenced reading times, there was no evidence for interactive effects of frequency and predictability. Our results call into question the current default practice in many psycholinguistic studies to rely on subtitle norms when it comes to estimating lexical frequencies, but they also suggest that frequency effects may be negligible in paradigms which promote contextual word-by-word reading. Our findings are more in line with modular models of language comprehension in which lexical access operates independently from contextual predictability.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A5

Xue, Wei; Zaitova, Iuliia; Möbius, Bernd

The Effect of Word Predictability on Spoken Cross-Language Intelligibility Inproceedings

Proceedings of Interspeech 2025, pp. 3798-3802, Rotterdam, The Netherlands, 2025, ISSN 2958-1796.
Cross-language intelligibility refers to how well speakers of language A understand language B without prior learning. While the impact of linguistic and extra-linguistic factors on cross-language intelligibility has been widely studied, the effect of word predictability, known to impact comprehension and speech perception, remains underexplored. This study examines this effect by comparing German and English native speakers translating Dutch words presented in Dutch spoken sentential utterances with varying word predictability. We also investigate whether additional written context would aid cross-language intelligibility. Our results showed that word predictability significantly influences cross-language intelligibility, with German speakers experiencing even stronger effects, whereas only English speakers benefit from the additional written context. These findings suggest that word predictability dynamically shapes cross-language intelligibility, tending to be language-specific.

@inproceedings{Xue/etal:2025a,
title = {The Effect of Word Predictability on Spoken Cross-Language Intelligibility},
author = {Wei Xue and Iuliia Zaitova and Bernd M{\"o}bius},
url = {https://www.isca-archive.org/interspeech_2025/xue25b_interspeech.html},
doi = {https://doi.org/10.21437/Interspeech.2025-1676},
year = {2025},
date = {2025},
booktitle = {Proceedings of Interspeech 2025},
issn = {2958-1796},
pages = {3798-3802},
address = {Rotterdam, The Netherlands},
abstract = {

Cross-language intelligibility refers to how well speakers of language A understand language B without prior learning. While the impact of linguistic and extra-linguistic factors on cross-language intelligibility has been widely studied, the effect of word predictability, known to impact comprehension and speech perception, remains underexplored. This study examines this effect by comparing German and English native speakers translating Dutch words presented in Dutch spoken sentential utterances with varying word predictability. We also investigate whether additional written context would aid cross-language intelligibility. Our results showed that word predictability significantly influences cross-language intelligibility, with German speakers experiencing even stronger effects, whereas only English speakers benefit from the additional written context. These findings suggest that word predictability dynamically shapes cross-language intelligibility, tending to be language-specific.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Abdullah, Badr M.; Baas, Matthew; Möbius, Bernd; Klakow, Dietrich

Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect Identification Inproceedings

Proceedings of Interspeech 2025, pp. 2790-2794, Rotterdam, The Netherlands, 2025, ISSN 2958-1796.
Arabic dialect identification (ADI) systems are essential for large-scale data collection pipelines that enable the development of inclusive speech technologies for Arabic language varieties. However, the reliability of current ADI systems is limited by poor generalization to out-of-domain speech. In this paper, we present an effective approach based on voice conversion for training ADI models that achieves state-of-the-art performance and significantly improves robustness in cross-domain scenarios. Evaluated on a newly collected real-world test set spanning four different domains, our approach yields consistent improvements of up to +34.1% in accuracy across domains. Furthermore, we present an analysis of our approach and demonstrate that voice conversion helps mitigate the speaker bias in the ADI dataset. We release our robust ADI model and cross-domain evaluation dataset to support the development of inclusive speech technologies for Arabic.

@inproceedings{Abdullah/etal:2025a,
title = {Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect Identification},
author = {Badr M. Abdullah and Matthew Baas and Bernd M{\"o}bius and Dietrich Klakow},
url = {https://www.isca-archive.org/interspeech_2025/abdullah25_interspeech.html},
doi = {https://doi.org/10.21437/Interspeech.2025-1809},
year = {2025},
date = {2025},
booktitle = {Proceedings of Interspeech 2025},
issn = {2958-1796},
pages = {2790-2794},
address = {Rotterdam, The Netherlands},
abstract = {

Arabic dialect identification (ADI) systems are essential for large-scale data collection pipelines that enable the development of inclusive speech technologies for Arabic language varieties. However, the reliability of current ADI systems is limited by poor generalization to out-of-domain speech. In this paper, we present an effective approach based on voice conversion for training ADI models that achieves state-of-the-art performance and significantly improves robustness in cross-domain scenarios. Evaluated on a newly collected real-world test set spanning four different domains, our approach yields consistent improvements of up to +34.1% in accuracy across domains. Furthermore, we present an analysis of our approach and demonstrate that voice conversion helps mitigate the speaker bias in the ADI dataset. We release our robust ADI model and cross-domain evaluation dataset to support the development of inclusive speech technologies for Arabic.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Schacht, Carmen; Nischk, Tobias; Yazdanfar, Oleksandra; Dipper, Stefanie

Cheap Annotation of Complex Information: A Study on the Annotation of Information Status in German TEDx Talks Inproceedings

Peng, Siyao; Rehbein, Ines (Ed.): Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025), Association for Computational Linguistics, pp. 297-307, Vienna, Austria, 2025, ISBN 979-8-89176-262-6.

We present an annotation experiment for the annotation of information status in German TEDx Talks with the main goal to reduce annotation costs in terms of time and personnel. We aim for maximizing efficiency while keeping annotation quality constant by testing various different annotation scenarios for an optimal ratio of annotation expenses to resulting quality of the annotations. We choose the RefLex scheme of Riester and Baumann (2017) as a basis for our annotations, refine their annotation guidelines for a more generalizable tagset and conduct the experiment on German Tedx talks, applying different constellations of annotators, curators and correctors to test for an optimal annotation scenario. Our results show that we can achieve equally good and possibly even better results with significantly less effort, by using correctors instead of additional annotators.

@inproceedings{schacht-etal-2025-cheap,
title = {Cheap Annotation of Complex Information: A Study on the Annotation of Information Status in German TEDx Talks},
author = {Carmen Schacht and Tobias Nischk and Oleksandra Yazdanfar and Stefanie Dipper},
editor = {Siyao Peng and Ines Rehbein},
url = {https://aclanthology.org/2025.law-1.25/},
doi = {https://doi.org/10.18653/v1/2025.law-1.25},
year = {2025},
date = {2025},
booktitle = {Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)},
isbn = {979-8-89176-262-6},
pages = {297-307},
publisher = {Association for Computational Linguistics},
address = {Vienna, Austria},
abstract = {We present an annotation experiment for the annotation of information status in German TEDx Talks with the main goal to reduce annotation costs in terms of time and personnel. We aim for maximizing efficiency while keeping annotation quality constant by testing various different annotation scenarios for an optimal ratio of annotation expenses to resulting quality of the annotations. We choose the RefLex scheme of Riester and Baumann (2017) as a basis for our annotations, refine their annotation guidelines for a more generalizable tagset and conduct the experiment on German Tedx talks, applying different constellations of annotators, curators and correctors to test for an optimal annotation scenario. Our results show that we can achieve equally good and possibly even better results with significantly less effort, by using correctors instead of additional annotators.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Zaitova, Iuliia; Abdullah, Badr M.; Xue, Wei; Klakow, Dietrich; Möbius, Bernd; Avgustinova, Tania

It’s not a walk in the park! Challenges of idiom translation in speech-to-text systems Inproceedings

Che, Wanxiang; Nabende, Joyce; Shutova, Ekaterina; Taher Pilehvar, Mohammad (Ed.): Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers, Association for Computational Linguistics, pp. 31310-31322, Vienna, Austria, 2025, ISBN 979-8-89176-251-0.
Idioms are defined as a group of words with a figurative meaning not deducible from their individual components. Although modern machine translation systems have made remarkable progress, translating idioms remains a major challenge, especially for speech-to-text systems, where research on this topic is notably sparse. In this paper, we systematically evaluate idiom translation as compared to conventional news translation in both text-to-text machine translation (MT) and speech-to-text translation (SLT) systems across two language pairs (German to English, Russian to English). We compare state-of-the-art end-to-end SLT systems (SeamlessM4T SLT-to-text, Whisper Large v3) with MT systems (SeamlessM4T SLT-to-text, No Language Left Behind), Large Language Models (DeepSeek, LLaMA) and cascaded alternatives. Our results reveal that SLT systems experience a pronounced performance drop on idiomatic data, often reverting to literal translations even in higher layers, whereas MT systems and Large Language Models demonstrate better handling of idioms. These findings underscore the need for idiom-specific strategies and improved internal representations in SLT architectures.

@inproceedings{Zaitova/etal:2025,
title = {It’s not a walk in the park! Challenges of idiom translation in speech-to-text systems},
author = {Iuliia Zaitova and Badr M. Abdullah and Wei Xue and Dietrich Klakow and Bernd M{\"o}bius and Tania Avgustinova},
editor = {Wanxiang Che and Joyce Nabende and Ekaterina Shutova and Mohammad Taher Pilehvar},
url = {https://aclanthology.org/2025.acl-long.1512/},
year = {2025},
date = {2025},
booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics},
isbn = {979-8-89176-251-0},
pages = {31310-31322},
publisher = {Association for Computational Linguistics},
address = {Vienna, Austria},
abstract = {

Idioms are defined as a group of words with a figurative meaning not deducible from their individual components. Although modern machine translation systems have made remarkable progress, translating idioms remains a major challenge, especially for speech-to-text systems, where research on this topic is notably sparse. In this paper, we systematically evaluate idiom translation as compared to conventional news translation in both text-to-text machine translation (MT) and speech-to-text translation (SLT) systems across two language pairs (German to English, Russian to English). We compare state-of-the-art end-to-end SLT systems (SeamlessM4T SLT-to-text, Whisper Large v3) with MT systems (SeamlessM4T SLT-to-text, No Language Left Behind), Large Language Models (DeepSeek, LLaMA) and cascaded alternatives. Our results reveal that SLT systems experience a pronounced performance drop on idiomatic data, often reverting to literal translations even in higher layers, whereas MT systems and Large Language Models demonstrate better handling of idioms. These findings underscore the need for idiom-specific strategies and improved internal representations in SLT architectures.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Andreeva, Bistra; Möbius, Bernd; Yuen, Ivan; Ibrahim, Omnia

Informationsdichte und die Vorhersagbarkeit der phonetischen Struktur Journal Article

Petkova-Kessanlis, Mikaela; Ivanova, Radka; Kileva-Stamenova, Reneta; Arnaudova, Svetlana; Endreva, Maria (Ed.): Journal for German and Scandinavian Studies, 1, pp. 251-266, 2025.

Die Studie untersucht die Beziehung zwischen Informationsdichte und linguistischer Kodierung in der Phonetik sowie der menschlichen Sprachverarbeitung. Die Informationsdichte einer linguistischen Einheit wird in Bezug auf Surprisal (den negativen Logarithmus der Wahrscheinlichkeit einer Einheit in einem gegebenen Kontext) definiert. Die Effekte von Surprisal auf die phonetische Kodierung wurden hinsichtlich verschiedener Aspekte wie Formantenverläufe von Vokalen, Stimmhaftigkeit von Plosiven, Silbendauer und Vokaldispersion (auch im L2) untersucht, wobei Kontrollfaktoren der prosodischen Struktur sowie mögliche Interaktionen mit dem Lombard-Effekt und der prosodischen Struktur berücksichtigt wurden. Die Ergebnisse deuten darauf hin, dass Sprecher phonetische Details anpassen, um ein Gleichgewicht zwischen Informationsdichte und phonetischer Kodierung aufrechtzuerhalten.

@article{Andreeva/etal:2025a,
title = {Informationsdichte und die Vorhersagbarkeit der phonetischen Struktur},
author = {Bistra Andreeva and Bernd M{\"o}bius and Ivan Yuen and Omnia Ibrahim},
editor = {Mikaela Petkova-Kessanlis and Radka Ivanova and Reneta Kileva-Stamenova and Svetlana Arnaudova and Maria Endreva},
url = {https://doi.org/10.60055/GerSk.2025.izv.1},
doi = {https://doi.org/10.60055/GerSk.2025.izv.1},
year = {2025},
date = {2025},
journal = {Journal for German and Scandinavian Studies},
pages = {251-266},
volume = {1},
abstract = {Die Studie untersucht die Beziehung zwischen Informationsdichte und linguistischer Kodierung in der Phonetik sowie der menschlichen Sprachverarbeitung. Die Informationsdichte einer linguistischen Einheit wird in Bezug auf Surprisal (den negativen Logarithmus der Wahrscheinlichkeit einer Einheit in einem gegebenen Kontext) definiert. Die Effekte von Surprisal auf die phonetische Kodierung wurden hinsichtlich verschiedener Aspekte wie Formantenverl{\"a}ufe von Vokalen, Stimmhaftigkeit von Plosiven, Silbendauer und Vokaldispersion (auch im L2) untersucht, wobei Kontrollfaktoren der prosodischen Struktur sowie m{\"o}gliche Interaktionen mit dem Lombard-Effekt und der prosodischen Struktur ber{\"u}cksichtigt wurden. Die Ergebnisse deuten darauf hin, dass Sprecher phonetische Details anpassen, um ein Gleichgewicht zwischen Informationsdichte und phonetischer Kodierung aufrechtzuerhalten.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Menzel, Katrin; Przybyl, Heike; Lapshinova-Koltunski, Ekaterina

EPIC-UdS – ein mehrsprachiges Korpus als Grundlage für die korpusbasierte Dolmetsch- und Übersetzungswissenschaft Incollection

Schmidhofer, Astrid; Ángeles Recio Ariza, María (Ed.): Zukunftsperspektiven in der Translationswissenschaft Ausgewählte Beiträge der Translata IV, innsbruck university press, pp. 323-349, Innsbruck, 2025, ISBN 978-3-99106-165-6.

In this paper, we present examples of current research foci and results of analyses of the Collaborative Research Centre subproject “Translation as Rational Communication” that focuses on the specific linguistic properties of interpreted and translated texts distinguishing them from original productions. We describe the creation and annotation of EPIC-UdS, a multilingual corpus of simultaneous interpreting for English, German and Spanish. We give an overview of the corpus variants and explore various applications of the corpus. Building on the ‘translationese’ hypothesis from translation studies, we investigate whether simultaneous interpreted language resembles written translated language with regard to specific features or whether it carries ‘interpretese’ features as a result of a unique language transfer process so that traces of these features tend to occur in all simultaneously interpreted texts and distinguish them from other texts. For instance, with regard to the simplification hypothesis put forward by various translation scholars, we can observe in interpreted texts that there is a tendency towards syntactic simplification. Another analysis shows that interpreted language is characterised by a particular use of discourse particles. EPIC-UdS contains rich metadata and fine-grained linguistic annotations tailored for diverse applications across a broad range of linguistic subfields. This paper provides the first overview in German on the EPIC-UdS corpus with the aim of bringing together results from individual studies, such as Bizzoni/Teich 2019; Karakanta/Vela/Teich 2018; Lapshinova-Koltunski et al. 2021a; Lapshinova-Koltunski/Przybyl/Bizzoni 2021b; Lapshinova-Koltunski/Pollkläsener/Przybyl 2022; Przybyl/Teich 2021; Pollkläsener 2021; Przybyl et al. 2022a, 2022b, and giving a concise and informative summary of applications and impacts of the project.

@incollection{Menzel_2025_EPIC,
title = {EPIC-UdS – ein mehrsprachiges Korpus als Grundlage f{\"u}r die korpusbasierte Dolmetsch- und {\"U}bersetzungswissenschaft},
author = {Katrin Menzel and Heike Przybyl and Ekaterina Lapshinova-Koltunski},
editor = {Astrid Schmidhofer and Mar{\'i}a {\'A}ngeles Recio Ariza},
url = {https://www.uibk.ac.at/iup/buecher/9783991061656.html},
doi = {https://doi.org/10.15203/99106-165-6-16},
year = {2025},
date = {2025},
booktitle = {Zukunftsperspektiven in der Translationswissenschaft Ausgew{\"a}hlte Beitr{\"a}ge der Translata IV},
isbn = {978-3-99106-165-6},
pages = {323-349},
publisher = {innsbruck university press},
address = {Innsbruck},
abstract = {In this paper, we present examples of current research foci and results of analyses of the Collaborative Research Centre subproject “Translation as Rational Communication” that focuses on the specific linguistic properties of interpreted and translated texts distinguishing them from original productions. We describe the creation and annotation of EPIC-UdS, a multilingual corpus of simultaneous interpreting for English, German and Spanish. We give an overview of the corpus variants and explore various applications of the corpus. Building on the ‘translationese’ hypothesis from translation studies, we investigate whether simultaneous interpreted language resembles written translated language with regard to specific features or whether it carries ‘interpretese’ features as a result of a unique language transfer process so that traces of these features tend to occur in all simultaneously interpreted texts and distinguish them from other texts. For instance, with regard to the simplification hypothesis put forward by various translation scholars, we can observe in interpreted texts that there is a tendency towards syntactic simplification. Another analysis shows that interpreted language is characterised by a particular use of discourse particles. EPIC-UdS contains rich metadata and fine-grained linguistic annotations tailored for diverse applications across a broad range of linguistic subfields. This paper provides the first overview in German on the EPIC-UdS corpus with the aim of bringing together results from individual studies, such as Bizzoni/Teich 2019; Karakanta/Vela/Teich 2018; Lapshinova-Koltunski et al. 2021a; Lapshinova-Koltunski/Przybyl/Bizzoni 2021b; Lapshinova-Koltunski/Pollkl{\"a}sener/Przybyl 2022; Przybyl/Teich 2021; Pollkl{\"a}sener 2021; Przybyl et al. 2022a, 2022b, and giving a concise and informative summary of applications and impacts of the project.},
pubstate = {published},
type = {incollection}
}

Copy BibTeX to Clipboard

Project:   B7

Spalek, Katharina; Bader, Regine; Glaser, Sandra; Höltje, Gerrit; Mecklinger, Axel

Contrastive focus accent retroactively modulates memory for focus alternatives: evidence from event-related potentials Journal Article

Language, Cognition and Neuroscience, Routledge, pp. 1-18, 2025.

Contrastive focus accent in spoken language indicates that alternatives to the focused element are relevant for interpretation. The sentence “Could I have some TEA, please?”, with contrastive accent on tea, is probably the response to an offer of several alternative beverages. Research shows that contrastive focus accent helps listeners remember such alternatives. We investigated the time-course of and mechanisms behind the effects of contrastive focus accent on memory with a variant of the subsequent memory effect paradigm with eventrelated potentials (ERPs). The ERP time-locked to a critical word was more positive-going if participants remembered two earlier mentioned alternatives than just one, but only if the critical word had been contrastively accented. This effect further was only observed when the critical word itself was remembered. These findings suggest that contrastive focus marking triggers a reinstatement of the preceding sentence context (retrieval practice) by which these elements are prioritised in memory

@article{Spaleck.etal.2025,
title = {Contrastive focus accent retroactively modulates memory for focus alternatives: evidence from event-related potentials},
author = {Katharina Spalek and Regine Bader and Sandra Glaser and Gerrit H{\"o}ltje and Axel Mecklinger},
url = {https://doi.org/10.1080/23273798.2025.2503906},
doi = {https://doi.org/10.1080/23273798.2025.2503906},
year = {2025},
date = {2025},
journal = {Language, Cognition and Neuroscience},
pages = {1-18},
publisher = {Routledge},
abstract = {Contrastive focus accent in spoken language indicates that alternatives to the focused element are relevant for interpretation. The sentence “Could I have some TEA, please?”, with contrastive accent on tea, is probably the response to an offer of several alternative beverages. Research shows that contrastive focus accent helps listeners remember such alternatives. We investigated the time-course of and mechanisms behind the effects of contrastive focus accent on memory with a variant of the subsequent memory effect paradigm with eventrelated potentials (ERPs). The ERP time-locked to a critical word was more positive-going if participants remembered two earlier mentioned alternatives than just one, but only if the critical word had been contrastively accented. This effect further was only observed when the critical word itself was remembered. These findings suggest that contrastive focus marking triggers a reinstatement of the preceding sentence context (retrieval practice) by which these elements are prioritised in memory},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A6

Scholman, Merel; Marchal, Marian; Brown, AriaRay; Demberg, Vera

DiscoNaija: A discourse-annotated parallel Nigerian Pidgin-English corpus Journal Article Forthcoming

Language Resources and Evaluation, 2025.

@article{scholman_etal_2025_disconaija,
title = {DiscoNaija: A discourse-annotated parallel Nigerian Pidgin-English corpus},
author = {Merel Scholman and Marian Marchal and AriaRay Brown and Vera Demberg},
year = {2025},
date = {2025},
journal = {Language Resources and Evaluation},
pubstate = {forthcoming},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B2

Zaitova, Iuliia; Hirak, Vitalii; Abdullah, Badr M.; Klakow, Dietrich; Möbius, Bernd; Avgustinova, Tania

Attention on Multiword Expressions: A Multilingual Study of BERT-based Models with Regard to Idiomaticity and Microsyntax Inproceedings

Chiruzzo, Luis; Ritter, Alan; Wang, Lu (Ed.): Findings of the Association for Computational Linguistics: NAACL 2025, Association for Computational Linguistics, pp. 4083-4092, Albuquerque, New Mexico, 2025, ISBN 979-8-89176-195-7.

This study analyzes the attention patterns of fine-tuned encoder-only models based on the BERT architecture (BERT-based models) towards two distinct types of Multiword Expressions (MWEs): idioms and microsyntactic units (MSUs). Idioms present challenges in semantic non-compositionality, whereas MSUs demonstrate unconventional syntactic behavior that does not conform to standard grammatical categorizations. We aim to understand whether fine-tuning BERT-based models on specific tasks influences their attention to MWEs, and how this attention differs between semantic and syntactic tasks. We examine attention scores to MWEs in both pre-trained and fine-tuned BERT-based models. We utilize monolingual models and datasets in six Indo-European languages —  English, German, Dutch, Polish, Russian, and Ukrainian. Our results show that fine-tuning significantly influences how models allocate attention to MWEs. Specifically, models fine-tuned on semantic tasks tend to distribute attention to idiomatic expressions more evenly across layers. Models fine-tuned on syntactic tasks show an increase in attention to MSUs in the lower layers, corresponding with syntactic processing requirements.

@inproceedings{zaitova-etal-2025-attention,
title = {Attention on Multiword Expressions: A Multilingual Study of BERT-based Models with Regard to Idiomaticity and Microsyntax},
author = {Iuliia Zaitova and Vitalii Hirak and Badr M. Abdullah and Dietrich Klakow and Bernd M{\"o}bius and Tania Avgustinova},
editor = {Luis Chiruzzo and Alan Ritter and Lu Wang},
url = {https://aclanthology.org/2025.findings-naacl.228/},
year = {2025},
date = {2025},
booktitle = {Findings of the Association for Computational Linguistics: NAACL 2025},
isbn = {979-8-89176-195-7},
pages = {4083-4092},
publisher = {Association for Computational Linguistics},
address = {Albuquerque, New Mexico},
abstract = {This study analyzes the attention patterns of fine-tuned encoder-only models based on the BERT architecture (BERT-based models) towards two distinct types of Multiword Expressions (MWEs): idioms and microsyntactic units (MSUs). Idioms present challenges in semantic non-compositionality, whereas MSUs demonstrate unconventional syntactic behavior that does not conform to standard grammatical categorizations. We aim to understand whether fine-tuning BERT-based models on specific tasks influences their attention to MWEs, and how this attention differs between semantic and syntactic tasks. We examine attention scores to MWEs in both pre-trained and fine-tuned BERT-based models. We utilize monolingual models and datasets in six Indo-European languages —  English, German, Dutch, Polish, Russian, and Ukrainian. Our results show that fine-tuning significantly influences how models allocate attention to MWEs. Specifically, models fine-tuned on semantic tasks tend to distribute attention to idiomatic expressions more evenly across layers. Models fine-tuned on syntactic tasks show an increase in attention to MSUs in the lower layers, corresponding with syntactic processing requirements.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Sabev, Mitko; Andreeva, Bistra; Möbius, Bernd; Yuen, Ivan; Ibrahim, Omnia

The effects of lexical frequency on anticipatory voice assimilation in Bulgarian obstruents Inproceedings

Grawunder, Sven (Ed.): Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2025, TUDpress, pp. 163-169, Dresden, 2025, ISBN 978-3-95908-803-9, ISSN 0940-6832.

This study investigates the relation between the surprisal (or unpredictability) of linguistic items and anticipatory voicing assimilation in Bulgarian obstruents. Using a corpus of speech read by 140 Bulgarian speakers and wordlevel language models, we calculated unigram surprisal for word forms ending in obstruents followed by a word-initial obstruent of the opposite underlying [±voice] specification. Percentage of voicing was computed for 9,712 word-final obstruents. Linear mixed models were used to determine the effect of surprisal on the percentage of voicing in assimilating obstruents. The results confirm that Bulgarian obstruents do indeed in general assimilate to the voicing of a following obstruent: voiceless obstruents become voiced before voiced ones, while voiced obstruents are devoiced before voiceless ones. Crucially, however, surprisal had a significant effect on the percentage of voicing found in assimilating obstruents: in words with higher surprisal values, we found significantly lower degrees of voicing in voiceless obstruents before voiced ones, as well as significantly less devoicing of voiced obstruents before voiceless ones. This shows that assimilation is stronger in lowsurprisal words, while in high-surprisal words speakers attempt to maintain the underlying [±voice] specification of an obstruent to a higher degree. Our findings add to a growing body of research that demonstrates that processes once thought of as entirely categorical in fact exhibit gradient variation in fine phonetic detail, which is attributable to speakers’ awareness of statistical patterns in language use and their response to the predictability of linguistic items in maintaining a balance between phonetic encoding and information density.

@inproceedings{sabev_etal_essv2025,
title = {The effects of lexical frequency on anticipatory voice assimilation in Bulgarian obstruents},
author = {Mitko Sabev and Bistra Andreeva and Bernd M{\"o}bius and Ivan Yuen and Omnia Ibrahim},
editor = {Sven Grawunder},
url = {https://www.essv.de/paper.php?id=1249},
year = {2025},
date = {2025},
booktitle = {Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2025},
isbn = {978-3-95908-803-9},
issn = {0940-6832},
pages = {163-169},
publisher = {TUDpress},
address = {Dresden},
abstract = {This study investigates the relation between the surprisal (or unpredictability) of linguistic items and anticipatory voicing assimilation in Bulgarian obstruents. Using a corpus of speech read by 140 Bulgarian speakers and wordlevel language models, we calculated unigram surprisal for word forms ending in obstruents followed by a word-initial obstruent of the opposite underlying [±voice] specification. Percentage of voicing was computed for 9,712 word-final obstruents. Linear mixed models were used to determine the effect of surprisal on the percentage of voicing in assimilating obstruents. The results confirm that Bulgarian obstruents do indeed in general assimilate to the voicing of a following obstruent: voiceless obstruents become voiced before voiced ones, while voiced obstruents are devoiced before voiceless ones. Crucially, however, surprisal had a significant effect on the percentage of voicing found in assimilating obstruents: in words with higher surprisal values, we found significantly lower degrees of voicing in voiceless obstruents before voiced ones, as well as significantly less devoicing of voiced obstruents before voiceless ones. This shows that assimilation is stronger in lowsurprisal words, while in high-surprisal words speakers attempt to maintain the underlying [±voice] specification of an obstruent to a higher degree. Our findings add to a growing body of research that demonstrates that processes once thought of as entirely categorical in fact exhibit gradient variation in fine phonetic detail, which is attributable to speakers’ awareness of statistical patterns in language use and their response to the predictability of linguistic items in maintaining a balance between phonetic encoding and information density.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Andreeva, Bistra; Sabev, Mitko

Ефектът На Лексикалната Честотност И Типа Морфема Върху Регресивната Асимилация На Българските Обструенти По Признака Звучност (The Effects Of Lexical Frequency And Morpheme Type On Anticipatory Voice Assimilation In Bulgarian Obstruents) Journal Article

БЪЛГАРСК ЕЗИК, ПРИЛОЖЕНИЕ / BULGARIAN LANGUAGE, SUPPLEMENT, 72, pp. 357–369, 2025, ISSN 2603-3372.

Настоящото изследване разглежда връзката между честотността (предсказуемостта) на езиковите единици и регресивната асимилация на обструентите по признака звучност в българския език. Въз основа на данните от речеви корпус и езикови модели на ниво словоформа изчисляваме стойността на изненадата на думи с краесловни съгласни /t/, /ʃ/ и /x/ в състава на формообразуващи и словообразуващи морфеми. Чрез смесен линеен модел (LMM) анализираме как предсказуемостта, звучността на следващия обструент и типът морфема влияят върху реализираната звучност на съгласните. Резултатите показват, че езиковата предсказуемост влияе върху прецизността на артикулацията, като модулира степента на регресивната асимилация в зависимост от морфологичния контекст. Установяваме, че обструентите в по-малко предсказуемите думи се произнасят с по-прецизна артикулация, отколкото обструентите в по-предсказуемите думи, особено в състава на словообразуващи морфеми. Това подчертава комплексната динамика между информационната плътност, морфологията и артикулационния процес, което може да даде основа за по-нататъшни изследвания в областта на фонетиката и фонологията.


This study investigates the relationship between the frequency (or predictability) of linguistic units and the anticipatory voice assimilation of obstruents in Bulgarian. Using a speech corpus and word-level language models, we calculate the surprisal of word forms ending in the consonants /t/, /ʃ/, and /x/ in inflectional and lexical morphemes. We then employ a linear mixed model (LMM) to analyse how surprisal, the voicing of the following obstruent, and morpheme type affect the voicing realised in the examined consonants. The findings demonstrate that linguistic predictability affects articulatory precision by modulating the degree of anticipatory assimilation, with different effect sizes for different morpheme types. More specifically, obstruents in less predictable words are articulated with greater precision than those in more predictable words, especially within lexical morphemes. This reveals a complex interaction between information density, morphology and articulation, providing avenues for further research in phonetics and phonology.

@article{AndreevaSabev2025,
title = {Ефектът На Лексикалната Честотност И Типа Морфема Върху Регресивната Асимилация На Българските Обструенти По Признака Звучност (The Effects Of Lexical Frequency And Morpheme Type On Anticipatory Voice Assimilation In Bulgarian Obstruents)},
author = {Bistra Andreeva and Mitko Sabev},
url = {https://www.balgarskiezik.eu/p-2025/0_0_25_BISTRA%20ANDREEVA,%20MITKO%20SABEV_357-369_BG.pdf},
year = {2025},
date = {2025},
journal = {БЪЛГАРСК ЕЗИК, ПРИЛОЖЕНИЕ / BULGARIAN LANGUAGE, SUPPLEMENT},
pages = {357–369},
volume = {72},
abstract = {Настоящото изследване разглежда връзката между честотността (предсказуемостта) на езиковите единици и регресивната асимилация на обструентите по признака звучност в българския език. Въз основа на данните от речеви корпус и езикови модели на ниво словоформа изчисляваме стойността на изненадата на думи с краесловни съгласни /t/, /ʃ/ и /x/ в състава на формообразуващи и словообразуващи морфеми. Чрез смесен линеен модел (LMM) анализираме как предсказуемостта, звучността на следващия обструент и типът морфема влияят върху реализираната звучност на съгласните. Резултатите показват, че езиковата предсказуемост влияе върху прецизността на артикулацията, като модулира степента на регресивната асимилация в зависимост от морфологичния контекст. Установяваме, че обструентите в по-малко предсказуемите думи се произнасят с по-прецизна артикулация, отколкото обструентите в по-предсказуемите думи, особено в състава на словообразуващи морфеми. Това подчертава комплексната динамика между информационната плътност, морфологията и артикулационния процес, което може да даде основа за по-нататъшни изследвания в областта на фонетиката и фонологията.


This study investigates the relationship between the frequency (or predictability) of linguistic units and the anticipatory voice assimilation of obstruents in Bulgarian. Using a speech corpus and word-level language models, we calculate the surprisal of word forms ending in the consonants /t/, /ʃ/, and /x/ in inflectional and lexical morphemes. We then employ a linear mixed model (LMM) to analyse how surprisal, the voicing of the following obstruent, and morpheme type affect the voicing realised in the examined consonants. The findings demonstrate that linguistic predictability affects articulatory precision by modulating the degree of anticipatory assimilation, with different effect sizes for different morpheme types. More specifically, obstruents in less predictable words are articulated with greater precision than those in more predictable words, especially within lexical morphemes. This reveals a complex interaction between information density, morphology and articulation, providing avenues for further research in phonetics and phonology.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Alves, Diego

Information Theory and Linguistic Variation: A Study of Brazilian and European Portuguese Inproceedings

Scherrer, Yves; Jauhiainen, Tommi; Ljubešić, Nikola; Nakov, Preslav; Tiedemann, Jorg; Zampieri, Marcos (Ed.): Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects, Association for Computational Linguistics, pp. 9-19, Abu Dhabi, UAE, 2025.

We present a general analysis of the lexical and grammatical differences between Brazilian and European Portuguese by applying entropy measures, including Kullback-Leibler divergence and word order entropy, across various linguistic levels. Using a parallel corpus of BP and EP sentences translated from English, we quantified these differences and identified characteristic phenomena underlying the divergences between the two varieties. The highest divergence was observed at the lexical level due to word pairs unique to each variety but also related to grammatical distinctions. Furthermore, the analysis of parts-of-speech (POS), dependency relations, and POS tri-grams provided information concerning distinctive grammatical constructions. Finally, the word order entropy analysis revealed that while most of the syntactic features analysed showed similar patterns across BP and EP, specific word order preferences were still apparent.

@inproceedings{alves-2025-information,
title = {Information Theory and Linguistic Variation: A Study of Brazilian and European Portuguese},
author = {Diego Alves},
editor = {Yves Scherrer and Tommi Jauhiainen and Nikola Ljubešić and Preslav Nakov and Jorg Tiedemann and Marcos Zampieri},
url = {https://aclanthology.org/2025.vardial-1.2/},
year = {2025},
date = {2025},
booktitle = {Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects},
pages = {9-19},
publisher = {Association for Computational Linguistics},
address = {Abu Dhabi, UAE},
abstract = {We present a general analysis of the lexical and grammatical differences between Brazilian and European Portuguese by applying entropy measures, including Kullback-Leibler divergence and word order entropy, across various linguistic levels. Using a parallel corpus of BP and EP sentences translated from English, we quantified these differences and identified characteristic phenomena underlying the divergences between the two varieties. The highest divergence was observed at the lexical level due to word pairs unique to each variety but also related to grammatical distinctions. Furthermore, the analysis of parts-of-speech (POS), dependency relations, and POS tri-grams provided information concerning distinctive grammatical constructions. Finally, the word order entropy analysis revealed that while most of the syntactic features analysed showed similar patterns across BP and EP, specific word order preferences were still apparent.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Delogu, Francesca; Aurnhammer, Christoph; Brouwer, Harm; Crocker, Matthew W.

On the biphasic nature of the N400-P600 complex underlying language comprehension Journal Article Forthcoming

Brain & Cognition, 2025.

@article{Delogu-etal-2025,
title = {On the biphasic nature of the N400-P600 complex underlying language comprehension},
author = {Francesca Delogu and Christoph Aurnhammer and Harm Brouwer and Matthew W. Crocker},
year = {2025},
date = {2025},
journal = {Brain & Cognition},
pubstate = {forthcoming},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A1

Yung, Frances Pik Yu; Demberg, Vera

On Crowdsourcing Task Design for Discourse Relation Annotation Inproceedings

Roth, Michael; Schlechtweg, Dominik (Ed.): Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation, International Committee on Computational Linguistics, pp. 12-19, Abu Dhabi, UAE, 2025.

Interpreting implicit discourse relations involves complex reasoning, requiring the integration of semantic cues with background knowledge, as overt connectives like “because” or “then” are absent. These relations often allow multiple interpretations, best represented as distributions. In this study, we compare two established methods that crowdsource implicit discourse relation annotation by connective insertion: a free-choice approach, which allows annotators to select any suitable connective, and a forced-choice approach, which asks them to select among a set of predefined options. Specifically, we re-annotate the whole DiscoGeM 1.0 corpus – initially annotated with the free-choice method – using the forced-choice approach. The free-choice approach allows for flexible and intuitive insertion of various connectives, which are context-dependent. Comparison among over 130,000 annotations, however, shows that the free-choice strategy produces less diverse annotations, often converging on common labels. Analysis of the results reveals the interplay between task design and the annotators’ abilities to interpret and produce discourse relations.

@inproceedings{yung-demberg-2025-crowdsourcing ,
title = {On Crowdsourcing Task Design for Discourse Relation Annotation},
author = {Frances Pik Yu Yung and Vera Demberg},
editor = {Michael Roth and Dominik Schlechtweg},
url = {https://aclanthology.org/2025.comedi-1.2/},
year = {2025},
date = {2025},
booktitle = {Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation},
pages = {12-19},
publisher = {International Committee on Computational Linguistics},
address = {Abu Dhabi, UAE},
abstract = {Interpreting implicit discourse relations involves complex reasoning, requiring the integration of semantic cues with background knowledge, as overt connectives like “because” or “then” are absent. These relations often allow multiple interpretations, best represented as distributions. In this study, we compare two established methods that crowdsource implicit discourse relation annotation by connective insertion: a free-choice approach, which allows annotators to select any suitable connective, and a forced-choice approach, which asks them to select among a set of predefined options. Specifically, we re-annotate the whole DiscoGeM 1.0 corpus - initially annotated with the free-choice method - using the forced-choice approach. The free-choice approach allows for flexible and intuitive insertion of various connectives, which are context-dependent. Comparison among over 130,000 annotations, however, shows that the free-choice strategy produces less diverse annotations, often converging on common labels. Analysis of the results reveals the interplay between task design and the annotators’ abilities to interpret and produce discourse relations.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Successfully