Publications

Sabev, Mitko; Andreeva, Bistra; Möbius, Bernd; Yuen, Ivan; Ibrahim, Omnia

The effects of lexical frequency on anticipatory voice assimilation in Bulgarian obstruents Inproceedings

Grawunder, Sven (Ed.): Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2025, TUDpress, pp. 163-169, Dresden, 2025, ISBN 978-3-95908-803-9, ISSN 0940-6832.

This study investigates the relation between the surprisal (or unpredictability) of linguistic items and anticipatory voicing assimilation in Bulgarian obstruents. Using a corpus of speech read by 140 Bulgarian speakers and wordlevel language models, we calculated unigram surprisal for word forms ending in obstruents followed by a word-initial obstruent of the opposite underlying [±voice] specification. Percentage of voicing was computed for 9,712 word-final obstruents. Linear mixed models were used to determine the effect of surprisal on the percentage of voicing in assimilating obstruents. The results confirm that Bulgarian obstruents do indeed in general assimilate to the voicing of a following obstruent: voiceless obstruents become voiced before voiced ones, while voiced obstruents are devoiced before voiceless ones. Crucially, however, surprisal had a significant effect on the percentage of voicing found in assimilating obstruents: in words with higher surprisal values, we found significantly lower degrees of voicing in voiceless obstruents before voiced ones, as well as significantly less devoicing of voiced obstruents before voiceless ones. This shows that assimilation is stronger in lowsurprisal words, while in high-surprisal words speakers attempt to maintain the underlying [±voice] specification of an obstruent to a higher degree. Our findings add to a growing body of research that demonstrates that processes once thought of as entirely categorical in fact exhibit gradient variation in fine phonetic detail, which is attributable to speakers’ awareness of statistical patterns in language use and their response to the predictability of linguistic items in maintaining a balance between phonetic encoding and information density.

@inproceedings{sabev_etal_essv2025,
title = {The effects of lexical frequency on anticipatory voice assimilation in Bulgarian obstruents},
author = {Mitko Sabev and Bistra Andreeva and Bernd M{\"o}bius and Ivan Yuen and Omnia Ibrahim},
editor = {Sven Grawunder},
url = {https://www.essv.de/paper.php?id=1249},
year = {2025},
date = {2025},
booktitle = {Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2025},
isbn = {978-3-95908-803-9},
issn = {0940-6832},
pages = {163-169},
publisher = {TUDpress},
address = {Dresden},
abstract = {This study investigates the relation between the surprisal (or unpredictability) of linguistic items and anticipatory voicing assimilation in Bulgarian obstruents. Using a corpus of speech read by 140 Bulgarian speakers and wordlevel language models, we calculated unigram surprisal for word forms ending in obstruents followed by a word-initial obstruent of the opposite underlying [±voice] specification. Percentage of voicing was computed for 9,712 word-final obstruents. Linear mixed models were used to determine the effect of surprisal on the percentage of voicing in assimilating obstruents. The results confirm that Bulgarian obstruents do indeed in general assimilate to the voicing of a following obstruent: voiceless obstruents become voiced before voiced ones, while voiced obstruents are devoiced before voiceless ones. Crucially, however, surprisal had a significant effect on the percentage of voicing found in assimilating obstruents: in words with higher surprisal values, we found significantly lower degrees of voicing in voiceless obstruents before voiced ones, as well as significantly less devoicing of voiced obstruents before voiceless ones. This shows that assimilation is stronger in lowsurprisal words, while in high-surprisal words speakers attempt to maintain the underlying [±voice] specification of an obstruent to a higher degree. Our findings add to a growing body of research that demonstrates that processes once thought of as entirely categorical in fact exhibit gradient variation in fine phonetic detail, which is attributable to speakers’ awareness of statistical patterns in language use and their response to the predictability of linguistic items in maintaining a balance between phonetic encoding and information density.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Andreeva, Bistra; Sabev, Mitko

Ефектът На Лексикалната Честотност И Типа Морфема Върху Регресивната Асимилация На Българските Обструенти По Признака Звучност (The Effects Of Lexical Frequency And Morpheme Type On Anticipatory Voice Assimilation In Bulgarian Obstruents) Journal Article

БЪЛГАРСК ЕЗИК, ПРИЛОЖЕНИЕ / BULGARIAN LANGUAGE, SUPPLEMENT, 72, pp. 357–369, 2025, ISSN 2603-3372.

Настоящото изследване разглежда връзката между честотността (предсказуемостта) на езиковите единици и регресивната асимилация на обструентите по признака звучност в българския език. Въз основа на данните от речеви корпус и езикови модели на ниво словоформа изчисляваме стойността на изненадата на думи с краесловни съгласни /t/, /ʃ/ и /x/ в състава на формообразуващи и словообразуващи морфеми. Чрез смесен линеен модел (LMM) анализираме как предсказуемостта, звучността на следващия обструент и типът морфема влияят върху реализираната звучност на съгласните. Резултатите показват, че езиковата предсказуемост влияе върху прецизността на артикулацията, като модулира степента на регресивната асимилация в зависимост от морфологичния контекст. Установяваме, че обструентите в по-малко предсказуемите думи се произнасят с по-прецизна артикулация, отколкото обструентите в по-предсказуемите думи, особено в състава на словообразуващи морфеми. Това подчертава комплексната динамика между информационната плътност, морфологията и артикулационния процес, което може да даде основа за по-нататъшни изследвания в областта на фонетиката и фонологията.


This study investigates the relationship between the frequency (or predictability) of linguistic units and the anticipatory voice assimilation of obstruents in Bulgarian. Using a speech corpus and word-level language models, we calculate the surprisal of word forms ending in the consonants /t/, /ʃ/, and /x/ in inflectional and lexical morphemes. We then employ a linear mixed model (LMM) to analyse how surprisal, the voicing of the following obstruent, and morpheme type affect the voicing realised in the examined consonants. The findings demonstrate that linguistic predictability affects articulatory precision by modulating the degree of anticipatory assimilation, with different effect sizes for different morpheme types. More specifically, obstruents in less predictable words are articulated with greater precision than those in more predictable words, especially within lexical morphemes. This reveals a complex interaction between information density, morphology and articulation, providing avenues for further research in phonetics and phonology.

@article{AndreevaSabev2025,
title = {Ефектът На Лексикалната Честотност И Типа Морфема Върху Регресивната Асимилация На Българските Обструенти По Признака Звучност (The Effects Of Lexical Frequency And Morpheme Type On Anticipatory Voice Assimilation In Bulgarian Obstruents)},
author = {Bistra Andreeva and Mitko Sabev},
url = {https://www.balgarskiezik.eu/p-2025/0_0_25_BISTRA%20ANDREEVA,%20MITKO%20SABEV_357-369_BG.pdf},
year = {2025},
date = {2025},
journal = {БЪЛГАРСК ЕЗИК, ПРИЛОЖЕНИЕ / BULGARIAN LANGUAGE, SUPPLEMENT},
pages = {357–369},
volume = {72},
abstract = {Настоящото изследване разглежда връзката между честотността (предсказуемостта) на езиковите единици и регресивната асимилация на обструентите по признака звучност в българския език. Въз основа на данните от речеви корпус и езикови модели на ниво словоформа изчисляваме стойността на изненадата на думи с краесловни съгласни /t/, /ʃ/ и /x/ в състава на формообразуващи и словообразуващи морфеми. Чрез смесен линеен модел (LMM) анализираме как предсказуемостта, звучността на следващия обструент и типът морфема влияят върху реализираната звучност на съгласните. Резултатите показват, че езиковата предсказуемост влияе върху прецизността на артикулацията, като модулира степента на регресивната асимилация в зависимост от морфологичния контекст. Установяваме, че обструентите в по-малко предсказуемите думи се произнасят с по-прецизна артикулация, отколкото обструентите в по-предсказуемите думи, особено в състава на словообразуващи морфеми. Това подчертава комплексната динамика между информационната плътност, морфологията и артикулационния процес, което може да даде основа за по-нататъшни изследвания в областта на фонетиката и фонологията.


This study investigates the relationship between the frequency (or predictability) of linguistic units and the anticipatory voice assimilation of obstruents in Bulgarian. Using a speech corpus and word-level language models, we calculate the surprisal of word forms ending in the consonants /t/, /ʃ/, and /x/ in inflectional and lexical morphemes. We then employ a linear mixed model (LMM) to analyse how surprisal, the voicing of the following obstruent, and morpheme type affect the voicing realised in the examined consonants. The findings demonstrate that linguistic predictability affects articulatory precision by modulating the degree of anticipatory assimilation, with different effect sizes for different morpheme types. More specifically, obstruents in less predictable words are articulated with greater precision than those in more predictable words, especially within lexical morphemes. This reveals a complex interaction between information density, morphology and articulation, providing avenues for further research in phonetics and phonology.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Sabev, Mitko; Andreeva, Bistra; Möbius, Bernd; Yuen, Ivan; Ibrahim, Omnia

The effect of lexical frequency on anticipatory voicing assimilation in Bulgarian obstruents Inproceedings Forthcoming

Proc. 36th Conf. Elektronische Sprachsignalverarbeitung (ESSV '25), Halle/Saale, 2025.

@inproceedings{Sabev-etal-2025,
title = {The effect of lexical frequency on anticipatory voicing assimilation in Bulgarian obstruents},
author = {Mitko Sabev and Bistra Andreeva and Bernd M{\"o}bius and Ivan Yuen and Omnia Ibrahim},
year = {2025},
date = {2025},
booktitle = {Proc. 36th Conf. Elektronische Sprachsignalverarbeitung (ESSV '25)},
address = {Halle/Saale},
pubstate = {forthcoming},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Alves, Diego

Information Theory and Linguistic Variation: A Study of Brazilian and European Portuguese Inproceedings

Scherrer, Yves; Jauhiainen, Tommi; Ljubešić, Nikola; Nakov, Preslav; Tiedemann, Jorg; Zampieri, Marcos (Ed.): Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects, Association for Computational Linguistics, pp. 9-19, Abu Dhabi, UAE, 2025.

We present a general analysis of the lexical and grammatical differences between Brazilian and European Portuguese by applying entropy measures, including Kullback-Leibler divergence and word order entropy, across various linguistic levels. Using a parallel corpus of BP and EP sentences translated from English, we quantified these differences and identified characteristic phenomena underlying the divergences between the two varieties. The highest divergence was observed at the lexical level due to word pairs unique to each variety but also related to grammatical distinctions. Furthermore, the analysis of parts-of-speech (POS), dependency relations, and POS tri-grams provided information concerning distinctive grammatical constructions. Finally, the word order entropy analysis revealed that while most of the syntactic features analysed showed similar patterns across BP and EP, specific word order preferences were still apparent.

@inproceedings{alves-2025-information,
title = {Information Theory and Linguistic Variation: A Study of Brazilian and European Portuguese},
author = {Diego Alves},
editor = {Yves Scherrer and Tommi Jauhiainen and Nikola Ljubešić and Preslav Nakov and Jorg Tiedemann and Marcos Zampieri},
url = {https://aclanthology.org/2025.vardial-1.2/},
year = {2025},
date = {2025},
booktitle = {Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects},
pages = {9-19},
publisher = {Association for Computational Linguistics},
address = {Abu Dhabi, UAE},
abstract = {We present a general analysis of the lexical and grammatical differences between Brazilian and European Portuguese by applying entropy measures, including Kullback-Leibler divergence and word order entropy, across various linguistic levels. Using a parallel corpus of BP and EP sentences translated from English, we quantified these differences and identified characteristic phenomena underlying the divergences between the two varieties. The highest divergence was observed at the lexical level due to word pairs unique to each variety but also related to grammatical distinctions. Furthermore, the analysis of parts-of-speech (POS), dependency relations, and POS tri-grams provided information concerning distinctive grammatical constructions. Finally, the word order entropy analysis revealed that while most of the syntactic features analysed showed similar patterns across BP and EP, specific word order preferences were still apparent.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Delogu, Francesca; Aurnhammer, Christoph; Brouwer, Harm; Crocker, Matthew W.

On the biphasic nature of the N400-P600 complex underlying language comprehension Journal Article Forthcoming

Brain & Cognition, 2025.

@article{Delogu-etal-2025,
title = {On the biphasic nature of the N400-P600 complex underlying language comprehension},
author = {Francesca Delogu and Christoph Aurnhammer and Harm Brouwer and Matthew W. Crocker},
year = {2025},
date = {2025},
journal = {Brain & Cognition},
pubstate = {forthcoming},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A1

Yung, Frances Pik Yu; Demberg, Vera

On Crowdsourcing Task Design for Discourse Relation Annotation Inproceedings

Roth, Michael; Schlechtweg, Dominik (Ed.): Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation, International Committee on Computational Linguistics, pp. 12-19, Abu Dhabi, UAE, 2025.

Interpreting implicit discourse relations involves complex reasoning, requiring the integration of semantic cues with background knowledge, as overt connectives like “because” or “then” are absent. These relations often allow multiple interpretations, best represented as distributions. In this study, we compare two established methods that crowdsource implicit discourse relation annotation by connective insertion: a free-choice approach, which allows annotators to select any suitable connective, and a forced-choice approach, which asks them to select among a set of predefined options. Specifically, we re-annotate the whole DiscoGeM 1.0 corpus – initially annotated with the free-choice method – using the forced-choice approach. The free-choice approach allows for flexible and intuitive insertion of various connectives, which are context-dependent. Comparison among over 130,000 annotations, however, shows that the free-choice strategy produces less diverse annotations, often converging on common labels. Analysis of the results reveals the interplay between task design and the annotators’ abilities to interpret and produce discourse relations.

@inproceedings{yung-demberg-2025-crowdsourcing ,
title = {On Crowdsourcing Task Design for Discourse Relation Annotation},
author = {Frances Pik Yu Yung and Vera Demberg},
editor = {Michael Roth and Dominik Schlechtweg},
url = {https://aclanthology.org/2025.comedi-1.2/},
year = {2025},
date = {2025},
booktitle = {Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation},
pages = {12-19},
publisher = {International Committee on Computational Linguistics},
address = {Abu Dhabi, UAE},
abstract = {Interpreting implicit discourse relations involves complex reasoning, requiring the integration of semantic cues with background knowledge, as overt connectives like “because” or “then” are absent. These relations often allow multiple interpretations, best represented as distributions. In this study, we compare two established methods that crowdsource implicit discourse relation annotation by connective insertion: a free-choice approach, which allows annotators to select any suitable connective, and a forced-choice approach, which asks them to select among a set of predefined options. Specifically, we re-annotate the whole DiscoGeM 1.0 corpus - initially annotated with the free-choice method - using the forced-choice approach. The free-choice approach allows for flexible and intuitive insertion of various connectives, which are context-dependent. Comparison among over 130,000 annotations, however, shows that the free-choice strategy produces less diverse annotations, often converging on common labels. Analysis of the results reveals the interplay between task design and the annotators’ abilities to interpret and produce discourse relations.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Häuser, Katja; Borovsky, Arielle

Got it right up front? Further evidence for parallel graded prediction during prenominal article processing in a self-paced reading study Journal Article

Glossa Psycholinguistics, 4, 2025.

Recent studies suggest that language users generate and maintain multiple predictions in parallel, especially in tasks that explicitly instruct participants to generate predictions. Here, we investigated the possibility of parallel gradedness of linguistic predictions in a simple reading task, using a new measure—imbalance—that captures the probabilistic difference between multiple sentence completions. We focus on prenominal gender-marked articles in German to obtain prediction-specific effects. Native speakers of German read predictable or unpredictable gender-marked nouns that were preceded by prediction-consistent or -inconsistent prenominal articles. Sentence frames either biased expectations more strongly toward the most likely continuation of the sentence, or balanced expectations between the first and second most likely continuation. The results showed reading facilitation for gender-marked articles when sentences were more biased but slowing when sentences were more balanced, irrespective of article predictability. We conclude that readers issue multiple prenominal predictions and weigh them according to their likelihood, providing evidence for parallel gradedness of prenominal predictions. The results are discussed in light of theoretical models on prediction and rational sentence processing.

@article{HäuserBorovsky2025,
title = {Got it right up front? Further evidence for parallel graded prediction during prenominal article processing in a self-paced reading study},
author = {Katja H{\"a}user and Arielle Borovsky},
url = {https://escholarship.org/uc/item/7g30m0th},
doi = {https://doi.org/10.5070/G6011.1636},
year = {2025},
date = {2025},
journal = {Glossa Psycholinguistics},
volume = {4},
number = {1},
abstract = {Recent studies suggest that language users generate and maintain multiple predictions in parallel, especially in tasks that explicitly instruct participants to generate predictions. Here, we investigated the possibility of parallel gradedness of linguistic predictions in a simple reading task, using a new measure—imbalance—that captures the probabilistic difference between multiple sentence completions. We focus on prenominal gender-marked articles in German to obtain prediction-specific effects. Native speakers of German read predictable or unpredictable gender-marked nouns that were preceded by prediction-consistent or -inconsistent prenominal articles. Sentence frames either biased expectations more strongly toward the most likely continuation of the sentence, or balanced expectations between the first and second most likely continuation. The results showed reading facilitation for gender-marked articles when sentences were more biased but slowing when sentences were more balanced, irrespective of article predictability. We conclude that readers issue multiple prenominal predictions and weigh them according to their likelihood, providing evidence for parallel gradedness of prenominal predictions. The results are discussed in light of theoretical models on prediction and rational sentence processing.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A5

Thillainathan, Sarubi; Koller, Alexander

Controllable Text Adaptation Using In-context Learning with Linguistic Features Inproceedings

AAAI2025 AI for Education - Tools, Opportunities, and Risks in the Generative AI Era, 2025.

The diversity in readers’ cognitive abilities, including working memory capacity and prior knowledge, necessitates texts that align with individual comprehension levels. We address the challenge of rewriting text to match readers’ unique needs, approximating readers to specific grade levels. Unlike prior approaches that rely on fine-tuned models and large training datasets, our method leverages in-context learning (ICL), making it effective in data-sparse scenarios. By precisely controlling linguistic features such as syntactic depth, our approach delivers tailored rewrites aligned with specific grade levels. We demonstrate state-of-the-art performance in generating grade-specific adaptations, highlighting the potential of ICL-based methods to enhance text accessibility and inclusivity.

@inproceedings{Thillainathan2025Controllable,
title = {Controllable Text Adaptation Using In-context Learning with Linguistic Features},
author = {Sarubi Thillainathan and Alexander Koller},
url = {https://ai4ed.cc/workshops/aaai2025},
year = {2025},
date = {2025},
booktitle = {AAAI2025 AI for Education - Tools, Opportunities, and Risks in the Generative AI Era},
abstract = {The diversity in readers’ cognitive abilities, including working memory capacity and prior knowledge, necessitates texts that align with individual comprehension levels. We address the challenge of rewriting text to match readers’ unique needs, approximating readers to specific grade levels. Unlike prior approaches that rely on fine-tuned models and large training datasets, our method leverages in-context learning (ICL), making it effective in data-sparse scenarios. By precisely controlling linguistic features such as syntactic depth, our approach delivers tailored rewrites aligned with specific grade levels. We demonstrate state-of-the-art performance in generating grade-specific adaptations, highlighting the potential of ICL-based methods to enhance text accessibility and inclusivity.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A8

Sentsova, Uliana; Ciminari, Debora; van Genabith, Josef; España-Bonet, Cristina

MultiCoPIE: A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation Inproceedings Forthcoming

21st Workshop on Multiword Expressions (MWE 2025) @NAACL2025, Albuquerque, New Mexico, U.S.A., 2025.

Language models are able to handle compositionality and, to some extent, noncompositional phenomena such as semantic idiosyncrasy, a feature most prominent in the case of idioms. This work introduces the MultiCoPIE corpus that includes potentially idiomatic expressions in Catalan, Italian, and Russian, extending the language coverage of PIE corpus data. The new corpus provides additional linguistic features of idioms, such as their semantic compositionality, part-of-speech of idiom head as well as their corresponding idiomatic expressions in English. With this new resource at hand, we first fine-tune an XLM-RoBERTa model to classify figurative and literal usage of potentially idiomatic expressions in English. We then study cross-lingual transfer to the languages represented in the MultiCoPIE corpus, evaluating the model’s ability to generalize an idiom-related task to languages not seen during fine-tuning. We show the effect of ‘cross-lingual lexical overlap’: the performance of the model, fine-tuned on English idiomatic expressions and tested on the MultiCoPIE languages, increases significantly when classifying ‘shared idioms’— idiomatic expressions that have direct counterparts in English with similar form and meaning. While this observation raises questions about the generalizability of cross-lingual learning, the results from experiments on PIEs demonstrate strong evidence of effective cross-lingual transfer, even when accounting for idioms similar across languages.

@inproceedings{Sentsova-etal-2025,
title = {MultiCoPIE: A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation},
author = {Uliana Sentsova and Debora Ciminari and Josef van Genabith and Cristina Espa{\~n}a-Bonet},
url = {https://multiword.org/mwe2025/},
year = {2025},
date = {2025},
booktitle = {21st Workshop on Multiword Expressions (MWE 2025) @NAACL2025},
address = {Albuquerque, New Mexico, U.S.A.},
abstract = {Language models are able to handle compositionality and, to some extent, noncompositional phenomena such as semantic idiosyncrasy, a feature most prominent in the case of idioms. This work introduces the MultiCoPIE corpus that includes potentially idiomatic expressions in Catalan, Italian, and Russian, extending the language coverage of PIE corpus data. The new corpus provides additional linguistic features of idioms, such as their semantic compositionality, part-of-speech of idiom head as well as their corresponding idiomatic expressions in English. With this new resource at hand, we first fine-tune an XLM-RoBERTa model to classify figurative and literal usage of potentially idiomatic expressions in English. We then study cross-lingual transfer to the languages represented in the MultiCoPIE corpus, evaluating the model’s ability to generalize an idiom-related task to languages not seen during fine-tuning. We show the effect of ‘cross-lingual lexical overlap’: the performance of the model, fine-tuned on English idiomatic expressions and tested on the MultiCoPIE languages, increases significantly when classifying ‘shared idioms’— idiomatic expressions that have direct counterparts in English with similar form and meaning. While this observation raises questions about the generalizability of cross-lingual learning, the results from experiments on PIEs demonstrate strong evidence of effective cross-lingual transfer, even when accounting for idioms similar across languages.},
pubstate = {forthcoming},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Alves, Diego

Diachronic Analysis of Phrasal Verbs in English Scientific Writing Inproceedings

Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), University of Tartu Library, Tallinn, Estonia, 2025.
Phrasal verbs (PVs) are a specific type of multi-word expressions and a specific feature of the English language. However, their usage in scientific prose is limited. Our study focuses on the analysis of phrasal verbs in the scientific domain using information theory methods to describe diachronic phenomena such as conventionalization and diversification regarding the usage of PVs. Thus, we analysed their developmental trajectory over time from the mid-17th century to the end of the 20th century by measuring the relative entropy (Kullback-Leibler divergence), predictability in context of the phrasal verbs particles (surprisal), and the paradigmatic variability using word embedding spaces. We were able to identify interesting phenomena such as the process of conventionalization over the 20th century and the peaks of diversification throughout the centuries.

@inproceedings{Alves-2025,
title = {Diachronic Analysis of Phrasal Verbs in English Scientific Writing},
author = {Diego Alves},
url = {https://dspace.ut.ee/items/ef26bd7f-e708-41b3-b5c8-84cf8057ab71},
year = {2025},
date = {2025},
booktitle = {Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)},
publisher = {University of Tartu Library},
address = {Tallinn, Estonia},
abstract = {

Phrasal verbs (PVs) are a specific type of multi-word expressions and a specific feature of the English language. However, their usage in scientific prose is limited. Our study focuses on the analysis of phrasal verbs in the scientific domain using information theory methods to describe diachronic phenomena such as conventionalization and diversification regarding the usage of PVs. Thus, we analysed their developmental trajectory over time from the mid-17th century to the end of the 20th century by measuring the relative entropy (Kullback-Leibler divergence), predictability in context of the phrasal verbs particles (surprisal), and the paradigmatic variability using word embedding spaces. We were able to identify interesting phenomena such as the process of conventionalization over the 20th century and the peaks of diversification throughout the centuries.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Kunilovskaya, Maria; Zaitova, Iuliia; Xue, Wei; Stenger, Irina; Avgustinova, Tania

Predictability of Microsyntactic Units across Slavic Languages: A translation-based Study Inproceedings

Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), University of Tartu Library, Tallinn, Estonia, 2025.
The paper presents the results of a free translation experiment, which was set up to explore Slavic cross-language intelligibility. In the experiment, native speakers of Russian were asked to read a sentence in one of the five Slavic languages and return a Russian translation of a highlighted item. The experiment is focused on microsyntactic units because they offer an increased intercomprehension difficulty due to opaque semantics. Each language is represented by at least 50 stimuli, and each stimulus has generated at least 20 responses. The levels of intercomprehension are captured by categorising participants‘ responses into seven types of translation solutions (paraphrase, correct, fluent_literal, awkward_literal, fantasy, noise, and empty), generally reflecting the level of the cross-linguistic intelligibility of the stimuli. The study aims to reveal linguistic factors that favour intercomprehension across Slavic languages. We use regression and correlation analysis to identify the most important intercomprehension predictors and statistical analysis to bring up the most typical cases and outliers. We explore several feature types that reflect the properties of the translation tasks and their outcomes, including point-wise phonological and orthographic distances, cosine similarities, surprisals, translation quality scores and translation solution entropy indices. The experimental data confirms the expected gradual increase of intelligibility from West-Slavic to East-Slavic languages for the speakers of Russian. We show that intelligibility is highly contingent on the ability of speakers to recognise and interpret formal similarities between languages as well as on the size of these similarities. For several Slavic languages, the context sentence complexity was a significant predictor of intelligibility.

@inproceedings{Kunilovskaya-etal-2025,
title = {Predictability of Microsyntactic Units across Slavic Languages: A translation-based Study},
author = {Maria Kunilovskaya and Iuliia Zaitova and Wei Xue and Irina Stenger and Tania Avgustinova},
url = {https://dspace.ut.ee/items/26e26504-9379-42cf-8f85-361a04dcd114},
year = {2025},
date = {2025},
booktitle = {Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)},
publisher = {University of Tartu Library},
address = {Tallinn, Estonia},
abstract = {

The paper presents the results of a free translation experiment, which was set up to explore Slavic cross-language intelligibility. In the experiment, native speakers of Russian were asked to read a sentence in one of the five Slavic languages and return a Russian translation of a highlighted item. The experiment is focused on microsyntactic units because they offer an increased intercomprehension difficulty due to opaque semantics. Each language is represented by at least 50 stimuli, and each stimulus has generated at least 20 responses. The levels of intercomprehension are captured by categorising participants' responses into seven types of translation solutions (paraphrase, correct, fluent_literal, awkward_literal, fantasy, noise, and empty), generally reflecting the level of the cross-linguistic intelligibility of the stimuli. The study aims to reveal linguistic factors that favour intercomprehension across Slavic languages. We use regression and correlation analysis to identify the most important intercomprehension predictors and statistical analysis to bring up the most typical cases and outliers. We explore several feature types that reflect the properties of the translation tasks and their outcomes, including point-wise phonological and orthographic distances, cosine similarities, surprisals, translation quality scores and translation solution entropy indices. The experimental data confirms the expected gradual increase of intelligibility from West-Slavic to East-Slavic languages for the speakers of Russian. We show that intelligibility is highly contingent on the ability of speakers to recognise and interpret formal similarities between languages as well as on the size of these similarities. For several Slavic languages, the context sentence complexity was a significant predictor of intelligibility.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B7 C4

Lemke, Tyll Robin

Investigating fragment usage with a gamified utterance selection task Proceeding

Experiments in Linguistic Meaning, 3, pp. 447-459, 2025.
Nonsentential utterances, or fragments, like A coffee, please! can often be used to communicate a propositional meaning otherwise encoded by a complete sentence I’d like to order a coffee, please!). Previous research focused mostly on the syntax and licensing of fragments, but the questions of why speakers use fragments and how listeners interpret them are still underexplored. I propose a simple game-theoretic account of fragment usage, which predicts that (i) listeners assign fragments the most likely interpretation in context and (ii) that speakers are aware of this and trade-off production cost and the risk of being misunderstood when choosing their utterance. Using a corpus of production data, empirically founded and precise model predictions are generated. These predictions are evaluated with two experiments using a novel gamified utterance selection paradigm. The experiments suggest that, as predicted, speakers take into account both potential gain in efficiency and the risk of being misunderstood when choosing their utterance.

@proceeding{Lemke_2025,
title = {Investigating fragment usage with a gamified utterance selection task},
author = {Tyll Robin Lemke},
url = {https://journals.linguisticsociety.org/proceedings/index.php/ELM/article/view/5836},
doi = {https://doi.org/10.3765/elm.3.5836},
year = {2025},
date = {2025},
booktitle = {Experiments in Linguistic Meaning},
pages = {447-459},
abstract = {

Nonsentential utterances, or fragments, like A coffee, please! can often be used to communicate a propositional meaning otherwise encoded by a complete sentence I'd like to order a coffee, please!). Previous research focused mostly on the syntax and licensing of fragments, but the questions of why speakers use fragments and how listeners interpret them are still underexplored. I propose a simple game-theoretic account of fragment usage, which predicts that (i) listeners assign fragments the most likely interpretation in context and (ii) that speakers are aware of this and trade-off production cost and the risk of being misunderstood when choosing their utterance. Using a corpus of production data, empirically founded and precise model predictions are generated. These predictions are evaluated with two experiments using a novel gamified utterance selection paradigm. The experiments suggest that, as predicted, speakers take into account both potential gain in efficiency and the risk of being misunderstood when choosing their utterance.
},
pubstate = {published},
type = {proceeding}
}

Copy BibTeX to Clipboard

Project:   B3

Häuser, Katja; Borovsky, Arielle

Got it right up front? Further evidence for parallel graded prediction during prenominal article processing in a self-paced reading study Journal Article

Glossa Psycholinguistics, 4, 2025.

Recent studies suggest that language users generate and maintain multiple predictions in parallel, especially in tasks that explicitly instruct participants to generate predictions. Here, we investigated the possibility of parallel gradedness of linguistic predictions in a simple reading task, using a new measure—imbalance—that captures the probabilistic difference between multiple sentence completions. We focus on prenominal gender-marked articles in German to obtain prediction-specific effects. Native speakers of German read predictable or unpredictable gender-marked nouns that were preceded by prediction-consistent or -inconsistent prenominal articles. Sentence frames either biased expectations more strongly toward the most likely continuation of the sentence, or balanced expectations between the first and second most likely continuation. The results showed reading facilitation for gender-marked articles when sentences were more biased but slowing when sentences were more balanced, irrespective of article predictability. We conclude that readers issue multiple prenominal predictions and weigh them according to their likelihood, providing evidence for parallel gradedness of prenominal predictions. The results are discussed in light of theoretical models on prediction and rational sentence processing.

@article{haeuser-borovsky-2025,
title = {Got it right up front? Further evidence for parallel graded prediction during prenominal article processing in a self-paced reading study},
author = {Katja H{\"a}user and Arielle Borovsky},
url = {https://escholarship.org/uc/item/7g30m0th},
doi = {https://doi.org/10.5070/G6011.1636},
year = {2025},
date = {2025},
journal = {Glossa Psycholinguistics},
volume = {4},
number = {1},
abstract = {Recent studies suggest that language users generate and maintain multiple predictions in parallel, especially in tasks that explicitly instruct participants to generate predictions. Here, we investigated the possibility of parallel gradedness of linguistic predictions in a simple reading task, using a new measure—imbalance—that captures the probabilistic difference between multiple sentence completions. We focus on prenominal gender-marked articles in German to obtain prediction-specific effects. Native speakers of German read predictable or unpredictable gender-marked nouns that were preceded by prediction-consistent or -inconsistent prenominal articles. Sentence frames either biased expectations more strongly toward the most likely continuation of the sentence, or balanced expectations between the first and second most likely continuation. The results showed reading facilitation for gender-marked articles when sentences were more biased but slowing when sentences were more balanced, irrespective of article predictability. We conclude that readers issue multiple prenominal predictions and weigh them according to their likelihood, providing evidence for parallel gradedness of prenominal predictions. The results are discussed in light of theoretical models on prediction and rational sentence processing.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A5

Talamo, Luigi

Introducing STAF: The Saarbrücken Treebank of Albanian Fiction Journal Article

Journal of Open Humanities Data, 11, pp. 1–6, 2025.

The present paper describes the building of STAF, a Universal Dependencies treebank for Albanian. STAF was bootstrapped using a Stanza model trained on previously unreleased data and then manually corrected by three Albanian speakers supervised by the author, who also revised all sentences. STAF focuses on the fiction genre, featuring 200 sentences selected from nine literary texts written by Albanian contemporary authors.

@article{Talamo-2025,
title = {Introducing STAF: The Saarbr{\"u}cken Treebank of Albanian Fiction},
author = {Luigi Talamo},
url = {https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.285},
doi = {https://doi.org/10.5334/johd.285},
year = {2025},
date = {2025},
journal = {Journal of Open Humanities Data},
pages = {1–6},
volume = {11},
number = {3},
abstract = {

The present paper describes the building of STAF, a Universal Dependencies treebank for Albanian. STAF was bootstrapped using a Stanza model trained on previously unreleased data and then manually corrected by three Albanian speakers supervised by the author, who also revised all sentences. STAF focuses on the fiction genre, featuring 200 sentences selected from nine literary texts written by Albanian contemporary authors.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C7

Dyer, Andrew; Betul, Ruveyda; Rajestari, Maryam; Rouvalis, Andreas; Singhal, Aarushi; Stodolinska, Yuliya; Asma, Syahidah; Rodrigues, Helena

A Multilingual Parallel Corpus for Coreference Resolution and Information Status in the Literary Domain Inproceedings

Dakota, Daniel; Jablotschkin, Sarah; Kübler, Sandra; Zinsmeister, Heike (Ed.): Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024), Association for Computational Linguistics, pp. 55-64, Hamburg, Germany, 2024.

Information status — the newness or givenness of referents in discourse — is known to affect the production of language at many different levels. At the morphosyntactic level, information status gives rise to special words orders, elisions, and other phenomena that challenge the notion that morphosyntax can be considered independent of discourse context. Though there are many language-specific corpora annotated for information status and its related phenomena, coreference and anaphora resolution, what is not available at present is a cross-lingually consistently annotated corpus or annotation scheme that would allow for comparativestudy of these phenomena across many diverse languages. In this paper we present our work to build such a resource. We are annotating a parsed, parallel corpus of prose in many languages for information status and coreference resolution, so that like-for-like cross-lingual comparisons can be made at the intersection of discourse and syntax. Our corpus can and will be used both for corpus analysis and for model training.

@inproceedings{dyer-etal-2024-multilingual,
title = {A Multilingual Parallel Corpus for Coreference Resolution and Information Status in the Literary Domain},
author = {Andrew Dyer andRuveyda Betul Bahceci and Maryam Rajestari and Andreas Rouvalis and Aarushi Singhal and Yuliya Stodolinska and Syahidah Asma Umniyati and Helena Rodrigues Menezes de Oliveira Vaz},
editor = {Daniel Dakota and Sarah Jablotschkin and Sandra K{\"u}bler and Heike Zinsmeister},
url = {https://aclanthology.org/2024.tlt-1.7/},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024)},
pages = {55-64},
publisher = {Association for Computational Linguistics},
address = {Hamburg, Germany},
abstract = {Information status — the newness or givenness of referents in discourse — is known to affect the production of language at many different levels. At the morphosyntactic level, information status gives rise to special words orders, elisions, and other phenomena that challenge the notion that morphosyntax can be considered independent of discourse context. Though there are many language-specific corpora annotated for information status and its related phenomena, coreference and anaphora resolution, what is not available at present is a cross-lingually consistently annotated corpus or annotation scheme that would allow for comparativestudy of these phenomena across many diverse languages. In this paper we present our work to build such a resource. We are annotating a parsed, parallel corpus of prose in many languages for information status and coreference resolution, so that like-for-like cross-lingual comparisons can be made at the intersection of discourse and syntax. Our corpus can and will be used both for corpus analysis and for model training.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C7

Talamo, Luigi; Verkerk, Annemarie; Salaberri, Iker

A quantitative approach to clause type and syntactic change in two Indo-European corpora Journal Article

Italian Journal of Linguistics, 36, pp. 53-82, 2024.

The aim of this paper is to empirically test the claim that subordinate clauses tend to preserve conservative features in language change. To this end, the diachronic behavior of two well-understood and frequently adduced features of grammar, namely null subject pronouns and order of subject, object and verb, is analyzed for main and adverbial clauses in a balanced corpus of 45 IndoEuropean languages. This study combines qualitative and quantitative analysis by drawing on individual descriptive grammars and parallel corpora respectively. Additionally, diachronic change is modeled using phylogenetic comparative methods. The data suggest that adverbial clauses can in some cases develop asymmetries with respect to their independent counterparts, either through innovation or through preservation of conservative features, possibly due to a communicative need to distinguish clause types by means of grammar. However, the general tendency is for adverbial clauses to change much in the same way as main clauses. This finding contradicts previous claims and calls for a reassessment of studies on the diachronic nature of distinct clause types.

@article{Talamo-etal-2024,
title = {A quantitative approach to clause type and syntactic change in two Indo-European corpora},
author = {Luigi Talamo and Annemarie Verkerk andIker Salaberri},
url = {https://www.italian-journal-linguistics.com/current-issue/},
doi = {https://doi.org/10.26346/1120-2726-225},
year = {2024},
date = {2024},
journal = {Italian Journal of Linguistics},
pages = {53-82},
volume = {36},
number = {2},
abstract = {The aim of this paper is to empirically test the claim that subordinate clauses tend to preserve conservative features in language change. To this end, the diachronic behavior of two well-understood and frequently adduced features of grammar, namely null subject pronouns and order of subject, object and verb, is analyzed for main and adverbial clauses in a balanced corpus of 45 IndoEuropean languages. This study combines qualitative and quantitative analysis by drawing on individual descriptive grammars and parallel corpora respectively. Additionally, diachronic change is modeled using phylogenetic comparative methods. The data suggest that adverbial clauses can in some cases develop asymmetries with respect to their independent counterparts, either through innovation or through preservation of conservative features, possibly due to a communicative need to distinguish clause types by means of grammar. However, the general tendency is for adverbial clauses to change much in the same way as main clauses. This finding contradicts previous claims and calls for a reassessment of studies on the diachronic nature of distinct clause types.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C7

Menzel, Katrin

Noun + noun Compounds and Verbal Complements as Non-normalised Features in Late Modern English Scientific Translations Inproceedings

Proceedings of 7th Translation in Transition Conference, Batumi: Shota Rustaveli State University, 2024.

This paper presents a study on the usage of noun+noun compounds and verbal complement structures in 18th century scientific articles in the Royal Society Corpus (RSC) comparing translated to non-translated English texts. Departing from the hypothesis that the translations will conform stronger to traditional patterns of the English language, the analysis shows that these historical translations and non-translated texts are similarly marked by the ongoing reorganisation of the noun phrase, but translations
contain more innovative complementation patterns. Additionally, a surprisal analysis shows that the analysed patterns tend to occur in more predictable and conventionalised contexts in non-translated texts than in translation.

 

@inproceedings{Menzel2024Noun,
title = {Noun + noun Compounds and Verbal Complements as Non-normalised Features in Late Modern English Scientific Translations},
author = {Katrin Menzel},
url = {https://sites.google.com/view/tt2024/schedule-and-proceedings},
year = {2024},
date = {2024-12-26},
booktitle = {Proceedings of 7th Translation in Transition Conference},
address = {Batumi: Shota Rustaveli State University},
abstract = {This paper presents a study on the usage of noun+noun compounds and verbal complement structures in 18th century scientific articles in the Royal Society Corpus (RSC) comparing translated to non-translated English texts. Departing from the hypothesis that the translations will conform stronger to traditional patterns of the English language, the analysis shows that these historical translations and non-translated texts are similarly marked by the ongoing reorganisation of the noun phrase, but translations
contain more innovative complementation patterns. Additionally, a surprisal analysis shows that the analysed patterns tend to occur in more predictable and conventionalised contexts in non-translated texts than in translation.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Menzel, Katrin

Initialisms in Scientific Writing in the 19th and Early 20th Centuries Journal Article

Zeitschrift für Wortbildung / Journal of Word Formation (ZWJW) (Special issue Historical English Word-Formation), 8, pp. 7-27, 2024.
This paper focusses on the role of initialisms in scientific English articles in the Royal Society Corpus (Fischer et al. 2020; Kermes et al. 2016). The development of scientific initialisms is illustrated with frequency data, a discussion of the evolution of the text topics obtained from topic modelling and an analysis of the development of information-theoretic surprisal values of initialisms in three time spans between 1830 and 1919. The overall frequency and diversity of initialisms for scientific concepts has risen considerably between 1830 and 1919 in the context of the ongoing specialisation of the sciences. Particularly from the 1860s onwards scientific initialisms increasingly become shortcuts for multiword units with wordhood and term status. The surprisal values of scientific initialisms decrease over time as such forms more regularly occur in conventionalised textual contexts and fixed expressions. Overall, the analysis of the RSC texts shows that key developments towards the conventionalisation of scientific initialisms as term formation patterns took place in the transitional period from Late Modern to Present-day English.

@article{Menzel2024,
title = {Initialisms in Scientific Writing in the 19th and Early 20th Centuries},
author = {Katrin Menzel},
url = {https://journals.linguistik.de/zwjw/article/view/108},
doi = {https://doi.org/10.21248/zwjw.2024.2.108},
year = {2024},
date = {2024},
journal = {Zeitschrift f{\"u}r Wortbildung / Journal of Word Formation (ZWJW) (Special issue Historical English Word-Formation)},
pages = {7-27},
volume = {8},
number = {2},
abstract = {

This paper focusses on the role of initialisms in scientific English articles in the Royal Society Corpus (Fischer et al. 2020; Kermes et al. 2016). The development of scientific initialisms is illustrated with frequency data, a discussion of the evolution of the text topics obtained from topic modelling and an analysis of the development of information-theoretic surprisal values of initialisms in three time spans between 1830 and 1919. The overall frequency and diversity of initialisms for scientific concepts has risen considerably between 1830 and 1919 in the context of the ongoing specialisation of the sciences. Particularly from the 1860s onwards scientific initialisms increasingly become shortcuts for multiword units with wordhood and term status. The surprisal values of scientific initialisms decrease over time as such forms more regularly occur in conventionalised textual contexts and fixed expressions. Overall, the analysis of the RSC texts shows that key developments towards the conventionalisation of scientific initialisms as term formation patterns took place in the transitional period from Late Modern to Present-day English.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B1

Steuer, Julius; Krielke, Marie-Pauline; Fischer, Stefan; Degaetano-Ortlieb, Stefania; Mosbach, Marius; Klakow, Dietrich

Modeling Diachronic Change in English Scientific Writing over 300+ Years with Transformer-based Language Model Surprisal Inproceedings

Zweigenbaum, Pierre; Rapp, Reinhard; Sharoff, Serge (Ed.): Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024, ELRA and ICCL, pp. 12-23, Torino, Italia, 2024.

This study presents an analysis of diachronic linguistic changes in English scientific writing, utilizing surprisal from transformer-based language models. Unlike traditional n-gram models, transformer-based models are potentially better at capturing nuanced linguistic changes such as long-range dependencies by considering variable context sizes. However, to create diachronically comparable language models there are several challenges with historical data, notably an exponential increase in no. of texts, tokens per text and vocabulary size over time. We address these by using a shared vocabulary and employing a robust training strategy that includes initial uniform sampling from the corpus and continuing pre-training on specific temporal segments. Our empirical analysis highlights the predictive power of surprisal from transformer-based models, particularly in analyzing complex linguistic structures like relative clauses. The models’ broader contextual awareness and the inclusion of dependency length annotations contribute to a more intricate understanding of communicative efficiency. While our focus is on scientific English, our approach can be applied to other low-resource scenarios.

@inproceedings{steuer-etal-2024-modeling ,
title = {Modeling Diachronic Change in English Scientific Writing over 300+ Years with Transformer-based Language Model Surprisal},
author = {Julius Steuer and Marie-Pauline Krielke and Stefan Fischer and Stefania Degaetano-Ortlieb and Marius Mosbach and Dietrich Klakow},
editor = {Pierre Zweigenbaum and Reinhard Rapp and Serge Sharoff},
url = {https://aclanthology.org/2024.bucc-1.2/},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024},
pages = {12-23},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {This study presents an analysis of diachronic linguistic changes in English scientific writing, utilizing surprisal from transformer-based language models. Unlike traditional n-gram models, transformer-based models are potentially better at capturing nuanced linguistic changes such as long-range dependencies by considering variable context sizes. However, to create diachronically comparable language models there are several challenges with historical data, notably an exponential increase in no. of texts, tokens per text and vocabulary size over time. We address these by using a shared vocabulary and employing a robust training strategy that includes initial uniform sampling from the corpus and continuing pre-training on specific temporal segments. Our empirical analysis highlights the predictive power of surprisal from transformer-based models, particularly in analyzing complex linguistic structures like relative clauses. The models’ broader contextual awareness and the inclusion of dependency length annotations contribute to a more intricate understanding of communicative efficiency. While our focus is on scientific English, our approach can be applied to other low-resource scenarios.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B1 B4

Bagdasarov, Sergei; Teich, Elke

Multi-word expressions in biomedical abstracts and their plain English adaptations Inproceedings

Hämäläinen, Mika; Öhman, Emily; Miyagawa, So; Alnajjar, Khalid; Bizzoni, Yuri (Ed.): Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, Association for Computational Linguistics, pp. 483-488, Miami, USA, 2024.

This study analyzes the use of multi-word expressions (MWEs), prefabricated sequences of words (e.g. in this case, this means that, healthcare service, follow up) in biomedical abstracts and their plain language adaptations. While English academic writing became highly specialized and complex from the late 19th century onwards, recent decades have seen a rising demand for a lay-friendly language in scientific content, especially in the health domain, to bridge a communication gap between experts and laypersons. Based on previous research showing that MWEs are easier to process than non-formulaic word sequences of comparable length, we hypothesize that they can potentially be used to create a more reader-friendly language. Our preliminary results suggest some significant differences between complex and plain abstracts when it comes to the usage patterns and informational load of MWEs.

@inproceedings{bagdasarov-teich-2024-multi,
title = {Multi-word expressions in biomedical abstracts and their plain English adaptations},
author = {Sergei Bagdasarov and Elke Teich},
editor = {Mika H{\"a}m{\"a}l{\"a}inen and Emily {\"O}hman and So Miyagawa and Khalid Alnajjar and Yuri Bizzoni},
url = {https://aclanthology.org/2024.nlp4dh-1.46/},
doi = {https://doi.org/10.18653/v1/2024.nlp4dh-1.46},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities},
pages = {483-488},
publisher = {Association for Computational Linguistics},
address = {Miami, USA},
abstract = {This study analyzes the use of multi-word expressions (MWEs), prefabricated sequences of words (e.g. in this case, this means that, healthcare service, follow up) in biomedical abstracts and their plain language adaptations. While English academic writing became highly specialized and complex from the late 19th century onwards, recent decades have seen a rising demand for a lay-friendly language in scientific content, especially in the health domain, to bridge a communication gap between experts and laypersons. Based on previous research showing that MWEs are easier to process than non-formulaic word sequences of comparable length, we hypothesize that they can potentially be used to create a more reader-friendly language. Our preliminary results suggest some significant differences between complex and plain abstracts when it comes to the usage patterns and informational load of MWEs.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Successfully