Publications

Thillainathan, Sarubi; Koller, Alexander

Controllable Text Adaptation Using In-context Learning with Linguistic Features Inproceedings

AAAI2025 AI for Education - Tools, Opportunities, and Risks in the Generative AI Era, 2025.

The diversity in readers’ cognitive abilities, including working memory capacity and prior knowledge, necessitates texts that align with individual comprehension levels. We address the challenge of rewriting text to match readers’ unique needs, approximating readers to specific grade levels. Unlike prior approaches that rely on fine-tuned models and large training datasets, our method leverages in-context learning (ICL), making it effective in data-sparse scenarios. By precisely controlling linguistic features such as syntactic depth, our approach delivers tailored rewrites aligned with specific grade levels. We demonstrate state-of-the-art performance in generating grade-specific adaptations, highlighting the potential of ICL-based methods to enhance text accessibility and inclusivity.

@inproceedings{Thillainathan2025Controllable,
title = {Controllable Text Adaptation Using In-context Learning with Linguistic Features},
author = {Sarubi Thillainathan and Alexander Koller},
url = {https://ai4ed.cc/workshops/aaai2025},
year = {2025},
date = {2025},
booktitle = {AAAI2025 AI for Education - Tools, Opportunities, and Risks in the Generative AI Era},
abstract = {The diversity in readers’ cognitive abilities, including working memory capacity and prior knowledge, necessitates texts that align with individual comprehension levels. We address the challenge of rewriting text to match readers’ unique needs, approximating readers to specific grade levels. Unlike prior approaches that rely on fine-tuned models and large training datasets, our method leverages in-context learning (ICL), making it effective in data-sparse scenarios. By precisely controlling linguistic features such as syntactic depth, our approach delivers tailored rewrites aligned with specific grade levels. We demonstrate state-of-the-art performance in generating grade-specific adaptations, highlighting the potential of ICL-based methods to enhance text accessibility and inclusivity.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A8

Sentsova, Uliana; Ciminari, Debora; van Genabith, Josef; España-Bonet, Cristina

MultiCoPIE: A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation Inproceedings Forthcoming

21st Workshop on Multiword Expressions (MWE 2025) @NAACL2025, Albuquerque, New Mexico, U.S.A., 2025.

Language models are able to handle compositionality and, to some extent, noncompositional phenomena such as semantic idiosyncrasy, a feature most prominent in the case of idioms. This work introduces the MultiCoPIE corpus that includes potentially idiomatic expressions in Catalan, Italian, and Russian, extending the language coverage of PIE corpus data. The new corpus provides additional linguistic features of idioms, such as their semantic compositionality, part-of-speech of idiom head as well as their corresponding idiomatic expressions in English. With this new resource at hand, we first fine-tune an XLM-RoBERTa model to classify figurative and literal usage of potentially idiomatic expressions in English. We then study cross-lingual transfer to the languages represented in the MultiCoPIE corpus, evaluating the model’s ability to generalize an idiom-related task to languages not seen during fine-tuning. We show the effect of ‘cross-lingual lexical overlap’: the performance of the model, fine-tuned on English idiomatic expressions and tested on the MultiCoPIE languages, increases significantly when classifying ‘shared idioms’— idiomatic expressions that have direct counterparts in English with similar form and meaning. While this observation raises questions about the generalizability of cross-lingual learning, the results from experiments on PIEs demonstrate strong evidence of effective cross-lingual transfer, even when accounting for idioms similar across languages.

@inproceedings{Sentsova-etal-2025,
title = {MultiCoPIE: A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation},
author = {Uliana Sentsova and Debora Ciminari and Josef van Genabith and Cristina Espa{\~n}a-Bonet},
url = {https://multiword.org/mwe2025/},
year = {2025},
date = {2025},
booktitle = {21st Workshop on Multiword Expressions (MWE 2025) @NAACL2025},
address = {Albuquerque, New Mexico, U.S.A.},
abstract = {Language models are able to handle compositionality and, to some extent, noncompositional phenomena such as semantic idiosyncrasy, a feature most prominent in the case of idioms. This work introduces the MultiCoPIE corpus that includes potentially idiomatic expressions in Catalan, Italian, and Russian, extending the language coverage of PIE corpus data. The new corpus provides additional linguistic features of idioms, such as their semantic compositionality, part-of-speech of idiom head as well as their corresponding idiomatic expressions in English. With this new resource at hand, we first fine-tune an XLM-RoBERTa model to classify figurative and literal usage of potentially idiomatic expressions in English. We then study cross-lingual transfer to the languages represented in the MultiCoPIE corpus, evaluating the model’s ability to generalize an idiom-related task to languages not seen during fine-tuning. We show the effect of ‘cross-lingual lexical overlap’: the performance of the model, fine-tuned on English idiomatic expressions and tested on the MultiCoPIE languages, increases significantly when classifying ‘shared idioms’— idiomatic expressions that have direct counterparts in English with similar form and meaning. While this observation raises questions about the generalizability of cross-lingual learning, the results from experiments on PIEs demonstrate strong evidence of effective cross-lingual transfer, even when accounting for idioms similar across languages.},
pubstate = {forthcoming},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Alves, Diego

Diachronic Analysis of Phrasal Verbs in English Scientific Writing Inproceedings

Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), University of Tartu Library, Tallinn, Estonia, 2025.
Phrasal verbs (PVs) are a specific type of multi-word expressions and a specific feature of the English language. However, their usage in scientific prose is limited. Our study focuses on the analysis of phrasal verbs in the scientific domain using information theory methods to describe diachronic phenomena such as conventionalization and diversification regarding the usage of PVs. Thus, we analysed their developmental trajectory over time from the mid-17th century to the end of the 20th century by measuring the relative entropy (Kullback-Leibler divergence), predictability in context of the phrasal verbs particles (surprisal), and the paradigmatic variability using word embedding spaces. We were able to identify interesting phenomena such as the process of conventionalization over the 20th century and the peaks of diversification throughout the centuries.

@inproceedings{Alves-2025,
title = {Diachronic Analysis of Phrasal Verbs in English Scientific Writing},
author = {Diego Alves},
url = {https://dspace.ut.ee/items/ef26bd7f-e708-41b3-b5c8-84cf8057ab71},
year = {2025},
date = {2025},
booktitle = {Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)},
publisher = {University of Tartu Library},
address = {Tallinn, Estonia},
abstract = {

Phrasal verbs (PVs) are a specific type of multi-word expressions and a specific feature of the English language. However, their usage in scientific prose is limited. Our study focuses on the analysis of phrasal verbs in the scientific domain using information theory methods to describe diachronic phenomena such as conventionalization and diversification regarding the usage of PVs. Thus, we analysed their developmental trajectory over time from the mid-17th century to the end of the 20th century by measuring the relative entropy (Kullback-Leibler divergence), predictability in context of the phrasal verbs particles (surprisal), and the paradigmatic variability using word embedding spaces. We were able to identify interesting phenomena such as the process of conventionalization over the 20th century and the peaks of diversification throughout the centuries.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Kunilovskaya, Maria; Zaitova, Iuliia; Xue, Wei; Stenger, Irina; Avgustinova, Tania

Predictability of Microsyntactic Units across Slavic Languages: A translation-based Study Inproceedings

Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), University of Tartu Library, Tallinn, Estonia, 2025.
The paper presents the results of a free translation experiment, which was set up to explore Slavic cross-language intelligibility. In the experiment, native speakers of Russian were asked to read a sentence in one of the five Slavic languages and return a Russian translation of a highlighted item. The experiment is focused on microsyntactic units because they offer an increased intercomprehension difficulty due to opaque semantics. Each language is represented by at least 50 stimuli, and each stimulus has generated at least 20 responses. The levels of intercomprehension are captured by categorising participants‘ responses into seven types of translation solutions (paraphrase, correct, fluent_literal, awkward_literal, fantasy, noise, and empty), generally reflecting the level of the cross-linguistic intelligibility of the stimuli. The study aims to reveal linguistic factors that favour intercomprehension across Slavic languages. We use regression and correlation analysis to identify the most important intercomprehension predictors and statistical analysis to bring up the most typical cases and outliers. We explore several feature types that reflect the properties of the translation tasks and their outcomes, including point-wise phonological and orthographic distances, cosine similarities, surprisals, translation quality scores and translation solution entropy indices. The experimental data confirms the expected gradual increase of intelligibility from West-Slavic to East-Slavic languages for the speakers of Russian. We show that intelligibility is highly contingent on the ability of speakers to recognise and interpret formal similarities between languages as well as on the size of these similarities. For several Slavic languages, the context sentence complexity was a significant predictor of intelligibility.

@inproceedings{Kunilovskaya-etal-2025,
title = {Predictability of Microsyntactic Units across Slavic Languages: A translation-based Study},
author = {Maria Kunilovskaya and Iuliia Zaitova and Wei Xue and Irina Stenger and Tania Avgustinova},
url = {https://dspace.ut.ee/items/26e26504-9379-42cf-8f85-361a04dcd114},
year = {2025},
date = {2025},
booktitle = {Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)},
publisher = {University of Tartu Library},
address = {Tallinn, Estonia},
abstract = {

The paper presents the results of a free translation experiment, which was set up to explore Slavic cross-language intelligibility. In the experiment, native speakers of Russian were asked to read a sentence in one of the five Slavic languages and return a Russian translation of a highlighted item. The experiment is focused on microsyntactic units because they offer an increased intercomprehension difficulty due to opaque semantics. Each language is represented by at least 50 stimuli, and each stimulus has generated at least 20 responses. The levels of intercomprehension are captured by categorising participants' responses into seven types of translation solutions (paraphrase, correct, fluent_literal, awkward_literal, fantasy, noise, and empty), generally reflecting the level of the cross-linguistic intelligibility of the stimuli. The study aims to reveal linguistic factors that favour intercomprehension across Slavic languages. We use regression and correlation analysis to identify the most important intercomprehension predictors and statistical analysis to bring up the most typical cases and outliers. We explore several feature types that reflect the properties of the translation tasks and their outcomes, including point-wise phonological and orthographic distances, cosine similarities, surprisals, translation quality scores and translation solution entropy indices. The experimental data confirms the expected gradual increase of intelligibility from West-Slavic to East-Slavic languages for the speakers of Russian. We show that intelligibility is highly contingent on the ability of speakers to recognise and interpret formal similarities between languages as well as on the size of these similarities. For several Slavic languages, the context sentence complexity was a significant predictor of intelligibility.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B7 C4

Lemke, Tyll Robin

Investigating fragment usage with a gamified utterance selection task Proceeding

Experiments in Linguistic Meaning, 3, pp. 447-459, 2025.
Nonsentential utterances, or fragments, like A coffee, please! can often be used to communicate a propositional meaning otherwise encoded by a complete sentence I’d like to order a coffee, please!). Previous research focused mostly on the syntax and licensing of fragments, but the questions of why speakers use fragments and how listeners interpret them are still underexplored. I propose a simple game-theoretic account of fragment usage, which predicts that (i) listeners assign fragments the most likely interpretation in context and (ii) that speakers are aware of this and trade-off production cost and the risk of being misunderstood when choosing their utterance. Using a corpus of production data, empirically founded and precise model predictions are generated. These predictions are evaluated with two experiments using a novel gamified utterance selection paradigm. The experiments suggest that, as predicted, speakers take into account both potential gain in efficiency and the risk of being misunderstood when choosing their utterance.

@proceeding{Lemke_2025,
title = {Investigating fragment usage with a gamified utterance selection task},
author = {Tyll Robin Lemke},
url = {https://journals.linguisticsociety.org/proceedings/index.php/ELM/article/view/5836},
doi = {https://doi.org/10.3765/elm.3.5836},
year = {2025},
date = {2025},
booktitle = {Experiments in Linguistic Meaning},
pages = {447-459},
abstract = {

Nonsentential utterances, or fragments, like A coffee, please! can often be used to communicate a propositional meaning otherwise encoded by a complete sentence I'd like to order a coffee, please!). Previous research focused mostly on the syntax and licensing of fragments, but the questions of why speakers use fragments and how listeners interpret them are still underexplored. I propose a simple game-theoretic account of fragment usage, which predicts that (i) listeners assign fragments the most likely interpretation in context and (ii) that speakers are aware of this and trade-off production cost and the risk of being misunderstood when choosing their utterance. Using a corpus of production data, empirically founded and precise model predictions are generated. These predictions are evaluated with two experiments using a novel gamified utterance selection paradigm. The experiments suggest that, as predicted, speakers take into account both potential gain in efficiency and the risk of being misunderstood when choosing their utterance.
},
pubstate = {published},
type = {proceeding}
}

Copy BibTeX to Clipboard

Project:   B3

Häuser, Katja; Borovsky, Arielle

Got it right up front? Further evidence for parallel graded prediction during prenominal article processing in a self-paced reading study Journal Article

Glossa Psycholinguistics, 4, 2025.

Recent studies suggest that language users generate and maintain multiple predictions in parallel, especially in tasks that explicitly instruct participants to generate predictions. Here, we investigated the possibility of parallel gradedness of linguistic predictions in a simple reading task, using a new measure—imbalance—that captures the probabilistic difference between multiple sentence completions. We focus on prenominal gender-marked articles in German to obtain prediction-specific effects. Native speakers of German read predictable or unpredictable gender-marked nouns that were preceded by prediction-consistent or -inconsistent prenominal articles. Sentence frames either biased expectations more strongly toward the most likely continuation of the sentence, or balanced expectations between the first and second most likely continuation. The results showed reading facilitation for gender-marked articles when sentences were more biased but slowing when sentences were more balanced, irrespective of article predictability. We conclude that readers issue multiple prenominal predictions and weigh them according to their likelihood, providing evidence for parallel gradedness of prenominal predictions. The results are discussed in light of theoretical models on prediction and rational sentence processing.

@article{haeuser-borovsky-2025,
title = {Got it right up front? Further evidence for parallel graded prediction during prenominal article processing in a self-paced reading study},
author = {Katja H{\"a}user and Arielle Borovsky},
url = {https://escholarship.org/uc/item/7g30m0th},
doi = {https://doi.org/10.5070/G6011.1636},
year = {2025},
date = {2025},
journal = {Glossa Psycholinguistics},
volume = {4},
number = {1},
abstract = {Recent studies suggest that language users generate and maintain multiple predictions in parallel, especially in tasks that explicitly instruct participants to generate predictions. Here, we investigated the possibility of parallel gradedness of linguistic predictions in a simple reading task, using a new measure—imbalance—that captures the probabilistic difference between multiple sentence completions. We focus on prenominal gender-marked articles in German to obtain prediction-specific effects. Native speakers of German read predictable or unpredictable gender-marked nouns that were preceded by prediction-consistent or -inconsistent prenominal articles. Sentence frames either biased expectations more strongly toward the most likely continuation of the sentence, or balanced expectations between the first and second most likely continuation. The results showed reading facilitation for gender-marked articles when sentences were more biased but slowing when sentences were more balanced, irrespective of article predictability. We conclude that readers issue multiple prenominal predictions and weigh them according to their likelihood, providing evidence for parallel gradedness of prenominal predictions. The results are discussed in light of theoretical models on prediction and rational sentence processing.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A5

Talamo, Luigi

Introducing STAF: The Saarbrücken Treebank of Albanian Fiction Journal Article

Journal of Open Humanities Data, 11, pp. 1–6, 2025.

The present paper describes the building of STAF, a Universal Dependencies treebank for Albanian. STAF was bootstrapped using a Stanza model trained on previously unreleased data and then manually corrected by three Albanian speakers supervised by the author, who also revised all sentences. STAF focuses on the fiction genre, featuring 200 sentences selected from nine literary texts written by Albanian contemporary authors.

@article{Talamo-2025,
title = {Introducing STAF: The Saarbr{\"u}cken Treebank of Albanian Fiction},
author = {Luigi Talamo},
url = {https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.285},
doi = {https://doi.org/10.5334/johd.285},
year = {2025},
date = {2025},
journal = {Journal of Open Humanities Data},
pages = {1–6},
volume = {11},
number = {3},
abstract = {

The present paper describes the building of STAF, a Universal Dependencies treebank for Albanian. STAF was bootstrapped using a Stanza model trained on previously unreleased data and then manually corrected by three Albanian speakers supervised by the author, who also revised all sentences. STAF focuses on the fiction genre, featuring 200 sentences selected from nine literary texts written by Albanian contemporary authors.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C7

Dyer, Andrew; Betul, Ruveyda; Rajestari, Maryam; Rouvalis, Andreas; Singhal, Aarushi; Stodolinska, Yuliya; Asma, Syahidah; Rodrigues, Helena

A Multilingual Parallel Corpus for Coreference Resolution and Information Status in the Literary Domain Inproceedings

Dakota, Daniel; Jablotschkin, Sarah; Kübler, Sandra; Zinsmeister, Heike (Ed.): Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024), Association for Computational Linguistics, pp. 55-64, Hamburg, Germany, 2024.

Information status — the newness or givenness of referents in discourse — is known to affect the production of language at many different levels. At the morphosyntactic level, information status gives rise to special words orders, elisions, and other phenomena that challenge the notion that morphosyntax can be considered independent of discourse context. Though there are many language-specific corpora annotated for information status and its related phenomena, coreference and anaphora resolution, what is not available at present is a cross-lingually consistently annotated corpus or annotation scheme that would allow for comparativestudy of these phenomena across many diverse languages. In this paper we present our work to build such a resource. We are annotating a parsed, parallel corpus of prose in many languages for information status and coreference resolution, so that like-for-like cross-lingual comparisons can be made at the intersection of discourse and syntax. Our corpus can and will be used both for corpus analysis and for model training.

@inproceedings{dyer-etal-2024-multilingual,
title = {A Multilingual Parallel Corpus for Coreference Resolution and Information Status in the Literary Domain},
author = {Andrew Dyer andRuveyda Betul Bahceci and Maryam Rajestari and Andreas Rouvalis and Aarushi Singhal and Yuliya Stodolinska and Syahidah Asma Umniyati and Helena Rodrigues Menezes de Oliveira Vaz},
editor = {Daniel Dakota and Sarah Jablotschkin and Sandra K{\"u}bler and Heike Zinsmeister},
url = {https://aclanthology.org/2024.tlt-1.7/},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024)},
pages = {55-64},
publisher = {Association for Computational Linguistics},
address = {Hamburg, Germany},
abstract = {Information status — the newness or givenness of referents in discourse — is known to affect the production of language at many different levels. At the morphosyntactic level, information status gives rise to special words orders, elisions, and other phenomena that challenge the notion that morphosyntax can be considered independent of discourse context. Though there are many language-specific corpora annotated for information status and its related phenomena, coreference and anaphora resolution, what is not available at present is a cross-lingually consistently annotated corpus or annotation scheme that would allow for comparativestudy of these phenomena across many diverse languages. In this paper we present our work to build such a resource. We are annotating a parsed, parallel corpus of prose in many languages for information status and coreference resolution, so that like-for-like cross-lingual comparisons can be made at the intersection of discourse and syntax. Our corpus can and will be used both for corpus analysis and for model training.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C7

Talamo, Luigi; Verkerk, Annemarie; Salaberri, Iker

A quantitative approach to clause type and syntactic change in two Indo-European corpora Journal Article

Italian Journal of Linguistics, 36, pp. 53-82, 2024.

The aim of this paper is to empirically test the claim that subordinate clauses tend to preserve conservative features in language change. To this end, the diachronic behavior of two well-understood and frequently adduced features of grammar, namely null subject pronouns and order of subject, object and verb, is analyzed for main and adverbial clauses in a balanced corpus of 45 IndoEuropean languages. This study combines qualitative and quantitative analysis by drawing on individual descriptive grammars and parallel corpora respectively. Additionally, diachronic change is modeled using phylogenetic comparative methods. The data suggest that adverbial clauses can in some cases develop asymmetries with respect to their independent counterparts, either through innovation or through preservation of conservative features, possibly due to a communicative need to distinguish clause types by means of grammar. However, the general tendency is for adverbial clauses to change much in the same way as main clauses. This finding contradicts previous claims and calls for a reassessment of studies on the diachronic nature of distinct clause types.

@article{Talamo-etal-2024,
title = {A quantitative approach to clause type and syntactic change in two Indo-European corpora},
author = {Luigi Talamo and Annemarie Verkerk andIker Salaberri},
url = {https://www.italian-journal-linguistics.com/current-issue/},
doi = {https://doi.org/10.26346/1120-2726-225},
year = {2024},
date = {2024},
journal = {Italian Journal of Linguistics},
pages = {53-82},
volume = {36},
number = {2},
abstract = {The aim of this paper is to empirically test the claim that subordinate clauses tend to preserve conservative features in language change. To this end, the diachronic behavior of two well-understood and frequently adduced features of grammar, namely null subject pronouns and order of subject, object and verb, is analyzed for main and adverbial clauses in a balanced corpus of 45 IndoEuropean languages. This study combines qualitative and quantitative analysis by drawing on individual descriptive grammars and parallel corpora respectively. Additionally, diachronic change is modeled using phylogenetic comparative methods. The data suggest that adverbial clauses can in some cases develop asymmetries with respect to their independent counterparts, either through innovation or through preservation of conservative features, possibly due to a communicative need to distinguish clause types by means of grammar. However, the general tendency is for adverbial clauses to change much in the same way as main clauses. This finding contradicts previous claims and calls for a reassessment of studies on the diachronic nature of distinct clause types.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C7

Menzel, Katrin

Noun + noun Compounds and Verbal Complements as Non-normalised Features in Late Modern English Scientific Translations Inproceedings

Proceedings of 7th Translation in Transition Conference, Batumi: Shota Rustaveli State University, 2024.

This paper presents a study on the usage of noun+noun compounds and verbal complement structures in 18th century scientific articles in the Royal Society Corpus (RSC) comparing translated to non-translated English texts. Departing from the hypothesis that the translations will conform stronger to traditional patterns of the English language, the analysis shows that these historical translations and non-translated texts are similarly marked by the ongoing reorganisation of the noun phrase, but translations
contain more innovative complementation patterns. Additionally, a surprisal analysis shows that the analysed patterns tend to occur in more predictable and conventionalised contexts in non-translated texts than in translation.

 

@inproceedings{Menzel2024Noun,
title = {Noun + noun Compounds and Verbal Complements as Non-normalised Features in Late Modern English Scientific Translations},
author = {Katrin Menzel},
url = {https://sites.google.com/view/tt2024/schedule-and-proceedings},
year = {2024},
date = {2024-12-26},
booktitle = {Proceedings of 7th Translation in Transition Conference},
address = {Batumi: Shota Rustaveli State University},
abstract = {This paper presents a study on the usage of noun+noun compounds and verbal complement structures in 18th century scientific articles in the Royal Society Corpus (RSC) comparing translated to non-translated English texts. Departing from the hypothesis that the translations will conform stronger to traditional patterns of the English language, the analysis shows that these historical translations and non-translated texts are similarly marked by the ongoing reorganisation of the noun phrase, but translations
contain more innovative complementation patterns. Additionally, a surprisal analysis shows that the analysed patterns tend to occur in more predictable and conventionalised contexts in non-translated texts than in translation.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Menzel, Katrin

Initialisms in Scientific Writing in the 19th and Early 20th Centuries Journal Article

Zeitschrift für Wortbildung / Journal of Word Formation (ZWJW) (Special issue Historical English Word-Formation), 8, pp. 7-27, 2024.
This paper focusses on the role of initialisms in scientific English articles in the Royal Society Corpus (Fischer et al. 2020; Kermes et al. 2016). The development of scientific initialisms is illustrated with frequency data, a discussion of the evolution of the text topics obtained from topic modelling and an analysis of the development of information-theoretic surprisal values of initialisms in three time spans between 1830 and 1919. The overall frequency and diversity of initialisms for scientific concepts has risen considerably between 1830 and 1919 in the context of the ongoing specialisation of the sciences. Particularly from the 1860s onwards scientific initialisms increasingly become shortcuts for multiword units with wordhood and term status. The surprisal values of scientific initialisms decrease over time as such forms more regularly occur in conventionalised textual contexts and fixed expressions. Overall, the analysis of the RSC texts shows that key developments towards the conventionalisation of scientific initialisms as term formation patterns took place in the transitional period from Late Modern to Present-day English.

@article{Menzel2024,
title = {Initialisms in Scientific Writing in the 19th and Early 20th Centuries},
author = {Katrin Menzel},
url = {https://journals.linguistik.de/zwjw/article/view/108},
doi = {https://doi.org/10.21248/zwjw.2024.2.108},
year = {2024},
date = {2024},
journal = {Zeitschrift f{\"u}r Wortbildung / Journal of Word Formation (ZWJW) (Special issue Historical English Word-Formation)},
pages = {7-27},
volume = {8},
number = {2},
abstract = {

This paper focusses on the role of initialisms in scientific English articles in the Royal Society Corpus (Fischer et al. 2020; Kermes et al. 2016). The development of scientific initialisms is illustrated with frequency data, a discussion of the evolution of the text topics obtained from topic modelling and an analysis of the development of information-theoretic surprisal values of initialisms in three time spans between 1830 and 1919. The overall frequency and diversity of initialisms for scientific concepts has risen considerably between 1830 and 1919 in the context of the ongoing specialisation of the sciences. Particularly from the 1860s onwards scientific initialisms increasingly become shortcuts for multiword units with wordhood and term status. The surprisal values of scientific initialisms decrease over time as such forms more regularly occur in conventionalised textual contexts and fixed expressions. Overall, the analysis of the RSC texts shows that key developments towards the conventionalisation of scientific initialisms as term formation patterns took place in the transitional period from Late Modern to Present-day English.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B1

Steuer, Julius; Krielke, Marie-Pauline; Fischer, Stefan; Degaetano-Ortlieb, Stefania; Mosbach, Marius; Klakow, Dietrich

Modeling Diachronic Change in English Scientific Writing over 300+ Years with Transformer-based Language Model Surprisal Inproceedings

Zweigenbaum, Pierre; Rapp, Reinhard; Sharoff, Serge (Ed.): Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024, ELRA and ICCL, pp. 12-23, Torino, Italia, 2024.

This study presents an analysis of diachronic linguistic changes in English scientific writing, utilizing surprisal from transformer-based language models. Unlike traditional n-gram models, transformer-based models are potentially better at capturing nuanced linguistic changes such as long-range dependencies by considering variable context sizes. However, to create diachronically comparable language models there are several challenges with historical data, notably an exponential increase in no. of texts, tokens per text and vocabulary size over time. We address these by using a shared vocabulary and employing a robust training strategy that includes initial uniform sampling from the corpus and continuing pre-training on specific temporal segments. Our empirical analysis highlights the predictive power of surprisal from transformer-based models, particularly in analyzing complex linguistic structures like relative clauses. The models’ broader contextual awareness and the inclusion of dependency length annotations contribute to a more intricate understanding of communicative efficiency. While our focus is on scientific English, our approach can be applied to other low-resource scenarios.

@inproceedings{steuer-etal-2024-modeling ,
title = {Modeling Diachronic Change in English Scientific Writing over 300+ Years with Transformer-based Language Model Surprisal},
author = {Julius Steuer and Marie-Pauline Krielke and Stefan Fischer and Stefania Degaetano-Ortlieb and Marius Mosbach and Dietrich Klakow},
editor = {Pierre Zweigenbaum and Reinhard Rapp and Serge Sharoff},
url = {https://aclanthology.org/2024.bucc-1.2/},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024},
pages = {12-23},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {This study presents an analysis of diachronic linguistic changes in English scientific writing, utilizing surprisal from transformer-based language models. Unlike traditional n-gram models, transformer-based models are potentially better at capturing nuanced linguistic changes such as long-range dependencies by considering variable context sizes. However, to create diachronically comparable language models there are several challenges with historical data, notably an exponential increase in no. of texts, tokens per text and vocabulary size over time. We address these by using a shared vocabulary and employing a robust training strategy that includes initial uniform sampling from the corpus and continuing pre-training on specific temporal segments. Our empirical analysis highlights the predictive power of surprisal from transformer-based models, particularly in analyzing complex linguistic structures like relative clauses. The models’ broader contextual awareness and the inclusion of dependency length annotations contribute to a more intricate understanding of communicative efficiency. While our focus is on scientific English, our approach can be applied to other low-resource scenarios.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B1 B4

Bagdasarov, Sergei; Teich, Elke

Multi-word expressions in biomedical abstracts and their plain English adaptations Inproceedings

Hämäläinen, Mika; Öhman, Emily; Miyagawa, So; Alnajjar, Khalid; Bizzoni, Yuri (Ed.): Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, Association for Computational Linguistics, pp. 483-488, Miami, USA, 2024.

This study analyzes the use of multi-word expressions (MWEs), prefabricated sequences of words (e.g. in this case, this means that, healthcare service, follow up) in biomedical abstracts and their plain language adaptations. While English academic writing became highly specialized and complex from the late 19th century onwards, recent decades have seen a rising demand for a lay-friendly language in scientific content, especially in the health domain, to bridge a communication gap between experts and laypersons. Based on previous research showing that MWEs are easier to process than non-formulaic word sequences of comparable length, we hypothesize that they can potentially be used to create a more reader-friendly language. Our preliminary results suggest some significant differences between complex and plain abstracts when it comes to the usage patterns and informational load of MWEs.

@inproceedings{bagdasarov-teich-2024-multi,
title = {Multi-word expressions in biomedical abstracts and their plain English adaptations},
author = {Sergei Bagdasarov and Elke Teich},
editor = {Mika H{\"a}m{\"a}l{\"a}inen and Emily {\"O}hman and So Miyagawa and Khalid Alnajjar and Yuri Bizzoni},
url = {https://aclanthology.org/2024.nlp4dh-1.46/},
doi = {https://doi.org/10.18653/v1/2024.nlp4dh-1.46},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities},
pages = {483-488},
publisher = {Association for Computational Linguistics},
address = {Miami, USA},
abstract = {This study analyzes the use of multi-word expressions (MWEs), prefabricated sequences of words (e.g. in this case, this means that, healthcare service, follow up) in biomedical abstracts and their plain language adaptations. While English academic writing became highly specialized and complex from the late 19th century onwards, recent decades have seen a rising demand for a lay-friendly language in scientific content, especially in the health domain, to bridge a communication gap between experts and laypersons. Based on previous research showing that MWEs are easier to process than non-formulaic word sequences of comparable length, we hypothesize that they can potentially be used to create a more reader-friendly language. Our preliminary results suggest some significant differences between complex and plain abstracts when it comes to the usage patterns and informational load of MWEs.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Häuser, Katja; Borovsky, Arielle

Predictive processing suppresses form-related words with overlapping onsets Inproceedings

Proceedings of the 46th Annual Conference of the Cognitive Science Society, 46, Cognitive Science Society, pp. 5042-5048, Rotterdam (The Netherlands), 2024.

Do language users predict word forms as readily as they predict semantic features? Previous studies are conflicting, possibly because they did not differentiate between two types of word form relationship: Head and rhyme relationships, sharing onset or offset features with predictable words. Here, we investigated prediction of form and meaning by means of a priming lexical decision task. People read constraining sentences that disconfirmed their expectations, and indicated, at sentence offset, whether a letter string was a word. Targets were predictable but not presented nouns, semantically related nouns, as well as head- and rhyme-related nouns. Unrelated control nouns were also presented. Results showed facilitation for predictable and semantically related words, with no difference between the two. While no effects emerged for rhymes, head-related words showed slowing, indicating suppression of lexical neighbors following prediction of word forms. Our findings align with word recognition models and prediction-by-production models of predictive processing.

@inproceedings{HaeuserBorovsky2024,
title = {Predictive processing suppresses form-related words with overlapping onsets},
author = {Katja H{\"a}user and Arielle Borovsky},
url = {https://escholarship.org/uc/item/95w210ck},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 46th Annual Conference of the Cognitive Science Society},
pages = {5042-5048},
publisher = {Cognitive Science Society},
address = {Rotterdam (The Netherlands)},
abstract = {Do language users predict word forms as readily as they predict semantic features? Previous studies are conflicting, possibly because they did not differentiate between two types of word form relationship: Head and rhyme relationships, sharing onset or offset features with predictable words. Here, we investigated prediction of form and meaning by means of a priming lexical decision task. People read constraining sentences that disconfirmed their expectations, and indicated, at sentence offset, whether a letter string was a word. Targets were predictable but not presented nouns, semantically related nouns, as well as head- and rhyme-related nouns. Unrelated control nouns were also presented. Results showed facilitation for predictable and semantically related words, with no difference between the two. While no effects emerged for rhymes, head-related words showed slowing, indicating suppression of lexical neighbors following prediction of word forms. Our findings align with word recognition models and prediction-by-production models of predictive processing.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A5

Häuser, Katja; Kray, Jutta

Age Differences in Context Use During Reading and Downstream Effects on Recognition Memory Journal Article

Psychology and Aging, 39, pp. 715–730, 2024.
It is well-known that sentential context modulates sentence processing. But does context also have effects that extend beyond the immediate moment, for example, by impacting the memory representations that people store? And are there age-related differences in this process? Here, we investigated this question. German readers who varied in age self-paced through constraining sentences that continued in a predictable or less predictable fashion. Participants’ recognition memory was then tested for previously seen (i.e., “old”) words and for initially predictable but not actually presented words (i.e., “lures”). The results showed that readers of all ages slowed down when reading unpredictable sentences. However, aging individuals maintained less sentence-specific information than younger adults: They not only understood sentential materials less correctly on the fly, but they also showed disproportionate rates of false remembering and less successful old–new discrimination in the recognition memory test. Of note, rates of false remembering were reduced in those aging readers who allocated more time toward reading unpredictable sentence continuations. Together, our results show that aging increases reliance on gist or schema-congruent processing but that more attentive encoding of text can buffer against some of the resulting memory distortions.

@article{HaeuserKray2024,
title = {Age Differences in Context Use During Reading and Downstream Effects on Recognition Memory},
author = {Katja H{\"a}user and Jutta Kray},
url = {https://psycnet.apa.org/fulltext/2025-19000-001.html},
doi = {https://doi.org/10.1037/pag0000845},
year = {2024},
date = {2024},
journal = {Psychology and Aging},
pages = {715–730},
volume = {39},
number = {7},
abstract = {

It is well-known that sentential context modulates sentence processing. But does context also have effects that extend beyond the immediate moment, for example, by impacting the memory representations that people store? And are there age-related differences in this process? Here, we investigated this question. German readers who varied in age self-paced through constraining sentences that continued in a predictable or less predictable fashion. Participants’ recognition memory was then tested for previously seen (i.e., “old”) words and for initially predictable but not actually presented words (i.e., “lures”). The results showed that readers of all ages slowed down when reading unpredictable sentences. However, aging individuals maintained less sentence-specific information than younger adults: They not only understood sentential materials less correctly on the fly, but they also showed disproportionate rates of false remembering and less successful old–new discrimination in the recognition memory test. Of note, rates of false remembering were reduced in those aging readers who allocated more time toward reading unpredictable sentence continuations. Together, our results show that aging increases reliance on gist or schema-congruent processing but that more attentive encoding of text can buffer against some of the resulting memory distortions.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A5

Sizov, Fedor; España-Bonet, Cristina; van Genabith, Josef; Xie, Roy; Dutta Chowdhury, Koel

Analysing Translation Artifacts: A Comparative Study of LLMs, NMTs, and Human Translations Inproceedings

Haddow, Barry; Kocmi, Tom; Koehn, Philipp; Monz, Christof (Ed.): Proceedings of the Ninth Conference on Machine Translation, Association for Computational Linguistics, pp. 1183-1199, Miami, Florida, USA, 2024.

Translated texts exhibit a range of characteristics that make them appear distinct from texts originally written in the same target language. With the rise of Large Language Models (LLMs), which are designed for a wide range of language generation and understanding tasks, there has been significant interest in their application to Machine Translation. While several studies have focused on improving translation quality through fine-tuning or few-shot prompting techniques, there has been limited exploration of how LLM-generated translations qualitatively differ from those produced by Neural Machine Translation (NMT) models, and human translations. Our study employs explainability methods such as Leave-One-Out (LOO) and Integrated Gradients (IG) to analyze the lexical features distinguishing human translations from those produced by LLMs and NMT systems. Specifically, we apply a two-stage approach: first, classifying texts based on their origin {–} whether they are original or translations {–} and second, extracting significant lexical features (highly attributed input words) using post-hoc interpretability methods. Our analysis shows that different methods of feature extraction vary in their effectiveness, with LOO being generally better at pinpointing critical input words and IG capturing a broader range of important words. Finally, our results show that while LLMs and NMT systems can produce translations of a good quality, they still differ from texts originally written by native speakers. Specifically, we find that while some LLMs often align closely with human translations, traditional NMT systems exhibit distinct characteristics, particularly in their use of certain linguistic features.

@inproceedings{sizov-etal-2024-analysing,
title = {Analysing Translation Artifacts: A Comparative Study of LLMs, NMTs, and Human Translations},
author = {Fedor Sizov and Cristina Espa{\~n}a-Bonet and Josef van Genabith and Roy Xie and Koel Dutta Chowdhury},
editor = {Barry Haddow and Tom Kocmi and Philipp Koehn and Christof Monz},
url = {https://aclanthology.org/2024.wmt-1.116},
doi = {https://doi.org/10.18653/v1/2024.wmt-1.116},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Ninth Conference on Machine Translation},
pages = {1183-1199},
publisher = {Association for Computational Linguistics},
address = {Miami, Florida, USA},
abstract = {Translated texts exhibit a range of characteristics that make them appear distinct from texts originally written in the same target language. With the rise of Large Language Models (LLMs), which are designed for a wide range of language generation and understanding tasks, there has been significant interest in their application to Machine Translation. While several studies have focused on improving translation quality through fine-tuning or few-shot prompting techniques, there has been limited exploration of how LLM-generated translations qualitatively differ from those produced by Neural Machine Translation (NMT) models, and human translations. Our study employs explainability methods such as Leave-One-Out (LOO) and Integrated Gradients (IG) to analyze the lexical features distinguishing human translations from those produced by LLMs and NMT systems. Specifically, we apply a two-stage approach: first, classifying texts based on their origin {--} whether they are original or translations {--} and second, extracting significant lexical features (highly attributed input words) using post-hoc interpretability methods. Our analysis shows that different methods of feature extraction vary in their effectiveness, with LOO being generally better at pinpointing critical input words and IG capturing a broader range of important words. Finally, our results show that while LLMs and NMT systems can produce translations of a good quality, they still differ from texts originally written by native speakers. Specifically, we find that while some LLMs often align closely with human translations, traditional NMT systems exhibit distinct characteristics, particularly in their use of certain linguistic features.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Ryzhova, Margarita; Ellsiepen, Emilia; Trinley, Katharina; Skrjanec, Iza; Demberg, Vera

The Effects of Linguistic Context on Comprehension of Unknown Words Inproceedings

The 2nd Workshop on Eye Movements and the Assessment of Reading Comprehension (MultiplEYE), 2024.

Words that are unfamiliar to us can elicit processing difficulties. Word familiarity can be modulated by the intrinsic properties of the word like frequency and length (Rayner, 1998, Kliegl et al. 2004). However, the literature shows that the context also affects comprehension (Nieuwland & van Berkum 2006; Lowell & Morris, 2014; Williams & Morris, 2004). For example, scientific or technical texts may contain more specialized vocabulary that is unfamiliar to the general reader, while everyday texts such as newspapers or novels may contain more familiar language. In such common contexts, the reader can be surprised to encounter an unknown word, or attribute it to a typo, while in a more scientific context, the reader might expect to encounter special domain terms that they don’t know.

In our study on processing unknown words in German, we manipulate the type of context to explore whether it affects the reader’s sensitivity to processing unfamiliar words. We conduct a self-paced reading experiment and ask participants to read texts for comprehension. Each text includes a target word: either a real word or a pseudoword. The target words were embedded into two types of context: everyday and scientific, making this study follow a 2×2 design. Everyday stories concern familiar events from daily life (e.g. children playing in a park), while scientific stories take place in less common settings with characters with a specialized profession (e.g. researchers conducting experiments in a laboratory). The scientific stories themselves are not expository texts, but rather narratives describing a less familiar scenario.

We find that in both contexts subjects showed sensitivity to pseudowords, resulting in higher reading times. However, this effect was significantly stronger in the everyday context, compared to the scientific context condition. The context alone didn’t affect the reading times. Our results show that unknown words, despite lacking defined meaning, are more anticipated in domain-specific texts than in general narratives. The scientific context increases the expectancy of encountering unknown words, resulting in faster reading.

In the time of abstract submission, we are conducting an eye-tracking counterpart of this study, additionally collecting information on language experience and domain expertise.

@inproceedings{Ryzhova_etal_2024,
title = {The Effects of Linguistic Context on Comprehension of Unknown Words},
author = {Margarita Ryzhova and Emilia Ellsiepen and Katharina Trinley and Iza Skrjanec and Vera Demberg},
url = {https://multipleye.eu/wp-content/uploads/Book_of_Abstracts24.pdf},
year = {2024},
date = {2024},
booktitle = {The 2nd Workshop on Eye Movements and the Assessment of Reading Comprehension (MultiplEYE)},
abstract = {Words that are unfamiliar to us can elicit processing difficulties. Word familiarity can be modulated by the intrinsic properties of the word like frequency and length (Rayner, 1998, Kliegl et al. 2004). However, the literature shows that the context also affects comprehension (Nieuwland & van Berkum 2006; Lowell & Morris, 2014; Williams & Morris, 2004). For example, scientific or technical texts may contain more specialized vocabulary that is unfamiliar to the general reader, while everyday texts such as newspapers or novels may contain more familiar language. In such common contexts, the reader can be surprised to encounter an unknown word, or attribute it to a typo, while in a more scientific context, the reader might expect to encounter special domain terms that they don’t know. In our study on processing unknown words in German, we manipulate the type of context to explore whether it affects the reader's sensitivity to processing unfamiliar words. We conduct a self-paced reading experiment and ask participants to read texts for comprehension. Each text includes a target word: either a real word or a pseudoword. The target words were embedded into two types of context: everyday and scientific, making this study follow a 2x2 design. Everyday stories concern familiar events from daily life (e.g. children playing in a park), while scientific stories take place in less common settings with characters with a specialized profession (e.g. researchers conducting experiments in a laboratory). The scientific stories themselves are not expository texts, but rather narratives describing a less familiar scenario. We find that in both contexts subjects showed sensitivity to pseudowords, resulting in higher reading times. However, this effect was significantly stronger in the everyday context, compared to the scientific context condition. The context alone didn't affect the reading times. Our results show that unknown words, despite lacking defined meaning, are more anticipated in domain-specific texts than in general narratives. The scientific context increases the expectancy of encountering unknown words, resulting in faster reading. In the time of abstract submission, we are conducting an eye-tracking counterpart of this study, additionally collecting information on language experience and domain expertise.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A8

Pollkläsener, Christina; Yung, Frances Pik Yu; Lapshinova-Koltunski, Ekaterina

Capturing variation of discourse relations in English parallel data through automatic annotation and alignment Journal Article

Across Languages and Cultures, 25, pp. 288–309, 2024, ISSN 1588-2519.
We present a study of discourse connectives and discourse relations in English parallel texts, i.e. in written and spoken originals, as well as translation and interpreting from German. For this, we apply automatic procedures to annotate discourse connectives and relations they trigger in a parallel corpus. We look at distributions of various connectives and discourse relations, comparing spoken and written mode, as well as original and translated or interpreted language production. Furthermore, we analyse the translation patterns in terms of translation entropy. We link our observations to the phenomena of explicitation and implicitation. We find that in both interpreting and translation, explicitation and implicitation patters are affected by the cognitive complexity of the discourse relation signalled by the connective. Moreover, we also show that the difference in the specificity of the same connectives in interpreting and translation also depends on the type of relation they trigger.

@article{Pollkläsener-etal-2024,
title = {Capturing variation of discourse relations in English parallel data through automatic annotation and alignment},
author = {Christina Pollkl{\"a}sener and Frances Pik Yu Yung and Ekaterina Lapshinova-Koltunski},
url = {https://akjournals.com/view/journals/084/25/2/article-p288.xml},
doi = {https://doi.org/10.1556/084.2024.00903},
year = {2024},
date = {2024},
journal = {Across Languages and Cultures},
pages = {288–309},
volume = {25},
number = {2},
abstract = {

We present a study of discourse connectives and discourse relations in English parallel texts, i.e. in written and spoken originals, as well as translation and interpreting from German. For this, we apply automatic procedures to annotate discourse connectives and relations they trigger in a parallel corpus. We look at distributions of various connectives and discourse relations, comparing spoken and written mode, as well as original and translated or interpreted language production. Furthermore, we analyse the translation patterns in terms of translation entropy. We link our observations to the phenomena of explicitation and implicitation. We find that in both interpreting and translation, explicitation and implicitation patters are affected by the cognitive complexity of the discourse relation signalled by the connective. Moreover, we also show that the difference in the specificity of the same connectives in interpreting and translation also depends on the type of relation they trigger.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Projects:   B7 B2

Ortmann, Katrin; Voigtmann, Sophia; Dipper, Stefanie; Speyer, Augustin

An information-theoretic account of constituent order in the German middle field Book Chapter

Lemke, Tyll Robin; Schäfer, Lisa; Reich, Ingo;  (Ed.): Information structure and information theory, Language Science Press, pp. 55–86, Berlin, 2024.

This paper proposes a novel approach to explain object order in German. Although the order of constituents is relatively free in modern German, there are clear preferences for the order dative before accusative (nominal) objects and for the order given before new objects. A range of influential factors have been described in the literature, most prominently givenness and length. We assume processing-related reasons and use information-theoretic measures, in particular surprisal and DORM (Cuskley et al. 2021), to explore the interplay of information structure and information density as factors for object order. We propose a measure called DORMdiff and the corpus of variants method for comparing information profiles between different plausible constituent orders. Our investigations show that language users follow information-theoretic principles (UID, Levy & Jaeger 2007) in choosing the object order that leads to a more uniform distribution of information. We argue that this preference also explains deviations from the unmarked object order (i.e., accusative preceding dative and new preceding given) if it is associated with smoother information profiles.

@inbook{Ortmann-etal-2024,
title = {An information-theoretic account of constituent order in the German middle field},
author = {Katrin Ortmann and Sophia Voigtmann and Stefanie Dipper and Augustin Speyer},
editor = {Tyll Robin Lemke and Lisa Sch{\"a}fer and Ingo Reich},
url = {https://langsci-press.org/catalog/book/465},
doi = {https://doi.org/10.5281/zenodo.13383787},
year = {2024},
date = {2024},
booktitle = {Information structure and information theory},
pages = {55–86},
publisher = {Language Science Press},
address = {Berlin},
abstract = {This paper proposes a novel approach to explain object order in German. Although the order of constituents is relatively free in modern German, there are clear preferences for the order dative before accusative (nominal) objects and for the order given before new objects. A range of influential factors have been described in the literature, most prominently givenness and length. We assume processing-related reasons and use information-theoretic measures, in particular surprisal and DORM (Cuskley et al. 2021), to explore the interplay of information structure and information density as factors for object order. We propose a measure called DORMdiff and the corpus of variants method for comparing information profiles between different plausible constituent orders. Our investigations show that language users follow information-theoretic principles (UID, Levy & Jaeger 2007) in choosing the object order that leads to a more uniform distribution of information. We argue that this preference also explains deviations from the unmarked object order (i.e., accusative preceding dative and new preceding given) if it is associated with smoother information profiles.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C6

Yuen, Ivan; Andreeva, Bistra; Ibrahim, Omnia; Möbius, Bernd

Prosodic factors do not always suppress discourse or surprisal factors on word-final syllable duration in German polysyllabic words Incollection

Lemke, Robin; Schäfer, Lisa; Reich, Ingo (Ed.): Information Structure and Information Theory, Language Science Press, pp. 215-234, Berlin, 2024.

Predictability is known to influence acoustic duration (e.g., Ibrahim et al. 2022) and prosodic factors such as accenting and boundary-related lengthening have been postulated to account for this effect (e.g., Aylett & Turk 2004). However, it has also been shown that other factors such as information status or speech styles could contribute to acoustic duration (e.g. Baker & Bradlow 2009). This raises the question as to whether acoustic duration is primarily subject to the influence of prosody that reflects linguistic structure including predictability. The current study addressed this question by examining the acoustic duration of word-final syllables in polysyllabic words in DIRNDL, a German radio broadcast corpus (e.g. Eckart et al. 2012). We analysed polysyllabic words followed by an intermediate phrase or an intonational phrase boundary, with or without accenting, and with given or new information status. Our results indicate that the acoustic duration of the word-final syllable was subject to the effect of prosodic boundary for long host words, in line with Aylett & Turk (2004); however, we also observed additional effects of information status, log surprisal and accenting for short host words, in line with Baker & Bradlow (2009). These results suggest that acoustic duration is subject to the influence of prosodic (e.g., boundary and accenting) and linguistic factors (e.g., information status and surprisal), and that the primacy of prosodic factors impacting on acoustic duration is further constrained by some intrinsic durational constraints, for example word length.

@incollection{Yuen/etal:2024b,
title = {Prosodic factors do not always suppress discourse or surprisal factors on word-final syllable duration in German polysyllabic words},
author = {Ivan Yuen and Bistra Andreeva and Omnia Ibrahim and Bernd M{\"o}bius},
editor = {Robin Lemke and Lisa Sch{\"a}fer and Ingo Reich},
url = {https://zenodo.org/records/13383799},
doi = {https://doi.org/10.5281/zenodo.13383799},
year = {2024},
date = {2024},
booktitle = {Information Structure and Information Theory},
pages = {215-234},
publisher = {Language Science Press},
address = {Berlin},
abstract = {Predictability is known to influence acoustic duration (e.g., Ibrahim et al. 2022) and prosodic factors such as accenting and boundary-related lengthening have been postulated to account for this effect (e.g., Aylett & Turk 2004). However, it has also been shown that other factors such as information status or speech styles could contribute to acoustic duration (e.g. Baker & Bradlow 2009). This raises the question as to whether acoustic duration is primarily subject to the influence of prosody that reflects linguistic structure including predictability. The current study addressed this question by examining the acoustic duration of word-final syllables in polysyllabic words in DIRNDL, a German radio broadcast corpus (e.g. Eckart et al. 2012). We analysed polysyllabic words followed by an intermediate phrase or an intonational phrase boundary, with or without accenting, and with given or new information status. Our results indicate that the acoustic duration of the word-final syllable was subject to the effect of prosodic boundary for long host words, in line with Aylett & Turk (2004); however, we also observed additional effects of information status, log surprisal and accenting for short host words, in line with Baker & Bradlow (2009). These results suggest that acoustic duration is subject to the influence of prosodic (e.g., boundary and accenting) and linguistic factors (e.g., information status and surprisal), and that the primacy of prosodic factors impacting on acoustic duration is further constrained by some intrinsic durational constraints, for example word length.},
pubstate = {published},
type = {incollection}
}

Copy BibTeX to Clipboard

Project:   C1

Successfully